Posts

Showing posts from February, 2022

PySpark relational operations

Image
In this blog I explore the PySpark DataFrame API to create SQL JOIN and Windowing functionality. I am using MovieLens 1 dataset to explain. In addition to SQL JOIN and CASE, advanced topic such as SQL Window and Pivot will be discussed here. In my last blog post 2 , I've explained how to use Hive metastore which is more intent to use SQL only. For the full list, see the Spark SQL documentation 3 . Create Dataframes JOIN UDF and explode function Windowing CASE Pivoting First start the spark session as follows. I've changed the logging level as well. from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Releational operations").getOrCreate() sc = spark.sparkContext # sel the log leve sc.setLogLevel('ERROR') Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLeve

Spark SQL using Hive metastore

This is the second part of the PySpark Dataframe DSL basics 1 . The main objectives are to create a Hive metastore based database and tables. Create Databases and tables Metadata Data Sources SQL User Define Functions First, create Spark Session manually as the first step. By default, Spark uses the Apache Hive metastore to persist all the tables metadata. I have to change the default location at line# 6 in the following code: from pyspark.sql import SparkSession # create a Spark Session spark = SparkSession.builder.appName("SparkSQLInterfaceExamples") \ .config("spark.sql.warehouse.dir","/Users/ojitha/datasets/data-warehouse") \ .enableHiveSupport().getOrCreate() sc =spark.sparkContext sc.setLogLevel("ERROR") # changed the default log level WARN to ERROR Create Databases and tables Get the Grouplens recommended for new research 2 data set and configure to access. You can create manage and unmanaged tables in Spark

PySpark Dataframe DSL basics

Image
In this blog post, I explore the PySpark DataFrame structured API and DSL operators. Typical tasks you can learn: Connection to remote PostgreSQL database Create DataFrame from above database using PostgreSQL Sample Database Create DataFrame using CSV (movieLens) files. In addition to that the equivalent SQL has been provided to compare with the DSL. Preperation Configure Database in the PySpark Aggreations DataFrame from a CSV file Spark SQL Preperation Setup the the environment mentioned in the blog post PySpark environment for the Postgres database 1 to execute the following PySpark queries on the Postgres Sample Database 2 . from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Postgres Connection") \ .config("spark.jars", # add the PostgresSQL jdbc driver jar "/home/jovyan/work/extlibs/postgresql-9.4.1207.jar").getOrCreate() As shown in line# 4, I am using JDBC driver which is in my local macO

PySpark environment for the Postgres database

Image
In this blog, I am going to Create Postgres 13.4 docker environment Create Spark enabled Jupyter docker environment Run remote Jupyter notebooks via Visual Studio Code And test the PySpark Jupyter notebook 1 or follow the PySpark Dataframe DSL basics which is the second part of this blog. As shown in Fig.1, Jupyter Server and Postgres databases run in the Docker environment. Jupyter and Postgres Docker instances can communicate with each other. Fig.1: Tool setup You need to install and run docker before going further. Setup Docker Setup Postgres Setup Jupyter notebook with PySpark Use the Jupyter plugin with Visual Studio Code Jupyter cell magic Appendix A: Connect to remote Docker machine Appendix B: Jupyter notebooks on AWS Glue version 4 Setup Docker To go through the rest of the installation, you should have setup Docker in your machine. You can even use remotely installed docker machine either in another machine or a cloud (Glue Development using Jupyter