PySpark Dataframe DSL basics
In this blog post, I explore the PySpark DataFrame structured API and DSL operators. Typical tasks you can learn: Connection to remote PostgreSQL database Create DataFrame from above database using PostgreSQL Sample Database Create DataFrame using CSV (movieLens) files. In addition to that the equivalent SQL has been provided to compare with the DSL. Preperation Configure Database in the PySpark Aggreations DataFrame from a CSV file Spark SQL Preperation Setup the the environment mentioned in the blog post PySpark environment for the Postgres database 1 to execute the following PySpark queries on the Postgres Sample Database 2 . from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Postgres Connection") \ .config("spark.jars", # add the PostgresSQL jdbc driver jar "/home/jovyan/work/extlibs/postgresql-9.4.1207.jar").getOrCreate() As shown in line# 4, I am using JDBC driver which is in my local macO...