PySpark relational operations
In this blog I explore the PySpark DataFrame API to create SQL JOIN and Windowing functionality. I am using MovieLens 1 dataset to explain. In addition to SQL JOIN and CASE, advanced topic such as SQL Window and Pivot will be discussed here. In my last blog post 2 , I've explained how to use Hive metastore which is more intent to use SQL only. For the full list, see the Spark SQL documentation 3 . Create Dataframes JOIN UDF and explode function Windowing CASE Pivoting First start the spark session as follows. I've changed the logging level as well. from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Releational operations").getOrCreate() sc = spark.sparkContext # sel the log leve sc.setLogLevel('ERROR') Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLeve...