PySpark environment for the Postgres database

In this blog, I am going to

  • Create Postgres 13.4 docker environment
  • Create Spark enabled Jupyter docker environment
  • Run remote Jupyter notebooks via Visual Studio Code
  • And test the PySpark Jupyter notebook1 or follow the PySpark Dataframe DSL basics which is the second part of this blog.

As shown in Fig.1, Jupyter Server and Postgres databases run in the Docker environment. Jupyter and Postgres Docker instances can communicate with each other.

Fig.1: Tool setup
Fig.1: Tool setup

You need to install and run docker before going further.



Setup Docker

To go through the rest of the installation, you should have setup Docker in your machine.
You can even use remotely installed docker machine either in another machine or a cloud (Glue Development using Jupyter2).

Setup Postgres

You have to create a folder where it is possible to create a data subfolder that holds all the database's contents.

docker run -t -i \
    --name Mastering-postgres \
    --rm \
    -p 5432:5432 \
    -e POSTGRES_PASSWORD=ojitha \
    -v "$(pwd)/data":/var/lib/postgresql/data \
    postgres:13.4

The Docker panel should show the container as follows:

Fig.2: Verify database is ready
Fig.2: Verify database is ready

Download the Postgres Sample Database3 as explained. Here is the DVD Rental ER Model:

Fig.3: DVD Rental ER Model
Fig.3: DVD Rental ER Model

You can use PgAdmin4 to import the database. Please follow the Load PostgreSQL Sample Database5 instructions.

Fig.4: PgAdmin
Fig.4: PgAdmin

Setup Jupyter notebook with PySpark

Setting up the Postgres JDBC driver to access at /home/jovyan/work/extlibs/postgresql-9.4.1207.jar is necessary. The following bash command shows that the downloaded postgresql-9.4.1207.jar should be in the ./extlibs directory.

You can set up Jupyter notebook with PySpark as a docker container:

docker run --name pyspark -ti --link Mastering-postgres --rm -e JUPYTER_ENABLE_LAB=yes -e JUPYTER_TOKEN="pyspark"  -v "$(pwd)":/home/jovyan/work -p 8888:8888 jupyter/pyspark-notebook:d4cbf2f80a2a

Fig.5: Verify Jupyter lab is ready
Fig.5: Verify Jupyter lab is ready

Use the Jupyter plugin with Visual Studio Code

First, you must install the Jupyter plugin6 for the VSCode. From the CMD+Shift+P, select the menu and type the Jupyter:

Now you can select from the existing Jupyter server next time:

Fig.6: Select Jupyter remote connection in VSCode
Fig.6: Select Jupyter remote connection in VSCode

As you see, you can find already-used sessions.

Jupyter cell magic

If you are working cell magic such as %%sql, you have to follow the following procedure:

  1. First install pg8000 !pip install pg8000
  2. Install the iPython cell magic for sql !pip install ipython-sql
  3. load the sql extension %load_ext sql

Now you are ready to execute the queries

%%sql postgresql+pg8000://postgres:ojitha@Mastering-postgres:5432/dvdrental
SELECT * FROM public.actor LIMIT 10

First, you must connect to the database and then execute the select query in the second line. This will show you the result of 10 records from the actor table.

Appendix A: Connect to remote Docker machine

For example, I have installed OpenSSH server and Docker on a Windows 10 machine which is in my local area network. I created SSH private/public keys in my MacOS and copy the private key to Windows OpenSSH server and successfully connected via iTerm 2 macOS terminal.

Using the docker context ls command you can list the existing docker contexts. To add the new context in your macOS:

export TARGET_HOST=<ip address of the Windows 10>
docker context update ojwin --docker "host=ssh://$TARGET_HOST"

Above for update the existing ojwin context, use the create for new context instead update. To switch to new docker context:

docker context use ojwin

Now you are ready to create docker containers:

docker run -t -i \
    --name Mastering-postgres \
    --rm \
    -p 5432:5432 \
    -e POSTGRES_PASSWORD=ojitha \
    -v "C:\Users\ojitha\workspace\postgres\data":/var/lib/postgresql/data:rw \
    postgres:13.4

NOTE: You have to change the volume mapping. For example, from the above bash command to create Postgres container, you have to change the volume as -v "C:\Users\ojitha\workspace\postgres\data":/var/lib/postgresql/data:rw. In the Windows environment, it is important to add the tailing :rw for read write permissions.

To install the Python 3.10.4 based Jupyter container

docker run --name jupyter -ti --link Mastering-postgres --rm -e JUPYTER_ENABLE_LAB=yes -e JUPYTER_TOKEN="pyspark"  -v "C:\Users\ojitha\workspace\jupyter":/home/jovyan/work:rw -p 8888:8888 jupyter/base-notebook:python-3.10.4

Fig.7: Connect notebook to the remote Jupyter server
Fig.7: Connect notebook to the remote Jupyter server

  1. Click kernel in the notebook
  2. You have a created Docker Jupyter server remotely
  3. You have to find the ip address of the remote server
  4. Enter the remote server URL http://<remote ip>:8888/?token=pyspark (this is the token given when the Jupyter container was creating JUPYTER_TOKEN="pyspark" in the docker run command).
  5. Select the remote server as Jupyter server
  6. Select the runtime as Python

Appendix B: Jupyter notebooks on AWS Glue version 4

AWS Glue version 4 which has the Jupyter notebooks. I've used the Linux operating system ubuntu 20.x as my docker host in this case.

To start Postgres:

docker run -t -i --name Mastering-postgres --rm  -p 5432:5432 -e POSTGRES_PASSWORD=ojitha  -v "/home/ojitha/workspace/postgres/data":/var/lib/postgresql/data:rw     postgres:13.4

To start the AWS Glue version which has the Jupyter notebooks.

docker run -it --link Mastering-postgres -v "/home/ojitha/workspace/glue":/home/glue_user/workspace/jupyter_workspace/ -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true --rm -p 4040:4040 -p 8888:8888 --name glue_pyspark amazon/aws-glue-libs:glue_libs_4.0.0_image_01

Above command will give you the Docker container bash prompt. To start the Jupyter notebook, you have to follow the 2 blog.

References

  1. PySpark Data Frame to Pie Chart

  2. Glue Development using Jupyter 2

  3. Postgres Sample DB

  4. PgAdmin 4 download

  5. Load PostgreSQL Sample Database

  6. Jupyter Notebook Renderers

Comments

Popular posts from this blog

How To: GitHub projects in Spring Tool Suite

Spring 3 Part 7: Spring with Databases

Parse the namespace based XML using Python