PySpark environment for the Postgres database

In this blog, I am going to

Create Postgres 13.4 docker environment
Create Spark enabled Jupyter docker environment
Run remote Jupyter notebooks via Visual Studio Code
And test the PySpark Jupyter notebook¹ or follow the PySpark Dataframe DSL basics which is the second part of this blog.

As shown in Fig.1, Jupyter Server and Postgres databases run in the Docker environment. Jupyter and Postgres Docker instances can communicate with each other.

You need to install and run docker before going further.

Setup Docker
Setup Postgres
Setup Jupyter notebook with PySpark
Use the Jupyter plugin with Visual Studio Code
Jupyter cell magic
Appendix A: Connect to remote Docker machine
Appendix B: Jupyter notebooks on AWS Glue version 4

Setup Docker

To go through the rest of the installation, you should have setup Docker in your machine.
You can even use remotely installed docker machine either in another machine or a cloud (Glue Development using Jupyter²).

Setup Postgres

You have to create a folder where it is possible to create a data subfolder that holds all the database's contents.

docker run -t -i \
    --name Mastering-postgres \
    --rm \
    -p 5432:5432 \
    -e POSTGRES_PASSWORD=ojitha \
    -v "$(pwd)/data":/var/lib/postgresql/data \
    postgres:13.4

The Docker panel should show the container as follows:

Download the Postgres Sample Database³ as explained. Here is the DVD Rental ER Model:

You can use PgAdmin⁴ to import the database. Please follow the Load PostgreSQL Sample Database⁵ instructions.

Setup Jupyter notebook with PySpark

Setting up the Postgres JDBC driver to access at /home/jovyan/work/extlibs/postgresql-9.4.1207.jar is necessary. The following bash command shows that the downloaded postgresql-9.4.1207.jar should be in the ./extlibs directory.

You can set up Jupyter notebook with PySpark as a docker container:

docker run --name pyspark -ti --link Mastering-postgres --rm -e JUPYTER_ENABLE_LAB=yes -e JUPYTER_TOKEN="pyspark"  -v "$(pwd)":/home/jovyan/work -p 8888:8888 jupyter/pyspark-notebook:d4cbf2f80a2a

Use the Jupyter plugin with Visual Studio Code

First, you must install the Jupyter plugin⁶ for the VSCode. From the CMD+Shift+P, select the menu and type the Jupyter:

Now you can select from the existing Jupyter server next time:

Fig.6: Select Jupyter remote connection in VSCode

As you see, you can find already-used sessions.

Jupyter cell magic

If you are working cell magic such as %%sql, you have to follow the following procedure:

First install pg8000 !pip install pg8000
Install the iPython cell magic for sql !pip install ipython-sql
load the sql extension %load_ext sql

Now you are ready to execute the queries

%%sql postgresql+pg8000://postgres:ojitha@Mastering-postgres:5432/dvdrental
SELECT * FROM public.actor LIMIT 10

First, you must connect to the database and then execute the select query in the second line. This will show you the result of 10 records from the actor table.

Appendix A: Connect to remote Docker machine

For example, I have installed OpenSSH server and Docker on a Windows 10 machine which is in my local area network. I created SSH private/public keys in my MacOS and copy the private key to Windows OpenSSH server and successfully connected via iTerm 2 macOS terminal.

Using the docker context ls command you can list the existing docker contexts. To add the new context in your macOS:

export TARGET_HOST=<ip address of the Windows 10>
docker context update ojwin --docker "host=ssh://$TARGET_HOST"

Above for update the existing ojwin context, use the create for new context instead update. To switch to new docker context:

docker context use ojwin

Now you are ready to create docker containers:

docker run -t -i \
    --name Mastering-postgres \
    --rm \
    -p 5432:5432 \
    -e POSTGRES_PASSWORD=ojitha \
    -v "C:\Users\ojitha\workspace\postgres\data":/var/lib/postgresql/data:rw \
    postgres:13.4

NOTE: You have to change the volume mapping. For example, from the above bash command to create Postgres container, you have to change the volume as -v "C:\Users\ojitha\workspace\postgres\data":/var/lib/postgresql/data:rw. In the Windows environment, it is important to add the tailing :rw for read write permissions.

To install the Python 3.10.4 based Jupyter container

docker run --name jupyter -ti --link Mastering-postgres --rm -e JUPYTER_ENABLE_LAB=yes -e JUPYTER_TOKEN="pyspark"  -v "C:\Users\ojitha\workspace\jupyter":/home/jovyan/work:rw -p 8888:8888 jupyter/base-notebook:python-3.10.4

Fig.7: Connect notebook to the remote Jupyter server

Click kernel in the notebook
You have a created Docker Jupyter server remotely
You have to find the ip address of the remote server
Enter the remote server URL http://<remote ip>:8888/?token=pyspark (this is the token given when the Jupyter container was creating JUPYTER_TOKEN="pyspark" in the docker run command).
Select the remote server as Jupyter server
Select the runtime as Python

Appendix B: Jupyter notebooks on AWS Glue version 4

AWS Glue version 4 which has the Jupyter notebooks. I've used the Linux operating system ubuntu 20.x as my docker host in this case.

To start Postgres:

docker run -t -i --name Mastering-postgres --rm  -p 5432:5432 -e POSTGRES_PASSWORD=ojitha  -v "/home/ojitha/workspace/postgres/data":/var/lib/postgresql/data:rw     postgres:13.4

To start the AWS Glue version which has the Jupyter notebooks.

docker run -it --link Mastering-postgres -v "/home/ojitha/workspace/glue":/home/glue_user/workspace/jupyter_workspace/ -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true --rm -p 4040:4040 -p 8888:8888 --name glue_pyspark amazon/aws-glue-libs:glue_libs_4.0.0_image_01

Above command will give you the Docker container bash prompt. To start the Jupyter notebook, you have to follow the ² blog.

References

Search This Blog

Ojitha Hewa Kumanayaka