Posts

Showing posts with the label Python

PySpark Dataframe DSL basics

Image
In this blog post, I explore the PySpark DataFrame structured API and DSL operators. Typical tasks you can learn: Connection to remote PostgreSQL database Create DataFrame from above database using PostgreSQL Sample Database Create DataFrame using CSV (movieLens) files. In addition to that the equivalent SQL has been provided to compare with the DSL. Preperation Configure Database in the PySpark Aggreations DataFrame from a CSV file Spark SQL Preperation Setup the the environment mentioned in the blog post PySpark environment for the Postgres database 1 to execute the following PySpark queries on the Postgres Sample Database 2 . from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Postgres Connection") \ .config("spark.jars", # add the PostgresSQL jdbc driver jar "/home/jovyan/work/extlibs/postgresql-9.4.1207.jar").getOrCreate() As shown in line# 4, I am using JDBC driver which is in my local macO...

PySpark environment for the Postgres database

Image
In this blog, I am going to Create Postgres 13.4 docker environment Create Spark enabled Jupyter docker environment Run remote Jupyter notebooks via Visual Studio Code And test the PySpark Jupyter notebook 1 or follow the PySpark Dataframe DSL basics which is the second part of this blog. As shown in Fig.1, Jupyter Server and Postgres databases run in the Docker environment. Jupyter and Postgres Docker instances can communicate with each other. Fig.1: Tool setup You need to install and run docker before going further. Setup Docker Setup Postgres Setup Jupyter notebook with PySpark Use the Jupyter plugin with Visual Studio Code Jupyter cell magic Appendix A: Connect to remote Docker machine Appendix B: Jupyter notebooks on AWS Glue version 4 Setup Docker To go through the rest of the installation, you should have setup Docker in your machine. You can even use remotely installed docker machine either in another machine or a cloud (Glue Development using Jupyter ...

Python my workflow

Image
My Flow I combined two softwares using pyenv-virtualenv : pyenv manages multiple versions of Python itself virtualenv ( Python Virtual Environments: A Primer ) manages virtual environments for a specific Python version. pyenv-virtualenv manages virtual environments for across varying versions of Python. Here the way to create virtualenv pyenv virtualenv 3.7.2 p3 To activate the environment pyenv activate p3 To deactivate anytime pyenv deactivate To uninstall the virtualenv pyenv uninstall my-virtual-env Create a project Now we have virtual env p3 for example. Now need to create auto activating environment for the project myproject as follows mkdir myproject cd muproject pyenv local p3 Here the complete story Python 3 use of venv If you want to setup project with venv, then first you have to set the python version to 3 using pyenv pyenv global 3.8.0 Then create your project python -m venv project To activate the environment, move to the project directory s...

Parse the namespace based XML using Python

In this blog, I am considering how to parser and modify the xml file using python. For example, I need to parser the following xml 1 (slightly modified for this blog) and need to write the modified xml to out.xml file. Here the country.xml <?xml version="1.0"?> <actors xmlns:fictional="http://characters.example.com" xmlns="http://people.example.com"> <actor type='T1'> <name>John Cleese</name> <fictional:character>Lancelot</fictional:character> <fictional:character>Archie Leach</fictional:character> </actor> <actor type='T2'> <name>Eric Idle</name> <fictional:character>Sir Robin</fictional:character> <fictional:character>Gunther</fictional:character> <fictional:character>Commander Clement</fictional:character> </actor> </actors> In ...

Python Mocking Examples

Here the first example using decorator in python 2.7 import unittest import random import mock def myrandom(p): return random.random() > p class Test(unittest.TestCase): @mock.patch('random.random') def test_myrandom(self, mock_random): mock_random.return_value = 0.1 val = myrandom(0.0) assert val > 0 assert mock_random.call_count == 1 if __name__ == '__main__': unittest.main() Here the example for assert_callled_with() function import unittest import mock import example class Test(unittest.TestCase): @mock.patch('example.hello') def test1(self,mock_hello): x = 'Oj' example.hello(x) # Uses patched example.func mock_hello.assert_called_with(x) if __name__ == '__main__': unittest.main() Above test can be ran using context manager: import unittest import mock import example class Test(unittest.TestCase): def test1(self): x = ...

Python Algorithm: create Object from JSON

As shown in the following example, you can use the @wrap (which return another wrapper) to to transfer JSON to object in the python. This blog written conjunction with the Python Algorithm to flattening JSON 1 . from functools import wraps def json_to_object(func): @wraps(func) def wrapper(self, d): for name, value in d.iteritems(): setattr(self, name,value) return func(self, d) return wrapper class Person(object): @json_to_object def __init__(self, d): pass a = Person({'firstName':'Tom', 'lastName':'Hanks', 'age':50}) a.firstName a.lastName a.age More advanced version from functools import wraps def json_to_object(func): @wraps(func) def wrapper(self, d): for name, value in d.iteritems(): if type(value) == dict: print(value) setattr(self,name,Person.fromJson(value)) else: setattr(self, nam...

Python Algorithm to flattening JSON

Algorithm to flatten the deep JSON structure: def flat(root, **args): d = {} for k, v in args.iteritems(): if type(v) == dict: d.update(flat((k if root == None else '{}.{}'.format(root,k)),**v)) elif type(v) == list: for idx, item in enumerate(v): d.update(flat('{}.{}'.format(k if root == None else '{}.{}'.format(root,k),idx), **item)) else: if root == None: d['{}'.format(k)] = v else: d['{}.{}'.format(root,k)] = v #print ('key: {}, val: {}'.format(k,v)) return d for example, if you flatten the following JSON structure: tt ={'name':{'firstName':'Tom', 'lastName':'Hanks'}, 'orderlineitems':[{'rice':{'qty':2,'price':10}},{'bread':{'qty':1,'price':2}}], 'age':20, 'location'...

Python Simple Tips

It is very much to forget the simple python programming tips. Here the blog to remember. Collection manipulation tips How to concatenate to to tuple as follows t = 'ABC', 24 t = t + ('Sydeny',) In the second line , is the important character in the above code. You can repeate the tuple: t * 3 #('ABC', 24, 'Sydeny', 'ABC', 24, 'Sydeny', 'ABC', 24, 'Sydeny') this is the simple and not need to mentions #simple list a = [1,2,3,4,5] print a[1:3] #[2, 3] unpacking the data structure letters = ('A', 'B'), 'a','b' (l1,l2),l3,l4 = letters print (l1,l2,l3,l4) unpack the dictionary d = {'a':1, 'b':2, 'c':3} (k1,v1), (k2, v2), (k3,v3) = d.items() # ('c', 3) Order is not guaranteed. Create own Iterator Two methods are mandatory: __iter__() function and __next__() in python 3 but in next() in pytho 2. For example: class MyIter...

Python tip to group arrays

Grouping is one of the most important in data cleansing. In the Python, itertools package is one of the most important. For example, Following source shows how to group the array by name and age using Python build-in libraries. import itertools from operator import itemgetter import json import StringIO import gzip import logging from module import mytest source = [ {'name':'z', 'age':21, 'other':'z-21'}, {'name':'z', 'age':21, 'other':'z-21-duplicated-1'}, {'name':'z', 'age':21, 'other':'z-21-duplicated-2'}, {'name':'z', 'age':20, 'other':'z-20'}, {'name':'c', 'age':31, 'other':'c-31'}, {'name':'c', 'age':30, 'other':'c-30'}, ] grouper = itemgetter('name','age') s = sorted(source, key=grouper) import...

Python Defensive Iteration

This is the item 12 explained in the "Effective Python", by Brett Slatkin. Generators are the best save of out of memory problem. Here the problem, class StringToList(object): def __init__(self, text): self.text = text def __iter__(self): for x in self.text: yield x words = StringToList('Hello') # case 1 it = iter(words) it2 =iter(it) print(it, it2) #(<generator object __iter__ at 0x10816d7d0>, <generator object __iter__ at 0x10816d7d0>) next(it) # 'H' next(it2) # 'e' The problem is it and it2 pointing to the same instance and iteration is not as expected ('it2' give a next element as 'e' instead 'H'). To overcome this problem, author has suggested the following solution which is applicable to set and dict as well. In the case 1, same container is used. But in the case 2, different containsers. # case 2 it = iter(words) it2 =iter(words) print(it, it2) # (<generator ...

PyPI hosting on AWS S3

Create a web hosting in the AWS S3 It is a common task to create a web hosting in the S3 bucket. However, you have to change the permission to access the bucket via http protocol. For that you can right click the dist folder and select the Make public from the drop down menu. Create Distribution package Create a directory for distribution package source for example ojservice . Your folder structure is as follows: ojsevice+ | +index.html | +error.html | +LICENSE.txt | +README.txt | +setup.py | +ojservice+ | +__init__.py | +helloservice.py In the above structure, index.html and error.html are two files given in the AWS S3 host configuration under the bucket’s Static web hosting : respectively index and error documents. Here the helloservice.py: def hello (name) : return 'Hello {}...

Collatz Conjecture in Python

Please read the wikipedia Collatz Conjecture first. Here the python function for the injecture: def collatz (n) : return n / 2 if n % 2 == 0 else 3 * n + 1 When you sequence , you get the result: n = 10 sequence = [n] while n != 1 : n = collatz(n) sequence.append(n) Generator functions are lazy. Using yield statement you can create generator function as follows: def collatz_iter (n) : while n != 1 : n = collatz(n) yield n Above function is lazy. It doesn’t compute, until iterate over that: s =collatz_iter( 10 ) for i in s: print(i) Reference: The Five Kinds of Python Functions by Steven F. Lott Publisher: O’Reilly Media, Inc.

Pandas Land

Introduction This is Panda exercises: Data Frames In this section, I explore how to create data frames from different ways: list dictionary Json csv file In addition to that basic Data Frame (DF) manipulations: import pandas as pd cols = { 'name' : [ 'Ted' , 'Mak' , 'Nina' , 'Leo' ] , 'age' : [ 50 , 20 , 33 , 25 ] } l_students= [{ 'name' : 'Tailor' , 'grade' : '10' , 'math' : 60 } ,{ 'name' : 'Lora' , 'grade' : '09' , 'math' : 80 } ,{ 'name' : 'Joe' , 'grade' : '11' , 'math' : 56.90 } ,{ 'name' : 'Tailor' , 'grade' : '11' , 'math' : 68.98 } ] studemtDF = pd.DataFrame(l_students) # read from the json import json json_students = json.dumps(l_students) # [ # { # "grade": "10", # ...