Sunday, August 07, 2016

Atom as Spark Editor

Recently I started to reach to integrate Atom editor with Spark pyspark. In addition to the Atom, I found how to integrate pyspark with IntelliJ Idea which I suppose to discuss later.
The nice thin about Atom is hydrogen plugin which you can use for inline evaluation with python.
Here the steps
1. Install Spark
2. Install Atom
3. Install hydrogen plugin to atom
4. most important to set the PYTHONPATH as follows
export PYTHONPATH=/<SPARK_HOME>/python:/<SPARK_HOME>/python/lib/py4j-0.9-src.zip
5. Now run the following code to verify.
Here the testing code.
from pyspark import SparkContext
from pyspark import SQLContext

sc = SparkContext()
sqlContext = SQLContext(sc)
df = sqlContext.createDataFrame([("Ojitha", "Kumanayaka"),("Mark", "Anthony")],("first_name", "last_name"))
df.show()



Introduction

Two basic rules:

Rule of sum: if an action is performed making A choices or B choices, then it can be performed $A+B$ ways.

Rule of product: If an action can be performed by making A choices followed by B choices, then it can be performed $A\times B$ ways.

Permutation: an ordered list where every object appears exactly once

combinations: When order is not important and the repetition is not allowed, the number of ways to choose k from the distinct n is as for $n \ge k \ge 0$:

in general, choosing k out of n is same as not choose n-k out of n.

According to the Pascal’s Triangle:

Again from the pascal triangle:

ordered repeating
sequence $n^k$ yes
Permutation $_{n}P_{r}$ yes
Multisubset $\left(\!\!{n\choose k}\!\!\right)$ no
combination $_{n}C_{k}$ or $n \choose k$ no

It is sufficent to implement $n!$.: the programming construct is recursion because $n !=n(n-1)!$.

12 -fold way

Here D - distinct and I - identical. In the following table, if $x \gt b$ then all the values of the at most are zeros. if $b \gt x$, then all the values of the at least are zeros.

A B any at most at least
D D $b^{a}$ $_{b}P_{a}$ $S(a,b)!b$
I D $\binom {a+b-1} {a}$ $_{b}C_{a}$ $x - 1 \choose b-1$
D I $\sum ^{b}_{k=1} S(a,k)$ 1 if $a \le b$ else 0 $S(a,b)$
I I $\sum ^{b}_{k=1} p_{k}(a)$ 1 if $a \le b$ else 0 $p_{b}(a)$

In the above table, Stirling numbers of the second kind $S(a,b)$ count the number of ways to partition a set of a elements into b nonempty subsets. Stirling numbers of the first kind $s(a,b)$ count permutations of a objects with exactly b cycles.

In the above table, $p_{b}(a)$ is the number of ways to partition the number a as the sum of b positive numbers since the order don’t matter. For $1 \lt k \lt n$:

Multi-choosing $\left(\!\!{n\choose k}\!\!\right)$ is the number of ways to choose k objects from a set of n objects where order is not important but repetition is allowed.

Simplification,

Multinomial theorem where $a, b,..,c > 0 \text{ and } a+b+ ... +c = n$

Principle of Inclusion-Exclusion

Formula for the sterling number:

Java

There are two choices for the intermediate represetiation

1. portable machine language
2. graph based

Initially Java was created only with byte code interpreter. But current Java has Just In Time (JIT) compiler.

There are two popular representations:

1. Stack based - 0 operand
2. Register based - 3 operand

Java uses Stack based representation.

If you consider the for single loop in Java

for (int i =0; i < N; i++){
...
}

The running time cost of the above loop is $N$ means linear. The running time cost of the double loop is quadratic. Triple loop running time is cubic.The worst running time can be something like $2^{N}$ exponential.

Order of the Growth

Fortunately there are limited models to consider. Here the growth from best running time to worst:

1. $\log N$
2. $N$ (Knuth shuffle)
3. $N \log N$ (Mergesort, QuickSort)
4. $N^{2}$ (selection =$N^{2}/2$, insertion = 1/4 $N^{2}$ )
5. $N^{3}$
6. $2^{N}$

Each instance of the java.util.ArrayList has the capacity, when reach to the capacity, the array need to be increased. Assume, each time ArrayList is double when it reach to the capacity then the equation is

This is because ArrayList is using java array of Object as a implementation. Advantages are that every operation takes constant time of Amortised time( Average running time per operation over the worst case sequence of operations) and less wasted space compared to linked list implementation because linked list is based on object.

Sorting

There java.lang.Comparable<T> and java.util.Comparator<T> is based on the total order which is a binary relation that satisfy:

• Antisymmetry: if $v \leq w$ and $w \leq v$ , then $v = w$
• Transitivity: if $v \leq w$ and $w \leq x$ , then $v \leq x$
• Totality: either $v \leq w$ or $w \leq v$ or both

Quicksort is little bit faster than Mergesort because Quicksort doesn’t exchange the elements always. However, quick sort wort case running time is quadratic ($N^{2}$) and average case is $1.39 N \log N$. Quciksort random shuffle is the probabilistic guarantee to avoid worst case.

The java.utils.Arrays sort() method is using Quicksort for the primitives and Mergesort for the objects.

Dijkstart 3-way partitioning is the way to compromise with the duplicate keys because lower bound is reduced linearithmic to linear for most of the applications as follows:

Algorithm Worst Average Best Remarks
selection $N^{2}/2$ $N^{2}/2$ $N^{2}/2$ N exchanges
insertions $N^{2}/2$ $N^{2}/4$ $N$ small N or partially
shell ? ? N tight code
merge $N \lg N$ $N \lg N$ $N \lg N$ $N \lg N$ guaranteed
quick $N^{2}/2$ $2N \lg N$ $N \lg N$ $N \lg N$ probabilistic guarantee
3-way quick $N^{2}/2$ $2N \lg N$ $N$ support duplicate keys
Heapsort $2N \lg N$ $2N \lg N$ $N \lg N$ In-place algo. Inner loop is longer than quicksort. Poor use of cache memory. Not stable.

Searching

Binary Search Tree (BST) is a binary tree(BT) in symmetric order. BT can be either empty or two disjoint binary trees at left and right(left/right nodes can be null).

In the BST every node has a key:

• Larger than all keys in its left subtree
• smaller than all keys in its right subtree

For the N number of distinct values in random order, the number of comparisons are $2 \ln N$.

Red-Black trees

The java.util.TreeMap is based on the Red-black tree. Here some of the characteristics of the Red-Black tree:

• Represents 2-3 tree as
• use internal left-leaning link to glue three nodes which is red link
• no node has two red links
• every path from root to null link there are same number of black links.

Appendix

logarithmic ( $N \ln N$ is called linearithmic)
cubic ( $N^{2}$ is called quadratic)