Friday, December 04, 2015

Java Tip: Append a text to a file

Here the simple tip, how to append text to a file.

package ojitha.blogspot.com.au;

import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.Date;

/**
 * Append to the file.
 *
 */
public class App 
{
    public static void main( String[] args )
    {
        try {
            PrintWriter printWriter = new PrintWriter(new BufferedWriter(new FileWriter("test.txt",true)));
            //PrintWriter printWriter = new PrintWriter("test.txt"); // not the way
            printWriter.printf("Hello, how are you %s day!\n",new Date().toString());
            printWriter.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

In the above code, the true parameter in the FileWriter constructor enable to append to the file instead create new one again.

Written with StackEdit.

Friday, November 13, 2015

python fun

This is my notes on some interesting python tips.

Sort with Lambda

As first, I would like to introduce how lambda can be used to sort the dictionary of word count (wc) based on the key or the value(name of the fruit and the count)

__author__ = 'ojitha'
wc ={'orange':2, 'mango':1, 'cherry':8, 'apple':5}

'''sort on key'''
print(sorted (wc.items(), key=lambda (word, count): word))


''' sort on the count'''
print(sorted (wc.items(), key=lambda (word, count): count))

The output of the above code is as follows

[('apple', 5), ('cherry', 8), ('mango', 1), ('orange', 2)]
[('mango', 1), ('orange', 2), ('apple', 5), ('cherry', 8)]

List Comprehensions

List of even and odds can be created as follows

listOfEvens = [x for x in range(10) if x %2 == 0]
print (listOfEvens)
listOfOdds = [x for x in range(10) if x %2 != 0]
print (listOfOdds)

The list comprehension is as follows

pairs = [(x,y) for x in range(5) for y in range(5)]
print(pairs)

This will create pairs as follows

[(0, 0), (0, 1), (0, 2), (0, 3), (0, 4), (1, 0), (1, 1), (1, 2), (1, 3), (1, 4), (2, 0), (2, 1), (2, 2), (2, 3), (2, 4), (3, 0), (3, 1), (3, 2), (3, 3), (3, 4), (4, 0), (4, 1), (4, 2), (4, 3), (4, 4)

map function

map function execute the function on sequence:

def multi(l,r):
    return l * r

x = [1,2,3,4]
y = [10,20,30,40]

print(map(multi,x, y))
#output : [10, 40, 90, 160]

filter function

Here the filter which execute logical operation on sequence

def is_even(x):
    return x % 2 == 0

x = [1,2,3,4,5,6,7,8,14,31,45]

print(filter(is_even, x))
#output [2, 4, 6, 8, 14]

reduce function

def multi(x, y): return x * y

x = [1,2,3,4]

print(reduce(multi, x))
#output: 24

pythonic way of enumeration

x = ['a','b','c','d']

for i,  j in enumerate(x):
    print i, j

zip and unzip

Here the example code to zip and unzip the lists

x = ['a','b','c','d']
y = [1,2,3,4]
z = ['p','q','r','s']

l = zip(x,y,z)

#zip
print l

#unzip
p,q,r = zip(*l)

print p
print q
print r

Output as follows

[('a', 1, 'p'), ('b', 2, 'q'), ('c', 3, 'r'), ('d', 4, 's')]
('a', 'b', 'c', 'd')
(1, 2, 3, 4)
('p', 'q', 'r', 's')

Saturday, October 31, 2015

ABC Photo stories term frequency Analysis

Download the ABC dataset from the data.gov.au site, ABC Local Online Photo Stories 2009-2014 which is available localphotostories20092014csv.csv. Open the file in the numbers and save the file with the UTF-8 encoding (for example ps.csv in my case) because the unknown-8bit is the encoding of the above document.

> file -I localphotostories20092014csv.csv
localphotostories20092014csv.csv: text/plain; charset=unknown-8bit

In the Mac terminal if you type above command, you can find the charset of the csv file. In the RStudio,

> library(tm)
Loading required package: NLP
> ps <- read.csv("data/ps.csv" , stringsAsFactors = FALSE)
> vs <- VectorSource(ps$Keywords)
> corpus <- Corpus(vs)

The tm package is the best for the text mining. First load the tm library after install the package if the package is not being already installed. The VectorSource only accept the character vectors. Now create the corpus from the vector sources(vs) created from the ps that is ps.csv contents. Here we consider only the key words related to each document.

> corpus <- tm_map(corpus, removePunctuation)
> corpus <- tm_map(corpus, removeWords, stopwords("english"))
> dtm <- DocumentTermMatrix(corpus)
> dtm2 <- as.matrix(dtm)
> f <- colSums(dtm2)

Now the cleaning. Remove all the punctuations, remove the stop words which are noisy.

Create document term matrix from the corpus and transform to the matrix. After sum the columns.

> > head(f)
  000  0439   100  1000  100m 100th 
    1     1    18     3     1     4 
> head (sort(f, decreasing =TRUE))
       abc       news        art queensland      coast    history 
      2326        819        797        710        596        537 
> 

Now, you can find abc is the most used and the word news in the second place and so on.