ABC Photo stories term frequency Analysis

Download the ABC dataset from the data.gov.au site, ABC Local Online Photo Stories 2009-2014 which is available localphotostories20092014csv.csv. Open the file in the numbers and save the file with the UTF-8 encoding (for example ps.csv in my case) because the unknown-8bit is the encoding of the above document.

> file -I localphotostories20092014csv.csv
localphotostories20092014csv.csv: text/plain; charset=unknown-8bit

In the Mac terminal if you type above command, you can find the charset of the csv file. In the RStudio,

> library(tm)
Loading required package: NLP
> ps <- read.csv("data/ps.csv" , stringsAsFactors = FALSE)
> vs <- VectorSource(ps$Keywords)
> corpus <- Corpus(vs)

The tm package is the best for the text mining. First load the tm library after install the package if the package is not being already installed. The VectorSource only accept the character vectors. Now create the corpus from the vector sources(vs) created from the ps that is ps.csv contents. Here we consider only the key words related to each document.

> corpus <- tm_map(corpus, removePunctuation)
> corpus <- tm_map(corpus, removeWords, stopwords("english"))
> dtm <- DocumentTermMatrix(corpus)
> dtm2 <- as.matrix(dtm)
> f <- colSums(dtm2)

Now the cleaning. Remove all the punctuations, remove the stop words which are noisy.

Create document term matrix from the corpus and transform to the matrix. After sum the columns.

> > head(f)
  000  0439   100  1000  100m 100th 
    1     1    18     3     1     4 
> head (sort(f, decreasing =TRUE))
       abc       news        art queensland      coast    history 
      2326        819        797        710        596        537 
> 

Now, you can find abc is the most used and the word news in the second place and so on.

Comments

Popular posts from this blog

How To: GitHub projects in Spring Tool Suite

Spring 3 Part 7: Spring with Databases

Parse the namespace based XML using Python