ABC Photo stories term frequency Analysis
Download the ABC dataset from the data.gov.au site, ABC Local Online Photo Stories 2009-2014 which is available localphotostories20092014csv.csv. Open the file in the numbers and save the file with the UTF-8 encoding (for example ps.csv in my case) because the unknown-8bit is the encoding of the above document.
> file -I localphotostories20092014csv.csv
localphotostories20092014csv.csv: text/plain; charset=unknown-8bit
In the Mac terminal if you type above command, you can find the charset of the csv file. In the RStudio,
> library(tm)
Loading required package: NLP
> ps <- read.csv("data/ps.csv" , stringsAsFactors = FALSE)
> vs <- VectorSource(ps$Keywords)
> corpus <- Corpus(vs)
The tm package is the best for the text mining. First load the tm library after install the package if the package is not being already installed. The VectorSource
only accept the character vectors. Now create the corpus from the vector sources(vs) created from the ps that is ps.csv contents. Here we consider only the key words related to each document.
> corpus <- tm_map(corpus, removePunctuation)
> corpus <- tm_map(corpus, removeWords, stopwords("english"))
> dtm <- DocumentTermMatrix(corpus)
> dtm2 <- as.matrix(dtm)
> f <- colSums(dtm2)
Now the cleaning. Remove all the punctuations, remove the stop words which are noisy.
Create document term matrix from the corpus and transform to the matrix. After sum the columns.
> > head(f)
000 0439 100 1000 100m 100th
1 1 18 3 1 4
> head (sort(f, decreasing =TRUE))
abc news art queensland coast history
2326 819 797 710 596 537
>
Now, you can find abc is the most used and the word news in the second place and so on.
Comments
Post a Comment
commented your blog