R Data manipulation with dplyr

First install the package

> install.packages("dplyr")
> library(dplyr)

dplyr provide five functions

  1. filter
  2. select
  3. mutate
  4. summarise
  5. arrange

The dataset of the ABC local stations where in the csv file abc-local-radio.csv downloaded from the data.gov.au web site. Another ABC dataset from the same site is ABC Local Online Photo Stories 2009-2014 which is available localphotostories20092014csv.csv.

Here the headings for the first file radio.

> radio <- read.csv("abc-local-radio.csv")
> names(radio)
 [1] "State"            "Website.URL"      "Station"         
 [4] "Town"             "Latitude"         "Longitude"       
 [7] "Talkback.number"  "Enquiries.number" "Fax.number"      
[10] "Sms.number"       "Street.number"    "Street.suburb"   
[13] "Street.postcode"  "PO.box"           "PO.suburb"       
[16] "PO.postcode"      "Twitter"          "Facebook"

The second file is

> pstories <- read.csv("localphotostories20092014csv.csv")
> names(pstories)
 [1] "Title"                            "URL"                             
 [3] "Date"                             "Primary.image"                   
 [5] "Primary.image.caption"            "Primary.image.rights.information"
 [7] "Subjects"                         "Station"                         
 [9] "State"                            "Place"                           
[11] "Keywords"                         "Latitude"                        
[13] "Longitude"                        "MediaRSS.URL"                    

Filter

Filter command except dataset and the condition criteria to filter out. For example, here the query how to filter out All the NSW stations:

> filter (radio, State=='NSW')[,"Station"]
 [1] "999 ABC Broken Hill"        "92.5 ABC Central Coast"     "ABC Central West"          
 [4] "ABC Coffs Coast"            "97.3 ABC Illawarra"         "ABC Mid North Coast"       
 [7] "1233 ABC Newcastle"         "ABC New England North West" "ABC North Coast"           
[10] "ABC Riverina"               "ABC South East"             "702 ABC Sydney"            
[13] "ABC Upper Hunter"           "ABC Western Plains"        

A shown in the above query there are ABC local radio14 stations in the NSW.

Select

Select command project only the specified columns. For example, above query can be written as:

> select(filter (radio, State=='NSW'),Station,Town)
                      Station             Town
1         999 ABC Broken Hill     Broken Hill 
2      92.5 ABC Central Coast         Gosford 
3            ABC Central West          Orange 
4             ABC Coffs Coast   Coffs Harbour 
5          97.3 ABC Illawarra      Wollongong 
6         ABC Mid North Coast  Port Macquarie 
7          1233 ABC Newcastle       Newcastle 
8  ABC New England North West        Tamworth 
9             ABC North Coast         Lismore 
10               ABC Riverina     Wagga Wagga 
11             ABC South East            Bega 
12             702 ABC Sydney          Sydney 
13           ABC Upper Hunter    Muswellbrook 
14         ABC Western Plains           Dubbo 

As shown in the above select query, there should be a data frame and the columns (e.g. Station and Town columns).

Arrange

You can arrange the data base on the ascending or descending order. For example, the above query is in the descending order of the Town:

> arrange(select(filter (radio, State=='NSW'),Station,Town),desc(Town))
                      Station             Town
1          97.3 ABC Illawarra      Wollongong 
2                ABC Riverina     Wagga Wagga 
3  ABC New England North West        Tamworth 
4              702 ABC Sydney          Sydney 
5         ABC Mid North Coast  Port Macquarie 
6            ABC Central West          Orange 
7          1233 ABC Newcastle       Newcastle 
8            ABC Upper Hunter    Muswellbrook 
9             ABC North Coast         Lismore 
10     92.5 ABC Central Coast         Gosford 
11         ABC Western Plains           Dubbo 
12            ABC Coffs Coast   Coffs Harbour 
13        999 ABC Broken Hill     Broken Hill 
14             ABC South East            Bega 

Mutate

This function add calucalted columns to the returning data frame. For example,

> filter(select(mutate(select(filter(pstories, State=="ACT"), Title,Station, Date),year = as.numeric(format(as.Date(Date, "%d/%m/%Y"),"%Y"))), Title, Station, year),year == 2011)
                                                          Title      Station year
1                Ticket to ride: the motorbikes of ACT Policing ABC Canberra 2011
2    Howzat? Behind the scenes of a great Australian scoreboard ABC Canberra 2011
3                       2011 Saturday Spruce Up in the sunshine ABC Canberra 2011
4                   Live from the Capital Region Farmers Market ABC Canberra 2011
5  Obamarama: president spotting at the Australian War Memorial ABC Canberra 2011
...
...
...

In the above query, new column year has been added to the returning data frame and use that column to filter the date frame to have only the titles published in year 2011.

Summarize

Summarization work on the group of data and publish the calculated result. For example

> summarize(filter(select(mutate(select(filter(pstories, State=="ACT"), Title,Station, Date),year = as.numeric(format(as.Date(Date, "%d/%m/%Y"),"%Y")), rec = 1), Title, Station, year, rec),year == 2011), "total records" = sum(rec))
  total records
1            27
> summarize(filter(select(mutate(select(filter(pstories, State=="ACT"), Title,Station, Date),year = as.numeric(format(as.Date(Date, "%d/%m/%Y"),"%Y")), rec = 1), Title, Station, year, rec),year == 2012), "total records" = sum(rec))
  total records
1            40
> summarize(filter(select(mutate(select(filter(pstories, State=="ACT"), Title,Station, Date),year = as.numeric(format(as.Date(Date, "%d/%m/%Y"),"%Y")), rec = 1), Title, Station, year, rec),year == 2013), "total records" = sum(rec))
  total records
1            53
> summarize(filter(select(mutate(select(filter(pstories, State=="ACT"), Title,Station, Date),year = as.numeric(format(as.Date(Date, "%d/%m/%Y"),"%Y")), rec = 1), Title, Station, year, rec),year == 2014), "total records" = sum(rec))
  total records
1            14

In the above code, using mutate function added new column rec=1 to the returning data frame. Now using summerize function, calculated the total number of published stories using sum for the each year in the ACT.

Inner join

Instead of relational algebra, set theory may be the simplest way to understand this concept even though not explain in formally.

if x and y are relational sets ( based on at least one common element) then inner join is shown as

Here the inner join example:

> names(filter(mutate(inner_join(pstories,radio, by="Station"),year = as.numeric(format(as.Date(Date, "%d/%m/%Y"),"%Y")), rec = 1),year == 2011) )
 [1] "Title"                            "URL"                             
 [3] "Date"                             "Primary.image"                   
 [5] "Primary.image.caption"            "Primary.image.rights.information"
 [7] "Subjects"                         "Station"                         
 [9] "State.x"                          "Place"                           
[11] "Keywords"                         "Latitude.x"                      
[13] "Longitude.x"                      "MediaRSS.URL"                    
[15] "State.y"                          "Website.URL"                     
[17] "Town"                             "Latitude.y"                      
[19] "Longitude.y"                      "Talkback.number"                 
[21] "Enquiries.number"                 "Fax.number"                      
[23] "Sms.number"                       "Street.number"                   
[25] "Street.suburb"                    "Street.postcode"                 
[27] "PO.box"                           "PO.suburb"                       
[29] "PO.postcode"                      "Twitter"                         
[31] "Facebook"                         "year"                            
[33] "rec"                             

In the above query shows the names available from the resulting joined table: x and y refers to the x table and y table columns.

Here the example inner join query

> select(filter(mutate(inner_join(pstories,radio, by=c("Station", "State")),year = as.numeric(format(as.Date(Date, "%d/%m/%Y"),"%Y")), rec = 1),year == 2011 & State=="VIC"), Title, Town, State)
                                                                        Title      Town State
1                                                         Marmosets moving in  Horsham    VIC
2                                             I can't believe it's a tapestry  Wodonga    VIC

Left Join

This is a query for left join. This query join the two datasets based on the Station and the State. Actually Station is the only necessary to join these.

> select(filter(mutate(left_join(pstories,radio, by=c("Station", "State")),year = as.numeric(format(as.Date(Date, "%d/%m/%Y"),"%Y")), rec = 1),year == 2011 & State=="VIC"), Station, State, Town)
                             Station State      Town
1   ABC Mildura - Swan Hill Victoria   VIC      <NA>
2   ABC Mildura - Swan Hill Victoria   VIC      <NA>
3               ABC Western Victoria   VIC  Horsham 
4

In the above query result, mean the data not found from the y dataset, but all the data from the x is selected.

Group By

The function group_by() available in the dplyr package .

> summarize( group_by( mutate(radio,rec=1),State),total=sum(rec))
Source: local data frame [7 x 2]

  State total
  (chr) (dbl)
1   NSW    14
2    NT     3
3   QLD    11
4    SA     5
5   TAS     2
6   VIC     9
7    WA     8

As shown in the above example, the number of ABC Local Stations available in the QLD is 11. All the number of Stations of each State and territories shown in the above example.

Comments

Popular posts from this blog

How To: GitHub projects in Spring Tool Suite

Spring 3 Part 7: Spring with Databases

Parse the namespace based XML using Python