R Data manipulation with dplyr
First install the package
> install.packages("dplyr")
> library(dplyr)
dplyr provide five functions
- filter
- select
- mutate
- summarise
- arrange
The dataset of the ABC local stations where in the csv file abc-local-radio.csv downloaded from the data.gov.au web site. Another ABC dataset from the same site is ABC Local Online Photo Stories 2009-2014 which is available localphotostories20092014csv.csv.
Here the headings for the first file radio
.
> radio <- read.csv("abc-local-radio.csv")
> names(radio)
[1] "State" "Website.URL" "Station"
[4] "Town" "Latitude" "Longitude"
[7] "Talkback.number" "Enquiries.number" "Fax.number"
[10] "Sms.number" "Street.number" "Street.suburb"
[13] "Street.postcode" "PO.box" "PO.suburb"
[16] "PO.postcode" "Twitter" "Facebook"
The second file is
> pstories <- read.csv("localphotostories20092014csv.csv")
> names(pstories)
[1] "Title" "URL"
[3] "Date" "Primary.image"
[5] "Primary.image.caption" "Primary.image.rights.information"
[7] "Subjects" "Station"
[9] "State" "Place"
[11] "Keywords" "Latitude"
[13] "Longitude" "MediaRSS.URL"
Filter
Filter command except dataset and the condition criteria to filter out. For example, here the query how to filter out All the NSW stations:
> filter (radio, State=='NSW')[,"Station"]
[1] "999 ABC Broken Hill" "92.5 ABC Central Coast" "ABC Central West"
[4] "ABC Coffs Coast" "97.3 ABC Illawarra" "ABC Mid North Coast"
[7] "1233 ABC Newcastle" "ABC New England North West" "ABC North Coast"
[10] "ABC Riverina" "ABC South East" "702 ABC Sydney"
[13] "ABC Upper Hunter" "ABC Western Plains"
A shown in the above query there are ABC local radio14 stations in the NSW.
Select
Select command project only the specified columns. For example, above query can be written as:
> select(filter (radio, State=='NSW'),Station,Town)
Station Town
1 999 ABC Broken Hill Broken Hill
2 92.5 ABC Central Coast Gosford
3 ABC Central West Orange
4 ABC Coffs Coast Coffs Harbour
5 97.3 ABC Illawarra Wollongong
6 ABC Mid North Coast Port Macquarie
7 1233 ABC Newcastle Newcastle
8 ABC New England North West Tamworth
9 ABC North Coast Lismore
10 ABC Riverina Wagga Wagga
11 ABC South East Bega
12 702 ABC Sydney Sydney
13 ABC Upper Hunter Muswellbrook
14 ABC Western Plains Dubbo
As shown in the above select query, there should be a data frame and the columns (e.g. Station and Town columns).
Arrange
You can arrange the data base on the ascending or descending order. For example, the above query is in the descending order of the Town:
> arrange(select(filter (radio, State=='NSW'),Station,Town),desc(Town))
Station Town
1 97.3 ABC Illawarra Wollongong
2 ABC Riverina Wagga Wagga
3 ABC New England North West Tamworth
4 702 ABC Sydney Sydney
5 ABC Mid North Coast Port Macquarie
6 ABC Central West Orange
7 1233 ABC Newcastle Newcastle
8 ABC Upper Hunter Muswellbrook
9 ABC North Coast Lismore
10 92.5 ABC Central Coast Gosford
11 ABC Western Plains Dubbo
12 ABC Coffs Coast Coffs Harbour
13 999 ABC Broken Hill Broken Hill
14 ABC South East Bega
Mutate
This function add calucalted columns to the returning data frame. For example,
> filter(select(mutate(select(filter(pstories, State=="ACT"), Title,Station, Date),year = as.numeric(format(as.Date(Date, "%d/%m/%Y"),"%Y"))), Title, Station, year),year == 2011)
Title Station year
1 Ticket to ride: the motorbikes of ACT Policing ABC Canberra 2011
2 Howzat? Behind the scenes of a great Australian scoreboard ABC Canberra 2011
3 2011 Saturday Spruce Up in the sunshine ABC Canberra 2011
4 Live from the Capital Region Farmers Market ABC Canberra 2011
5 Obamarama: president spotting at the Australian War Memorial ABC Canberra 2011
...
...
...
In the above query, new column year
has been added to the returning data frame and use that column to filter the date frame to have only the titles published in year 2011.
Summarize
Summarization work on the group of data and publish the calculated result. For example
> summarize(filter(select(mutate(select(filter(pstories, State=="ACT"), Title,Station, Date),year = as.numeric(format(as.Date(Date, "%d/%m/%Y"),"%Y")), rec = 1), Title, Station, year, rec),year == 2011), "total records" = sum(rec))
total records
1 27
> summarize(filter(select(mutate(select(filter(pstories, State=="ACT"), Title,Station, Date),year = as.numeric(format(as.Date(Date, "%d/%m/%Y"),"%Y")), rec = 1), Title, Station, year, rec),year == 2012), "total records" = sum(rec))
total records
1 40
> summarize(filter(select(mutate(select(filter(pstories, State=="ACT"), Title,Station, Date),year = as.numeric(format(as.Date(Date, "%d/%m/%Y"),"%Y")), rec = 1), Title, Station, year, rec),year == 2013), "total records" = sum(rec))
total records
1 53
> summarize(filter(select(mutate(select(filter(pstories, State=="ACT"), Title,Station, Date),year = as.numeric(format(as.Date(Date, "%d/%m/%Y"),"%Y")), rec = 1), Title, Station, year, rec),year == 2014), "total records" = sum(rec))
total records
1 14
In the above code, using mutate function added new column rec=1
to the returning data frame. Now using summerize function, calculated the total number of published stories using sum
for the each year in the ACT.
Inner join
Instead of relational algebra, set theory may be the simplest way to understand this concept even though not explain in formally.
if x and y are relational sets ( based on at least one common element) then inner join is shown as
Here the inner join example:
> names(filter(mutate(inner_join(pstories,radio, by="Station"),year = as.numeric(format(as.Date(Date, "%d/%m/%Y"),"%Y")), rec = 1),year == 2011) )
[1] "Title" "URL"
[3] "Date" "Primary.image"
[5] "Primary.image.caption" "Primary.image.rights.information"
[7] "Subjects" "Station"
[9] "State.x" "Place"
[11] "Keywords" "Latitude.x"
[13] "Longitude.x" "MediaRSS.URL"
[15] "State.y" "Website.URL"
[17] "Town" "Latitude.y"
[19] "Longitude.y" "Talkback.number"
[21] "Enquiries.number" "Fax.number"
[23] "Sms.number" "Street.number"
[25] "Street.suburb" "Street.postcode"
[27] "PO.box" "PO.suburb"
[29] "PO.postcode" "Twitter"
[31] "Facebook" "year"
[33] "rec"
In the above query shows the names available from the resulting joined table: x and y refers to the x table and y table columns.
Here the example inner join query
> select(filter(mutate(inner_join(pstories,radio, by=c("Station", "State")),year = as.numeric(format(as.Date(Date, "%d/%m/%Y"),"%Y")), rec = 1),year == 2011 & State=="VIC"), Title, Town, State)
Title Town State
1 Marmosets moving in Horsham VIC
2 I can't believe it's a tapestry Wodonga VIC
Left Join
This is a query for left join. This query join the two datasets based on the Station and the State. Actually Station is the only necessary to join these.
> select(filter(mutate(left_join(pstories,radio, by=c("Station", "State")),year = as.numeric(format(as.Date(Date, "%d/%m/%Y"),"%Y")), rec = 1),year == 2011 & State=="VIC"), Station, State, Town)
Station State Town
1 ABC Mildura - Swan Hill Victoria VIC <NA>
2 ABC Mildura - Swan Hill Victoria VIC <NA>
3 ABC Western Victoria VIC Horsham
4
In the above query result, mean the data not found from the y dataset, but all the data from the x is selected.
Group By
The function group_by()
available in the dplyr package .
> summarize( group_by( mutate(radio,rec=1),State),total=sum(rec))
Source: local data frame [7 x 2]
State total
(chr) (dbl)
1 NSW 14
2 NT 3
3 QLD 11
4 SA 5
5 TAS 2
6 VIC 9
7 WA 8
As shown in the above example, the number of ABC Local Stations available in the QLD is 11. All the number of Stations of each State and territories shown in the above example.
Comments
Post a Comment
commented your blog