An exercise in R using local open data

Posted on June 17, 2012 by CL in R bloggers | 0 Comments

[This article was first published on Christophe Ladroue » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last week I went to the “Government Open Data Hack Day” ([@,#]godhd on twitter) in Birmingham (UK), organised by Gavin Broughton and James Catell. The idea was to get hold of local open data and try and make use of them in just one day. You can see some of the work done on that day presented here. It was good fun and I’ve learned a few tricks and resources in the process, which I’m going to go through in this post. I’ll refrain from any data analysis because I know next to nothing about this type of data. Rather, I’m going to explain how to go from the raw format (in this case an Excel sheet) to something useful and exploitable.

(all files here)

The data I was given come from nomis and consist of job vacancies in West Midlands for the years 2011 and 2012, broken down by job types. The spreadsheet lists 353 job types for 59 constituencies, one after the other:

From the spread sheet to an R data frame
The first thing to do is to turn this into an R data frame for easier manipulation and reshaping. Luckily, each dataset follows the exact same pattern and it’s easy to extract the name of each constituency and each job type in two files with a simple grep, and combine both files into a data frame from R. The lines starting with “area name” contain the name of the constituency, those starting with 4 digits contains the job type and the numbers we want. (nomis_jobs_wm_2011_2012.csv is the tab-separated version of the spreadsheet)

In a terminal:

Select All Code:

1 2	grep "^area name" nomis_jobs_wm_2011_2012.csv > areaname.csv grep "^[0-9]\{4\}" nomis_jobs_wm_2011_2012.csv > jobtypes.csv

In R:

Select All Code:

# reading data
areas<-read.table("arename.csv",header=FALSE,sep='\t')
jobs<-read.table("jobtypes.csv",header=FALSE,sep='\t')
 
# combine them
areas<-areas$V2
jobs$region<-rep(areas,each=nrow(jobs)/length(areas))
 
names(jobs)<-c("JobType","Vacancies2011","Vacancies2012","change","changePercent","region")
jobs<-subset(jobs,select=c("JobType","Vacancies2011","Vacancies2012","region"))
 
# A subtlety here: Excel formatted the number with a comma e.g. 1,234 for 1234. 
# So the comma has to be removed for the cast to work properly
jobs$Vacancies2011<-as.numeric(gsub(",","",jobs$Vacancies2011))
jobs$Vacancies2012<-as.numeric(gsub(",","",jobs$Vacancies2012))

Select All Code:

 head(jobs)
                                                       JobType Vacancies2011 Vacancies2012              region
1               1111 : Senior officials in national government             0             0 Aldridge-Brownhills
2 1112 : Directors and chief executives of major organisations             0             0 Aldridge-Brownhills
3                  1113 : Senior officials in local government             0             0 Aldridge-Brownhills
4    1114 : Senior officials of special interest organisations             0             0 Aldridge-Brownhills
5            1121 : Production, works and maintenance managers            10             4 Aldridge-Brownhills
6                              1122 : Managers in construction            47             9 Aldridge-Brownhills

Now that we have the data in one single data frame, it’s much easier to do something with it.

Aggregating the job types
There are 353 job types in total, which is too fine of a granularity for us. It turns out that the 4 digit numbers that were so useful for parsing the data are from the SOC (Standard Occupational Classification) code and follow a hierarchical pattern, with 1xxx meaning “Managers and Senior Officials”, 2xxx “Professional Occupations” etc.. Somewhere on the internet (I can’t remember where) I tracked down an exploitable list (as in, not a b. pdf!) of those SOC numbers, which I promptly turned into a tab-separated file.

Select All Code:

soc<-read.table("soc.csv",sep='\t',head=TRUE,stringsAsFactor=FALSE)
> head(soc)
  Major.Group Sub.Major.Group Minor.Group Unit...Group                              Group.Title
1           1              NA          NA           NA MANAGERS, DIRECTORS AND SENIOR OFFICIALS
2          NA              11          NA           NA         CORPORATE MANAGERS AND DIRECTORS
3          NA              NA         111           NA    Chief Executives and Senior Officials
4          NA              NA          NA         1115    Chief executives and senior officials
5          NA              NA          NA         1116     Elected officers and representatives
6          NA              NA         112           NA        Production Managers and Directors

Now that we have the description of each level in the SOC code, we can aggregate the 353 jobs into, for example, the 9 job types of level 1 (‘Major Group’).

Select All Code:

# select the job types in the major group
level1<-subset(soc,!is.na(Major.Group),select=c("Major.Group","Group.Title"))
# build a look-up table to go from a digit to a job type
lookup<-1:nrow(level1)
lookup[level1$Major.Group]<-level1$Group.Title
lookup<-factor(lookup,levels=lookup)

Select All Code:

> lookup
[1] MANAGERS, DIRECTORS AND SENIOR OFFICIALS         PROFESSIONAL OCCUPATIONS                        
[3] ASSOCIATE PROFESSIONAL AND TECHNICAL OCCUPATIONS ADMINISTRATIVE AND SECRETARIAL OCCUPATIONS      
[5] SKILLED TRADES OCCUPATIONS                       CARING, LEISURE AND OTHER SERVICE OCCUPATIONS   
[7] SALES AND CUSTOMER SERVICE OCCUPATIONS           PROCESS, PLANT AND MACHINE OPERATIVES           
[9] ELEMENTARY OCCUPATIONS                          
9 Levels: MANAGERS, DIRECTORS AND SENIOR OFFICIALS PROFESSIONAL OCCUPATIONS ... ELEMENTARY OCCUPATIONS

Select All Code:

# add a column 'level1' to jobs which contains one of the 9 possible job titles
jobs$level1<-lookup[
  as.numeric(sapply(jobs$JobType,function(s) substr(s,1,1),USE.NAMES=FALSE))]
# Build a new data frame byLevel1, the aggregated data
byLevel1<-ddply(jobs,.(region,level1),summarise,Vacancies2011=sum(Vacancies2011),Vacancies2012=sum(Vacancies2012))

Select All Code:

> head(byLevel1)
               region                                           level1 Vacancies2011 Vacancies2012
1 Aldridge-Brownhills         MANAGERS, DIRECTORS AND SENIOR OFFICIALS           173           134
2 Aldridge-Brownhills                         PROFESSIONAL OCCUPATIONS            97           100
3 Aldridge-Brownhills ASSOCIATE PROFESSIONAL AND TECHNICAL OCCUPATIONS           548           190
4 Aldridge-Brownhills       ADMINISTRATIVE AND SECRETARIAL OCCUPATIONS           288           202
5 Aldridge-Brownhills                       SKILLED TRADES OCCUPATIONS           693          1470
6 Aldridge-Brownhills    CARING, LEISURE AND OTHER SERVICE OCCUPATIONS           477           566

We now have a smaller data frame, with 59×9=531 (constituencies x job types) rows. An obvious graph to do is looking at the distribution of vacancies in each constituency:

Select All Code:

library(ggplot2)
# sort the constituencies backward, to have them listed alphabetically from top to bottom in the graph
levels(byLevel1$region)<-rev(levels(byLevel1$region))
p<-ggplot(byLevel1)
p<-p+geom_bar(aes(x=region,Vacancies2011,fill=level1),position='fill')
p<-p+coord_flip()+scale_fill_brewer(type='qual',palette='Set1')
print(p)

This representation shows the relative proportion of job types within each constituency. It would be misleading to try and compare the number of vacancies from one constituency with another for example, since they might not represent the same population etc.. I don’t have this data so can’t normalise in a sensible manner.

Maps!
Since we’re dealing with regional data, wouldn’t it be cool to plot that on a map? geom_map from ggplot2 can help with that, but we first need to find the boundaries of all the 59 constituencies to get started. My office mate helpfully pointed me to mapit, a great service from mysociety.org. If you know the id of an area, mapit can give you its boundaries in a JSON object, which you can easily turn into a data frame with the package rjson. Here’s how I did it:

Select All Code:

library(rjson)
library(plyr)
 
# list of all the areas we need
areas<-read.table("arename.csv",header=FALSE,sep='\t')
areas<-areas$V2
areas<-as.character(areas)
 
# All UK Parliament Constituencies
WMC<-fromJSON(file='http://mapit.mysociety.org/areas/WMC')
# Extract name and id
constituencies<-sapply(names(WMC),function(id) WMC[[c(id,'name')]])
# Select only those we need
constituencies<-constituencies[which(constituencies %in% areas)]
# id and name
areas<-data.frame(group=names(constituencies),region=constituencies)
 
# boundaries to all West Midlands constituencies
WestMidlands<-ddply(areas,.(group,region),.fun=function(row){
  x<-fromJSON(file=paste('http://mapit.mysociety.org/area/',row$group,'.geojson',sep=''))
  x<-unlist(x$coordinates)
  n<-length(x)/2
  data.frame(long=x[2*(1:n)-1],lat=x[2*(1:n)])  
})

(ignore the warnings, they’re all due to some non-existent end-of-line.)

Select All Code:

> head(WestMidlands)
  group                region      long      lat
1 65563 Shrewsbury and Atcham -3.044652 52.66554
2 65563 Shrewsbury and Atcham -3.044531 52.66568
3 65563 Shrewsbury and Atcham -3.044481 52.66573
4 65563 Shrewsbury and Atcham -3.044355 52.66585
5 65563 Shrewsbury and Atcham -3.044110 52.66605
6 65563 Shrewsbury and Atcham -3.043950 52.66621

Let’s see what’s the relative change in vacancies for each constituency per job title:

Select All Code:

 
# Compute relative change
byLevel1$percentChange<-100*(byLevel1$Vacancies2012-byLevel1$Vacancies2011)/byLevel1$Vacancies2011
 
# Connect map and data
p<-ggplot(byLevel1, aes(map_id = region)) 
p<-p+geom_map(aes(fill = percentChange), map = WestMidlands)
p<-p+expand_limits(x = WestMidlands$long, y = WestMidlands$lat)
# Colour scale from red to blue, cropped at -100 and 100. "Lab" is nicer on the eyes than RGB.
p<-p+scale_fill_gradient2(limits=c(-100,100),name="% change",space="Lab")
# mercator
p<-p+coord_map()
# one plot per job level
p<-p+facet_wrap(~level1)
# remove grid etc.
p<-p+opts(
  axis.title.x=theme_blank(),
  axis.title.y=theme_blank(),
  panel.grid.major=theme_blank(),
  panel.grid.minor=theme_blank(),
  axis.text.x=theme_blank(),
  axis.text.y=theme_blank(),
  axis.ticks=theme_blank()
  )
print(p) # might take some time!

which is rather nice for a first go. Some areas appear gray because they’re off the scale; the relative change is over 100%. This hard limit is completely arbitrary, but setting it to 220 (and getting rid of the gray areas) results in very low constrast for the rest of the plot. One could fix that with a capping colour scale for example.

Select All Code:

> summary(byLevel1$percentChange)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-74.580 -18.200   2.732   7.292  26.780 218.600 
> subset(byLevel1,percentChange>100)
                      region                                           level1 Vacancies2011 Vacancies2012 percentChange
5                Wyre Forest                       SKILLED TRADES OCCUPATIONS           693          1470      112.1212
38  Wolverhampton North East                         PROFESSIONAL OCCUPATIONS            63           173      174.6032
83                    Warley                         PROFESSIONAL OCCUPATIONS            91           209      129.6703
114               The Wrekin    CARING, LEISURE AND OTHER SERVICE OCCUPATIONS           872          2236      156.4220
129                 Tamworth ASSOCIATE PROFESSIONAL AND TECHNICAL OCCUPATIONS           370           743      100.8108
135                 Tamworth                           ELEMENTARY OCCUPATIONS           634          1331      109.9369
190   Stoke-on-Trent Central         MANAGERS, DIRECTORS AND SENIOR OFFICIALS           188           599      218.6170
195   Stoke-on-Trent Central    CARING, LEISURE AND OTHER SERVICE OCCUPATIONS           667          1427      113.9430
204  Staffordshire Moorlands    CARING, LEISURE AND OTHER SERVICE OCCUPATIONS           363           730      101.1019
312       Mid Worcestershire    CARING, LEISURE AND OTHER SERVICE OCCUPATIONS           734          1869      154.6322
315       Mid Worcestershire                           ELEMENTARY OCCUPATIONS          1245          2744      120.4016
380             Dudley North                         PROFESSIONAL OCCUPATIONS            97           246      153.6082
384             Dudley North    CARING, LEISURE AND OTHER SERVICE OCCUPATIONS          1465          3117      112.7645
389           Coventry South                         PROFESSIONAL OCCUPATIONS            90           193      114.4444
521    Birmingham, Edgbaston            PROCESS, PLANT AND MACHINE OPERATIVES          1245          2548      104.6586

We can spot a 200% increase in managerial positions in Stoke-on-Trent from 2011 to 2012! I’ll leave it to the professionals to explain those numbers.

I’m stopping here but there’s obviously quite a lot you can do with this data, it all depends on what question you want to ask. Again, this data is available for free, which is rather nice and crossed with other datasets (like geographical location here), we can do quite a lot — after some preprocessing work — in a few lines of code. See for example what Andy Pryke and Sarah Mount did with similar datasets on that day: video.

To leave a comment for the author, please follow the link and comment on their blog: Christophe Ladroue » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

An exercise in R using local open data

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)