An exercise in R using local open data
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Last week I went to the “Government Open Data Hack Day” ([@,#]godhd on twitter) in Birmingham (UK), organised by Gavin Broughton and James Catell. The idea was to get hold of local open data and try and make use of them in just one day. You can see some of the work done on that day presented here. It was good fun and I’ve learned a few tricks and resources in the process, which I’m going to go through in this post. I’ll refrain from any data analysis because I know next to nothing about this type of data. Rather, I’m going to explain how to go from the raw format (in this case an Excel sheet) to something useful and exploitable.
(all files here)
The data I was given come from nomis and consist of job vacancies in West Midlands for the years 2011 and 2012, broken down by job types. The spreadsheet lists 353 job types for 59 constituencies, one after the other:
From the spread sheet to an R data frame
The first thing to do is to turn this into an R data frame for easier manipulation and reshaping. Luckily, each dataset follows the exact same pattern and it’s easy to extract the name of each constituency and each job type in two files with a simple grep, and combine both files into a data frame from R. The lines starting with “area name” contain the name of the constituency, those starting with 4 digits contains the job type and the numbers we want. (nomis_jobs_wm_2011_2012.csv is the tab-separated version of the spreadsheet)
In a terminal:
1 2 | grep "^area name" nomis_jobs_wm_2011_2012.csv > areaname.csv grep "^[0-9]\{4\}" nomis_jobs_wm_2011_2012.csv > jobtypes.csv |
In R:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | # reading data areas<-read.table("arename.csv",header=FALSE,sep='\t') jobs<-read.table("jobtypes.csv",header=FALSE,sep='\t') # combine them areas<-areas$V2 jobs$region<-rep(areas,each=nrow(jobs)/length(areas)) names(jobs)<-c("JobType","Vacancies2011","Vacancies2012","change","changePercent","region") jobs<-subset(jobs,select=c("JobType","Vacancies2011","Vacancies2012","region")) # A subtlety here: Excel formatted the number with a comma e.g. 1,234 for 1234. # So the comma has to be removed for the cast to work properly jobs$Vacancies2011<-as.numeric(gsub(",","",jobs$Vacancies2011)) jobs$Vacancies2012<-as.numeric(gsub(",","",jobs$Vacancies2012)) |
head(jobs) JobType Vacancies2011 Vacancies2012 region 1 1111 : Senior officials in national government 0 0 Aldridge-Brownhills 2 1112 : Directors and chief executives of major organisations 0 0 Aldridge-Brownhills 3 1113 : Senior officials in local government 0 0 Aldridge-Brownhills 4 1114 : Senior officials of special interest organisations 0 0 Aldridge-Brownhills 5 1121 : Production, works and maintenance managers 10 4 Aldridge-Brownhills 6 1122 : Managers in construction 47 9 Aldridge-Brownhills
Now that we have the data in one single data frame, it’s much easier to do something with it.
Aggregating the job types
There are 353 job types in total, which is too fine of a granularity for us. It turns out that the 4 digit numbers that were so useful for parsing the data are from the SOC (Standard Occupational Classification) code and follow a hierarchical pattern, with 1xxx meaning “Managers and Senior Officials”, 2xxx “Professional Occupations” etc.. Somewhere on the internet (I can’t remember where) I tracked down an exploitable list (as in, not a b. pdf!) of those SOC numbers, which I promptly turned into a tab-separated file.
1 2 3 4 5 6 7 8 9 | soc<-read.table("soc.csv",sep='\t',head=TRUE,stringsAsFactor=FALSE) > head(soc) Major.Group Sub.Major.Group Minor.Group Unit...Group Group.Title 1 1 NA NA NA MANAGERS, DIRECTORS AND SENIOR OFFICIALS 2 NA 11 NA NA CORPORATE MANAGERS AND DIRECTORS 3 NA NA 111 NA Chief Executives and Senior Officials 4 NA NA NA 1115 Chief executives and senior officials 5 NA NA NA 1116 Elected officers and representatives 6 NA NA 112 NA Production Managers and Directors |
Now that we have the description of each level in the SOC code, we can aggregate the 353 jobs into, for example, the 9 job types of level 1 (‘Major Group’).
1 2 3 4 5 6 | # select the job types in the major group level1<-subset(soc,!is.na(Major.Group),select=c("Major.Group","Group.Title")) # build a look-up table to go from a digit to a job type lookup<-1:nrow(level1) lookup[level1$Major.Group]<-level1$Group.Title lookup<-factor(lookup,levels=lookup) |
> lookup [1] MANAGERS, DIRECTORS AND SENIOR OFFICIALS PROFESSIONAL OCCUPATIONS [3] ASSOCIATE PROFESSIONAL AND TECHNICAL OCCUPATIONS ADMINISTRATIVE AND SECRETARIAL OCCUPATIONS [5] SKILLED TRADES OCCUPATIONS CARING, LEISURE AND OTHER SERVICE OCCUPATIONS [7] SALES AND CUSTOMER SERVICE OCCUPATIONS PROCESS, PLANT AND MACHINE OPERATIVES [9] ELEMENTARY OCCUPATIONS 9 Levels: MANAGERS, DIRECTORS AND SENIOR OFFICIALS PROFESSIONAL OCCUPATIONS ... ELEMENTARY OCCUPATIONS
# add a column 'level1' to jobs which contains one of the 9 possible job titles jobs$level1<-lookup[ as.numeric(sapply(jobs$JobType,function(s) substr(s,1,1),USE.NAMES=FALSE))] # Build a new data frame byLevel1, the aggregated data byLevel1<-ddply(jobs,.(region,level1),summarise,Vacancies2011=sum(Vacancies2011),Vacancies2012=sum(Vacancies2012))
> head(byLevel1) region level1 Vacancies2011 Vacancies2012 1 Aldridge-Brownhills MANAGERS, DIRECTORS AND SENIOR OFFICIALS 173 134 2 Aldridge-Brownhills PROFESSIONAL OCCUPATIONS 97 100 3 Aldridge-Brownhills ASSOCIATE PROFESSIONAL AND TECHNICAL OCCUPATIONS 548 190 4 Aldridge-Brownhills ADMINISTRATIVE AND SECRETARIAL OCCUPATIONS 288 202 5 Aldridge-Brownhills SKILLED TRADES OCCUPATIONS 693 1470 6 Aldridge-Brownhills CARING, LEISURE AND OTHER SERVICE OCCUPATIONS 477 566
We now have a smaller data frame, with 59×9=531 (constituencies x job types) rows. An obvious graph to do is looking at the distribution of vacancies in each constituency:
1 2 3 4 5 6 7 | library(ggplot2) # sort the constituencies backward, to have them listed alphabetically from top to bottom in the graph levels(byLevel1$region)<-rev(levels(byLevel1$region)) p<-ggplot(byLevel1) p<-p+geom_bar(aes(x=region,Vacancies2011,fill=level1),position='fill') p<-p+coord_flip()+scale_fill_brewer(type='qual',palette='Set1') print(p) |
This representation shows the relative proportion of job types within each constituency. It would be misleading to try and compare the number of vacancies from one constituency with another for example, since they might not represent the same population etc.. I don’t have this data so can’t normalise in a sensible manner.
Maps!
Since we’re dealing with regional data, wouldn’t it be cool to plot that on a map? geom_map from ggplot2 can help with that, but we first need to find the boundaries of all the 59 constituencies to get started. My office mate helpfully pointed me to mapit, a great service from mysociety.org. If you know the id of an area, mapit can give you its boundaries in a JSON object, which you can easily turn into a data frame with the package rjson. Here’s how I did it:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | library(rjson) library(plyr) # list of all the areas we need areas<-read.table("arename.csv",header=FALSE,sep='\t') areas<-areas$V2 areas<-as.character(areas) # All UK Parliament Constituencies WMC<-fromJSON(file='http://mapit.mysociety.org/areas/WMC') # Extract name and id constituencies<-sapply(names(WMC),function(id) WMC[[c(id,'name')]]) # Select only those we need constituencies<-constituencies[which(constituencies %in% areas)] # id and name areas<-data.frame(group=names(constituencies),region=constituencies) # boundaries to all West Midlands constituencies WestMidlands<-ddply(areas,.(group,region),.fun=function(row){ x<-fromJSON(file=paste('http://mapit.mysociety.org/area/',row$group,'.geojson',sep='')) x<-unlist(x$coordinates) n<-length(x)/2 data.frame(long=x[2*(1:n)-1],lat=x[2*(1:n)]) }) |
(ignore the warnings, they’re all due to some non-existent end-of-line.)
> head(WestMidlands) group region long lat 1 65563 Shrewsbury and Atcham -3.044652 52.66554 2 65563 Shrewsbury and Atcham -3.044531 52.66568 3 65563 Shrewsbury and Atcham -3.044481 52.66573 4 65563 Shrewsbury and Atcham -3.044355 52.66585 5 65563 Shrewsbury and Atcham -3.044110 52.66605 6 65563 Shrewsbury and Atcham -3.043950 52.66621
Let’s see what’s the relative change in vacancies for each constituency per job title:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | # Compute relative change byLevel1$percentChange<-100*(byLevel1$Vacancies2012-byLevel1$Vacancies2011)/byLevel1$Vacancies2011 # Connect map and data p<-ggplot(byLevel1, aes(map_id = region)) p<-p+geom_map(aes(fill = percentChange), map = WestMidlands) p<-p+expand_limits(x = WestMidlands$long, y = WestMidlands$lat) # Colour scale from red to blue, cropped at -100 and 100. "Lab" is nicer on the eyes than RGB. p<-p+scale_fill_gradient2(limits=c(-100,100),name="% change",space="Lab") # mercator p<-p+coord_map() # one plot per job level p<-p+facet_wrap(~level1) # remove grid etc. p<-p+opts( axis.title.x=theme_blank(), axis.title.y=theme_blank(), panel.grid.major=theme_blank(), panel.grid.minor=theme_blank(), axis.text.x=theme_blank(), axis.text.y=theme_blank(), axis.ticks=theme_blank() ) print(p) # might take some time! |
which is rather nice for a first go. Some areas appear gray because they’re off the scale; the relative change is over 100%. This hard limit is completely arbitrary, but setting it to 220 (and getting rid of the gray areas) results in very low constrast for the rest of the plot. One could fix that with a capping colour scale for example.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | > summary(byLevel1$percentChange) Min. 1st Qu. Median Mean 3rd Qu. Max. -74.580 -18.200 2.732 7.292 26.780 218.600 > subset(byLevel1,percentChange>100) region level1 Vacancies2011 Vacancies2012 percentChange 5 Wyre Forest SKILLED TRADES OCCUPATIONS 693 1470 112.1212 38 Wolverhampton North East PROFESSIONAL OCCUPATIONS 63 173 174.6032 83 Warley PROFESSIONAL OCCUPATIONS 91 209 129.6703 114 The Wrekin CARING, LEISURE AND OTHER SERVICE OCCUPATIONS 872 2236 156.4220 129 Tamworth ASSOCIATE PROFESSIONAL AND TECHNICAL OCCUPATIONS 370 743 100.8108 135 Tamworth ELEMENTARY OCCUPATIONS 634 1331 109.9369 190 Stoke-on-Trent Central MANAGERS, DIRECTORS AND SENIOR OFFICIALS 188 599 218.6170 195 Stoke-on-Trent Central CARING, LEISURE AND OTHER SERVICE OCCUPATIONS 667 1427 113.9430 204 Staffordshire Moorlands CARING, LEISURE AND OTHER SERVICE OCCUPATIONS 363 730 101.1019 312 Mid Worcestershire CARING, LEISURE AND OTHER SERVICE OCCUPATIONS 734 1869 154.6322 315 Mid Worcestershire ELEMENTARY OCCUPATIONS 1245 2744 120.4016 380 Dudley North PROFESSIONAL OCCUPATIONS 97 246 153.6082 384 Dudley North CARING, LEISURE AND OTHER SERVICE OCCUPATIONS 1465 3117 112.7645 389 Coventry South PROFESSIONAL OCCUPATIONS 90 193 114.4444 521 Birmingham, Edgbaston PROCESS, PLANT AND MACHINE OPERATIVES 1245 2548 104.6586 |
We can spot a 200% increase in managerial positions in Stoke-on-Trent from 2011 to 2012! I’ll leave it to the professionals to explain those numbers.
I’m stopping here but there’s obviously quite a lot you can do with this data, it all depends on what question you want to ask. Again, this data is available for free, which is rather nice and crossed with other datasets (like geographical location here), we can do quite a lot — after some preprocessing work — in a few lines of code. See for example what Andy Pryke and Sarah Mount did with similar datasets on that day: video.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Copyright © 2022 | MH Corporate basic by MH Themes