Plotting Scottish census data with some tidyverse magic

[This article was first published on R – scottishsnow, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve been working with the Scottish census recently, to investigate employment in land-based (agriculture, forestry and fishing) industry. A friend of mine has recently moved to Dumfries and Galloway – a rural, farming area of Scotland. He’s commented on the ageing population in the area, so I pulled out the age profile from the census for his civil parish. This post shows how to plot up an age profile from the Scottish census table KS102SC, which is available online.

First up, let’s load our packages and read in the table. Note I’ve skipped the first few header lines and have coded – to NA. In reality – are actually 0s, so I’ve used `mutate_all` to fix them.

library(tidyverse)

df = read_csv("~/Downloads/temp/KS102SC.csv", skip=4, na="-") %>%
   mutate_all(funs(replace(., is.na(.), 0)))

Next we can select the parish of interest, select the columns we’re interested in, convert these to long format, and force the ordering of the ages (e.g. 8-10 should come before 10-14). I’ve piped the output of this munging into ggplot and added some styling and an all important licence statement.

Dalton

df %>%
   filter(X1=="Dalton") %>%
   select(-X1, -`All people`, -`Mean age`, -`Median age`, -X21) %>%
   gather() %>%
   mutate(key = reorder(key, seq_along(key))) %>%
   ggplot(aes(key, value)) +
   geom_col() +
   labs(title="Dalton parish population distribution",
        subtitle="Contains: Scotland's Census data and Scottish Government data\nlicensed under the Open Government Licence v3.0",
        x="",
        y="People") +
   coord_flip() +
   theme_bw() +
   theme(text=element_text(size=20),
         plot.subtitle=element_text(size=10))

It’s also of interest to compare one parish against another, so I compared Dalton against Edinburgh. Basically as before but adding an extra point layer for the visualisation. The data have now been changed to proportions of each parish so they are comparable.

Dalton_Edinburgh

x = df %>%
   filter(X1=="Dalton" | X1=="Edinburgh") %>%
   select(-`Mean age`, -`Median age`, -X21) %>%
   mutate_at(vars(-X1), funs(prop = . / `All people`)) %>%
   select(-`All people_prop`) %>%
   select(X1, ends_with("prop")) %>%
   gather(key, value, -X1) %>%
   separate(key, c("key", "drop"), "_") %>%
   mutate(key = reorder(key, seq_along(key)))

x %>%
   filter(X1=="Dalton") %>%
   ggplot(aes(key, value)) +
   geom_col() +
   geom_point(data=filter(x, X1=="Edinburgh"), aes(key, value)) +
   scale_y_continuous(labels=scales::percent) +
   labs(title="Dalton parish (bars) and Edinburgh (dots) population distribution",
        subtitle="Contains: Scotland's Census data and Scottish Government data\nlicensed under the Open Government Licence v3.0",
        x="",
        y="People") +
   coord_flip() +
   theme_bw() +
   theme(text=element_text(size=20),
         plot.subtitle=element_text(size=10))

Finally, we can compare distributions for the whole of Scotland against Edinburgh and Dalton using boxplots. I can imagine a beautiful plot with density polygons showing the national data, but I don’t have time to figure it out now!

boxplots

x = df %>%
   select(-`Mean age`, -`Median age`, -X21) %>%
   mutate_at(vars(-X1), funs(prop = . / `All people`)) %>%
   select(-`All people_prop`) %>%
   select(X1, ends_with("prop")) %>%
   gather(key, value, -X1) %>%
   separate(key, c("key", "drop"), "_") %>%
   mutate(key = reorder(key, seq_along(key)))

x %>%
   filter(X1!="Scotland") %>%
   ggplot(aes(key, value)) +
   geom_boxplot(colour="grey50") +
   geom_point(data=filter(x, X1=="Dalton"), aes(key, value), colour="purple4", shape=4, stroke=2, show.legend=T) +
   geom_point(data=filter(x, X1=="Edinburgh"), aes(key, value), colour="darkorange2", shape=2, stroke=1.5, show.legend=T) +
   scale_y_continuous(labels=scales::percent) +
   labs(title="Dalton parish (purple crosses) and Edinburgh (orange triangles)\nover Scotland's population distribution",
        subtitle="Contains: Scotland's Census data and Scottish Government data\nlicensed under the Open Government Licence v3.0",
        x="",
        y="People") +
   coord_flip() +
   theme_bw() +
   theme(text=element_text(size=20),
         plot.subtitle=element_text(size=10))

To leave a comment for the author, please follow the link and comment on their blog: R – scottishsnow.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)