Site icon R-bloggers

Getting Open Data into R from CKAN

[This article was first published on R on Alan Yeung, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Preamble

I’ve got lots of rough pieces of R code written as I’ve been exploring/testing various things in the past. A lot of this is currently stored in a pretty disorganised fashion so I thought it would be a good idea to start writing some of these up into blog posts – at the very least, this should make it easier for me to find things later! To start with, I am writing a short post here about how to download data from the CKAN API directly into R – CKAN is an open source data portal platform (basically a tool for making open data websites) and the reason I became interested in it is because this is the platform that NHS Scotland has chosen to host their open data.

Open Data in Scotland

It is becoming increasingly important for public sector organisations to make their data open. In Scotland, this is clear from the Scottish Government’s open data strategy. They believe that providing open data not only aids transparency, but should also result in wider social and economic benefits in the long run. My opinion generally matches this but I feel that one thing to be wary of is the potential for users to interpret data incorrectly so therefore I feel it is absolutely crucial to provide as much information as possible about the datasets (i.e. provide comprehensive metadata). On balance, the upsides of providing open data should definitely outweigh the downsides, particularly if pragmatic steps are taken to mitigate potential issues such as using the data incorrectly which can lead to producing misleading analyses.

Much work needs to be done on the open data front in Scotland. The level of open data provision by organisations in Scotland is currently lagging behind other parts of the UK in terms of the number of open datasets available but also in terms of how up-to-date the datasets provided are (see this blog post written in February 2019 by Ian Watt for more on this). Encouragingly though the situation is improving. NHS Scotland is contributing to this improvement by launching their open data platform (which I will simply refer to as CKAN from hereafter). All of the data held on this platform is licensed under the UK Open Government Licence (OGL). What you can and can’t do with the data is well described there in detail but to me, the gist of it seems to be that you can more or less do what you want with the data as long as you properly acknowledge the source (please read the detail though rather than just taking my word for it!).

Querying CKAN

Each dataset held on CKAN is assigned a resource ID which uniquely identifies it and to make a query on a particular dataset, you will need to know this. The resource ID can be found in the Additional Information section once you are on a dataset’s page. As an example, let’s say we are interested in downloading data about Data Zone (2011) Population Estimates (this gives data on population estimates for all 6,976 data zones (2011) in Scotland from 2011 to 2017), then, the resource ID for this is c505f490-c201-44bd-abd1-1bd7a64285ee. Note that this resource ID is also contained in the weblink for the dataset so you do not actually need to navigate down to the Additional Information section to find it.

There are three querying methods that you can use which are Custom JSON, SQL and HTSQL. Each of these methods supports different features but I will only discuss some basic querying using Custom JSON and SQL here. The basic idea of making a query on the CKAN API is that you build this into the web URL and the form of the URL will depend on the querying method you decide to use.

Querying with Custom JSON

For Custom JSON, the URL should take the form
https://www.opendata.nhs.scot/api/3/action/ +
datastore_search? +
resource_id=long-id-number +
&your-CustomJSON-query

Let’s test this in R by downloading the Data Zone (2011) Population Estimates dataset. Two R packages are required to help with this: httr and jsonlite. The httr package is needed to work with the API and jsonlite is needed to convert the downloaded JSON data into an R object. I will also load the tidyverse package for good measure!

library(httr)
library(jsonlite)
library(tidyverse)

url <- paste0("https://www.opendata.nhs.scot/api/3/action/",
              "datastore_search?",
              "resource_id=c505f490-c201-44bd-abd1-1bd7a64285ee")
page <- GET(url) # API request
status_code(page) # # Check that the call is successful
# [1] 200
# This means it was successful

# Status codes:
# 1XX - Informational
# 2XX - Success!
# 3XX - Client Error (something’s not right on your end)
# 4XX - Server Error (something’s not right on their end)

# Download the JSON data and convert to an R list
dz2011_list <- fromJSON(url)
# Extract the actual data from the list
dz2011 <- dz2011_list$result$records
glimpse(dz2011, width = 50)
# Observations: 100
# Variables: 97
# $ `_id`     <int> 178, 179, 180, 181, 182, 18...
# $ Year      <int> 2011, 2011, 2011, 2011, 201...
# $ DZ2011    <chr> "S01006593", "S01006594", "...
# $ DZ2011QF  <chr> " ", " ", " ", " ", " ", " ...
# $ Sex       <chr> "Female", "Male", "Female",...
# $ AllAges   <int> 325, 252, 296, 478, 454, 36...
# $ Age0      <int> 2, 2, 6, 8, 2, 4, 4, 3, 1, ...
# $ Age1      <int> 4, 4, 0, 3, 4, 2, 6, 1, 3, ...
# $ Age2      <int> 2, 2, 3, 1, 4, 4, 4, 1, 2, ...
# $ Age3      <int> 6, 1, 0, 5, 1, 4, 4, 2, 1, ...

Note that when I used the glimpse() function, I’ve only shown the first 10 variables as this dataset contains 97 variables and I wanted to keep the output fairly concise (I will do the same for the rest of the examples in this post where appropriate). A further thing to note is that the default setting using Custom JSON is to download only the first 100 records from the dataset which you can also see from the number of observations in the output of glimpse(); if you want to download more rows than this you need to explicitly set this in the query. For example, if you wanted the first 150 rows you can simply add this query into the URL.

url_150r <- paste0(url, "&limit=150")
dz2011_list_150r <- fromJSON(url_150r)
dz2011_150r <- dz2011_list_150r$result$records
glimpse(dz2011_150r, width = 50)
# Observations: 150
# Variables: 97
# $ `_id`    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 1...
# $ Year     <int> 2011, 2011, 2011, 2011, 2011...
# $ DZ2011   <chr> "S92000003", "S92000003", "S...
# $ DZ2011QF <chr> "d", "d", " ", " ", " ", " "...
# $ Sex      <chr> "Male", "Female", "Male", "F...
# $ AllAges  <int> 2570300, 2729600, 419, 463, ...
# $ Age0     <int> 30877, 29488, 3, 10, 4, 4, 9...
# $ Age1     <int> 29388, 28294, 5, 8, 6, 0, 3,...
# $ Age2     <int> 30189, 29190, 5, 7, 1, 3, 4,...
# $ Age3     <int> 30173, 29061, 5, 7, 5, 6, 1,...

I am not yet sure how to set up the query in Custom JSON so that it just downloads all the records but I suppose you could just set the limit to a very high number that you know will cover everything.

Querying with SQL

To query with SQL you will obviously need to know how to write SQL queries. I won’t be discussing SQL queries here as the focus is on using the CKAN API. For SQL, the URL should take the form
https://www.opendata.nhs.scot/api/3/action/ +
datastore_search_sql? +
sql= +
your-SQL-query

There are a couple of quirks about how to make SQL query work in R. You need to enclose the name of the resource ID in speech marks (“resource-ID”) so if like me, your default method of creating strings is to use speech marks rather than the apostrophe character (’), you will need to escape the speech mark using the backslash (\”). Similarly, you need to enclose other things such as variable names in speech marks when making more complicated queries. You also need to percent-encode your special characters to make them work as a URL – luckily the URLencode() function makes this easy!

# Download the whole dataset using a SQL query
url_sql <- paste0("https://www.opendata.nhs.scot/api/3/action/",
                  "datastore_search_sql?",
                  "sql=",
                  URLencode("SELECT * from \"c505f490-c201-44bd-abd1-1bd7a64285ee\""))

dz2011_list_sql <- fromJSON(url_sql)
dz2011_sql <- dz2011_list_sql$result$records
glimpse(dz2011_sql, width = 50)
# Observations: 97,678
# Variables: 98
# $ Age28 <chr> "33649", "35002", "12", "6", "6...
# $ Age29 <chr> "33865", "35698", "8", "10", "1...
# $ Age26 <chr> "34410", "35491", "6", "11", "1...
# $ Age27 <chr> "33302", "34274", "11", "7", "5...
# $ Age24 <chr> "35129", "36098", "9", "7", "1"...
# $ Age25 <chr> "35166", "35492", "6", "6", "2"...
# $ Age22 <chr> "36117", "36450", "6", "3", "3"...
# $ Age23 <chr> "36710", "37038", "4", "3", "3"...
# $ Age20 <chr> "37785", "37513", "5", "3", "6"...
# $ Age21 <chr> "36354", "36386", "3", "9", "4"...

Notice that when you use SQL to query CKAN, all 97,678 records are downloaded from the dataset (unless you explicitly tell it not to) so therefore, the query took much longer to finish. The data has been extracted but with a couple of strange issues. The variables are in a weird order and there are 98 variables here instead of the 97 we got previously – it seems to have extracted a variable called _full_text. I don’t know why it does this but I suppose these issues can be sorted out easily enough with some dplyr. Let’s also make a query to download just one variable to demonstrate how to write a marginally more complicated SQL query.

# Download only the DZ2011 variable
url_sql_1v <- paste0("https://www.opendata.nhs.scot/api/3/action/",
                     "datastore_search_sql?",
                     "sql=",
                     URLencode("SELECT \"DZ2011\" from \"c505f490-c201-44bd-abd1-1bd7a64285ee\""))

dz2011_list_1v <- fromJSON(url_sql_1v)
dz2011_1v <- dz2011_list_1v$result$records
glimpse(dz2011_1v, width = 50)
# Observations: 97,678
# Variables: 1
# $ DZ2011 <chr> "S92000003", "S92000003", "S01...

No issues with this query and everything was extracted as expected.

Conclusions and Further Ideas

This post has given a little bit of background to open data in Scotland as well as an introduction to downloading NHS Scotland open data directly into R using simple queries with the CKAN API but there are, of course, much fancier things that you can do when constructing queries (e.g. joins on datasets using SQL or using the API to search CKAN for datasets with a particular tag) which I’ve not covered. I may cover some of these in a future post but for now, the interested reader can trawl through the CKAN user guide to find out more about features like that! As a final point, I think it would be neat to write some wrapper functions in R to make it easier to make queries to CKAN (this could even be packaged up) – it might be a future project! I envision that this would work similarly to the opendatascot package which has been built to help with extracting data from statistics.gov.scot. Anyway that’s for another day.

To leave a comment for the author, please follow the link and comment on their blog: R on Alan Yeung.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.