Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In previous blog posts (Hacking dbplyr for CKAN, Getting Open Data into R from CKAN) I have been exploring how to download data from the NHS Scotland open data platform into R. I’ve recently discovered that ROpenSci has a package to help with just this called ckanr and I wish I’d known about it earlier as it is really pretty handy! It certainly would’ve saved me some time if I’d know about it earlier but I suppose the positive I can take from it is that some of the functions in ckanr perform similar functions to the ideas I had so I guess that shows that my ideas are not completely wacky!
How resources are grouped in CKAN
Before we start testing out some code from ckanr, it is important to consider how resources (I am going to call the individual data items such as specific csv files hosted on CKAN as ‘resources’ but I am not sure if this is necessarily correct) on CKAN are grouped up as this helps to understand the design of some functions within ckanr. Resources can be grouped within ‘Datasets’, ‘Groups’, ‘Tags’ and ‘Themes’ (and possibly more that I don’t yet know about). Out of these, it is clear to me that ckanr offers functions for exploring resources by all of these groupings except themes (although I could also be mistaken about this). With this out of the way, let’s delve into some code.
Initialising ckanr and exploring groups of resources
There are, of course, many different data portals that are powered by CKAN so the first thing we need to do with the ckanr package is to tell it which URL to use by default with ckanr_setup()
. Note that if you are working in a place where you need to use a proxy to connect R to the internet, this can also be set within ckanr_setup()
using the proxy
argument.
library(tidyverse) library(ckanr) ckanr_setup(url = "https://www.opendata.nhs.scot")
Now we can explore the groupings available in the NHS Scotland CKAN website with group_list()
, package_list()
and tag_list()
; from the Figure above, these correspond to ‘Groups’, ‘datasets’ and ‘Tags’ respectively. Note that I only show 10 records in each case to keep things concise.
# View available groups and packages/datasets group_list(as = "table")[1:10] # [1] "acute-hospital-activity" "cancer" # [3] "child-health" "dental-care" # [5] "deprivation" "emergency-care" # [7] "general-practice" "geography" # [9] "hospital-activity" "infection-disease-and-virus-surveillance" package_list(as = "table")[1:10] # [1] "18-weeks-referral-to-treatment" # [2] "27-30-month-review-statistics" # [3] "alcohol-related-hospital-statistics-scotland" # [4] "allied-health-professionals-musculoskeletal-waiting-times" # [5] "allied-health-professional-vacancies" # [6] "annual-cancer-incidence" # [7] "births-in-scottish-hospitals" # [8] "cancelled-planned-operations" # [9] "cancer-mortality" # [10] "cancer-waiting-times" tag_list(as = "table")$name[1:10] # [1] "31 day" "62 day" # [3] "address" "adolescent" # [5] "adult" "age" # [7] "agenda for change" "agenda for change band" # [9] "ahp" "ailment"
Being able to see the names related to different groupings is useful if you want to download data from all resources in a particular group. I’ll give an example of doing this later but first I want to mimic some of the things I done in previous blog posts but using ckanr.
Connect to CKAN with dplyr and download from one resource
Let’s demonstrate downloading from one resource using the fairly small Patients Referred dataset within Allied Health Professionals – Musculoskeletal Waiting Times which has resource ID 3988df43-3516-4190-93da-16189db7329a
. We start by using src_ckan()
to create a connection to CKAN (similar to how you would do so for other non-CKAN databases). From there, you can download data in a similar way to when using dbplyr but using a CKAN resource ID instead of a database table name.
ckan <- src_ckan("https://www.opendata.nhs.scot") res_id <- "3988df43-3516-4190-93da-16189db7329a" dplyr::tbl(src = ckan$con, from = res_id) %>% as_tibble() # A tibble: 1,115 x 9 # `_id` HBT2014 ReferralsPerOne~ `_full_text` Specialty NumberOfReferra~ NumberOfReferra~ # <chr> <chr> <dbl> <chr> <chr> <dbl> <chr> # 1 1 S08000~ 2890. '2015q3':9 ~ All AHP ~ 8904 d # 2 2 S08000~ 411. '1267':5 '2~ Chiropod~ 1267 "" # 3 3 S08000~ 94.4 '2015q3':3 ~ Occupati~ 291 "" # 4 4 S08000~ 178. '178.17':5 ~ Orthotics 549 "" # 5 5 S08000~ 2206. '2015q3':2 ~ Physioth~ 6797 "" # 6 6 S08000~ 1530. '1472':7 '1~ All AHP ~ 1472 d # 7 7 S08000~ 165. '159':1 '16~ Orthotics 159 "" # 8 8 S08000~ 1365. '1313':2 '1~ Physioth~ 1313 "" # 9 9 S08000~ 2562. '2015q3':7 ~ All AHP ~ 3212 d # 10 10 S08000~ 197. '197.02':1 ~ Chiropod~ 247 "" # # ... with 1,105 more rows, and 2 more variables: Quarter <chr>, # # ReferralsPerOneHundredThousandPopulationQF <chr>
The variables look like they’ve been downloaded in a bit of a random order and that _full_text
variable seems to have appeared so this makes me think that ckanr is using SQL to download the data. This is easy enough to confirm.
getAnywhere(tbl.src_ckan) function (src, from, ..., name = NULL) { if (is.null(name)) { tbl_sql("ckan", src = src, from = sql(from), ...) } else { tbl_sql(subclass = "ckan", src = src, from = sql(sprintf("SELECT * FROM \"%s\"", name))) } }
Now let’s try combining this with some basic dplyr functions like select()
and filter()
.
dplyr::tbl(src = ckan$con, from = res_id) %>% select(Quarter, HBT2014) %>% filter(HBT2014 == "S08000015") %>% as_tibble() # A tibble: 89 x 2 # Quarter HBT2014 # <chr> <chr> # 1 2015Q3 S08000015 # 2 2015Q3 S08000015 # 3 2015Q3 S08000015 # 4 2015Q3 S08000015 # 5 2015Q3 S08000015 # 6 2015Q4 S08000015 # 7 2015Q4 S08000015 # 8 2015Q4 S08000015 # 9 2015Q4 S08000015 # 10 2015Q4 S08000015 # # ... with 79 more rows # Warning messages: # 1: Translator is missing window variants of the following aggregate functions: # * all # * any # * cor # * cov # * paste # * sd # # 2: Translator is missing window variants of the following aggregate functions: # * all # * any # * cor # * cov # * paste # * sd # # 3: Translator is missing window variants of the following aggregate functions: # * all # * any # * cor # * cov # * paste # * sd
We get a long list of warnings explaining what you cannot do with the SQL translation available in ckanr but otherwise, works great!
Downloading all resources from a dataset
Often, a dataset on CKAN contains many resources related to the same thing. For example, the Consultant Vacancies dataset (remember you can see all available ‘Datasets’ using package_list()
) contains different csv files for vacancies at different time points.
cons_vac <- package_show("consultant-vacancies", as = "table")$resources cons_vac %>% select(name, id) # name id # 1 Vacancies June 2019 16e27935-325c-471b-89dc-d41c84b3a744 # 2 Vacancies March 2019 ca67b2a4-b2f3-4420-8b77-3771c53b01f4 # 3 Vacancies December 2018 5da80103-4da8-4694-a8b5-2332dfc43e25 # 4 Vacancies September 2018 91d7b780-f2cb-47fb-919f-1c165ed7d301 # 5 Vacancies June 2018 e874f6f4-6cf5-402c-af1d-2d4f26cc669f # 6 Vacancies March 2018 415c2f86-db7c-4c12-9a64-0cd9cf0d9118
Now if you extract the required resource IDs, you can download all of the datasets with some help from the fantastic purrr package.
id_list <- cons_vac$id # Download each resource into a list item cons_vac_list <- map(id_list, ~as_tibble(dplyr::tbl(src = ckan$con, from = .x))) # Check how many variables in each resource map_dbl(cons_vac_list, length) # [1] 12 12 12 14 15 12 # Not all resources have the same structure # Check variable names for a couple that differ map(cons_vac_list[3:4], names) # [[1]] # [1] "WorkforceRegionGrouping" "HB2014" # [3] "HB2014QF" "TotalVacancies" # [5] "_full_text" "Specialty" # [7] "VacanciesGreater6Months" "Date" # [9] "SpecialtyQF" "_id" # [11] "Establishment" "StaffInPost" # # [[2]] # [1] "WorkforceRegionGrouping" "HB2014" # [3] "HB2014QF" "TotalVacancies" # [5] "TotalVacanciesQF" "_full_text" # [7] "Specialty" "EstablishmentQF" # [9] "VacanciesGreater6Months" "Date" # [11] "SpecialtyQF" "_id" # [13] "Establishment" "StaffInPost" # TotalVacanciesQF and EstablishmentQF not in resource 3 but are in resource 4 # Combine just the first 3 resources which look like they all have the same structure bind_rows(cons_vac_list[1:3]) # A tibble: 1,822 x 12 # WorkforceRegion~ HB2014 HB2014QF TotalVacancies `_full_text` # <chr> <chr> <chr> <dbl> <chr> # 1 National Bodies~ SB0806 "" 0 '0':9 '1.4'~ # 2 Scotland S9200~ d 0 '0':5 '2019~ # 3 Scotland S9200~ d 0 '0':5 '2.1'~ # 4 National Bodies~ SB0807 "" 0 '0':9 '0.3'~ # 5 North S0800~ "" 0 '0':5 '0.1'~ # 6 National Bodies~ SB0808 "" 0 '0':9 '1.2'~ # 7 East S0800~ "" 0 '0':3 '12.1~ # 8 East S0800~ "" 0 '0':3 '1.7'~ # 9 National Bodies~ SB0804 "" 0 '0':1 '1':9~ # 10 National Bodies~ SB0807 "" 0 '0':9 '0.4'~ # ... with 1,812 more rows, and 7 more variables: Specialty <chr>, # VacanciesGreater6Months <dbl>, Date <dbl>, SpecialtyQF <chr>, # `_id` <chr>, Establishment <dbl>, StaffInPost <dbl>
So that is a basic workflow using ckanr alongside functions from purrr for combining related resources into one dataset. I’ve also presented some ways of checking consistency in the structure of those datasets (an essential step when trying to do something like this) and in this case, not all of the datasets were the same so I just combined the most recent 3 datasets for consultant vacancies at the end for simplicity here; in reality you might want to look at ways to make all of the data consistent first and then combine them up but I’ll leave that data wrangling exercise up to the interested reader.
My final verdict: ckanr is definitely recommended for working with data from CKAN!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.