Site icon R-bloggers

How to use jailbreakr

[This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

What is jailbreakr

The jailbreakr package is probably one of the most interesting packages I came across recently. This package makes it possible to extract messy data from spreadsheets. What is meant by messy? I am sure you already had to deal with spreadsheets that contained little tables inside a single sheet for example. As far as I know, there is no simple way of extracting these tables without having to fiddle around a lot. This is now over with jailbreakr. Well not entirely, because jailbreakr is still in development, but it works well already. If you want to know more about the planned features, you can watch the following video by Jenny Bryan, one of the package’s authors.

Installation and data

You will have to install the package from Github, as it is not on CRAN yet. Here is the Github link. To install the package, just run the following commands in an R console:

devtools::install_github(c("hadley/xml2",
                           "rsheets/linen",
                           "rsheets/cellranger",
                           "rsheets/rexcel",
                           "rsheets/jailbreakr"))

If you get the following error:

devtools::install_github("hadley/xml2")
Downloading GitHub repo hadley/xml2@master
from URL https://api.github.com/repos/hadley/xml2/zipball/master
Error in system(full, intern = quiet, ignore.stderr = quiet, ...) : 
    error in running command

and if you’re on a GNU+Linux distribution try to run the following command:

options(unzip = "internal")

and then run github_install() again.

As you can see, you need some other packages to make it work. Now we are going to get some data. We are going to download some time series from the European Commission, data I had to deal with recently. Download the data by clicking here and look for the spreadsheet titled Investment_total_factors_nace2.xlsx. The data we are interested in is on the second sheet, named TOT. You cannot import this sheet easily into R because there are four tables on the same sheet. Let us use jailbreakr to get these tables out of the sheet and into nice, tidy, data frames.

jailbreakr to the rescue

The first step is to read the data in. For this, we are going to use the rexcel package, which is also part of the rsheets organization on Github that was set up by Jenny Brian and Rich Fitzjohn, the authors of these packages. rexcel imports the sheet you want but not in a way that is immediately useful to you. It just gets the sheet into R, which makes it then possible to use jailbreakr’s magic on it. First, let’s import the packages we need:

library("rexcel")
library("jailbreakr")

We need to check which sheet to import. There are two sheets, and we want to import the one called TOT, the second one. But is it really the second one? I have noticed that sometimes, there are hidden sheets which makes importing the one you want impossible. So first, let use use another package, readxl and its function excel_sheets() to make sure we are extracting the sheet we really need:

sheets <- readxl::excel_sheets(path_to_data)

tot_sheet <- which(sheets == "TOT")

print(tot_sheet)
## [1] 3

As you can see, the sheet we want is not the second, but the third! Let us import this sheet into R now (this might take more time than you think; on my computer it takes around 10 seconds):

my_sheet <- rexcel_read(path_to_data, sheet = tot_sheet)

Now we can start using jailbreakr. The function split_sheet() is the one that splits the sheet into little tables:

tables <- split_sheet(my_sheet)
str(tables)
## List of 4
##  $ :Classes 'worksheet_view', 'R6' <worksheet_view>
##   Public:
##     cells: active binding
##     clone: function (deep = FALSE) 
##     data: NULL
##     dim: 34 28
##     header: NULL
##     idx: list
##     initialize: function (sheet, xr, filter, header, data) 
##     lookup: active binding
##     lookup2: active binding
##     merged: active binding
##     sheet: worksheet, R6
##     table: function (col_names = TRUE, ...) 
##     values: function () 
##     xr: cell_limits, list 
##  $ :Classes 'worksheet_view', 'R6' <worksheet_view>
##   Public:
##     cells: active binding
##     clone: function (deep = FALSE) 
##     data: NULL
##     dim: 33 28
##     header: NULL
##     idx: list
##     initialize: function (sheet, xr, filter, header, data) 
##     lookup: active binding
##     lookup2: active binding
##     merged: active binding
##     sheet: worksheet, R6
##     table: function (col_names = TRUE, ...) 
##     values: function () 
##     xr: cell_limits, list 
##  $ :Classes 'worksheet_view', 'R6' <worksheet_view>
##   Public:
##     cells: active binding
##     clone: function (deep = FALSE) 
##     data: NULL
##     dim: 32 28
##     header: NULL
##     idx: list
##     initialize: function (sheet, xr, filter, header, data) 
##     lookup: active binding
##     lookup2: active binding
##     merged: active binding
##     sheet: worksheet, R6
##     table: function (col_names = TRUE, ...) 
##     values: function () 
##     xr: cell_limits, list 
##  $ :Classes 'worksheet_view', 'R6' <worksheet_view>
##   Public:
##     cells: active binding
##     clone: function (deep = FALSE) 
##     data: NULL
##     dim: 33 28
##     header: NULL
##     idx: list
##     initialize: function (sheet, xr, filter, header, data) 
##     lookup: active binding
##     lookup2: active binding
##     merged: active binding
##     sheet: worksheet, R6
##     table: function (col_names = TRUE, ...) 
##     values: function () 
##     xr: cell_limits, list

tables is actually a list containing worksheet_view objects. Take a look at the dim attribute: you see the dimensions of the tables there. When I started using jailbreakr I was stuck here. I was looking for the function that would extract the data frames and could not find it. Then I watched the video and I understood what I had to do: a worksheet_view object has a values() method that does the extraction for you. This is a bit unusual in R (it made me feel like I was using Python); maybe in future versions this values() method will become a separate function of its own in the package. What happens when we use values()?

library("purrr")
list_of_data <-  map(tables, (function(x)(x$values())))
map(list_of_data, head)
## [[1]]
##      [,1]     [,2]    [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9]  [,10]
## [1,] "TOT"    NA      NA    NA    NA    NA    NA    NA    NA    NA   
## [2,] "DEMAND" 33603   33969 34334 34699 35064 35430 35795 36160 36525
## [3,] "FDEMT"  "FDEMN" NA    NA    NA    NA    NA    NA    NA    NA   
## [4,] "EU"     ":"     16.9  -1.4  20.2  34.5  31.4  37.5  39    37.3 
## [5,] "EA"     ":"     15.5  -13.1 14.8  30.9  25.1  35.2  39.2  37.1 
## [6,] "BE"     ":"     ":"   ":"   ":"   ":"   ":"   ":"   42.3  43.1 
##      [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21]
## [1,] NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA   
## [2,] 36891 37256 37621 37986 38352 38717 39082 39447 39813 40178 40543
## [3,] NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA   
## [4,] 39.2  27.5  20.6  21.4  29.8  26.4  32.5  47.1  19    -1.3  23.5 
## [5,] 39.5  25.3  18.2  18.9  27.4  23    28.2  46.1  12.3  -9.3  19.3 
## [6,] 45.8  42.2  42.9  43.8  45.8  47.4  49.1  50.9  48.2  46.9  46.3 
##      [,22] [,23] [,24] [,25] [,26] [,27] [,28] 
## [1,] NA    NA    NA    NA    NA    NA    NA    
## [2,] 40908 41274 41639 42004 42369 42735 43100 
## [3,] NA    NA    NA    NA    NA    NA    NA    
## [4,] 29    22    21.1  25.6  31.8  22.9  "30.7"
## [5,] 26.2  18.6  15.7  21.7  28.8  17.3  26.6  
## [6,] 46.8  47.1  48.2  50.1  49.2  34.5  34.4  
## 
## [[2]]
##      [,1]        [,2]    [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9]  [,10]
## [1,] "FINANCIAL" 33603   33969 34334 34699 35064 35430 35795 36160 36525
## [2,] "FFINT"     "FFINN" NA    NA    NA    NA    NA    NA    NA    NA   
## [3,] "EU"        ":"     -5.1  -6.2  2.7   6.7   9     14.4  13.9  14   
## [4,] "EA"        ":"     -8.8  -13.5 -3.4  2.6   5.7   12.5  13.2  13.1 
## [5,] "BE"        ":"     ":"   ":"   ":"   ":"   ":"   ":"   21.5  22.4 
## [6,] "BG"        ":"     ":"   ":"   ":"   ":"   ":"   ":"   ":"   ":"  
##      [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21]
## [1,] 36891 37256 37621 37986 38352 38717 39082 39447 39813 40178 40543
## [2,] NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA   
## [3,] 16.4  9.4   7.4   8.1   12.4  8.4   13.6  23.4  4.1   -4    10.9 
## [4,] 16.5  8     6.8   5.1   9.9   4.8   8.4   24.3  -2.8  -10.5 9.3  
## [5,] 20.9  22.3  32.2  33.5  33.8  34.8  35    34.5  37.2  33.5  32.7 
## [6,] ":"   ":"   20.8  24    27.1  28.3  33.4  37.5  37.7  26.6  30.4 
##      [,22] [,23] [,24] [,25] [,26] [,27] [,28] 
## [1,] 40908 41274 41639 42004 42369 42735 43100 
## [2,] NA    NA    NA    NA    NA    NA    NA    
## [3,] 12.4  10.2  8.8   13.4  17.4  6.2   "12.3"
## [4,] 9     7.2   5     11    13.1  -1    6.5   
## [5,] 31.5  32.3  33    31.7  32.2  19.9  20.5  
## [6,] 33.8  35.6  36    41.5  41.6  44.2  43.8  
## 
## [[3]]
##      [,1]        [,2]    [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9]  [,10]
## [1,] "TECHNICAL" 33603   33969 34334 34699 35064 35430 35795 36160 36525
## [2,] "FTECT"     "FTECN" NA    NA    NA    NA    NA    NA    NA    NA   
## [3,] "EU"        ":"     39.2  37.6  38.3  40    40.7  42.8  43.5  43.8 
## [4,] "EA"        ":"     39.7  36.2  37.5  41.2  40    44    44.8  44.9 
## [5,] "BE"        ":"     ":"   ":"   ":"   ":"   ":"   ":"   58.8  58.5 
## [6,] "BG"        ":"     ":"   ":"   ":"   ":"   ":"   ":"   ":"   ":"  
##      [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21]
## [1,] 36891 37256 37621 37986 38352 38717 39082 39447 39813 40178 40543
## [2,] NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA   
## [3,] 37    31.1  27.2  30.9  30.4  30.3  27.4  40.5  25.8  23.1  27.4 
## [4,] 37    30.3  27.4  31    29.9  29.7  24.8  41    23.4  19.5  26.4 
## [5,] 58.3  58.4  57.7  59.2  59.6  59.4  60.2  59.5  60.5  57.9  56.3 
## [6,] ":"   ":"   17.3  17.5  21.1  21.5  25.3  28.2  26.1  21    25.3 
##      [,22] [,23] [,24] [,25] [,26] [,27] [,28] 
## [1,] 40908 41274 41639 42004 42369 42735 43100 
## [2,] NA    NA    NA    NA    NA    NA    NA    
## [3,] 28.9  26.3  31.3  32.1  32.1  30.2  "34.6"
## [4,] 28.5  25.9  32.1  32.4  33.1  30.2  36    
## [5,] 56.7  57.7  57.9  58.6  59.1  13.1  13.1  
## [6,] 24.6  26.8  30.4  31.9  34.1  34.8  33.7  
## 
## [[4]]
##      [,1]    [,2]    [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9]  [,10] [,11]
## [1,] "OTHER" 33603   33969 34334 34699 35064 35430 35795 36160 36525 36891
## [2,] "FOTHT" "FOTHN" NA    NA    NA    NA    NA    NA    NA    NA    NA   
## [3,] "EU"    ":"     2.9   -0.5  3.9   3.9   1     4.1   4.7   7     7.2  
## [4,] "EA"    ":"     2.3   -4.9  1.4   1.3   -2.4  1.1   3.2   5.8   7    
## [5,] "BE"    ":"     ":"   ":"   ":"   ":"   ":"   ":"   14    14.9  15.9 
## [6,] "BG"    ":"     ":"   ":"   ":"   ":"   ":"   ":"   ":"   ":"   ":"  
##      [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22]
## [1,] 37256 37621 37986 38352 38717 39082 39447 39813 40178 40543 40908
## [2,] NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA   
## [3,] -1.5  6.2   8.1   7.6   1.4   2.4   13.7  -1.9  -3.2  1.1   1.1  
## [4,] -3.7  5.5   7.1   7.2   -2.2  0.4   15.5  -4.6  -8.4  0.3   -3.3 
## [5,] 16.3  22.8  23.1  22.4  24.5  25.3  25.5  26.6  26.6  24.7  24.6 
## [6,] ":"   -2.3  -0.8  2.4   2.9   3.5   4.8   5.5   2.2   3.3   3.2  
##      [,23] [,24] [,25] [,26] [,27] [,28]
## [1,] 41274 41639 42004 42369 42735 43100
## [2,] NA    NA    NA    NA    NA    NA   
## [3,] -1.6  0.9   2.7   1.9   -3.3  "2.1"
## [4,] -2.3  0.6   2.5   2.1   -5.4  1.7  
## [5,] 26.4  25.9  25    25.3  4.7   5.2  
## [6,] 5.9   7     8.2   9.6   9.4   9.1

We are getting really close to something useful! Now we can get the first table and do some basic cleaning to have a tidy dataset:

dataset1 <- list_of_data[[1]]

dataset1 <- dataset1[-c(1:3), ]
dataset1[dataset1 == ":"] <- NA
colnames(dataset1) <- c("country", seq(from = 1991, to = 2017))

head(dataset1)
##      country 1991 1992 1993  1994 1995 1996 1997 1998 1999 2000 2001 2002
## [1,] "EU"    NA   16.9 -1.4  20.2 34.5 31.4 37.5 39   37.3 39.2 27.5 20.6
## [2,] "EA"    NA   15.5 -13.1 14.8 30.9 25.1 35.2 39.2 37.1 39.5 25.3 18.2
## [3,] "BE"    NA   NA   NA    NA   NA   NA   NA   42.3 43.1 45.8 42.2 42.9
## [4,] "BG"    NA   NA   NA    NA   NA   NA   NA   NA   NA   NA   NA   39.6
## [5,] "CZ"    NA   NA   NA    NA   NA   NA   NA   NA   NA   NA   NA   54.9
## [6,] "DK"    49.5 45   50    59.5 62.5 55.5 60.5 57.5 56   61.5 57.5 59.5
##      2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
## [1,] 21.4 29.8 26.4 32.5 47.1 19   -1.3 23.5 29   22   21.1 25.6 31.8 22.9
## [2,] 18.9 27.4 23   28.2 46.1 12.3 -9.3 19.3 26.2 18.6 15.7 21.7 28.8 17.3
## [3,] 43.8 45.8 47.4 49.1 50.9 48.2 46.9 46.3 46.8 47.1 48.2 50.1 49.2 34.5
## [4,] 43   42.8 45.5 49.1 52.6 50.7 39.5 45.5 47.4 45.6 50.5 51.4 49.9 53.2
## [5,] 37   48.5 67.9 66.4 66.8 69.3 64.7 61   56   47.5 53   53.5 67.5 58  
## [6,] 53.5 50   59   64   63   56   33.5 57   47   48   52   45.5 40.5 36.5
##      2017  
## [1,] "30.7"
## [2,] 26.6  
## [3,] 34.4  
## [4,] 52.8  
## [5,] 59.5  
## [6,] 37.5

Et voilà! We went from a messy spreadsheet to a tidy dataset in a matter of minutes. Even though this package is still in early development and not all the features that are planned are available, the basics are there and can save you a lot of pain!

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.