Drilling into non-rectangular data with purrr
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
My PhD is in climate science, and climate data is usually rectangular—or some higher-dimensional analogue of rectangular, anyway. (Blocky?) And since rectangular data is R’s bread and butter, I’ve had a pretty good time of things up until now.
But for the last few months I’ve been working on Is It Hot Right Now with my colleagues, Mat and Stefan, and that’s forced me to get to know data formats that are a bit less familiar to R but ubiquitous on the web: namely, XML and JSON.
Luckily, the xml2 and RJSONIO packages make accessing XML and JSON data really easy, and with purrrr (part of the tidyverse, we can reduce them down to something useful really quickly. Like a yummy, rich… bolognese of data, I guess?
Let’s look at some basic examples and then ramp it up.
JSON: it’s lists all the way down
JSON is a data format that’s designed to mimic the way JavaScript stores objects. Here’s an example we’re using on Is It Hot Right Now:
JSON can store data in Arrays (denoted by [square brackets]) or Objects (denoted by {curly brackets}). This example doesn’t illustrate it, but you can also nest Arrays and Objects as deeply as you want.
Arrays and Objects both map naturally to R’s lists, since you can access list elements by a numeric position or a name. So when you load JSON data in R using RJSONIO, you just get a hierarchy of lists:
UPDATE: Maëlle Salmon pointed me to the excellent jsonlite
package, which works the same way as RJSONIO but can also process JSON arrays and data frame-like structures into native R vectors and data frames. It could save you a lot of time for simple files!
So this is a list of lists. We can look at one of the outer list elements (a single station) and see that it’s a list with a bunch of named elements inside. We can get an element out:
We can also use purrr’s amazing pluck
function to dig as far as we need into lists of lists in a more readable way. For example, to get the name from the third station in the list:
(If you haven’t used the pipe operator %>%
from magrittr before, it takes the thing on the left and squeezes it into the function on the right before that function’s other arguments. If you want the thing on the left somewhere else, you can use .
to put it wherever you’d like.)
But pulling things out one at a time gets old real quick; we want to automate this. In cases where we trust the input data to be structured in a predictable way (like this example, where we expect that each station in the Array will have the same data in it), we can use map
in combination with pluck
to get every station’s name
:
The map_*
functions take lists (or vectors) in, run the supplied function on each element of the input, then return the results of the function in a data structure of your choice. Here, pluck gives us a character vector of length 1 (because, on its own, it deals with one thing at a time). Vanilla map
pluck
s from each object in turn and gives us a list with each character vector back. Since the list we get back just as character vectors in it, we could reduce it all the way down to one vector with unlist()
:
Usually we pluck something like("this")
, but map
can pass arguments along to the function you want to run. So in this case, map
passes name
onto pluck
.
A different way to do this is using what’s called an anonymous function: instead of a calling a here’s-one-I-prepared-earlier function, we define one on the spot. map
gives a shortcut to do this using the tilde (~
):
That looks a bit more complicated. In the first version, map
passed each element of the list onto pluck
automatically (just like the pipe operator, %>%
, does); in the second, we have to do it ourselves using the .
pronoun.
But the second version is also more powerful—nor just because we can use functions that don’t expect the data from map
to go first, but because now we can pipe functions together inside map
. For example:
Here, one pipe is stations %>% map() %>% unlist() %>% head()
; the other is pluck() %>% paste()
. The data pronoun .
is passed into map
from the outer pipe.
XML unplucked
To show you some of the more complex things we can do with map
and nested pipes, let’s look at a more interesting dataset:
Like JSON, XML can store hierarchies of objects. However, in XML, the objects (or nodes, as XML calls them) are stored like <my_thing some_attribute="some_value">Some contents</my_thing>
. So nodes have a name (my_thing
), some optional attributes with values, and they have contents (which, like JSON, can be straight-up data or more objects).
Unlike JSON, XML doesn’t map neatly to R’s data structures. So we can’t pluck
into XML files, or even use R’s builtin[["accessor"]][["syntax"]]
. Instead, we use accessor functions that help us isolate parts of the file:
And then we use functions like xml_attr
or xml_text
to get at the good bits as pluck did. But, unlike, pluck
, these XML functions can be given a group of nodes from xml_find_all
, and they’ll return the results from each element in the group. No map
needed!
(xml2
uses a syntax called XPath to make all sorts of granular selections of nodes. I’m not going to delve into it too deeply, but if you’re interested, MSDN has a good primer on it.)
I guess we don’t need map
after all! Well, not quite…
Nesting pipes
I mentioned before that using map
with the tilde syntax allows us to chain functions together and repeat the result across a list of things. But nested pipes can get complicated real fast, as the next example will illustrate.
Let’s say I don’t just want a bunch of station names from my XML file—I want a bunch of useful information, like its latitude and longitude, its timezone and the air temperature.
I’ll probably want to put that into a data frame, and I can! But I only want the stations with codes matching my earlier list:
Now we’re cookin’ with gas! (These are current observations, so your numbers might look a little different.)
Let’s leave the rather convoluted XPath filter aside and focus on the pipes. This code is liberally sprinkled with %>%
and .
, but there are actually two different pipes going on. (This is a really good argument for consistent indentation style: it helps keep nested pipes straight!)
In the first two lines (18 and 19), we have the sort of pipe we would usually expect: it passes the data frame obs
onto the XML selector, xml_find_all
, and that passes the selection of nodes onto data_frame
. Once we’re inside data_frame
, we use the data pronoun .
to refer back to the selection from xml_find_all
. So the pipe %>%
stays outside data_frame
, following the first level of indentation, and .
stays inside.
But there’s also a pipe operator %>%
inside data_frame
, on line 26. That’s a new pipe. It’s passing a sub-selection on to xml_text
. If we continue piping on outside data_frame
, we’re back to the outer pipe, passing the whole data frame on.
There’s just one problem with all of this: my stations are spread across a bunch of different states, and this XML file is for one state, New South Wales. I want a data frame for all of them!
Something-something Inception
Luckily, the Bureau of Meteorology keeps the other states’ observations in the same place, varying only the third letter of the name: IDN60920.xml
for New South Wales, IDV60920.xml
for Victoria, and so on.
Now, what can we use to repeat a task across a list?
Yep, it’s map
! But this time, we’re returning data frames. I mentioned that there are map_*
functions for combining the results of our mapping in different ways, and map_dfr
can bind data frames you return by row. So we’re going to take our entire last example, and we’re going to jam it into a map_dfr()
call using that magical tilde ~
:
Okay, that got a little wild. By my count, we have three pipes going here—and two of them have .
operators referring back:
- First, we pipe that vector of letter codes into
map_dfr
(on line 2). Once we’re insidemap_dfr
, we use.
on line 4 to refer to the current letter (because the functions we give tomap_*
deal with one list element at a time). - But we also start a new pipe on lines 4 and 5, inside
map_dfr
, carrying a node selection intodata_frame
as we did before. And once we’re insidedata_frame
, we use.
to refer back to the second pipe’s data (lines 8–11, 13). - And, on line 13, we start a third pipe (as we did before) to get the text from each
air_temperature
element.
Each of those data frames we made before gets row bound (glued from top-to-bottom) by map_dfr
. Ta-da!
The important thing to keep track of all these pipes is that the pipe operator %>%
and the data pronoun .
appear on the outside and inside of a piped function respectively. In a couple of places the pronoun from one pipe and the operator from another appear on the same line. But if we’re consistent about our indentation, we can always see which pipes they belong to, remembering that the pronoun .
appears one level further in (because it’s inside the piped function).
So now we have a totally automated way to bring together the interesting observations from all stations of interest across a number of files. In fact, we did automate it: for Is It Hot Right Now, we schedule R to run a script with this pipe in it every half hour.
Next stop
One of the best things about map
is its flexibility: you can use this approach to deal with just about any structured data in R, whether it’s complex objects like regression models, data structures you’ve built yourself or files brought in using other packages.
If you’re looking for more detail, I—like many others—recommend Jenny Bryan’s incomparable purrr tutorial. It covers a lot of the other sophisticated ways you can use purrr. One particular use case from her tutorial that I didn’t cover is roundtripping a data frame list column with map
inside a mutate
verb, the way you would other a regular data frame column verbs. That’s mostly because I’ve only done it once and I still only 80% understand it ????
Think I could’ve done this better? Got a question? Let me know!
Cover image: Jackhammer by Martinus Scriblerus. Licensed under CC BY-NC 2.0
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.