[This article was first published on
rstats-tips.net, and kindly contributed to
R-bloggers]. (You can report issue about the content on this page
here)
Want to share your content on R-bloggers?
click here if you have a blog, or
here if you don't.
I’m working on an R-package to access the data of a web service.
So I have to handle large data I get back from an API call. The result is
encoded in JSON-format which itself results in a large list of lists (of lists).
But I want to convert these lists of lists into a data.frame or tibble.
Sounds easy …. Here are the caveats I came across.
Some sample data
Let’s first create some sample data. As I mentioned before the result of an API-call
is a list of lists. The inner lists contain lists, too.
So here’s a function that generates such data.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
options(tidyverse.quiet = TRUE)
library(tidyverse)
set.seed(21)
list_entry <- function(dummy) {
list(
sample(letters, 1),
sample(1:100, 1),
list(paste0("Sublist ", round(runif(1, 0, 100))),
paste0("Sublist ", round(runif(1, 0, 100)))
),
runif(1, 0, 100)
)
}
|
Let’s build a small list and another one we can use for performance measurement.
1
2
3
4
5
|
n_small <- 5
list_of_list <-lapply(as.list(1:n_small), list_entry)
n_large <- 100000
long_list_of_list <-lapply(as.list(1:n_large), list_entry)
|
Tidyverse
First we will use the tidyverse
functions to convert the data.
Naive Way
1
|
list_of_list %>% as_tibble()
|
1
|
## Error: Columns 1, 2, 3, 4, and 5 must be named.
|
Uhh, the error occurs because the list does not contain named values.
So we need to generate (or repair) the names.
1
|
list_of_list %>% as_tibble(.name_repair = "universal")
|
1
2
3
4
5
6
|
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
|
1
2
3
4
5
6
7
|
## # A tibble: 4 x 5
## ...1 ...2 ...3 ...4 ...5
## <list> <list> <list> <list> <list>
## 1 <chr [1]> <chr [1]> <chr [1]> <chr [1]> <chr [1]>
## 2 <int [1]> <int [1]> <int [1]> <int [1]> <int [1]>
## 3 <list [2]> <list [2]> <list [2]> <list [2]> <list [2]>
## 4 <dbl [1]> <dbl [1]> <dbl [1]> <dbl [1]> <dbl [1]>
|
Okay, that’s the wrong way. It generates the transposed version of what we want.
So use a version of all of these fantastic map
-function from the purrr
-package:
(Read here for other
fantastic stuff of purrr.)
1
|
list_of_list %>% map_dfr(as_tibble, .name_repair = "universal")
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
|
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
|
1
2
3
4
5
6
7
8
9
10
11
12
13
|
## # A tibble: 10 x 4
## ...1 ...2 ...3 ...4
## <chr> <int> <list> <dbl>
## 1 a 3 <chr [1]> 91.9
## 2 a 3 <chr [1]> 91.9
## 3 p 21 <chr [1]> 66.8
## 4 p 21 <chr [1]> 66.8
## 5 j 94 <chr [1]> 52.5
## 6 j 94 <chr [1]> 52.5
## 7 b 67 <chr [1]> 77.6
## 8 b 67 <chr [1]> 77.6
## 9 j 46 <chr [1]> 77.8
## 10 j 46 <chr [1]> 77.8
|
Look’s nice. But the original list consists of 5 rows. The result above is
twice as long. So what happend? as_tibble
generates for each entry of the list
of the inner lists a row. That’s not what I’ve expected. I would prefer to get a
another column for each entry.
Flatten the Inner List
When we apply as.data.frame
to an inner list the list of each inner list is
converted into two columns:
1
2
3
4
5
|
column_names <- letters[1:5]
list_of_list %>%
map(function(x) {x <- as.data.frame(x); colnames(x) <- column_names; x}) %>%
bind_rows() %>%
as_tibble()
|
1
2
3
4
5
6
7
8
|
## # A tibble: 5 x 5
## a b c d e
## <chr> <int> <chr> <chr> <dbl>
## 1 a 3 Sublist 18 Sublist 96 91.9
## 2 p 21 Sublist 99 Sublist 85 66.8
## 3 j 94 Sublist 17 Sublist 4 52.5
## 4 b 67 Sublist 1 Sublist 61 77.6
## 5 j 46 Sublist 79 Sublist 57 77.8
|
Yeah, that’s great. That’s what we want. We can even do it with map_dfr
:
1
2
3
|
column_names <- letters[1:5]
list_of_list %>% map_dfr(function(x) {x <- as.data.frame(x); colnames(x) <- column_names; x}) %>%
as_tibble()
|
1
2
3
4
5
6
7
8
|
## # A tibble: 5 x 5
## a b c d e
## <chr> <int> <chr> <chr> <dbl>
## 1 a 3 Sublist 18 Sublist 96 91.9
## 2 p 21 Sublist 99 Sublist 85 66.8
## 3 j 94 Sublist 17 Sublist 4 52.5
## 4 b 67 Sublist 1 Sublist 61 77.6
## 5 j 46 Sublist 79 Sublist 57 77.8
|
But how performant is this code. Let’s check it with the long_list_of_list
and
measere the time with the package tictoc
.
1
2
3
4
5
6
|
library(tictoc)
tic()
column_names <- letters[1:5]
long_list_of_list %>% map_dfr(function(x) {x <- as.data.frame(x); colnames(x) <- column_names; x}) %>%
as_tibble()
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
## # A tibble: 100,000 x 5
## a b c d e
## <chr> <int> <chr> <chr> <dbl>
## 1 w 71 Sublist 93 Sublist 51 74.5
## 2 p 98 Sublist 65 Sublist 34 50.8
## 3 a 49 Sublist 51 Sublist 6 15.1
## 4 u 92 Sublist 77 Sublist 41 87.0
## 5 e 91 Sublist 63 Sublist 83 13.0
## 6 b 68 Sublist 7 Sublist 3 92.2
## 7 x 63 Sublist 13 Sublist 82 92.3
## 8 l 50 Sublist 53 Sublist 97 52.8
## 9 m 85 Sublist 92 Sublist 23 92.8
## 10 t 53 Sublist 99 Sublist 6 86.3
## # … with 99,990 more rows
|
WHAT? One minute for 100,000 rows? That’s long. What can we change?
1
2
3
4
5
|
library(tictoc)
tic()
column_names <- letters[1:5]
long_list_of_list %>% map_dfr(function(x) {x = unlist(x); x = set_names(x, column_names); x}) %>%
as_tibble()
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
## # A tibble: 100,000 x 5
## a b c d e
## <chr> <chr> <chr> <chr> <chr>
## 1 w 71 Sublist 93 Sublist 51 74.5060192188248
## 2 p 98 Sublist 65 Sublist 34 50.765589158982
## 3 a 49 Sublist 51 Sublist 6 15.1016460731626
## 4 u 92 Sublist 77 Sublist 41 87.0233365567401
## 5 e 91 Sublist 63 Sublist 83 13.0497927311808
## 6 b 68 Sublist 7 Sublist 3 92.2474714694545
## 7 x 63 Sublist 13 Sublist 82 92.2626771498471
## 8 l 50 Sublist 53 Sublist 97 52.8331008506939
## 9 m 85 Sublist 92 Sublist 23 92.7664576098323
## 10 t 53 Sublist 99 Sublist 6 86.2593436148018
## # … with 99,990 more rows
|
Three seconds – that’s better!
But the downside is we lost the different types of the columns. All columns
are converted to character
.
Don’t lose the types
The solution is still within purrr
: We can use the functionflatten
:
1
2
3
4
5
|
library(tictoc)
tic()
column_names <- letters[1:5]
long_list_of_list %>% map_dfr(function(x) {x = flatten(x); x = set_names(x, column_names); x}) %>%
as_tibble()
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
## # A tibble: 100,000 x 5
## a b c d e
## <chr> <int> <chr> <chr> <dbl>
## 1 w 71 Sublist 93 Sublist 51 74.5
## 2 p 98 Sublist 65 Sublist 34 50.8
## 3 a 49 Sublist 51 Sublist 6 15.1
## 4 u 92 Sublist 77 Sublist 41 87.0
## 5 e 91 Sublist 63 Sublist 83 13.0
## 6 b 68 Sublist 7 Sublist 3 92.2
## 7 x 63 Sublist 13 Sublist 82 92.3
## 8 l 50 Sublist 53 Sublist 97 52.8
## 9 m 85 Sublist 92 Sublist 23 92.8
## 10 t 53 Sublist 99 Sublist 6 86.3
## # … with 99,990 more rows
|
Related
What is the best way to read data in JSON format into R? Though really common for almost all modern online applications, JSON is not every R user's best friend. After seeing the slides for my Web Scraping course, in which I somewhat arbitrarily veered between using the packages rjson…
March 13, 2014
In "R bloggers"
When working with data which is a result of json-data converted to a list of lists of lists of lists … (you know what mean ;-)) I often want to convert it a data.frame. Unfortunately there’s often a list in the source data which is unnamed...
February 16, 2022
In "R bloggers"
The jsonlite package is a JSON parser/generator optimized for the web. It implements a bidirectional mapping between JSON data and the most important R data types, which allows for converting objects to JSON and back without manual data restructuring. This is ideal for interacting with web APIs, or to build…
September 5, 2014
In "R bloggers"