Converting Lists of Lists of Lists to Data.Frames (or Tibbles)

rstats-tips.net

1 year ago

[This article was first published on rstats-tips.net, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’m working on an R-package to access the data of a web service. So I have to handle large data I get back from an API call. The result is encoded in JSON-format which itself results in a large list of lists (of lists).

But I want to convert these lists of lists into a data.frame or tibble. Sounds easy …. Here are the caveats I came across.

Some sample data

Let’s first create some sample data. As I mentioned before the result of an API-call is a list of lists. The inner lists contain lists, too.

So here’s a function that generates such data.

options(tidyverse.quiet = TRUE)
library(tidyverse)

set.seed(21)

list_entry <- function(dummy) {
  list(
    sample(letters, 1),
    sample(1:100, 1),
    list(paste0("Sublist ", round(runif(1, 0, 100))), 
         paste0("Sublist ", round(runif(1, 0, 100)))
         ),
    runif(1, 0, 100)
  )
}

Let’s build a small list and another one we can use for performance measurement.

n_small <- 5
list_of_list <-lapply(as.list(1:n_small), list_entry)

n_large <- 100000
long_list_of_list <-lapply(as.list(1:n_large), list_entry)

Tidyverse

First we will use the tidyverse functions to convert the data.

Naive Way

1	list_of_list %>% as_tibble()

1	## Error: Columns 1, 2, 3, 4, and 5 must be named.

Uhh, the error occurs because the list does not contain named values. So we need to generate (or repair) the names.

1	list_of_list %>% as_tibble(.name_repair = "universal")

## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5

## # A tibble: 4 x 5
##   ...1       ...2       ...3       ...4       ...5      
##   <list>     <list>     <list>     <list>     <list>    
## 1 <chr [1]>  <chr [1]>  <chr [1]>  <chr [1]>  <chr [1]> 
## 2 <int [1]>  <int [1]>  <int [1]>  <int [1]>  <int [1]> 
## 3 <list [2]> <list [2]> <list [2]> <list [2]> <list [2]>
## 4 <dbl [1]>  <dbl [1]>  <dbl [1]>  <dbl [1]>  <dbl [1]>

Okay, that’s the wrong way. It generates the transposed version of what we want.

So use a version of all of these fantastic map-function from the purrr-package: (Read here for other fantastic stuff of purrr.)

1	list_of_list %>% map_dfr(as_tibble, .name_repair = "universal")

## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4

## # A tibble: 10 x 4
##    ...1   ...2 ...3       ...4
##    <chr> <int> <list>    <dbl>
##  1 a         3 <chr [1]>  91.9
##  2 a         3 <chr [1]>  91.9
##  3 p        21 <chr [1]>  66.8
##  4 p        21 <chr [1]>  66.8
##  5 j        94 <chr [1]>  52.5
##  6 j        94 <chr [1]>  52.5
##  7 b        67 <chr [1]>  77.6
##  8 b        67 <chr [1]>  77.6
##  9 j        46 <chr [1]>  77.8
## 10 j        46 <chr [1]>  77.8

Look’s nice. But the original list consists of 5 rows. The result above is twice as long. So what happend? as_tibble generates for each entry of the list of the inner lists a row. That’s not what I’ve expected. I would prefer to get a another column for each entry.

Flatten the Inner List

When we apply as.data.frame to an inner list the list of each inner list is converted into two columns:

column_names <- letters[1:5]
list_of_list %>%
  map(function(x) {x <- as.data.frame(x); colnames(x) <- column_names; x}) %>%
  bind_rows() %>% 
  as_tibble()

## # A tibble: 5 x 5
##   a         b c          d              e
##   <chr> <int> <chr>      <chr>      <dbl>
## 1 a         3 Sublist 18 Sublist 96  91.9
## 2 p        21 Sublist 99 Sublist 85  66.8
## 3 j        94 Sublist 17 Sublist 4   52.5
## 4 b        67 Sublist 1  Sublist 61  77.6
## 5 j        46 Sublist 79 Sublist 57  77.8

Yeah, that’s great. That’s what we want. We can even do it with map_dfr:

1
2
3

column_names <- letters[1:5]
list_of_list %>% map_dfr(function(x) {x <- as.data.frame(x); colnames(x) <- column_names; x}) %>% 
  as_tibble()

## # A tibble: 5 x 5
##   a         b c          d              e
##   <chr> <int> <chr>      <chr>      <dbl>
## 1 a         3 Sublist 18 Sublist 96  91.9
## 2 p        21 Sublist 99 Sublist 85  66.8
## 3 j        94 Sublist 17 Sublist 4   52.5
## 4 b        67 Sublist 1  Sublist 61  77.6
## 5 j        46 Sublist 79 Sublist 57  77.8

Performance

But how performant is this code. Let’s check it with the long_list_of_list and measere the time with the package tictoc.

library(tictoc)

tic()
column_names <- letters[1:5]
long_list_of_list %>% map_dfr(function(x) {x <- as.data.frame(x); colnames(x) <- column_names; x}) %>% 
  as_tibble()

## # A tibble: 100,000 x 5
##    a         b c          d              e
##    <chr> <int> <chr>      <chr>      <dbl>
##  1 w        71 Sublist 93 Sublist 51  74.5
##  2 p        98 Sublist 65 Sublist 34  50.8
##  3 a        49 Sublist 51 Sublist 6   15.1
##  4 u        92 Sublist 77 Sublist 41  87.0
##  5 e        91 Sublist 63 Sublist 83  13.0
##  6 b        68 Sublist 7  Sublist 3   92.2
##  7 x        63 Sublist 13 Sublist 82  92.3
##  8 l        50 Sublist 53 Sublist 97  52.8
##  9 m        85 Sublist 92 Sublist 23  92.8
## 10 t        53 Sublist 99 Sublist 6   86.3
## # … with 99,990 more rows

toc()

1	## 57.617 sec elapsed

WHAT? One minute for 100,000 rows? That’s long. What can we change?

library(tictoc)
tic()
column_names <- letters[1:5]
long_list_of_list %>% map_dfr(function(x) {x = unlist(x); x = set_names(x, column_names); x}) %>% 
  as_tibble()

## # A tibble: 100,000 x 5
##    a     b     c          d          e               
##    <chr> <chr> <chr>      <chr>      <chr>           
##  1 w     71    Sublist 93 Sublist 51 74.5060192188248
##  2 p     98    Sublist 65 Sublist 34 50.765589158982 
##  3 a     49    Sublist 51 Sublist 6  15.1016460731626
##  4 u     92    Sublist 77 Sublist 41 87.0233365567401
##  5 e     91    Sublist 63 Sublist 83 13.0497927311808
##  6 b     68    Sublist 7  Sublist 3  92.2474714694545
##  7 x     63    Sublist 13 Sublist 82 92.2626771498471
##  8 l     50    Sublist 53 Sublist 97 52.8331008506939
##  9 m     85    Sublist 92 Sublist 23 92.7664576098323
## 10 t     53    Sublist 99 Sublist 6  86.2593436148018
## # … with 99,990 more rows

toc()

1	## 2.7 sec elapsed

Three seconds – that’s better!

But the downside is we lost the different types of the columns. All columns are converted to character.

Don’t lose the types

The solution is still within purrr: We can use the functionflatten:

library(tictoc)
tic()
column_names <- letters[1:5]
long_list_of_list %>% map_dfr(function(x) {x = flatten(x); x = set_names(x, column_names); x}) %>% 
  as_tibble()

## # A tibble: 100,000 x 5
##    a         b c          d              e
##    <chr> <int> <chr>      <chr>      <dbl>
##  1 w        71 Sublist 93 Sublist 51  74.5
##  2 p        98 Sublist 65 Sublist 34  50.8
##  3 a        49 Sublist 51 Sublist 6   15.1
##  4 u        92 Sublist 77 Sublist 41  87.0
##  5 e        91 Sublist 63 Sublist 83  13.0
##  6 b        68 Sublist 7  Sublist 3   92.2
##  7 x        63 Sublist 13 Sublist 82  92.3
##  8 l        50 Sublist 53 Sublist 97  52.8
##  9 m        85 Sublist 92 Sublist 23  92.8
## 10 t        53 Sublist 99 Sublist 6   86.3
## # … with 99,990 more rows

toc()

1	## 2.661 sec elapsed

To leave a comment for the author, please follow the link and comment on their blog: rstats-tips.net.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.