Embarking on nested dataframes

Jamie Lendrum

4 years ago

[This article was first published on R on R-house, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In a recent sprint, I was faced with the problem of carrying out analysis on data extracted from a database where there were several instances of the same table type and I wanted to do the same tasks on each of them. I know enough about the tidyverse to realise that this was a good opportunity to use functions such as map() and nest(). However, I fell at the first hurdle when the pressure of producing results meant I couldn’t spend the time I needed to get it to work…something to work on in slower time, hence this post.

To give an example of how my problem was structured, I’m going to use a movie example. Say we have a number of movies, and each of these movies have their own set of tables:

library(tidyverse)

pulp_fiction_cast <- data_frame(cast_member = c("Uma Thurman", "John Travolta"),
                                role = c("Mia Wallace", "Vincent Vega"))

terminator_cast <- data_frame(cast_member = c("Arnold Schwarzenegger", "Linda Hamilton"),
                              role = c("T-101", "Sarah Connor"))

pulp_fiction_critic_ratings <- data_frame(critic = c("Reviewer1", "Reviewer2"),
                                          score = c(8.1, 9))

terminator_critic_ratings <- data_frame(critic = c("Reviewer3", "Reviewer4"),
                                        score = c(7.8, 8.5))

pulp_fiction_cast
## # A tibble: 2 x 2
##   cast_member   role        
##   <chr>         <chr>       
## 1 Uma Thurman   Mia Wallace 
## 2 John Travolta Vincent Vega
terminator_cast
## # A tibble: 2 x 2
##   cast_member           role        
##   <chr>                 <chr>       
## 1 Arnold Schwarzenegger T-101       
## 2 Linda Hamilton        Sarah Connor
pulp_fiction_critic_ratings
## # A tibble: 2 x 2
##   critic    score
##   <chr>     <dbl>
## 1 Reviewer1  8.10
## 2 Reviewer2  9.00
terminator_critic_ratings
## # A tibble: 2 x 2
##   critic    score
##   <chr>     <dbl>
## 1 Reviewer3  7.80
## 2 Reviewer4  8.50

Whereas above I’ve defined each dataframe explicitly, in reality I had a function which could return the dataframe I wanted from a database given the film name and table name.

This was the type of setup I settled for; defining each dataframe separately and performing my analysis pipelines manually on each. Of particular note is that the number of movies and table types would increase as more information was loaded into the database. This was very laborious. What I really wanted was a way of storing all of this information in a single dataframe, by nesting the above dataframes within it. I could then store my analysis pipelines in functions which I could then loop over with the map() functions. So the single dataframe would look like the following:

(master <- data_frame(movie_name = c("Pulp Fiction", "Terminator"),
                     cast = c("pulp_fiction_cast df", "terminator_cast df"),
                     critic_ratings = c("pulp_fiction_critic_ratings df", "terminator_critic_ratings df")))
## # A tibble: 2 x 3
##   movie_name   cast                 critic_ratings                
##   <chr>        <chr>                <chr>                         
## 1 Pulp Fiction pulp_fiction_cast df pulp_fiction_critic_ratings df
## 2 Terminator   terminator_cast df   terminator_critic_ratings df

In order to demonstrate the problem, I’ve defined a very rudimentary import function below to return a particular dataframe:

movie_names <- c("Pulp Fiction", "Terminator")
table_names <- c("Cast", "Critic Ratings") 

import_fn <- function(movie_name, table_name){
  if (movie_name == movie_names[1]) {
    if (table_name == table_names[1]) {
      pulp_fiction_cast
    } else if (table_name == table_names[2]){
      pulp_fiction_critic_ratings
    }
  } else if (movie_name == movie_names[2]) {
    if (table_name == table_names[1]) {
      terminator_cast
    } else if (table_name == table_names[2]){
      terminator_critic_ratings
    }
  }
}

import_fn(movie_names[2], table_names[2])
## # A tibble: 2 x 2
##   critic    score
##   <chr>     <dbl>
## 1 Reviewer3  7.80
## 2 Reviewer4  8.50

My first attempt to create the dataframe I wanted looked like this:

master <- map(movie_names, function(x){map(table_names, import_fn, movie_name = x)})
str(master)
## List of 2
##  $ :List of 2
##   ..$ :Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of  2 variables:
##   .. ..$ cast_member: chr [1:2] "Uma Thurman" "John Travolta"
##   .. ..$ role       : chr [1:2] "Mia Wallace" "Vincent Vega"
##   ..$ :Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of  2 variables:
##   .. ..$ critic: chr [1:2] "Reviewer1" "Reviewer2"
##   .. ..$ score : num [1:2] 8.1 9
##  $ :List of 2
##   ..$ :Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of  2 variables:
##   .. ..$ cast_member: chr [1:2] "Arnold Schwarzenegger" "Linda Hamilton"
##   .. ..$ role       : chr [1:2] "T-101" "Sarah Connor"
##   ..$ :Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of  2 variables:
##   .. ..$ critic: chr [1:2] "Reviewer3" "Reviewer4"
##   .. ..$ score : num [1:2] 7.8 8.5

This initial attempt was just to dive straight in with something to see what I got. I knew I’d get a list of lists by using the bare map() function, and despite changing to map_df() I still didn’t get what I was after. But at least all the data is in there and it’s generalised regardless of how many movies or tables I have.

After a few days thinking I knew I wanted to start with a movie name column and then do a series of mutate() commands joining on lists of dataframes. As a starter for ten using the cast table:

master <- data_frame(Movie = movie_names) %>%
            mutate(map(movie_names, import_fn, table_name = table_names[1]))
str(master)
## Classes 'tbl_df', 'tbl' and 'data.frame':    2 obs. of  2 variables:
##  $ Movie                                                   : chr  "Pulp Fiction" "Terminator"
##  $ map(movie_names, import_fn, table_name = table_names[1]):List of 2
##   ..$ :Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of  2 variables:
##   .. ..$ cast_member: chr  "Uma Thurman" "John Travolta"
##   .. ..$ role       : chr  "Mia Wallace" "Vincent Vega"
##   ..$ :Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of  2 variables:
##   .. ..$ cast_member: chr  "Arnold Schwarzenegger" "Linda Hamilton"
##   .. ..$ role       : chr  "T-101" "Sarah Connor"

Two issues now: how to generalise to several columns instead of just one, and how to give it the right heading – although I could always name the columns afterwards using names(). My current best effort looks like the following:

master <- data_frame(Movie = movie_names) %>%
          bind_cols(
            map_df(table_names, 
                   function(x) {mutate(., map(movie_names, import_fn, table_name = x))})
          )
str(master)

This gives an error, but I’ll keep plugging away at it…

Update 2018-01-23

Whilst doing some more research into the tidyverse, I discovered a neat function set_names() in the purrr package which allows you to set the names() property of your dataframes within a tidy pipeline, so any solution I come up with can include %>% set_names(table_names) and I’m good to go! (This reminds me why I used the base R function setNames() in my Star Wars post).

Whilst I was looking into a solution to my problem, I discovered the reduce() function which uses a two argument function, such as bind_cols() to reduce a list down to a single element. I don’t know if my attempt below is getting me closer or further away!

# Function to take in two table names, make them into two lists of dataframes, and bind them
bind_table_cols <- function(x, y) {
  x_col <- map(movie_names, import_fn, table_name = x)
  y_col <- map(movie_names, import_fn, table_name = y)
  bind_cols(x_col, y_col)
}

master <- reduce(table_names, bind_table_cols) %>%
            set_names(table_names) %>%
            data_frame(Movie = movie_names, .)

Update 2018-02-20

Thanks to the clever people on Stack Overflow, I’ve now got an embarrassingly simple but elegant solution to this problem:

master <- crossing(movie_names, table_names) %>%
                    mutate(df = map2(movie_names, table_names, import_fn)) %>%
                    spread(table_names, df)
master
## # A tibble: 2 x 3
##   movie_names  Cast             `Critic Ratings`
##   <chr>        <list>           <list>          
## 1 Pulp Fiction <tibble [2 x 2]> <tibble [2 x 2]>
## 2 Terminator   <tibble [2 x 2]> <tibble [2 x 2]>
master$Cast[[2]]
## # A tibble: 2 x 2
##   cast_member           role        
##   <chr>                 <chr>       
## 1 Arnold Schwarzenegger T-101       
## 2 Linda Hamilton        Sarah Connor
master %>%
      filter(movie_names == "Terminator") %>%
      select(Cast) %>%
      unnest()
## # A tibble: 2 x 2
##   cast_member           role        
##   <chr>                 <chr>       
## 1 Arnold Schwarzenegger T-101       
## 2 Linda Hamilton        Sarah Connor

The crossing() function was a new one on me. The moral of the story? Don’t necessarily try to create something in its final form straight off the bat!

To leave a comment for the author, please follow the link and comment on their blog: R on R-house.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.