Site icon R-bloggers

R Quick Tip: Collapse a lists of data.frames with data.table

[This article was first published on R – It's a Locke, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

With my HIBPwned package, I consume the HaveIBeenPwned API and return back a list object with an element for each email address. Each element holds a data.frame of breach data or a stub response with a single column data.frame containing NA. Elements are named with the email addresses they relate to. I had a list of data.frames and I wanted a consolidated data.frame (well, I always want a data.table).

Enter data.table …

data.table has a very cool, and very fast function named rbindlist(). This takes a list of data.frames and consolidates them into one data.table, which can, of course, be handled as a data.frame if you didn’t want to use data.table for anything else.

Prep

For this post, I need a list with some data.frames in it. For simplicity, I’m going to split the iris dataset into three separate data.frames.

library(data.table)
myList<-list( p1=iris[1:50,]
             , p2=iris[51:100,]
             , p3=iris[101:150,]
 )

Simplest usage

If all your tables have the same columns and in the same order, then this is super simple.

> myList<-list( p1=iris[1:50,] , p2=iris[51:100,] , p3=iris[101:150,] )
> dt<-rbindlist(myList)
> head(dt)
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1:          5.1         3.5          1.4         0.2  setosa
2:          4.9         3.0          1.4         0.2  setosa
3:          4.7         3.2          1.3         0.2  setosa
4:          4.6         3.1          1.5         0.2  setosa
5:          5.0         3.6          1.4         0.2  setosa
6:          5.4         3.9          1.7         0.4  setosa

Varying columns

If your data structure varies at all, you can use the arguments use.names and fill to combine the data.frames without an error. It will put columns on the RHS of the table as they appear within the list.

Here I remove the first column from the first part of our dataset to illustrate.

> myList<-list( p1=iris[1:50, -1 ] , p2=iris[51:100,] , p3=iris[101:150,] )
> dt<-rbindlist(myList, use.names=TRUE, fill=TRUE)
> head(dt)
   Sepal.Width Petal.Length Petal.Width Species Sepal.Length
1:         3.5          1.4         0.2  setosa           NA
2:         3.0          1.4         0.2  setosa           NA
3:         3.2          1.3         0.2  setosa           NA
4:         3.1          1.5         0.2  setosa           NA
5:         3.6          1.4         0.2  setosa           NA
6:         3.9          1.7         0.4  setosa           NA

Adding the list element ID

I didn’t RTFM very well at first, so when I wanted the name of the element the specific row came from to be present in my data.frame, I (shudder) wrote this horror:

dt<-rbindlist(lapply(1:length(myList)
                     , function(x){ setDT(myList[[x]])[
  , id:=names(myList)[x]]})
  , use.names=TRUE, fill=TRUE)

When I write things like that, I know I’m in the wrong! This hideous inline function, iterating through the elements and updating values in the element, is frankly something I should be slapped for. So with it playing on my mind, I went back to the manual, and lo’ data.table was awesome.

> myList<-list( p1=iris[1:50, -1 ] , p2=iris[51:100,] , p3=iris[101:150,] ) 
> dt<-rbindlist(myList
+   , use.names=TRUE, fill=TRUE, idcol="myList")
> 
> head(dt)
   myList Sepal.Width Petal.Length Petal.Width Species Sepal.Length
1:     p1         3.5          1.4         0.2  setosa           NA
2:     p1         3.0          1.4         0.2  setosa           NA
3:     p1         3.2          1.3         0.2  setosa           NA
4:     p1         3.1          1.5         0.2  setosa           NA
5:     p1         3.6          1.4         0.2  setosa           NA
6:     p1         3.9          1.7         0.4  setosa           NA

This used the names of the list elements but would use the numeric index number if names weren’t available.

Wrap up

So with just three little arguments, data.table’s rbindlist() bundled up my data.frames into a single data.frame whilst handling divergent columns, and providing a reference back to the original list element!

The post R Quick Tip: Collapse a lists of data.frames with data.table appeared first on It's a Locke.

To leave a comment for the author, please follow the link and comment on their blog: R – It's a Locke.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.