#15: Tidyverse and data.table, sitting side by side … (Part 1)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Welcome to the fifteenth post in the rarely rational R rambling series, or R4 for short. There are two posts I have been meaning to get out for a bit, and hope to get to shortly—but in the meantime we are going start something else.
Another longer-running idea I had was to present some simple application cases with (one or more) side-by-side code comparisons. Why? Well at times it feels like R, and the R community, are being split. You’re either with one (increasingly “religious” in their defense of their deemed-superior approach) side, or the other. And that is of course utter nonsense. It’s all R after all.
Programming, just like other fields using engineering methods and thinking, is about making choices, and trading off between certain aspects. A simple example is the fairly well-known trade-off between memory use and speed: think e.g. of a hash map allowing for faster lookup at the cost of some more memory. Generally speaking, solutions are rarely limited to just one way, or just one approach. So if pays off to know your tools, and choose wisely among all available options. Having choices is having options, and those tend to have non-negative premiums to take advantage off. Locking yourself into one and just one paradigm can never be better.
In that spirit, I want to (eventually) show a few simple comparisons of code being done two distinct ways.
One obvious first candidate for this is the gunsales repository with some R code which backs an earlier NY Times article. I got involved for a similar reason, and updated the code from its initial form. Then again, this project also helped motivate what we did later with the x13binary package which permits automated installation of the X13-ARIMA-SEATS binary to support Christoph’s excellent seasonal CRAN package (and website) for which we now have a forthcoming JSS paper. But the actual code example is not that interesting / a bit further off the mainstream because of the more specialised seasonal ARIMA modeling.
But then this week I found a much simpler and shorter example, and quickly converted its code. The code comes from the inaugural datascience 1 lesson at the Crosstab, a fabulous site by G. Elliot Morris (who may be the highest-energy undergrad I have come across lately) focusssed on political polling, forecasts, and election outcomes. Lesson 1 is a simple introduction, and averages some polls of the 2016 US Presidential Election.
Complete Code using Approach “TV”
Elliot does a fine job walking the reader through his code so I will be brief and simply quote it in one piece:
## Getting the polls library(readr) polls_2016 <- read_tsv(url("http://elections.huffingtonpost.com/pollster/api/v2/questions/16-US-Pres-GE%20TrumpvClinton/poll-responses-clean.tsv")) ## Wrangling the polls library(dplyr) polls_2016 <- polls_2016 %>% filter(sample_subpopulation %in% c("Adults","Likely Voters","Registered Voters")) library(lubridate) polls_2016 <- polls_2016 %>% mutate(end_date = ymd(end_date)) polls_2016 <- polls_2016 %>% right_join(data.frame(end_date = seq.Date(min(polls_2016$end_date), max(polls_2016$end_date), by="days"))) ## Average the polls polls_2016 <- polls_2016 %>% group_by(end_date) %>% summarise(Clinton = mean(Clinton), Trump = mean(Trump)) library(zoo) rolling_average <- polls_2016 %>% mutate(Clinton.Margin = Clinton-Trump, Clinton.Avg = rollapply(Clinton.Margin,width=14, FUN=function(x){mean(x, na.rm=TRUE)}, by=1, partial=TRUE, fill=NA, align="right")) library(ggplot2) ggplot(rolling_average)+ geom_line(aes(x=end_date,y=Clinton.Avg),col="blue") + geom_point(aes(x=end_date,y=Clinton.Margin))
It uses five packages to i) read some data off them interwebs, ii) then filters / subsets / modifies it leading to a right (outer) join with itself before iv) averaging per-day polls first and then creates rolling averages over 14 days before v) plotting. Several standard verbs are used: filter()
, mutate()
, right_join()
, group_by()
, and summarise()
. One non-verse function is rollapply()
which comes from zoo, a popular package for time-series data.
Complete Code using Approach “DT”
As I will show below, we can do the same with fewer packages as data.table covers the reading, slicing/dicing and time conversion. We still need zoo for its rollapply()
and of course the same plotting code:
## Getting the polls library(data.table) pollsDT <- fread("http://elections.huffingtonpost.com/pollster/api/v2/questions/16-US-Pres-GE%20TrumpvClinton/poll-responses-clean.tsv") ## Wrangling the polls pollsDT <- pollsDT[sample_subpopulation %in% c("Adults","Likely Voters","Registered Voters"), ] pollsDT[, end_date := as.IDate(end_date)] pollsDT <- pollsDT[ data.table(end_date = seq(min(pollsDT[,end_date]), max(pollsDT[,end_date]), by="days")), on="end_date"] ## Average the polls library(zoo) pollsDT <- pollsDT[, .(Clinton=mean(Clinton), Trump=mean(Trump)), by=end_date] pollsDT[, Clinton.Margin := Clinton-Trump] pollsDT[, Clinton.Avg := rollapply(Clinton.Margin, width=14, FUN=function(x){mean(x, na.rm=TRUE)}, by=1, partial=TRUE, fill=NA, align="right")] library(ggplot2) ggplot(pollsDT) + geom_line(aes(x=end_date,y=Clinton.Avg),col="blue") + geom_point(aes(x=end_date,y=Clinton.Margin))
This uses several of the components of data.table
which are often called [i, j, by=...]
. Row are selected (i
), columns are either modified (via :=
assignment) or summarised (via =
), and grouping is undertaken by by=...
. The outer join is done by having a data.table
object indexed by another, and is pretty standard too. That allows us to do all transformations in three lines. We then create per-day average by grouping by day, compute the margin and construct its rolling average as before. The resulting chart is, unsurprisingly, the same.
Benchmark Reading
We can looking how the two approaches do on getting data read into our session. For simplicity, we will read a local file to keep the (fixed) download aspect out of it:
R> url <- "http://elections.huffingtonpost.com/pollster/api/v2/questions/16-US-Pres-GE%20TrumpvClinton/poll-responses-clean.tsv" R> download.file(url, destfile=file, quiet=TRUE) R> file <- "/tmp/poll-responses-clean.tsv" R> res <- microbenchmark(tidy=suppressMessages(readr::read_tsv(file)), + dt=data.table::fread(file, showProgress=FALSE)) R> res Unit: milliseconds expr min lq mean median uq max neval tidy 6.67777 6.83458 7.13434 6.98484 7.25831 9.27452 100 dt 1.98890 2.04457 2.37916 2.08261 2.14040 28.86885 100 R>
That is a clear relative difference, though the absolute amount of time is not that relevant for such a small (demo) dataset.
Benchmark Processing
We can also look at the processing part:
R> rdin <- suppressMessages(readr::read_tsv(file)) R> dtin <- data.table::fread(file, showProgress=FALSE) R> R> library(dplyr) R> library(lubridate) R> library(zoo) R> R> transformTV <- function(polls_2016=rdin) { + polls_2016 <- polls_2016 %>% + filter(sample_subpopulation %in% c("Adults","Likely Voters","Registered Voters")) + polls_2016 <- polls_2016 %>% + mutate(end_date = ymd(end_date)) + polls_2016 <- polls_2016 %>% + right_join(data.frame(end_date = seq.Date(min(polls_2016$end_date), + max(polls_2016$end_date), by="days"))) + polls_2016 <- polls_2016 %>% + group_by(end_date) %>% + summarise(Clinton = mean(Clinton), + Trump = mean(Trump)) + + rolling_average <- polls_2016 %>% + mutate(Clinton.Margin = Clinton-Trump, + Clinton.Avg = rollapply(Clinton.Margin,width=14, + FUN=function(x){mean(x, na.rm=TRUE)}, + by=1, partial=TRUE, fill=NA, align="right")) + } R> R> transformDT <- function(dtin) { + pollsDT <- copy(dtin) ## extra work to protect from reference semantics for benchmark + pollsDT <- pollsDT[sample_subpopulation %in% c("Adults","Likely Voters","Registered Voters"), ] + pollsDT[, end_date := as.IDate(end_date)] + pollsDT <- pollsDT[ data.table(end_date = seq(min(pollsDT[,end_date]), + max(pollsDT[,end_date]), by="days")), on="end_date"] + pollsDT <- pollsDT[, .(Clinton=mean(Clinton), Trump=mean(Trump)), + by=end_date][, Clinton.Margin := Clinton-Trump] + pollsDT[, Clinton.Avg := rollapply(Clinton.Margin, width=14, + FUN=function(x){mean(x, na.rm=TRUE)}, + by=1, partial=TRUE, fill=NA, align="right")] + } R> R> res <- microbenchmark(tidy=suppressMessages(transformTV(rdin)), + dt=transformDT(dtin)) R> res Unit: milliseconds expr min lq mean median uq max neval tidy 12.54723 13.18643 15.29676 13.73418 14.71008 104.5754 100 dt 7.66842 8.02404 8.60915 8.29984 8.72071 17.7818 100 R>
Not quite a factor of two on the small data set, but again a clear advantage. data.table
has a reputation for doing really well for large datasets; here we see that it is also faster for small datasets.
Side-by-side
Stripping the reading, as well as the plotting both of which are about the same, we can compare the essential data operations.
Summary
We found a simple task solved using code and packages from an increasingly popular sub-culture within R, and contrasted it with a second approach. We find the second approach to i) have fewer dependencies, ii) less code, and iii) running faster.
Now, undoubtedly the former approach will have its staunch defenders (and that is all good and well, after all choice is good and even thirty years later some still debate vi
versus emacs
endlessly) but I thought it to be instructive to at least to be able to make an informed comparison.
Acknowledgements
My thanks to G. Elliot Morris for a fine example, and of course a fine blog and (if somewhat hyperactive) Twitter account.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.