Selecting the max value from each group, a case study: data.table
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
In my last post we looked at how to slice a data.frame
by group to obtain the rows for which a particular column in that group is at its maximum value sing base R. In this post, we will be taking a look at how to perform this task using data.table
.
data.table
Solution(s)
For this exercise we will be using datasets::mtcars
and so first, we must convert mtcars
to a data.table
.
library(data.table) mtcars_colnames <- colnames(mtcars) mtcars[, "car"] <- rownames(mtcars) mtcars <- mtcars[, c("car", mtcars_colnames)] mtcars <- as.data.table(mtcars)
The current data.table
syntax suggests that when grouping data, we should use the .SD
syntax. .SD
stands for “S
ubset of D
ata.table”, so when we group the data by
a variable, we are creating subsets of the data. Note that there’s no significance to the initial .
, except that it makes it slightly more unlikely that there will be a clash with a user-defined column name. So, one approach to solving our problem can be seen below.
mtcars[, .SD[which.max(mpg)], by = cyl] # cyl car mpg disp hp drat wt qsec vs am gear carb # 1: 6 Hornet 4 Drive 21.4 258.0 110 3.08 3.215 19.44 1 0 3 1 # 2: 4 Toyota Corolla 33.9 71.1 65 4.22 1.835 19.90 1 1 4 1 # 3: 8 Pontiac Firebird 19.2 400.0 175 3.08 3.845 17.05 0 0 3 2
So this code essentially treats .SD
as a “group”, one for each cylinder level (by
), and subsets the row by the index where mpg
is at its maximum.
However this wasn’t always the case with data.table
, there is some legacy syntax which is still valid within the package. If we take a look at the following code, we will see that we can obtain the row where mpg
is at its maximum using the .I
syntax.
mtcars[, .I[which.max(mpg)]] # [1] 20
In other words, here .I
is a vector representing the row number where mpg
is at its maximum in the original data.table
. Now consider the case where we look at this by each cylinder group; we obtain a data.table
whose column V1
represents the row indices for each cylinder group where mpg
is at its maximum, i.e. one row for each group.
mtcars[, .I[which.max(mpg)], by = cyl] # cyl V1 # 1: 6 4 # 2: 4 20 # 3: 8 25
So using this data.table
we can subset on the column V1
to extract these row indices and subset the original data.table
for those rows.
mtcars[mtcars[, .I[which.max(mpg)], by = cyl]$V1] # car mpg cyl disp hp drat wt qsec vs am gear carb # 1: Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 # 2: Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 # 3: Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
The final way in which we could solve this problem with data.table
is in fact very similar to the base R approach that we saw in the previous post. This involves split
ting the data.table
into three separate data.table
s, one for each cylinder group, and applying a function to each group that finds the index of the maximum row, subsetting the grouped data.table
on that index. We then bind these lists together using the data.table::rbindlist()
function. In fact, we could have even used the do.call(rbind, .)
approach we saw in the first post.
rbindlist(lapply(split(mtcars, mtcars[, cyl]), function(x) x[which.max(mpg)])) # car mpg cyl disp hp drat wt qsec vs am gear carb # 1: Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 # 2: Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 # 3: Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
So there we have it, three separate ways to solve the same problem using data.table
.
Benchmarks
Now this wouldn’t be a completely informative blog post without some benchmarks. In fact these three options are what inspired me to write this series of posts. We can see the results of the benchmarking below.
We can see that, on average, the .I
solution is fastest. So you may be wondering, if the .SD
syntax is the newer syntax, why then is the .I
code quicker? Well it’s simply because .SD
has not yet been completely optimised. .I
avoids constructing .SD
, which is the bottleneck in the .SD
solution.
As a side note, we could speed these results up even more with data.table::setkey(mtcars, cyl)
. I won’t show the benchmarking results here as the data is so small it isn’t really a useful representation but it is worth considering should you need to perform a similar task on a larger dataset.
Conclusion
If you need to apply some function to each group within a data.table
, data.table
has many solutions. The newest solution, .SD
, may not always be the fastest approach and you should try the old .I
approach if speed is important.
Are you aware of any other data.table
solutions to this problem? If so, let me know in the comments!
Next time, we will take a look at how to solve this problem in dplyr
as well as sparklyr
.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.