Site icon R-bloggers

My Favorite data.table Feature

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

My favorite R data.table feature is the “by” grouping notation when combined with the := notation.

Let’s take a look at this powerful notation.

First, let’s build an example data.frame.

d <- wrapr::build_frame(
   "group"  , "value" |
     "a"    , 1L      |
     "a"    , 2L      |
     "b"    , 3L      |
     "b"    , 4L      )

knitr::kable(d)
group value
a 1
a 2
b 3
b 4

The data is some sort of value with a grouping column telling us which rows are related.

With the data.table:=,by” notation we can add the per-group totals into each row of the data as follows (the extra [] at the end is just the command to also print the results in addition to adding the column in-place).

library("data.table")
dt <- data.table::as.data.table(d)

dt[, group_sum := sum(value), by = "group"][]

#    group value group_sum
# 1:     a     1         3
# 2:     a     2         3
# 3:     b     3         7
# 4:     b     4         7

The “by” signals we are doing a per-group calculation, and the “:=” signals to land the results in the original data.table. This sort of window function is incredibly useful in computing things such as what fraction of a group’s mass is in each row. For example.

# build a fresh copy as last command altered dt in place
dt <- data.table::as.data.table(d) 

dt[, fraction := value/sum(value), by = "group"][]

#    group value  fraction
# 1:     a     1 0.3333333
# 2:     a     2 0.6666667
# 3:     b     3 0.4285714
# 4:     b     4 0.5714286

In base R (or in a more purely relational data system) the obvious way to solve this requires two steps: computing the per-group summaries and then joining them back into the original table rows. This can be done as follows.

sums <- tapply(d$value, d$group, sum)
d$fraction <- d$value/sums[d$group]
print(d)

#   group value  fraction
# 1     a     1 0.3333333
# 2     a     2 0.6666667
# 3     b     3 0.4285714
# 4     b     4 0.5714286

We called the transform a “window function”, as that is the name that SQL uses for the concept. The SQL code to perform this calculation would look like the following.

 SELECT
  group,
  value,
  value/sum(value) OVER (  PARTITION BY group ) AS fraction
 FROM
   d

And the popular package dplyr uses the following notation for the same problem.

d %>% 
    group_by(group) %>%
    mutate(fraction = value/sum(value)) %>%
    ungroup()

And, as always, let’s end with some timings. For a 1000000 row table with 10 additional irrelevant columns, and group ids picked uniformly from 100000 symbols: we see the various solutions take the following times to complete the task.

##            solution  milliseconds 
##      datatable_soln           384
##  base_R_lookup_soln          1476
##          dplyr_soln          3988

All packages are the current CRAN releases as of 2019-06-29. Timings are sensitive to number of row, columns, and groups. Note the data.table time includes the time to convert to the data.table class.Details on the timings can be found here.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.