Site icon R-bloggers

Releasing collapse 2.0: Blazing Fast Joins, Reshaping, and Enhanced R

[This article was first published on R, Econometrics, High Performance, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’m excited to announce the release of collapse 2.0, adding blazing fast joins, pivots, flexible namespace and many other features, including a brand new website, an updated cheat sheet, and a new vignette aimed at tidyverse users.

In the 3.5 years after the first release of collapse 1.0 to CRAN in March 2020, the package has seen 10 major updates, and become a remarkable piece of statistical software that is robust, stable, lightweight, fast, statistically advanced, comprehensive, flexible, class-agnostic, verbose and well-documented. It is profoundly able to deal with rich (multi-level, irregular, weighted, nested, labelled, and missing) scientific data, and can enhance the workflow of every R user.

The addition of rich, fast, and verbose joins and pivots in this release, together with secure interactive namespace masking and extensive global configurability, should enable many R users to use it as a workhorse package for data manipulation and statistical computing tasks.

In this post, I briefly introduce the core new features of this release and end with some reflections on why I created the package and think that its approach towards speeding up and enriching R is more encompassing than others.

Fast, Class-Agnostic, and Verbose Table Joins

Joins in collapse has been a much-requested feature. Still, I was long hesitant to take them on because they are complex, high-risk, operations, and I was unsure on how to go about them or provide an implementation that would satisfy my own ambitious demands to their performance, generality and verbosity.

I am glad that, following repeated requests, I have overcome these hesitations and designed an implementation – inspired by polars – that I am very satisfied with. collapse’s join function is simply called join(), and provides 6 types of joins (left, inner, full, right, semi and anti), controlled by a how argument – the default being a left join. It also provides two separate join algorithms: a vectorized hash join (the default, sort = FALSE) and a sort-merge-join (sort = TRUE). The join-column argument is called on, and, if left empty, selects columns present in both datasets. An example with generated data follows:

library(collapse)
df1 <- data.frame(
  id1 = c(1, 1, 2, 3),
  id2 = c("a", "b", "b", "c"),
  name = c("John", "Jane", "Bob", "Carl"),
  age = c(35, 28, 42, 50)
)
df2 <- data.frame(
  id1 = c(1, 2, 3, 3),
  id2 = c("a", "b", "c", "e"),
  salary = c(60000, 55000, 70000, 80000),
  dept = c("IT", "Marketing", "Sales", "IT")
)

# Different types of joins
for (i in c("left","inner","right","full","semi","anti"))
    join(df1, df2, how = i) |> print()
## left join: df1[id1, id2] 3/4 (75%) <m:m> df2[id1, id2] 3/4 (75%)
##   id1 id2 name age salary      dept
## 1   1   a John  35  60000        IT
## 2   1   b Jane  28     NA      <NA>
## 3   2   b  Bob  42  55000 Marketing
## 4   3   c Carl  50  70000     Sales
## inner join: df1[id1, id2] 3/4 (75%) <m:m> df2[id1, id2] 3/4 (75%)
##   id1 id2 name age salary      dept
## 1   1   a John  35  60000        IT
## 2   2   b  Bob  42  55000 Marketing
## 3   3   c Carl  50  70000     Sales
## right join: df1[id1, id2] 3/4 (75%) <m:m> df2[id1, id2] 3/4 (75%)
##   id1 id2 name age salary      dept
## 1   1   a John  35  60000        IT
## 2   2   b  Bob  42  55000 Marketing
## 3   3   c Carl  50  70000     Sales
## 4   3   e <NA>  NA  80000        IT
## full join: df1[id1, id2] 3/4 (75%) <m:m> df2[id1, id2] 3/4 (75%)
##   id1 id2 name age salary      dept
## 1   1   a John  35  60000        IT
## 2   1   b Jane  28     NA      <NA>
## 3   2   b  Bob  42  55000 Marketing
## 4   3   c Carl  50  70000     Sales
## 5   3   e <NA>  NA  80000        IT
## semi join: df1[id1, id2] 3/4 (75%) <m:m> df2[id1, id2] 3/4 (75%)
##   id1 id2 name age
## 1   1   a John  35
## 2   2   b  Bob  42
## 3   3   c Carl  50
## anti join: df1[id1, id2] 3/4 (75%) <m:m> df2[id1, id2] 3/4 (75%)
##   id1 id2 name age
## 1   1   b Jane  28

Notice how, by default (verbose = 1), a compact summary of the operation is printed, indicating the type of join, the datasets and columns, and the number and percentage of records from each dataset matched in the join operation. join() also preserves the attributes of the first argument (x) and the order of columns and rows (default keep.col.order = TRUE, sort = FALSE) in it. We can thus think of a join operation as adding columns to a data frame-like object (x) from another similar object (y) .

There are several additional options to increase verbosity and assimilate the join operation:

# verbose = 2 also shows classes, allowing you to detect implicit conversions (inside fmatch())
join(df1, df2, how = "left", verbose = 2)
## left join: df1[id1:numeric, id2:character] 3/4 (75%) <m:m> df2[id1:numeric, id2:character] 3/4 (75%)
##   id1 id2 name age salary      dept
## 1   1   a John  35  60000        IT
## 2   1   b Jane  28     NA      <NA>
## 3   2   b  Bob  42  55000 Marketing
## 4   3   c Carl  50  70000     Sales

# Adding join column: useful especially for full join
join(df1, df2, how = "full", column = TRUE)
## full join: df1[id1, id2] 3/4 (75%) <m:m> df2[id1, id2] 3/4 (75%)
##   id1 id2 name age salary      dept   .join
## 1   1   a John  35  60000        IT matched
## 2   1   b Jane  28     NA      <NA>     df1
## 3   2   b  Bob  42  55000 Marketing matched
## 4   3   c Carl  50  70000     Sales matched
## 5   3   e <NA>  NA  80000        IT     df2

# Custom column + rearranging
join(df1, df2, how = "full", column = list("join", c("x", "y", "x_y")), 
     keep.col.order = FALSE)
## full join: df1[id1, id2] 3/4 (75%) <m:m> df2[id1, id2] 3/4 (75%)
##   id1 id2 join name age salary      dept
## 1   1   a  x_y John  35  60000        IT
## 2   1   b    x Jane  28     NA      <NA>
## 3   2   b  x_y  Bob  42  55000 Marketing
## 4   3   c  x_y Carl  50  70000     Sales
## 5   3   e    y <NA>  NA  80000        IT

# Attaching match attribute
str(join(df1, df2, attr = TRUE))
## left join: df1[id1, id2] 3/4 (75%) <m:m> df2[id1, id2] 3/4 (75%)
## 'data.frame':	4 obs. of  6 variables:
##  $ id1   : num  1 1 2 3
##  $ id2   : chr  "a" "b" "b" "c"
##  $ name  : chr  "John" "Jane" "Bob" "Carl"
##  $ age   : num  35 28 42 50
##  $ salary: num  60000 NA 55000 70000
##  $ dept  : chr  "IT" NA "Marketing" "Sales"
##  - attr(*, "join.match")=List of 3
##   ..$ call   : language join(x = df1, y = df2, attr = TRUE)
##   ..$ on.cols:List of 2
##   .. ..$ x: chr [1:2] "id1" "id2"
##   .. ..$ y: chr [1:2] "id1" "id2"
##   ..$ match  : 'qG' int [1:4] 1 NA 2 3
##   .. ..- attr(*, "N.nomatch")= int 1
##   .. ..- attr(*, "N.groups")= int 4
##   .. ..- attr(*, "N.distinct")= int 3

Finally, it is possible to validate the join operation to be either one of "m:m" (default, no checks), "1:m", "m:1" or "1:1". For example:

join(df1, rowbind(df2, df2), validate = "1:1")
## Error in join(df1, rowbind(df2, df2), validate = "1:1"): Join is not 1:1: df1 (x) is unique on the join columns; rowbind (y) is not unique on the join columns

Another check being automatically executed inside the workhorse function fmatch() (if sort = FALSE) is for overidentified join conditions, i.e., if the records are more than identified by the join columns. For example if we added "name" and "dept" to the join condition, this would issue a warning as the match is already identified by "id1" and "id2":

join(df1, df2, on = c("id1", "id2", "name" = "dept"), how = "left")
## Warning in fmatch(x[ixon], y[iyon], nomatch = NA_integer_, count = count, : Overidentified match/join: the first 2 of 3 columns uniquely match the records. With overid > 0, fmatch() continues to
## match columns. Consider removing columns or setting overid = 0 to terminate the algorithm after 2 columns (the results may differ, see ?fmatch). Alternatively set overid = 2 to silence this warning.
## left join: df1[id1, id2, name] 0/4 (0%) <m:m> df2[id1, id2, dept] 0/4 (0%)
##   id1 id2 name age salary
## 1   1   a John  35     NA
## 2   1   b Jane  28     NA
## 3   2   b  Bob  42     NA
## 4   3   c Carl  50     NA

The warning can be silenced by passing overid = 2 to join(). To see better where this may be useful, consider the following example using fmatch().

df1 <- data.frame(
  id1 = c(1, 1, 2, 3),
  id2 = c("a", "b", "b", "c"),
  name = c("John", "Bob", "Jane", "Carl")
)
df2 <- data.frame(
  id1 = c(1, 2, 3, 3),
  id2 = c("a", "b", "c", "e"),
  name = c("John", "Janne", "Carl", "Lynne")
)

# This gives an overidentification warning: columns 1:2 identify the data
fmatch(df1, df2)
## Warning in fmatch(df1, df2): Overidentified match/join: the first 2 of 3 columns uniquely match the records. With overid > 0, fmatch() continues to match columns. Consider removing columns or setting
## overid = 0 to terminate the algorithm after 2 columns (the results may differ, see ?fmatch). Alternatively set overid = 2 to silence this warning.
## [1]  1 NA NA  3
# This just runs through without warning
fmatch(df1, df2, overid = 2)
## [1]  1 NA NA  3
# This terminates computation after first 2 columns
fmatch(df1, df2, overid = 0)
## [1]  1 NA  2  3
fmatch(df1[1:2], df2[1:2])  # Same thing!
## [1]  1 NA  2  3
# -> note that here we get an additional match based on the unique ids,
# which we didn't get before because "Jane" != "Janne"

So, in summary, the implementation of joins on collapse as provided by the join() function is not only blazing fast1 and class-agnostic but also allows you to verify all aspects of this high-risk operation.

Advanced Pivots

The second big addition in collapse 2.0 is pivot(), which provides advanced data reshaping capabilities in a single parsimonious API. Notably, it supports longer-, wider-, and recast-pivoting functionality and can accommodate variable labels in the reshaping process.

Fortunately, collapse supplies a perfect test dataset to illustrate these capabilities: the 2014 Groningen Growth and Development Centre 10-Sector Database, which provides sectoral employment and value-added series for 10 broad sectors in 43 countries:

head(GGDC10S)
##   Country Regioncode             Region Variable Year      AGR      MIN       MAN        PU       CON      WRT      TRA     FIRE      GOV      OTH      SUM
## 1     BWA        SSA Sub-saharan Africa       VA 1960       NA       NA        NA        NA        NA       NA       NA       NA       NA       NA       NA
## 2     BWA        SSA Sub-saharan Africa       VA 1961       NA       NA        NA        NA        NA       NA       NA       NA       NA       NA       NA
## 3     BWA        SSA Sub-saharan Africa       VA 1962       NA       NA        NA        NA        NA       NA       NA       NA       NA       NA       NA
## 4     BWA        SSA Sub-saharan Africa       VA 1963       NA       NA        NA        NA        NA       NA       NA       NA       NA       NA       NA
## 5     BWA        SSA Sub-saharan Africa       VA 1964 16.30154 3.494075 0.7365696 0.1043936 0.6600454 6.243732 1.658928 1.119194 4.822485 2.341328 37.48229
## 6     BWA        SSA Sub-saharan Africa       VA 1965 15.72700 2.495768 1.0181992 0.1350976 1.3462312 7.064825 1.939007 1.246789 5.695848 2.678338 39.34710

namlab(GGDC10S, N = TRUE, Ndistinct = TRUE)
##      Variable    N Ndist                                                 Label
## 1     Country 5027    43                                               Country
## 2  Regioncode 5027     6                                           Region code
## 3      Region 5027     6                                                Region
## 4    Variable 5027     2                                              Variable
## 5        Year 5027    67                                                  Year
## 6         AGR 4364  4353                                          Agriculture 
## 7         MIN 4355  4224                                                Mining
## 8         MAN 4355  4353                                         Manufacturing
## 9          PU 4354  4237                                             Utilities
## 10        CON 4355  4339                                          Construction
## 11        WRT 4355  4344                         Trade, restaurants and hotels
## 12        TRA 4355  4334                  Transport, storage and communication
## 13       FIRE 4355  4349 Finance, insurance, real estate and business services
## 14        GOV 3482  3470                                   Government services
## 15        OTH 4248  4238               Community, social and personal services
## 16        SUM 4364  4364                               Summation of sector GDP

Evidently, the data is supplied in a format where two variables, employment and value-added, are stacked in each sector column. The data is also labeled, with descriptions attached as "label" attributes (retrievable using vlabels() or, together with names, using namlab()).

There are 3 different ways to reshape this data to make it easier to analyze. The first is to simply melt it into a long frame, e.g. for plotting with ggplot2:

# Pivot Longer
pivot(GGDC10S, ids = 1:5, 
      names = list(variable = "Sectorcode", value = "Value"), 
      labels = "Sector", how = "longer", na.rm = TRUE) |> head()
##   Country Regioncode             Region Variable Year Sectorcode       Sector    Value
## 1     BWA        SSA Sub-saharan Africa       VA 1964        AGR Agriculture  16.30154
## 2     BWA        SSA Sub-saharan Africa       VA 1965        AGR Agriculture  15.72700
## 3     BWA        SSA Sub-saharan Africa       VA 1966        AGR Agriculture  17.68066
## 4     BWA        SSA Sub-saharan Africa       VA 1967        AGR Agriculture  19.14591
## 5     BWA        SSA Sub-saharan Africa       VA 1968        AGR Agriculture  21.09957
## 6     BWA        SSA Sub-saharan Africa       VA 1969        AGR Agriculture  21.86221

Note how specifying the labels argument created a column that captures the sector descriptions, which would otherwise be lost in the reshaping process, and na.rm = TRUE removed missing values in the long frame. I note without demonstration that this operation has an exact reverse operation: pivot(long_df, 1:5, "Value", "Sector", "Description", how = "wider").

The second way to reshape the data is to create a wider frame with sector-variable columns:

# Pivot Wider
pivot(GGDC10S, ids = 1:5, names = "Variable", how = "wider", na.rm = TRUE) |> 
  namlab(N = TRUE, Ndistinct = TRUE)
##      Variable    N Ndist                                                 Label
## 1     Country 3376    36                                               Country
## 2  Regioncode 3376     6                                           Region code
## 3      Region 3376     6                                                Region
## 4    Variable 3376     2                                              Variable
## 5        Year 3376    67                                                  Year
## 6      AGR_VA 1702  1700                                          Agriculture 
## 7     AGR_EMP 1674  1669                                          Agriculture 
## 8      MIN_VA 1702  1641                                                Mining
## 9     MIN_EMP 1674  1622                                                Mining
## 10     MAN_VA 1702  1702                                         Manufacturing
## 11    MAN_EMP 1674  1672                                         Manufacturing
## 12      PU_VA 1702  1665                                             Utilities
## 13     PU_EMP 1674  1615                                             Utilities
## 14     CON_VA 1702  1693                                          Construction
## 15    CON_EMP 1674  1668                                          Construction
## 16     WRT_VA 1702  1695                         Trade, restaurants and hotels
## 17    WRT_EMP 1674  1670                         Trade, restaurants and hotels
## 18     TRA_VA 1702  1694                  Transport, storage and communication
## 19    TRA_EMP 1674  1662                  Transport, storage and communication
## 20    FIRE_VA 1702  1696 Finance, insurance, real estate and business services
## 21   FIRE_EMP 1674  1674 Finance, insurance, real estate and business services
## 22     GOV_VA 1702  1698                                   Government services
## 23    GOV_EMP 1674  1666                                   Government services
## 24     OTH_VA 1702  1695               Community, social and personal services
## 25    OTH_EMP 1674  1671               Community, social and personal services
## 26     SUM_VA 1702  1702                               Summation of sector GDP
## 27    SUM_EMP 1674  1674                               Summation of sector GDP

Note how the variable labels were copied to each of the two variables created for each sector. It is also possible to pass argument transpose = c("columns", "names") to change the order of columns and/or naming of the casted columns. Wide pivots where multiple columns are cast do not have a well-defined reverse operation. It may nevertheless be very useful to analyze individual sectors.

The third useful way to reshape this data for analysis is to recast it such that each variable goes into a separate column and the sectors are stacked in one column:

# Pivot Recast
recast_df = pivot(GGDC10S, values = 6:16, 
      names = list(from = "Variable", to = "Sectorcode"),
      labels = list(to = "Sector"), how = "recast", na.rm = TRUE)
head(recast_df)
##   Country Regioncode             Region Year Sectorcode       Sector       VA      EMP
## 1     BWA        SSA Sub-saharan Africa 1964        AGR Agriculture  16.30154 152.1179
## 2     BWA        SSA Sub-saharan Africa 1965        AGR Agriculture  15.72700 153.2971
## 3     BWA        SSA Sub-saharan Africa 1966        AGR Agriculture  17.68066 153.8867
## 4     BWA        SSA Sub-saharan Africa 1967        AGR Agriculture  19.14591 155.0659
## 5     BWA        SSA Sub-saharan Africa 1968        AGR Agriculture  21.09957 156.2451
## 6     BWA        SSA Sub-saharan Africa 1969        AGR Agriculture  21.86221 157.4243

This is useful, for example, if we wanted to run a regression with sector-fixed effects. The code to reverse this pivot is

# Reverse Pivot Recast
pivot(recast_df, values = c("VA", "EMP"), 
      names = list(from = "Sectorcode", to = "Variable"),
      labels = list(from = "Sector"), how = "recast") |> head(3)
##   Country Regioncode             Region Year Variable      AGR
## 1     BWA        SSA Sub-saharan Africa 1964       VA 16.30154
## 2     BWA        SSA Sub-saharan Africa 1965       VA 15.72700
## 3     BWA        SSA Sub-saharan Africa 1966       VA 17.68066

This showcased just some of the functionality of pivot(), more extensive examples are available in the documentation (?pivot). But the above is enough to demonstrate this unified API’s power and flexibility; it is also blazing fast.

Global Configurability and Interactive Namespace Masking

The third major feature of collapse 2.0 is its extensive global configurability via the set_collapse() function, which includes the default behavior for missing values (na.rm arguments in all statistical functions and algorithms), sorted grouping (sort), multithreading and algorithmic optimizations (nthreads, stable.algo), presentational settings (stub, digits, verbose), and, surpassing all else, the package namespace itself (mask, remove).

Why should the namespace, in particular, be modifiable? The main reason is that collapse provides many enhanced and performance improved equivalents to functions present in base R and dplyr, such as the Fast Statistical Functions, fast Grouping and Ordering functions and algorithms, Data Manipulation and Time Series functions.

collapse is intentionally fully compatible with the base R and dplyr namespaces by adding f-prefixes to these performance-improved functions where conflicts exist. Since v1.7.0, there exists a global option "collapse_mask" which can be set before the package is loaded to export non-prefixed versions of these functions, but this was somewhat tedious and best done with an .Rprofile file. collapse 2.0 also adds this option to set_collapse() and makes it fully interactive; that is, it can be set and changed at any point within the active session.

Concretely, what does this mean? Base R and dplyr are relatively slow compared to what can be achieved with group-level vectorization, SIMD instructions, and efficient algorithms, especially as data grows. To provide an example, I generate some large vectors and run some benchmarks for basic operations:

ul <- outer(letters, letters, paste0)
l <- sample(ul, 1e7, replace = TRUE)
m <- sample(outer(month.abb, month.abb, paste0), 1e7, replace = TRUE)
x <- na_insert(rnorm(1e7), prop = 0.05)
data <- data.frame(l, m, x)

library(microbenchmark)
microbenchmark(
  unique(l),
  table(l, m),
  sum(x, na.rm = TRUE),
  median(x, na.rm = TRUE),
  mean(x, na.rm = TRUE),
times = 10)
## Warning in microbenchmark(unique(l), table(l, m), sum(x, na.rm = TRUE), : less accurate nanosecond times to avoid potential integer overflows
## Unit: milliseconds
##                     expr       min        lq      mean    median        uq       max neval
##                unique(l)  82.07856  85.14671  91.08940  86.02663  87.65242 124.87604    10
##              table(l, m) 506.91215 546.24042 554.55146 549.27130 565.78729 640.67941    10
##     sum(x, na.rm = TRUE)  15.39316  15.40485  15.58453  15.45362  15.54109  16.57486    10
##  median(x, na.rm = TRUE) 155.55667 157.19572 164.38405 160.75044 165.29642 196.26446    10
##    mean(x, na.rm = TRUE)  53.79520  54.12406  60.39059  55.94202  57.72792  97.40887    10

library(dplyr)
microbenchmark(
  dplyr = data |>
    subset(l %in% ul[1:500]) |>
    group_by(l, m) |>
    summarize(mean_x = mean(x, na.rm = TRUE), 
              median_x = median(x, na.rm = TRUE)), 
times = 10)
## Unit: seconds
##   expr      min       lq     mean   median       uq      max neval
##  dplyr 2.159383 2.193034 2.230608 2.226005 2.250952 2.370895    10

The beauty of namespace masking is that we can turn parts or all of this code into collapse code by simply invoking the mask option to set_collapse(). The most comprehensive setting is mask = "all". It is a secure option because invoking it instantly exports these functions in the collapse namespace and re-attaches the namespace to make sure it is at the top of the search path:

set_collapse(mask = "all")
# This is all collapse code now + no need to set na.rm = TRUE (default in collapse)
# We could use set_collapse(na.rm = FALSE) to bring collapse in-line with base R
microbenchmark(
  unique(l),
  table(l, m),
  sum(x),
  median(x),
  mean(x),
times = 10)
## Unit: milliseconds
##         expr       min        lq       mean     median         uq        max neval
##    unique(l) 14.215561 15.289105  15.516647  15.564871  15.786722  16.411152    10
##  table(l, m) 94.875968 97.539410 107.189014 108.895508 115.038948 118.004806    10
##       sum(x)  1.968984  1.998955   2.053936   2.037946   2.128843   2.152500    10
##    median(x) 90.736772 91.784609  93.486921  92.288725  94.812787  99.236277    10
##      mean(x)  2.384314  2.389931   2.454666   2.419861   2.456023   2.680621    10

microbenchmark(
  collapse = data |>
    subset(l %in% ul[1:500]) |>
    group_by(l, m) |>
    summarize(mean_x = mean(x), 
              median_x = median(x)), 
times = 10)
## Unit: milliseconds
##      expr      min       lq     mean   median       uq      max neval
##  collapse 344.2272 351.6525 363.8762 358.4373 373.9692 402.6943    10

# Reset the masking 
# set_collapse(mask = NULL)

Evidently, the collapse code runs much faster. The 5-10x speedups shown here are quite normal. Higher speedups can be experienced for grouped operations as the number of groups grows large and repetition in R becomes very costly. As indicated before, masking in collapse 2.0 is fully interactive and reversible: invoking set_collapse(mask = NULL) and running the same code again will execute it again with base R and dplyr.

So, in summary, collapse 2.0 provides fast R, in R, in a very simple and broadly accessible way. There are many other advantages to using collapse, e.g., given that its Fast Statistical Functions are S3 generic and support grouped and weighted aggregations and transformations out of the box, this saves many unnecessary calls to apply(), lapply() or summarise(), etc. (in addition to many unnecessary specifications of na.rm = TRUE) e.g.:

# S3 generic statistical functions save a lot of syntax
mean(mtcars)                 # = sapply(mtcars, mean)
##        mpg        cyl       disp         hp       drat         wt       qsec         vs         am       gear       carb 
##  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750   0.437500   0.406250   3.687500   2.812500
mean(mtcars, w = runif(32))  # = sapply(mtcars, weighted.mean, w = runif(32))
##         mpg         cyl        disp          hp        drat          wt        qsec          vs          am        gear        carb 
##  20.9009889   5.9682791 221.9436951 137.7951392   3.6740174   3.1670541  18.0195300   0.4878638   0.4240404   3.7486737   2.6293563
mean(mtcars$mpg, mtcars$cyl) # = tapply(mtcars$mpg, mtcars$cyl, mean)
##        4        6        8 
## 26.66364 19.74286 15.10000
mean(mtcars, TRA = "-") |>   # = sweep(mtcars, 2, sapply(mtcars, mean))
  head()
##                         mpg     cyl        disp       hp       drat       wt     qsec      vs       am    gear    carb
## Mazda RX4          0.909375 -0.1875  -70.721875 -36.6875  0.3034375 -0.59725 -1.38875 -0.4375  0.59375  0.3125  1.1875
## Mazda RX4 Wag      0.909375 -0.1875  -70.721875 -36.6875  0.3034375 -0.34225 -0.82875 -0.4375  0.59375  0.3125  1.1875
## Datsun 710         2.709375 -2.1875 -122.721875 -53.6875  0.2534375 -0.89725  0.76125  0.5625  0.59375  0.3125 -1.8125
## Hornet 4 Drive     1.309375 -0.1875   27.278125 -36.6875 -0.5165625 -0.00225  1.59125  0.5625 -0.40625 -0.6875 -1.8125
## Hornet Sportabout -1.390625  1.8125  129.278125  28.3125 -0.4465625  0.22275 -0.82875 -0.4375 -0.40625 -0.6875 -0.8125
## Valiant           -1.990625 -0.1875   -5.721875 -41.6875 -0.8365625  0.24275  2.37125  0.5625 -0.40625 -0.6875 -1.8125
mtcars |> group_by(cyl, vs, am) |> 
  mean() # = summarize(across(everything(), mean))
##   cyl vs am      mpg     disp        hp     drat       wt     qsec     gear     carb
## 1   4  0  1 26.00000 120.3000  91.00000 4.430000 2.140000 16.70000 5.000000 2.000000
## 2   4  1  0 22.90000 135.8667  84.66667 3.770000 2.935000 20.97000 3.666667 1.666667
## 3   4  1  1 28.37143  89.8000  80.57143 4.148571 2.028286 18.70000 4.142857 1.428571
## 4   6  0  1 20.56667 155.0000 131.66667 3.806667 2.755000 16.32667 4.333333 4.666667
## 5   6  1  0 19.12500 204.5500 115.25000 3.420000 3.388750 19.21500 3.500000 2.500000
## 6   8  0  0 15.05000 357.6167 194.16667 3.120833 4.104083 17.14250 3.000000 3.083333
## 7   8  0  1 15.40000 326.0000 299.50000 3.880000 3.370000 14.55000 5.000000 6.000000

Concluding Reflections

It has been a remarkable 3.5-year-long journey leading up to the development of collapse 2.0, and a tremendous feat of time, energy, and determination. I could probably have published 2 academic papers instead, but would still be writing horrible code like most economists, be trying to do complex things in econometrics, etc., using other mainstream data science libraries not really designed for that, or, worst case, just be stuck to commercial software. I’m happy I came out as an open-source developer and that I’ve been accompanied on this path by other great people from my profession.

I also never regretted choosing R as a primary language. I find it unique in its simplicity and parsimony, the ability to work with different objects like vectors, matrices, and data frames in a fluent way, and to do rather complex things like matching with a single function call.

On the other hand, R for me always lacked speed and the ability to do advanced statistical and data manipulation operations with ease, as I was used to coming from commercial environments (STATA, Mathematica).

collapse is my determined attempt to bring statistical complexity, parsimony, speed, and joy to statistics and data manipulation in R, and I believe it is the most encompassing attempt out there and preserves the fundamental character of the language.

Notably, the collapse approach is not limited to a certain object (like e.g. data.table, which remains a great idea and implementation), and does not rely on data structures and syntax that are somewhat alien/external to the language and do not integrate with many of its first-order features (e.g. arrow, polars, duckdb). It is also arguably more successful than alternative ways to implement or compile the language (FastR / Graal VM), because the fundamental performance problem in R is algorithmic efficiency and the lack of low-level vectorization for repetitive tasks2.

By reimplementing core parts of the language using efficient algorithms and providing rich and flexible vectorizations for many statistical operations across columns and groups in a class-agnostic way supporting nearly all frequently used data structures, collapse solves the fundamental performance problem in a way that integrates seamlessly with the core of the language. It also adds much-needed statistical complexity, particularly for weighted statistics, time series, and panel data. In short, it provides advanced and fast R, inside GNU R.

It is not, and will never be, the absolute best that can be done in performance terms. The data formats used by the best-performing systems (such as the arrow columnar format underlying polars) are designed at the memory level to optimally use computer resources (SIMD etc.), with database applications in mind, and the people doing this did not study economics. But it is not yet clear that such architectures are very suitable for languages meant to do broad and linear-algebra heavy statistical computing tasks, and R just celebrated its 30th birthday this year. So, given the constraints imposed by a 30-year-old C-based language and API, frameworks like collapse and data.table are pushing the boundaries very far3.

Let me stop here; collapse 2.0 is out. It changed my R life, and I hope it will change yours.


  1. Some initial benchmarks were shared on Twitter, and collapse is about to enter the DuckDB database like ops benchmark. fmatch() is also nearly an order of magnitude faster than match() for atomic vectors.↩︎

  2. See e.g. benchmarks for r2c.↩︎

  3. Actually, it is extremely impressive how well data.table still performs compared to modern libraries based on optimized memory models. The benchmarks also show that in high-cardinality settings (many groups relative to the data size), optimized memory models don’t pay off that much, indicating that there is always a tradeoff between the complexity of statistical operations and the possibility of vectorization/bulk processing on modern hardware.↩︎

To leave a comment for the author, please follow the link and comment on their blog: R, Econometrics, High Performance.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version