Releasing collapse 2.0: Blazing Fast Joins, Reshaping, and Enhanced R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’m excited to announce the release of collapse 2.0, adding blazing fast joins, pivots, flexible namespace and many other features, including a brand new website, an updated cheat sheet, and a new vignette aimed at tidyverse users.
In the 3.5 years after the first release of collapse 1.0 to CRAN in March 2020, the package has seen 10 major updates, and become a remarkable piece of statistical software that is robust, stable, lightweight, fast, statistically advanced, comprehensive, flexible, class-agnostic, verbose and well-documented. It is profoundly able to deal with rich (multi-level, irregular, weighted, nested, labelled, and missing) scientific data, and can enhance the workflow of every R user.
The addition of rich, fast, and verbose joins and pivots in this release, together with secure interactive namespace masking and extensive global configurability, should enable many R users to use it as a workhorse package for data manipulation and statistical computing tasks.
In this post, I briefly introduce the core new features of this release and end with some reflections on why I created the package and think that its approach towards speeding up and enriching R is more encompassing than others.
Fast, Class-Agnostic, and Verbose Table Joins
Joins in collapse has been a much-requested feature. Still, I was long hesitant to take them on because they are complex, high-risk, operations, and I was unsure on how to go about them or provide an implementation that would satisfy my own ambitious demands to their performance, generality and verbosity.
I am glad that, following repeated requests, I have overcome these hesitations and designed an implementation – inspired by polars – that I am very satisfied with. collapse’s join function is simply called join()
, and provides 6 types of joins (left, inner, full, right, semi and anti), controlled by a how
argument – the default being a left join. It also provides two separate join algorithms: a vectorized hash join (the default, sort = FALSE
) and a sort-merge-join (sort = TRUE
). The join-column argument is called on
, and, if left empty, selects columns present in both datasets. An example with generated data follows:
library(collapse) df1 <- data.frame( id1 = c(1, 1, 2, 3), id2 = c("a", "b", "b", "c"), name = c("John", "Jane", "Bob", "Carl"), age = c(35, 28, 42, 50) ) df2 <- data.frame( id1 = c(1, 2, 3, 3), id2 = c("a", "b", "c", "e"), salary = c(60000, 55000, 70000, 80000), dept = c("IT", "Marketing", "Sales", "IT") ) # Different types of joins for (i in c("left","inner","right","full","semi","anti")) join(df1, df2, how = i) |> print() ## left join: df1[id1, id2] 3/4 (75%) <m:m> df2[id1, id2] 3/4 (75%) ## id1 id2 name age salary dept ## 1 1 a John 35 60000 IT ## 2 1 b Jane 28 NA <NA> ## 3 2 b Bob 42 55000 Marketing ## 4 3 c Carl 50 70000 Sales ## inner join: df1[id1, id2] 3/4 (75%) <m:m> df2[id1, id2] 3/4 (75%) ## id1 id2 name age salary dept ## 1 1 a John 35 60000 IT ## 2 2 b Bob 42 55000 Marketing ## 3 3 c Carl 50 70000 Sales ## right join: df1[id1, id2] 3/4 (75%) <m:m> df2[id1, id2] 3/4 (75%) ## id1 id2 name age salary dept ## 1 1 a John 35 60000 IT ## 2 2 b Bob 42 55000 Marketing ## 3 3 c Carl 50 70000 Sales ## 4 3 e <NA> NA 80000 IT ## full join: df1[id1, id2] 3/4 (75%) <m:m> df2[id1, id2] 3/4 (75%) ## id1 id2 name age salary dept ## 1 1 a John 35 60000 IT ## 2 1 b Jane 28 NA <NA> ## 3 2 b Bob 42 55000 Marketing ## 4 3 c Carl 50 70000 Sales ## 5 3 e <NA> NA 80000 IT ## semi join: df1[id1, id2] 3/4 (75%) <m:m> df2[id1, id2] 3/4 (75%) ## id1 id2 name age ## 1 1 a John 35 ## 2 2 b Bob 42 ## 3 3 c Carl 50 ## anti join: df1[id1, id2] 3/4 (75%) <m:m> df2[id1, id2] 3/4 (75%) ## id1 id2 name age ## 1 1 b Jane 28
Notice how, by default (verbose = 1
), a compact summary of the operation is printed, indicating the type of join, the datasets and columns, and the number and percentage of records from each dataset matched in the join operation. join()
also preserves the attributes of the first argument (x
) and the order of columns and rows (default keep.col.order = TRUE
, sort = FALSE
) in it. We can thus think of a join operation as adding columns to a data frame-like object (x
) from another similar object (y
) .
There are several additional options to increase verbosity and assimilate the join operation:
# verbose = 2 also shows classes, allowing you to detect implicit conversions (inside fmatch()) join(df1, df2, how = "left", verbose = 2) ## left join: df1[id1:numeric, id2:character] 3/4 (75%) <m:m> df2[id1:numeric, id2:character] 3/4 (75%) ## id1 id2 name age salary dept ## 1 1 a John 35 60000 IT ## 2 1 b Jane 28 NA <NA> ## 3 2 b Bob 42 55000 Marketing ## 4 3 c Carl 50 70000 Sales # Adding join column: useful especially for full join join(df1, df2, how = "full", column = TRUE) ## full join: df1[id1, id2] 3/4 (75%) <m:m> df2[id1, id2] 3/4 (75%) ## id1 id2 name age salary dept .join ## 1 1 a John 35 60000 IT matched ## 2 1 b Jane 28 NA <NA> df1 ## 3 2 b Bob 42 55000 Marketing matched ## 4 3 c Carl 50 70000 Sales matched ## 5 3 e <NA> NA 80000 IT df2 # Custom column + rearranging join(df1, df2, how = "full", column = list("join", c("x", "y", "x_y")), keep.col.order = FALSE) ## full join: df1[id1, id2] 3/4 (75%) <m:m> df2[id1, id2] 3/4 (75%) ## id1 id2 join name age salary dept ## 1 1 a x_y John 35 60000 IT ## 2 1 b x Jane 28 NA <NA> ## 3 2 b x_y Bob 42 55000 Marketing ## 4 3 c x_y Carl 50 70000 Sales ## 5 3 e y <NA> NA 80000 IT # Attaching match attribute str(join(df1, df2, attr = TRUE)) ## left join: df1[id1, id2] 3/4 (75%) <m:m> df2[id1, id2] 3/4 (75%) ## 'data.frame': 4 obs. of 6 variables: ## $ id1 : num 1 1 2 3 ## $ id2 : chr "a" "b" "b" "c" ## $ name : chr "John" "Jane" "Bob" "Carl" ## $ age : num 35 28 42 50 ## $ salary: num 60000 NA 55000 70000 ## $ dept : chr "IT" NA "Marketing" "Sales" ## - attr(*, "join.match")=List of 3 ## ..$ call : language join(x = df1, y = df2, attr = TRUE) ## ..$ on.cols:List of 2 ## .. ..$ x: chr [1:2] "id1" "id2" ## .. ..$ y: chr [1:2] "id1" "id2" ## ..$ match : 'qG' int [1:4] 1 NA 2 3 ## .. ..- attr(*, "N.nomatch")= int 1 ## .. ..- attr(*, "N.groups")= int 4 ## .. ..- attr(*, "N.distinct")= int 3
Finally, it is possible to validate the join operation to be either one of "m:m"
(default, no checks), "1:m"
, "m:1"
or "1:1"
. For example:
join(df1, rowbind(df2, df2), validate = "1:1") ## Error in join(df1, rowbind(df2, df2), validate = "1:1"): Join is not 1:1: df1 (x) is unique on the join columns; rowbind (y) is not unique on the join columns
Another check being automatically executed inside the workhorse function fmatch()
(if sort = FALSE
) is for overidentified join conditions, i.e., if the records are more than identified by the join columns. For example if we added "name"
and "dept"
to the join condition, this would issue a warning as the match is already identified by "id1"
and "id2"
:
join(df1, df2, on = c("id1", "id2", "name" = "dept"), how = "left") ## Warning in fmatch(x[ixon], y[iyon], nomatch = NA_integer_, count = count, : Overidentified match/join: the first 2 of 3 columns uniquely match the records. With overid > 0, fmatch() continues to ## match columns. Consider removing columns or setting overid = 0 to terminate the algorithm after 2 columns (the results may differ, see ?fmatch). Alternatively set overid = 2 to silence this warning. ## left join: df1[id1, id2, name] 0/4 (0%) <m:m> df2[id1, id2, dept] 0/4 (0%) ## id1 id2 name age salary ## 1 1 a John 35 NA ## 2 1 b Jane 28 NA ## 3 2 b Bob 42 NA ## 4 3 c Carl 50 NA
The warning can be silenced by passing overid = 2
to join()
. To see better where this may be useful, consider the following example using fmatch()
.
df1 <- data.frame( id1 = c(1, 1, 2, 3), id2 = c("a", "b", "b", "c"), name = c("John", "Bob", "Jane", "Carl") ) df2 <- data.frame( id1 = c(1, 2, 3, 3), id2 = c("a", "b", "c", "e"), name = c("John", "Janne", "Carl", "Lynne") ) # This gives an overidentification warning: columns 1:2 identify the data fmatch(df1, df2) ## Warning in fmatch(df1, df2): Overidentified match/join: the first 2 of 3 columns uniquely match the records. With overid > 0, fmatch() continues to match columns. Consider removing columns or setting ## overid = 0 to terminate the algorithm after 2 columns (the results may differ, see ?fmatch). Alternatively set overid = 2 to silence this warning. ## [1] 1 NA NA 3 # This just runs through without warning fmatch(df1, df2, overid = 2) ## [1] 1 NA NA 3 # This terminates computation after first 2 columns fmatch(df1, df2, overid = 0) ## [1] 1 NA 2 3 fmatch(df1[1:2], df2[1:2]) # Same thing! ## [1] 1 NA 2 3 # -> note that here we get an additional match based on the unique ids, # which we didn't get before because "Jane" != "Janne"
So, in summary, the implementation of joins on collapse as provided by the join()
function is not only blazing fast1 and class-agnostic but also allows you to verify all aspects of this high-risk operation.
Advanced Pivots
The second big addition in collapse 2.0 is pivot()
, which provides advanced data reshaping capabilities in a single parsimonious API. Notably, it supports longer-, wider-, and recast-pivoting functionality and can accommodate variable labels in the reshaping process.
Fortunately, collapse supplies a perfect test dataset to illustrate these capabilities: the 2014 Groningen Growth and Development Centre 10-Sector Database, which provides sectoral employment and value-added series for 10 broad sectors in 43 countries:
head(GGDC10S) ## Country Regioncode Region Variable Year AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM ## 1 BWA SSA Sub-saharan Africa VA 1960 NA NA NA NA NA NA NA NA NA NA NA ## 2 BWA SSA Sub-saharan Africa VA 1961 NA NA NA NA NA NA NA NA NA NA NA ## 3 BWA SSA Sub-saharan Africa VA 1962 NA NA NA NA NA NA NA NA NA NA NA ## 4 BWA SSA Sub-saharan Africa VA 1963 NA NA NA NA NA NA NA NA NA NA NA ## 5 BWA SSA Sub-saharan Africa VA 1964 16.30154 3.494075 0.7365696 0.1043936 0.6600454 6.243732 1.658928 1.119194 4.822485 2.341328 37.48229 ## 6 BWA SSA Sub-saharan Africa VA 1965 15.72700 2.495768 1.0181992 0.1350976 1.3462312 7.064825 1.939007 1.246789 5.695848 2.678338 39.34710 namlab(GGDC10S, N = TRUE, Ndistinct = TRUE) ## Variable N Ndist Label ## 1 Country 5027 43 Country ## 2 Regioncode 5027 6 Region code ## 3 Region 5027 6 Region ## 4 Variable 5027 2 Variable ## 5 Year 5027 67 Year ## 6 AGR 4364 4353 Agriculture ## 7 MIN 4355 4224 Mining ## 8 MAN 4355 4353 Manufacturing ## 9 PU 4354 4237 Utilities ## 10 CON 4355 4339 Construction ## 11 WRT 4355 4344 Trade, restaurants and hotels ## 12 TRA 4355 4334 Transport, storage and communication ## 13 FIRE 4355 4349 Finance, insurance, real estate and business services ## 14 GOV 3482 3470 Government services ## 15 OTH 4248 4238 Community, social and personal services ## 16 SUM 4364 4364 Summation of sector GDP
Evidently, the data is supplied in a format where two variables, employment and value-added, are stacked in each sector column. The data is also labeled, with descriptions attached as "label"
attributes (retrievable using vlabels()
or, together with names, using namlab()
).
There are 3 different ways to reshape this data to make it easier to analyze. The first is to simply melt it into a long frame, e.g. for plotting with ggplot2
:
# Pivot Longer pivot(GGDC10S, ids = 1:5, names = list(variable = "Sectorcode", value = "Value"), labels = "Sector", how = "longer", na.rm = TRUE) |> head() ## Country Regioncode Region Variable Year Sectorcode Sector Value ## 1 BWA SSA Sub-saharan Africa VA 1964 AGR Agriculture 16.30154 ## 2 BWA SSA Sub-saharan Africa VA 1965 AGR Agriculture 15.72700 ## 3 BWA SSA Sub-saharan Africa VA 1966 AGR Agriculture 17.68066 ## 4 BWA SSA Sub-saharan Africa VA 1967 AGR Agriculture 19.14591 ## 5 BWA SSA Sub-saharan Africa VA 1968 AGR Agriculture 21.09957 ## 6 BWA SSA Sub-saharan Africa VA 1969 AGR Agriculture 21.86221
Note how specifying the labels
argument created a column that captures the sector descriptions, which would otherwise be lost in the reshaping process, and na.rm = TRUE
removed missing values in the long frame. I note without demonstration that this operation has an exact reverse operation: pivot(long_df, 1:5, "Value", "Sector", "Description", how = "wider")
.
The second way to reshape the data is to create a wider frame with sector-variable columns:
# Pivot Wider pivot(GGDC10S, ids = 1:5, names = "Variable", how = "wider", na.rm = TRUE) |> namlab(N = TRUE, Ndistinct = TRUE) ## Variable N Ndist Label ## 1 Country 3376 36 Country ## 2 Regioncode 3376 6 Region code ## 3 Region 3376 6 Region ## 4 Variable 3376 2 Variable ## 5 Year 3376 67 Year ## 6 AGR_VA 1702 1700 Agriculture ## 7 AGR_EMP 1674 1669 Agriculture ## 8 MIN_VA 1702 1641 Mining ## 9 MIN_EMP 1674 1622 Mining ## 10 MAN_VA 1702 1702 Manufacturing ## 11 MAN_EMP 1674 1672 Manufacturing ## 12 PU_VA 1702 1665 Utilities ## 13 PU_EMP 1674 1615 Utilities ## 14 CON_VA 1702 1693 Construction ## 15 CON_EMP 1674 1668 Construction ## 16 WRT_VA 1702 1695 Trade, restaurants and hotels ## 17 WRT_EMP 1674 1670 Trade, restaurants and hotels ## 18 TRA_VA 1702 1694 Transport, storage and communication ## 19 TRA_EMP 1674 1662 Transport, storage and communication ## 20 FIRE_VA 1702 1696 Finance, insurance, real estate and business services ## 21 FIRE_EMP 1674 1674 Finance, insurance, real estate and business services ## 22 GOV_VA 1702 1698 Government services ## 23 GOV_EMP 1674 1666 Government services ## 24 OTH_VA 1702 1695 Community, social and personal services ## 25 OTH_EMP 1674 1671 Community, social and personal services ## 26 SUM_VA 1702 1702 Summation of sector GDP ## 27 SUM_EMP 1674 1674 Summation of sector GDP
Note how the variable labels were copied to each of the two variables created for each sector. It is also possible to pass argument transpose = c("columns", "names")
to change the order of columns and/or naming of the casted columns. Wide pivots where multiple columns are cast do not have a well-defined reverse operation. It may nevertheless be very useful to analyze individual sectors.
The third useful way to reshape this data for analysis is to recast it such that each variable goes into a separate column and the sectors are stacked in one column:
# Pivot Recast recast_df = pivot(GGDC10S, values = 6:16, names = list(from = "Variable", to = "Sectorcode"), labels = list(to = "Sector"), how = "recast", na.rm = TRUE) head(recast_df) ## Country Regioncode Region Year Sectorcode Sector VA EMP ## 1 BWA SSA Sub-saharan Africa 1964 AGR Agriculture 16.30154 152.1179 ## 2 BWA SSA Sub-saharan Africa 1965 AGR Agriculture 15.72700 153.2971 ## 3 BWA SSA Sub-saharan Africa 1966 AGR Agriculture 17.68066 153.8867 ## 4 BWA SSA Sub-saharan Africa 1967 AGR Agriculture 19.14591 155.0659 ## 5 BWA SSA Sub-saharan Africa 1968 AGR Agriculture 21.09957 156.2451 ## 6 BWA SSA Sub-saharan Africa 1969 AGR Agriculture 21.86221 157.4243
This is useful, for example, if we wanted to run a regression with sector-fixed effects. The code to reverse this pivot is
# Reverse Pivot Recast pivot(recast_df, values = c("VA", "EMP"), names = list(from = "Sectorcode", to = "Variable"), labels = list(from = "Sector"), how = "recast") |> head(3) ## Country Regioncode Region Year Variable AGR ## 1 BWA SSA Sub-saharan Africa 1964 VA 16.30154 ## 2 BWA SSA Sub-saharan Africa 1965 VA 15.72700 ## 3 BWA SSA Sub-saharan Africa 1966 VA 17.68066
This showcased just some of the functionality of pivot()
, more extensive examples are available in the documentation (?pivot
). But the above is enough to demonstrate this unified API’s power and flexibility; it is also blazing fast.
Global Configurability and Interactive Namespace Masking
The third major feature of collapse 2.0 is its extensive global configurability via the set_collapse()
function, which includes the default behavior for missing values (na.rm
arguments in all statistical functions and algorithms), sorted grouping (sort
), multithreading and algorithmic optimizations (nthreads
, stable.algo
), presentational settings (stub
, digits
, verbose
), and, surpassing all else, the package namespace itself (mask
, remove
).
Why should the namespace, in particular, be modifiable? The main reason is that collapse provides many enhanced and performance improved equivalents to functions present in base R and dplyr, such as the Fast Statistical Functions, fast Grouping and Ordering functions and algorithms, Data Manipulation and Time Series functions.
collapse is intentionally fully compatible with the base R and dplyr namespaces by adding f-prefixes to these performance-improved functions where conflicts exist. Since v1.7.0, there exists a global option "collapse_mask"
which can be set before the package is loaded to export non-prefixed versions of these functions, but this was somewhat tedious and best done with an .Rprofile
file. collapse 2.0 also adds this option to set_collapse()
and makes it fully interactive; that is, it can be set and changed at any point within the active session.
Concretely, what does this mean? Base R and dplyr are relatively slow compared to what can be achieved with group-level vectorization, SIMD instructions, and efficient algorithms, especially as data grows. To provide an example, I generate some large vectors and run some benchmarks for basic operations:
ul <- outer(letters, letters, paste0) l <- sample(ul, 1e7, replace = TRUE) m <- sample(outer(month.abb, month.abb, paste0), 1e7, replace = TRUE) x <- na_insert(rnorm(1e7), prop = 0.05) data <- data.frame(l, m, x) library(microbenchmark) microbenchmark( unique(l), table(l, m), sum(x, na.rm = TRUE), median(x, na.rm = TRUE), mean(x, na.rm = TRUE), times = 10) ## Warning in microbenchmark(unique(l), table(l, m), sum(x, na.rm = TRUE), : less accurate nanosecond times to avoid potential integer overflows ## Unit: milliseconds ## expr min lq mean median uq max neval ## unique(l) 82.07856 85.14671 91.08940 86.02663 87.65242 124.87604 10 ## table(l, m) 506.91215 546.24042 554.55146 549.27130 565.78729 640.67941 10 ## sum(x, na.rm = TRUE) 15.39316 15.40485 15.58453 15.45362 15.54109 16.57486 10 ## median(x, na.rm = TRUE) 155.55667 157.19572 164.38405 160.75044 165.29642 196.26446 10 ## mean(x, na.rm = TRUE) 53.79520 54.12406 60.39059 55.94202 57.72792 97.40887 10 library(dplyr) microbenchmark( dplyr = data |> subset(l %in% ul[1:500]) |> group_by(l, m) |> summarize(mean_x = mean(x, na.rm = TRUE), median_x = median(x, na.rm = TRUE)), times = 10) ## Unit: seconds ## expr min lq mean median uq max neval ## dplyr 2.159383 2.193034 2.230608 2.226005 2.250952 2.370895 10
The beauty of namespace masking is that we can turn parts or all of this code into collapse code by simply invoking the mask
option to set_collapse()
. The most comprehensive setting is mask = "all"
. It is a secure option because invoking it instantly exports these functions in the collapse namespace and re-attaches the namespace to make sure it is at the top of the search path:
set_collapse(mask = "all") # This is all collapse code now + no need to set na.rm = TRUE (default in collapse) # We could use set_collapse(na.rm = FALSE) to bring collapse in-line with base R microbenchmark( unique(l), table(l, m), sum(x), median(x), mean(x), times = 10) ## Unit: milliseconds ## expr min lq mean median uq max neval ## unique(l) 14.215561 15.289105 15.516647 15.564871 15.786722 16.411152 10 ## table(l, m) 94.875968 97.539410 107.189014 108.895508 115.038948 118.004806 10 ## sum(x) 1.968984 1.998955 2.053936 2.037946 2.128843 2.152500 10 ## median(x) 90.736772 91.784609 93.486921 92.288725 94.812787 99.236277 10 ## mean(x) 2.384314 2.389931 2.454666 2.419861 2.456023 2.680621 10 microbenchmark( collapse = data |> subset(l %in% ul[1:500]) |> group_by(l, m) |> summarize(mean_x = mean(x), median_x = median(x)), times = 10) ## Unit: milliseconds ## expr min lq mean median uq max neval ## collapse 344.2272 351.6525 363.8762 358.4373 373.9692 402.6943 10 # Reset the masking # set_collapse(mask = NULL)
Evidently, the collapse code runs much faster. The 5-10x speedups shown here are quite normal. Higher speedups can be experienced for grouped operations as the number of groups grows large and repetition in R becomes very costly. As indicated before, masking in collapse 2.0 is fully interactive and reversible: invoking set_collapse(mask = NULL)
and running the same code again will execute it again with base R and dplyr.
So, in summary, collapse 2.0 provides fast R, in R, in a very simple and broadly accessible way. There are many other advantages to using collapse, e.g., given that its Fast Statistical Functions are S3 generic and support grouped and weighted aggregations and transformations out of the box, this saves many unnecessary calls to apply()
, lapply()
or summarise()
, etc. (in addition to many unnecessary specifications of na.rm = TRUE
) e.g.:
# S3 generic statistical functions save a lot of syntax mean(mtcars) # = sapply(mtcars, mean) ## mpg cyl disp hp drat wt qsec vs am gear carb ## 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750 0.437500 0.406250 3.687500 2.812500 mean(mtcars, w = runif(32)) # = sapply(mtcars, weighted.mean, w = runif(32)) ## mpg cyl disp hp drat wt qsec vs am gear carb ## 20.9009889 5.9682791 221.9436951 137.7951392 3.6740174 3.1670541 18.0195300 0.4878638 0.4240404 3.7486737 2.6293563 mean(mtcars$mpg, mtcars$cyl) # = tapply(mtcars$mpg, mtcars$cyl, mean) ## 4 6 8 ## 26.66364 19.74286 15.10000 mean(mtcars, TRA = "-") |> # = sweep(mtcars, 2, sapply(mtcars, mean)) head() ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 0.909375 -0.1875 -70.721875 -36.6875 0.3034375 -0.59725 -1.38875 -0.4375 0.59375 0.3125 1.1875 ## Mazda RX4 Wag 0.909375 -0.1875 -70.721875 -36.6875 0.3034375 -0.34225 -0.82875 -0.4375 0.59375 0.3125 1.1875 ## Datsun 710 2.709375 -2.1875 -122.721875 -53.6875 0.2534375 -0.89725 0.76125 0.5625 0.59375 0.3125 -1.8125 ## Hornet 4 Drive 1.309375 -0.1875 27.278125 -36.6875 -0.5165625 -0.00225 1.59125 0.5625 -0.40625 -0.6875 -1.8125 ## Hornet Sportabout -1.390625 1.8125 129.278125 28.3125 -0.4465625 0.22275 -0.82875 -0.4375 -0.40625 -0.6875 -0.8125 ## Valiant -1.990625 -0.1875 -5.721875 -41.6875 -0.8365625 0.24275 2.37125 0.5625 -0.40625 -0.6875 -1.8125 mtcars |> group_by(cyl, vs, am) |> mean() # = summarize(across(everything(), mean)) ## cyl vs am mpg disp hp drat wt qsec gear carb ## 1 4 0 1 26.00000 120.3000 91.00000 4.430000 2.140000 16.70000 5.000000 2.000000 ## 2 4 1 0 22.90000 135.8667 84.66667 3.770000 2.935000 20.97000 3.666667 1.666667 ## 3 4 1 1 28.37143 89.8000 80.57143 4.148571 2.028286 18.70000 4.142857 1.428571 ## 4 6 0 1 20.56667 155.0000 131.66667 3.806667 2.755000 16.32667 4.333333 4.666667 ## 5 6 1 0 19.12500 204.5500 115.25000 3.420000 3.388750 19.21500 3.500000 2.500000 ## 6 8 0 0 15.05000 357.6167 194.16667 3.120833 4.104083 17.14250 3.000000 3.083333 ## 7 8 0 1 15.40000 326.0000 299.50000 3.880000 3.370000 14.55000 5.000000 6.000000
Concluding Reflections
It has been a remarkable 3.5-year-long journey leading up to the development of collapse 2.0, and a tremendous feat of time, energy, and determination. I could probably have published 2 academic papers instead, but would still be writing horrible code like most economists, be trying to do complex things in econometrics, etc., using other mainstream data science libraries not really designed for that, or, worst case, just be stuck to commercial software. I’m happy I came out as an open-source developer and that I’ve been accompanied on this path by other great people from my profession.
I also never regretted choosing R as a primary language. I find it unique in its simplicity and parsimony, the ability to work with different objects like vectors, matrices, and data frames in a fluent way, and to do rather complex things like matching with a single function call.
On the other hand, R for me always lacked speed and the ability to do advanced statistical and data manipulation operations with ease, as I was used to coming from commercial environments (STATA, Mathematica).
collapse is my determined attempt to bring statistical complexity, parsimony, speed, and joy to statistics and data manipulation in R, and I believe it is the most encompassing attempt out there and preserves the fundamental character of the language.
Notably, the collapse approach is not limited to a certain object (like e.g. data.table, which remains a great idea and implementation), and does not rely on data structures and syntax that are somewhat alien/external to the language and do not integrate with many of its first-order features (e.g. arrow, polars, duckdb). It is also arguably more successful than alternative ways to implement or compile the language (FastR / Graal VM), because the fundamental performance problem in R is algorithmic efficiency and the lack of low-level vectorization for repetitive tasks2.
By reimplementing core parts of the language using efficient algorithms and providing rich and flexible vectorizations for many statistical operations across columns and groups in a class-agnostic way supporting nearly all frequently used data structures, collapse solves the fundamental performance problem in a way that integrates seamlessly with the core of the language. It also adds much-needed statistical complexity, particularly for weighted statistics, time series, and panel data. In short, it provides advanced and fast R, inside GNU R.
It is not, and will never be, the absolute best that can be done in performance terms. The data formats used by the best-performing systems (such as the arrow columnar format underlying polars) are designed at the memory level to optimally use computer resources (SIMD etc.), with database applications in mind, and the people doing this did not study economics. But it is not yet clear that such architectures are very suitable for languages meant to do broad and linear-algebra heavy statistical computing tasks, and R just celebrated its 30th birthday this year. So, given the constraints imposed by a 30-year-old C-based language and API, frameworks like collapse and data.table are pushing the boundaries very far3.
Let me stop here; collapse 2.0 is out. It changed my R life, and I hope it will change yours.
Some initial benchmarks were shared on Twitter, and collapse is about to enter the DuckDB database like ops benchmark.
fmatch()
is also nearly an order of magnitude faster thanmatch()
for atomic vectors.↩︎Actually, it is extremely impressive how well data.table still performs compared to modern libraries based on optimized memory models. The benchmarks also show that in high-cardinality settings (many groups relative to the data size), optimized memory models don’t pay off that much, indicating that there is always a tradeoff between the complexity of statistical operations and the possibility of vectorization/bulk processing on modern hardware.↩︎
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.