Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
collapse
Author(s): Sebastian Krantz
Maintainer: Sebastian Krantz (sebastian.krantz@graduateinstitute.ch)
collapse
is a large C/C++-based infrastructure package facilitating complex statistical computing, data transformation, and exploration tasks in R – at outstanding levels of performance and memory efficiency. It also implements a class-agnostic approach to R programming supporting vector, matrix and data frame-like objects (includings xts, tibble, data.table, and sf). It has a stable API, depends on Rcpp, and supports R versions >= 3.4.0.
Relationship with data.table
At the C-level, collapse
took much inspiration from data.table
, and leverages some of its core algorithms like radixsort, while adding significant statistical functionality and new algorithms within a class-agnostic programming framework that seamslessly supports data.table
. Notably, collapse::qDT()
is a highly efficient anything to data.table
converter, and all manipulation functions in collapse
, if passed data.table
’s, return valid data.table
’s, allowing for subsequent reference operations (:=
).
It’s added functionality includes a rich set of Fast Statistical Functions supporting vectorized (grouped, weighted) statistical operations on matrix-like objects. These are integrated with fast data manipulation functions in a way that also more complex statistical expressions can be vectorized across groups. It also adds flexible time series functions and classes supporting irregular series and panels, (panel-)data transformations, vectorized hash-joins, fast aggregation and recast pivots, (internal) support for variable labels, powerful descriptive tools, memory efficient programming tools, and recursive tools for heterogeneous nested data.
It is highly and interactively configurable. A navigable internal documentation/overview facilitates its use.
< section id="overview" class="level2">Overview
The easiest way to load collapse
and data.table
together is via the fastverse
package:
library(fastverse)
-- Attaching packages ------------------------------------------------------------------------------- fastverse 0.3.3 --
v data.table 1.15.4 v kit 0.0.19 v magrittr 2.0.3 v collapse 2.0.16
This demonstrates collapse
’s deep integration with data.table
.
mtcarsDT <- qDT(mtcars) # This creates a valid data.table (no deep copy) mtcarsDT[, new := mean(mpg), by = cyl] # Proof: no warning here
There are many reasons to use collapse
, e.g., to compute advanced statistics very fast:
# Fast tidyverse-like functions: one of the ways to code with collapse mtcDTagg <- mtcarsDT |> fgroup_by(cyl, vs, am) |> fsummarise(mpg_wtd_median = fmedian(mpg, wt), # Weighted median mpg_wtd_p90 = fnth(mpg, 0.9, wt, ties = "q8"), # Weighted 90% quantile type 8 mpg_wtd_mode = fmode(mpg, wt, ties = "max"), # Weighted maximum mode mpg_range = fmax(mpg) %-=% fmin(mpg), # Range: vectorized and memory efficient lm_mpg_carb = fsum(mpg, W(carb)) %/=% fsum(W(carb)^2)) # coef(lm(mpg ~ carb)): vectorized # Note: for increased parsimony, can appreviate fgroup_by -> gby, fsummarise -> smr mtcDTagg[, new2 := 1][1:3] # Still a data.table
cyl vs am mpg_wtd_median mpg_wtd_p90 mpg_wtd_mode mpg_range lm_mpg_carb new2 <num> <num> <num> <num> <num> <num> <num> <num> <num> 1: 4 0 1 26.0 26.00000 26.0 0.0 NaN 1 2: 4 1 0 22.8 24.40000 24.4 2.9 2.1 1 3: 4 1 1 30.4 33.48809 30.4 12.5 -1.7 1
Or simply, convenience functions like collap()
for fast multi-type aggregation:
# World Development Dataset (see ?wlddev) head(wlddev, 3)
country iso3c date year decade region income OECD PCGDP LIFEEX GINI ODA POP 1 Afghanistan AFG 1961-01-01 1960 1960 South Asia Low income FALSE NA 32.446 NA 116769997 8996973 2 Afghanistan AFG 1962-01-01 1961 1960 South Asia Low income FALSE NA 32.962 NA 232080002 9169410 3 Afghanistan AFG 1963-01-01 1962 1960 South Asia Low income FALSE NA 33.471 NA 112839996 9351441
# Population weighted mean for numeric and mode for non-numeric columns (multithreaded and # vectorized across groups and columns, the default in statistical functions is na.rm = TRUE) wlddev |> collap(~ year + income, fmean, fmode, w = ~ POP, nthreads = 4) |> ss(1:3)
country iso3c date year decade region income OECD PCGDP LIFEEX GINI 1 United States USA 1961-01-01 1960 1960 Europe & Central Asia High income TRUE 12768.7126 68.59372 NA 2 Ethiopia ETH 1961-01-01 1960 1960 Sub-Saharan Africa Low income FALSE 658.4778 38.33382 NA 3 India IND 1961-01-01 1960 1960 South Asia Lower middle income FALSE 500.7932 45.26707 NA ODA POP 1 911825661 749495030 2 160457982 147355735 3 3278899549 927990163
We can also use the low-level API for statistical programming:
# Grouped mean fmean(mtcars$mpg, mtcars$g)
3 4 5 16.10667 24.53333 21.38000
# Grouping object from multiple columns g <- GRP(mtcars, c("cyl", "vs", "am")) fmean(mtcars$mpg, g)
4.0.1 4.1.0 4.1.1 6.0.1 6.1.0 8.0.0 8.0.1 26.00000 22.90000 28.37143 20.56667 19.12500 15.05000 15.40000
vars <- c("carb", "hp", "qsec") # columns to aggregate # Aggreagting: weighted mean: vectorized across groups and columns add_vars(g$groups, # Grouping columns fmean(get_vars(mtcars, vars), g, w = mtcars$wt, use.g.names = FALSE) )
cyl vs am carb hp qsec 1 4 0 1 2.000000 91.00000 16.70000 2 4 1 0 1.720045 83.60420 21.04028 3 4 1 1 1.416115 82.11819 18.75509 4 6 0 1 4.670296 131.78463 16.33306 5 6 1 0 2.522685 115.32202 19.21275 6 8 0 0 3.186582 196.74988 17.20449 7 8 0 1 6.118694 301.60682 14.55297
# Let's aggregate a matrix m <- matrix(abs(rnorm(32^2)), 32) m |> fmean(g) |> t() |> fmean(g) |> t()
4.0.1 4.1.0 4.1.1 6.0.1 6.1.0 8.0.0 8.0.1 4.0.1 1.4521108 0.3523314 0.8342670 0.9171383 0.9653782 0.9972716 0.9588307 4.1.0 0.4862770 0.8524147 0.7069511 1.0736184 0.7940877 0.7172582 0.6543751 4.1.1 0.8559884 0.8533788 0.8341950 0.8516854 0.6340604 0.8739776 0.7811358 6.0.1 0.9444262 0.7027453 1.0463235 0.7361824 0.8646207 0.9863881 0.7091550 6.1.0 0.9136786 0.8365409 0.8170907 0.8222345 0.9893137 0.9412397 0.9012829 8.0.0 0.8671352 0.8113418 0.6135990 0.6826202 0.8601678 0.7693314 0.8069385 8.0.1 0.4695412 1.0580121 0.8191335 0.9231220 0.6918469 0.8509011 1.2230739
# Normalizing the columns, by reference fsum(m, TRA = "/", set = TRUE) fsum(m) # Check
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# Multiply the rows with a vector (by reference) setop(m, "*", mtcars$mpg, rowwise = TRUE) # Replace some elements with a number setv(m, 3:40, 5.76) # Could also use a vector to copy from whichv(m, 5.76) # get the indices back...
[1] 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
It is also fairly easy to do more involved data exploration and manipulation:
# Groningen Growth and Development Center 10 Sector Database (see ?GGDC10S) namlab(GGDC10S, N = TRUE, Ndistinct = TRUE, class = TRUE)
Variable Class N Ndist Label 1 Country character 5027 43 Country 2 Regioncode character 5027 6 Region code 3 Region character 5027 6 Region 4 Variable character 5027 2 Variable 5 Year numeric 5027 67 Year 6 AGR numeric 4364 4353 Agriculture 7 MIN numeric 4355 4224 Mining 8 MAN numeric 4355 4353 Manufacturing 9 PU numeric 4354 4237 Utilities 10 CON numeric 4355 4339 Construction 11 WRT numeric 4355 4344 Trade, restaurants and hotels 12 TRA numeric 4355 4334 Transport, storage and communication 13 FIRE numeric 4355 4349 Finance, insurance, real estate and business services 14 GOV numeric 3482 3470 Government services 15 OTH numeric 4248 4238 Community, social and personal services 16 SUM numeric 4364 4364 Summation of sector GDP
# Describe total Employment and Value-Added descr(GGDC10S, SUM ~ Variable)
Dataset: GGDC10S, 1 Variables, N = 5027 Grouped by: Variable [2] N Perc EMP 2516 50.05 VA 2511 49.95 ------------------------------------------------------------------------------------------------------------------------ SUM (numeric): Summation of sector GDP Statistics (N = 4364, 13.19% NAs) N Perc Ndist Mean SD Min Max Skew Kurt EMP 2225 50.99 2225 36846.87 96318.65 173.88 764200 5.02 30.98 VA 2139 49.01 2139 43'961639.1 358'350627 0 8.06794210e+09 15.77 289.46 Quantiles 1% 5% 10% 25% 50% 75% 90% 95% 99% EMP 256.12 599.38 1599.27 3555.62 9593.98 24801.5 66975.01 152402.28 550909.6 VA 0 25.01 444.54 21302 243186.47 1'396139.11 15'926968.3 104'405351 692'993893 ------------------------------------------------------------------------------------------------------------------------
# Compute growth rate (Employment and VA, all sectors) GGDC10S_growth <- tfmv(GGDC10S, AGR:SUM, fgrowth, # tfmv = transform variables. Alternatively: fmutate(across(...)) g = list(Country, Variable), t = Year, # Internal grouping and ordering, passed to fgrowth() apply = FALSE) # apply = FALSE ensures we call fgrowth.data.frame # Recast the dataset, median growth rate across years, taking along variable labels GGDC_med_growth <- pivot(GGDC10S_growth, ids = c("Country", "Regioncode", "Region"), values = slt(GGDC10S, AGR:SUM, return = "names"), # slt = shorthand for fselect() names = list(from = "Variable", to = "Sectorcode"), labels = list(to = "Sector"), FUN = fmedian, # Fast function = vectorized how = "recast" # Recast (transposition) method ) |> qDT() GGDC_med_growth[1:3]
Country Regioncode Region Sectorcode Sector VA EMP <char> <char> <char> <fctr> <fctr> <num> <num> 1: BWA SSA Sub-saharan Africa AGR Agriculture 8.790267 0.8921475 2: ETH SSA Sub-saharan Africa AGR Agriculture 6.664964 2.5876142 3: GHA SSA Sub-saharan Africa AGR Agriculture 28.215905 1.4045550
# Finally, lets just join this to wlddev, enabling multiple matches (cartesian product) # -> on average 61 years x 11 sectors = 671 records per unique (country) match join(wlddev, GGDC_med_growth, on = c("iso3c" = "Country"), how = "inner", multiple = TRUE) |> ss(1:3)
inner join: wlddev[iso3c] 2379/13176 (18.1%) <61:11> GGDC_med_growth[Country] 429/473 (90.7%)
country iso3c date year decade region income OECD PCGDP LIFEEX GINI 1 Argentina ARG 1961-01-01 1960 1960 Latin America & Caribbean Upper middle income FALSE 5642.765 65.055 NA 2 Argentina ARG 1961-01-01 1960 1960 Latin America & Caribbean Upper middle income FALSE 5642.765 65.055 NA 3 Argentina ARG 1961-01-01 1960 1960 Latin America & Caribbean Upper middle income FALSE 5642.765 65.055 NA ODA POP Regioncode Region Sectorcode Sector VA EMP 1 219809998 20481779 LAM Latin America AGR Agriculture 32.91968 -0.8646301 2 219809998 20481779 LAM Latin America MIN Mining 25.72799 1.5627293 3 219809998 20481779 LAM Latin America MAN Manufacturing 26.66754 1.0801500
In summary: collapse
provides flexible high-performance statistical and data manipulation tools, which extend and seamlessly integrate with data.table
. The package follows a similar development philosophy emphasizing API stability, parsimonious syntax, and zero dependencies (apart from Rcpp
). data.table
users may wish to employ collapse
for some of the advanced statistical and manipulation functionality showcased above, but also to efficiently manipulate other data frame-like objects, such as sf
data frames.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.