Seal of Approval: collapse

Sebastian Krantz

1 week ago

[This article was first published on Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

< section id="collapse" class="level2">

`collapse`

Author(s): Sebastian Krantz

Maintainer: Sebastian Krantz (sebastian.krantz@graduateinstitute.ch)

collapse is a large C/C++-based infrastructure package facilitating complex statistical computing, data transformation, and exploration tasks in R – at outstanding levels of performance and memory efficiency. It also implements a class-agnostic approach to R programming supporting vector, matrix and data frame-like objects (includings xts, tibble, data.table, and sf). It has a stable API, depends on Rcpp, and supports R versions >= 3.4.0.

< section id="relationship-with-data.table" class="level2">

Relationship with `data.table`

At the C-level, collapse took much inspiration from data.table, and leverages some of its core algorithms like radixsort, while adding significant statistical functionality and new algorithms within a class-agnostic programming framework that seamslessly supports data.table. Notably, collapse::qDT() is a highly efficient anything to data.table converter, and all manipulation functions in collapse, if passed data.table’s, return valid data.table’s, allowing for subsequent reference operations (:=).

It’s added functionality includes a rich set of Fast Statistical Functions supporting vectorized (grouped, weighted) statistical operations on matrix-like objects. These are integrated with fast data manipulation functions in a way that also more complex statistical expressions can be vectorized across groups. It also adds flexible time series functions and classes supporting irregular series and panels, (panel-)data transformations, vectorized hash-joins, fast aggregation and recast pivots, (internal) support for variable labels, powerful descriptive tools, memory efficient programming tools, and recursive tools for heterogeneous nested data.

It is highly and interactively configurable. A navigable internal documentation/overview facilitates its use.

< section id="overview" class="level2">

Overview

The easiest way to load collapse and data.table together is via the fastverse package:

library(fastverse)

-- Attaching packages ------------------------------------------------------------------------------- fastverse 0.3.3 --

v data.table 1.15.4     v kit        0.0.19
v magrittr   2.0.3      v collapse   2.0.16

This demonstrates collapse’s deep integration with data.table.

mtcarsDT <- qDT(mtcars)                # This creates a valid data.table (no deep copy)
mtcarsDT[, new := mean(mpg), by = cyl] # Proof: no warning here

There are many reasons to use collapse, e.g., to compute advanced statistics very fast:

# Fast tidyverse-like functions: one of the ways to code with collapse
mtcDTagg <- mtcarsDT |> 
  fgroup_by(cyl, vs, am) |> 
  fsummarise(mpg_wtd_median = fmedian(mpg, wt),             # Weighted median
             mpg_wtd_p90 = fnth(mpg, 0.9, wt, ties = "q8"), # Weighted 90% quantile type 8
             mpg_wtd_mode = fmode(mpg, wt, ties = "max"),   # Weighted maximum mode 
             mpg_range = fmax(mpg) %-=% fmin(mpg),          # Range: vectorized and memory efficient   
             lm_mpg_carb = fsum(mpg, W(carb)) %/=% fsum(W(carb)^2)) # coef(lm(mpg ~ carb)): vectorized
# Note: for increased parsimony, can appreviate fgroup_by -> gby, fsummarise -> smr
mtcDTagg[, new2 := 1][1:3] # Still a data.table

     cyl    vs    am mpg_wtd_median mpg_wtd_p90 mpg_wtd_mode mpg_range lm_mpg_carb  new2
   <num> <num> <num>          <num>       <num>        <num>     <num>       <num> <num>
1:     4     0     1           26.0    26.00000         26.0       0.0         NaN     1
2:     4     1     0           22.8    24.40000         24.4       2.9         2.1     1
3:     4     1     1           30.4    33.48809         30.4      12.5        -1.7     1

Or simply, convenience functions like collap() for fast multi-type aggregation:

# World Development Dataset (see ?wlddev)
head(wlddev, 3)

      country iso3c       date year decade     region     income  OECD PCGDP LIFEEX GINI       ODA     POP
1 Afghanistan   AFG 1961-01-01 1960   1960 South Asia Low income FALSE    NA 32.446   NA 116769997 8996973
2 Afghanistan   AFG 1962-01-01 1961   1960 South Asia Low income FALSE    NA 32.962   NA 232080002 9169410
3 Afghanistan   AFG 1963-01-01 1962   1960 South Asia Low income FALSE    NA 33.471   NA 112839996 9351441

# Population weighted mean for numeric and mode for non-numeric columns (multithreaded and 
# vectorized across groups and columns, the default in statistical functions is na.rm = TRUE)
wlddev |> collap(~ year + income, fmean, fmode, w = ~ POP, nthreads = 4) |> ss(1:3)

        country iso3c       date year decade                region              income  OECD      PCGDP   LIFEEX GINI
1 United States   USA 1961-01-01 1960   1960 Europe & Central Asia         High income  TRUE 12768.7126 68.59372   NA
2      Ethiopia   ETH 1961-01-01 1960   1960    Sub-Saharan Africa          Low income FALSE   658.4778 38.33382   NA
3         India   IND 1961-01-01 1960   1960            South Asia Lower middle income FALSE   500.7932 45.26707   NA
         ODA       POP
1  911825661 749495030
2  160457982 147355735
3 3278899549 927990163

We can also use the low-level API for statistical programming:

# Grouped mean
fmean(mtcars$mpg, mtcars$g)

       3        4        5 
16.10667 24.53333 21.38000

# Grouping object from multiple columns
g <- GRP(mtcars, c("cyl", "vs", "am"))
fmean(mtcars$mpg, g)

   4.0.1    4.1.0    4.1.1    6.0.1    6.1.0    8.0.0    8.0.1 
26.00000 22.90000 28.37143 20.56667 19.12500 15.05000 15.40000

vars <- c("carb", "hp", "qsec") # columns to aggregate
# Aggreagting: weighted mean: vectorized across groups and columns 
add_vars(g$groups, # Grouping columns
  fmean(get_vars(mtcars, vars), g, 
        w = mtcars$wt, use.g.names = FALSE)
)

  cyl vs am     carb        hp     qsec
1   4  0  1 2.000000  91.00000 16.70000
2   4  1  0 1.720045  83.60420 21.04028
3   4  1  1 1.416115  82.11819 18.75509
4   6  0  1 4.670296 131.78463 16.33306
5   6  1  0 2.522685 115.32202 19.21275
6   8  0  0 3.186582 196.74988 17.20449
7   8  0  1 6.118694 301.60682 14.55297

# Let's aggregate a matrix 
m <- matrix(abs(rnorm(32^2)), 32)
m |> fmean(g) |> t() |> fmean(g) |> t()

          4.0.1     4.1.0     4.1.1     6.0.1     6.1.0     8.0.0     8.0.1
4.0.1 1.4521108 0.3523314 0.8342670 0.9171383 0.9653782 0.9972716 0.9588307
4.1.0 0.4862770 0.8524147 0.7069511 1.0736184 0.7940877 0.7172582 0.6543751
4.1.1 0.8559884 0.8533788 0.8341950 0.8516854 0.6340604 0.8739776 0.7811358
6.0.1 0.9444262 0.7027453 1.0463235 0.7361824 0.8646207 0.9863881 0.7091550
6.1.0 0.9136786 0.8365409 0.8170907 0.8222345 0.9893137 0.9412397 0.9012829
8.0.0 0.8671352 0.8113418 0.6135990 0.6826202 0.8601678 0.7693314 0.8069385
8.0.1 0.4695412 1.0580121 0.8191335 0.9231220 0.6918469 0.8509011 1.2230739

# Normalizing the columns, by reference
fsum(m, TRA = "/", set = TRUE)
fsum(m) # Check

 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

# Multiply the rows with a vector (by reference)
setop(m, "*", mtcars$mpg, rowwise = TRUE)
# Replace some elements with a number
setv(m, 3:40, 5.76) # Could also use a vector to copy from
whichv(m, 5.76) # get the indices back...

 [1]  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

It is also fairly easy to do more involved data exploration and manipulation:

# Groningen Growth and Development Center 10 Sector Database (see ?GGDC10S)
namlab(GGDC10S, N = TRUE, Ndistinct = TRUE, class = TRUE)

     Variable     Class    N Ndist                                                 Label
1     Country character 5027    43                                               Country
2  Regioncode character 5027     6                                           Region code
3      Region character 5027     6                                                Region
4    Variable character 5027     2                                              Variable
5        Year   numeric 5027    67                                                  Year
6         AGR   numeric 4364  4353                                          Agriculture 
7         MIN   numeric 4355  4224                                                Mining
8         MAN   numeric 4355  4353                                         Manufacturing
9          PU   numeric 4354  4237                                             Utilities
10        CON   numeric 4355  4339                                          Construction
11        WRT   numeric 4355  4344                         Trade, restaurants and hotels
12        TRA   numeric 4355  4334                  Transport, storage and communication
13       FIRE   numeric 4355  4349 Finance, insurance, real estate and business services
14        GOV   numeric 3482  3470                                   Government services
15        OTH   numeric 4248  4238               Community, social and personal services
16        SUM   numeric 4364  4364                               Summation of sector GDP

# Describe total Employment and Value-Added
descr(GGDC10S, SUM ~ Variable)

Dataset: GGDC10S, 1 Variables, N = 5027
Grouped by: Variable [2]
        N   Perc
EMP  2516  50.05
VA   2511  49.95
------------------------------------------------------------------------------------------------------------------------
SUM (numeric): Summation of sector GDP
Statistics (N = 4364, 13.19% NAs)
        N   Perc  Ndist         Mean          SD     Min             Max   Skew    Kurt
EMP  2225  50.99   2225     36846.87    96318.65  173.88          764200   5.02   30.98
VA   2139  49.01   2139  43'961639.1  358'350627       0  8.06794210e+09  15.77  289.46

Quantiles
         1%      5%      10%      25%        50%          75%          90%         95%         99%
EMP  256.12  599.38  1599.27  3555.62    9593.98      24801.5     66975.01   152402.28    550909.6
VA        0   25.01   444.54    21302  243186.47  1'396139.11  15'926968.3  104'405351  692'993893
------------------------------------------------------------------------------------------------------------------------

# Compute growth rate (Employment and VA, all sectors)
GGDC10S_growth <- tfmv(GGDC10S, AGR:SUM, fgrowth, # tfmv = transform variables. Alternatively: fmutate(across(...))
                       g = list(Country, Variable), t = Year, # Internal grouping and ordering, passed to fgrowth()
                       apply = FALSE) # apply = FALSE ensures we call fgrowth.data.frame

# Recast the dataset, median growth rate across years, taking along variable labels 
GGDC_med_growth <- pivot(GGDC10S_growth,
  ids = c("Country", "Regioncode", "Region"),
  values = slt(GGDC10S, AGR:SUM, return = "names"), # slt = shorthand for fselect()
  names = list(from = "Variable", to = "Sectorcode"),
  labels = list(to = "Sector"), 
  FUN = fmedian,  # Fast function = vectorized
  how = "recast"  # Recast (transposition) method
) |> qDT()
GGDC_med_growth[1:3]

   Country Regioncode             Region Sectorcode       Sector        VA       EMP
    <char>     <char>             <char>     <fctr>       <fctr>     <num>     <num>
1:     BWA        SSA Sub-saharan Africa        AGR Agriculture   8.790267 0.8921475
2:     ETH        SSA Sub-saharan Africa        AGR Agriculture   6.664964 2.5876142
3:     GHA        SSA Sub-saharan Africa        AGR Agriculture  28.215905 1.4045550

# Finally, lets just join this to wlddev, enabling multiple matches (cartesian product)
# -> on average 61 years x 11 sectors = 671 records per unique (country) match
join(wlddev, GGDC_med_growth, on = c("iso3c" = "Country"), 
     how = "inner", multiple = TRUE) |> ss(1:3)

inner join: wlddev[iso3c] 2379/13176 (18.1%) <61:11> GGDC_med_growth[Country] 429/473 (90.7%)

    country iso3c       date year decade                    region              income  OECD    PCGDP LIFEEX GINI
1 Argentina   ARG 1961-01-01 1960   1960 Latin America & Caribbean Upper middle income FALSE 5642.765 65.055   NA
2 Argentina   ARG 1961-01-01 1960   1960 Latin America & Caribbean Upper middle income FALSE 5642.765 65.055   NA
3 Argentina   ARG 1961-01-01 1960   1960 Latin America & Caribbean Upper middle income FALSE 5642.765 65.055   NA
        ODA      POP Regioncode        Region Sectorcode        Sector       VA        EMP
1 219809998 20481779        LAM Latin America        AGR  Agriculture  32.91968 -0.8646301
2 219809998 20481779        LAM Latin America        MIN        Mining 25.72799  1.5627293
3 219809998 20481779        LAM Latin America        MAN Manufacturing 26.66754  1.0801500

In summary: collapse provides flexible high-performance statistical and data manipulation tools, which extend and seamlessly integrate with data.table. The package follows a similar development philosophy emphasizing API stability, parsimonious syntax, and zero dependencies (apart from Rcpp). data.table users may wish to employ collapse for some of the advanced statistical and manipulation functionality showcased above, but also to efficiently manipulate other data frame-like objects, such as sf data frames.

No matching items

To leave a comment for the author, please follow the link and comment on their blog: Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Relationship with data.table

Overview

Related

Relationship with `data.table`