Benchmarking cast in R from long data frame to wide matrix
[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In my daily work I often have to transform a long table to a wide matrix so accommodate some function. At some stage in my life I came across the reshape2 package, and I have been with that philosophy ever since – I find it makes data wrangling easy and straight forward. I particularly like the tidyverse philosophy where data should be in a long table, where one row is an observation, and a column a parameter. It just makes sense. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
However, I quite often have to transform the data into another format, a wide matrix especially for functions of the vegan package, and one day I wondering how to do that in the fastest way.
The code to create the test sets and benchmark the functions is in section ‘Settings and script’ at the end of this document.
I created several data sets that mimic the data I usually work with in terms of size and values. The data sets have 2 to 10 groups, where each group can have up to 50000, 100000, 150000, or 200000 samples. The methods xtabs() from base R, dcast() from data.table, dMcast() from Matrix.utils, and spread() from tidyr were benchmarked using microbenchmark() from the package microbenchmark. Each method was evaluated 10 times on the same data set, which was repeated for 10 randomly generated data sets.
After the 10 x 10 repetitions of casting from long to wide it is clear the spread() is the worst. This is clear when we focus on the size (figure 1).
And the complexity (figure 2).
Close up on the top three methods
Casting from a long table to a wide matrix is clearly slowest with spread(), where as the remaining look somewhat similar. A direct comparison of the methods show a similarity in their performance, with dMcast() from the package Matrix.utils being better — especially with the large and more complex tables (figure 3).I am aware, that it might be to much to assume linearity, between the computation times at different set sizes, but I do believe it captures the point – dMcast() and dcast() are similar, with advantage to dMcast() for large data sets with large number of groups. It does, however, look like dcast() scales better with the complexity (figure 4).
Settings and script
Session info
## ─ Session info ────────────────────────────────────────────────────────── ## setting value ## version R version 3.5.2 (2018-12-20) ## os Ubuntu 18.04.1 LTS ## system x86_64, linux-gnu ## ui X11 ## language en_GB:en_US ## collate en_DE.UTF-8 ## ctype en_DE.UTF-8 ## tz Europe/Berlin ## date 2019-02-03 ## ## ─ Packages ────────────────────────────────────────────────────────────── ## package * version date lib source ## assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.5.2) ## bindr 0.1.1 2018-03-13 [1] CRAN (R 3.5.2) ## bindrcpp * 0.2.2 2018-03-29 [1] CRAN (R 3.5.2) ## cli 1.0.1 2018-09-25 [1] CRAN (R 3.5.2) ## colorspace 1.4-0 2019-01-13 [1] CRAN (R 3.5.2) ## crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.2) ## digest 0.6.18 2018-10-10 [1] CRAN (R 3.5.2) ## dplyr * 0.7.8 2018-11-10 [1] CRAN (R 3.5.2) ## evaluate 0.12 2018-10-09 [1] CRAN (R 3.5.2) ## ggplot2 * 3.1.0 2018-10-25 [1] CRAN (R 3.5.2) ## glue 1.3.0 2018-07-17 [1] CRAN (R 3.5.2) ## gtable 0.2.0 2016-02-26 [1] CRAN (R 3.5.2) ## highr 0.7 2018-06-09 [1] CRAN (R 3.5.2) ## htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.5.2) ## knitr 1.21 2018-12-10 [1] CRAN (R 3.5.2) ## labeling 0.3 2014-08-23 [1] CRAN (R 3.5.2) ## lazyeval 0.2.1 2017-10-29 [1] CRAN (R 3.5.2) ## magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.2) ## munsell 0.5.0 2018-06-12 [1] CRAN (R 3.5.2) ## packrat 0.5.0 2018-11-14 [1] CRAN (R 3.5.2) ## pillar 1.3.1 2018-12-15 [1] CRAN (R 3.5.2) ## pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.5.2) ## plyr 1.8.4 2016-06-08 [1] CRAN (R 3.5.2) ## purrr 0.2.5 2018-05-29 [1] CRAN (R 3.5.2) ## R6 2.3.0 2018-10-04 [1] CRAN (R 3.5.2) ## Rcpp 1.0.0 2018-11-07 [1] CRAN (R 3.5.2) ## rlang 0.3.1 2019-01-08 [1] CRAN (R 3.5.2) ## rmarkdown 1.11 2018-12-08 [1] CRAN (R 3.5.2) ## scales 1.0.0 2018-08-09 [1] CRAN (R 3.5.2) ## sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.2) ## stringi 1.2.4 2018-07-20 [1] CRAN (R 3.5.2) ## stringr * 1.3.1 2018-05-10 [1] CRAN (R 3.5.2) ## tibble 2.0.1 2019-01-12 [1] CRAN (R 3.5.2) ## tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.5.2) ## viridisLite 0.3.0 2018-02-01 [1] CRAN (R 3.5.2) ## withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.2) ## xfun 0.4 2018-10-23 [1] CRAN (R 3.5.2) ## yaml 2.2.0 2018-07-25 [1] CRAN (R 3.5.2)[code style=”display: none;”]]czo1OTpcIjxjb2RlIHN0eWxlPVwiZGlzcGxheTogbm9uZTtcIj48Y29kZSBzdHlsZT1cImRpc3BsYXk6IG5vbmU7XCI+ClwiO3tbJiomXX0=[[/code][code style=”display: none;”]]czoxOlwiClwiO3tbJiomXX0=[[/code]
Settings
[code style=”display: none;”]]czo2MDpcIjxjb2RlIHN0eWxlPVwiZGlzcGxheTogbm9uZTtcIj4KPGNvZGUgc3R5bGU9XCJkaXNwbGF5OiBub25lO1wiPgpcIjt7WyYqJl19[[/code]# settings.yml set_size: [50000, 100000, 150000, 200000] num_groups: [2, 3, 4, 5, 6, 7, 8, 9, 10] benchmark_repetitions: 10 num_test_sets: 10 max_value: 500 word_length: 10[code style=”display: none;”]]czo2MDpcIjxjb2RlIHN0eWxlPVwiZGlzcGxheTogbm9uZTtcIj4KPGNvZGUgc3R5bGU9XCJkaXNwbGF5OiBub25lO1wiPgpcIjt7WyYqJl19[[/code]
Data creation and benchmarking scripts
[code style=”display: none;”]]czo2MDpcIjxjb2RlIHN0eWxlPVwiZGlzcGxheTogbm9uZTtcIj4KPGNvZGUgc3R5bGU9XCJkaXNwbGF5OiBub25lO1wiPgpcIjt7WyYqJl19[[/code][code style=”display: none;”]]czoxOlwiClwiO3tbJiomXX0=[[/code][code style=”display: none;”]]czoxOlwiClwiO3tbJiomXX0=[[/code]# main.R # Global variables ---------------------------------------------- # Set this to FALSE if you want to run the complete analysis running_test <- TRUE vars <- yaml::read_yaml("./settings.yml") set_size <- vars$set_size num_groups <- vars$num_groups benchmark_repetitions <- vars$benchmark_repetitions num_test_sets <- vars$num_test_sets max_value <- vars$max_value word_length <- vars$word_length # Test variables ------------------------------------------------ if(running_test){ set_size <- seq.int(0L, 60L, 30L) num_groups <- c(2L:3L) benchmark_repetitions <- 2L num_test_sets <- 2L } # Libraries ----------------------------------------------------- suppressPackageStartupMessages(library(foreach)) suppressPackageStartupMessages(library(doParallel)) # Setup parallel ------------------------------------------------ num_cores <- detectCores() - 1 these_cores <- makeCluster(num_cores, type = "PSOCK") registerDoParallel(these_cores) # Functions ----------------------------------------------------- run_benchmark <- function(as){ source("test_cast.R") num_groups <- as["num_groups"] set_size <- as["set_size"] num_test_sets <- as["num_test_sets"] word_length <- as["word_length"] max_value <- as["max_value"] test_data <- prepare_test_data(set_size, num_groups, word_length, max_value) perform_benchmark(test_data, benchmark_repetitions) } # Setup and run tests ------------------------------------------- set_size <- set_size[set_size > 0] analysis_comb <- expand.grid(num_groups, set_size) analysis_settings <- vector("list", length = nrow(analysis_comb)) for(i in seq_len(nrow(analysis_comb))){ analysis_settings[[i]] <- c(num_groups =analysis_comb[i, "Var1"], set_size = analysis_comb[i, "Var2"], word_length = word_length, max_value = max_value, benchmark_repetitions = benchmark_repetitions) } for(as in analysis_settings){ num_groups <- as["num_groups"] set_size <- as["set_size"] report_str <- paste("ng:", num_groups, "- setsize:", set_size, "\n") cat(report_str) rds_file_name <- paste0("./output/benchmark_setsize-", set_size, "_ng-", num_groups, ".rds") bm_res <- foreach(seq_len(num_test_sets), .combine = "rbind") %dopar% { run_benchmark(as) } bm_res <- dplyr::mutate(bm_res, `Number groups` = num_groups, `Set size` = set_size) saveRDS(bm_res, rds_file_name) report_str }[code style=”display: none;”]]czo2MDpcIjxjb2RlIHN0eWxlPVwiZGlzcGxheTogbm9uZTtcIj4KPGNvZGUgc3R5bGU9XCJkaXNwbGF5OiBub25lO1wiPgpcIjt7WyYqJl19[[/code]
[code style=”display: none;”]]czoxOlwiClwiO3tbJiomXX0=[[/code]
# test_cast.R setup <- function(){ library(data.table) library(tidyr) library(dplyr) library(Matrix.utils) library(tibble) } prepare_test_data <- function(set_size, num_groups, word_length, sample_int_n){ calc_subset_size <- function(){ subset_size <- 0 while(subset_size == 0 | subset_size > set_size){ subset_size <- abs(round(set_size - set_size/(3 + rnorm(1)))) } subset_size } words <- stringi::stri_rand_strings(set_size, word_length) subset_sizes <- replicate(num_groups, calc_subset_size()) purrr::map_df(subset_sizes, function(subset_size){ tibble::tibble(Variable = sample(words, subset_size), Value = sample.int(sample_int_n, subset_size, replace = TRUE), Group = stringi::stri_rand_strings(1, word_length)) }) } test_tidyr <- function(test_df){ test_df %>% spread(Variable, Value, fill = 0L) %>% tibble::column_to_rownames(var = "Group") %>% as.matrix.data.frame() } test_xtabs <- function(test_df){ xtabs(Value ~ Group + Variable, data = test_df) } test_dMcast <- function(test_df){ class(test_df) <- "data.frame" dMcast(test_df, Group ~ Variable, value.var = "Value") } test_dcast <- function(test_df){ test_df_dt <- data.table(test_df) dcast(test_df_dt, Group ~ Variable, value.var = "Value", fill = 0) %>% tibble::column_to_rownames(var = "Group") %>% as.matrix.data.frame() } perform_benchmark <- function(test_df, benchmark_repetitions){ suppressPackageStartupMessages(setup()) bm_res <- microbenchmark::microbenchmark( Spread = test_tidyr(test_df = test_df), Xtabs = test_xtabs(test_df = test_df), dMcast = test_dMcast(test_df = test_df), dcast = test_dcast(test_df = test_df), times = benchmark_repetitions ) class(bm_res) <- "data.frame" bm_res %>% mutate(time = microbenchmark:::convert_to_unit(time, "ms")) %>% rename(Method = expr, `Time (ms)` = time) }
To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.