A Simple Example of Using replyr::gapply
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
It’s a common situation to have data from multiple processes in a “long” data format, for example a table with columns measurement
and process_that_produced_measurement
. It’s also natural to split that data apart to analyze or transform it, per-process — and then to bring the results of that data processing together, for comparison. Such a work pattern is called “Split-Apply-Combine,” and we discuss several R implementations of this pattern here. In this article we show a simple example of one such implementation, replyr::gapply
, from our latest package, replyr
.
Illustration by Boris Artzybasheff. Image: James Vaughn, some rights reserved.
The example task is to evaluate how several different models perform on the same classification problem, in terms of deviance, accuracy, precision and recall. We will use the “default of credit card clients” data set from the UCI Machine Learning Repository.
To keep this post short, we will skip over the preliminary data processing and the modeling; if you are interested, the code for the full example is available here. We will fit a logistic regression model (GLM), a generalized additive model (GAM), and a random forest model (ranger
implementation) to a training set, and evaluate the models’ performance on a hold-out set.
# load the file of model fitting and prediction functions source("modelfitting.R") algolist = list(glm=glm_predictor, gam=gam_predictor, rangerRF=ranger_predictor) # define outcome column and variables outcome = "defaults" varlist = ... # Fit models for each algorithm and gather together the # predictions each model makes on a test set. predictors = fit_models(algolist, outcome, varlist, train) predframe = make_predictions(predictors, test, outcome) library(replyr) replyr_summary(predframe)[, c("column", "class", "nunique")] ## column class nunique ## 2 defaults logical 2 ## 3 model character 3 ## 1 pred numeric 17973 replyr_uniqueValues(predframe, "model") ## # A tibble: 3 × 2 ## model n ## <chr> <dbl> ## 1 gam 5997 ## 2 glm 5997 ## 3 rangerRF 5997
The results of the evaluation are in a single data frame predframe
, with columns defaults
(the true outcome: whether or not this customer defaulted on their loan in the next month); pred
(the predicted probability of default); and model
(the model that made the prediction).
To evaluate each model’s performance, we write a function metric_row
that takes a frame of predictions and true outcomes, and returns a data frame of all the performance metrics (deviance explained, accuracy, precision, and recall; the implementations for each metric are not shown here). This is the function we wish to apply to each group of data.
metric_row = function(subframe, yvar, pred, label) { confmat = cmat(subframe[[yvar]], subframe[[pred]]) devExplained = sigr::formatChiSqTest(subframe, pred, yvar)$pseudoR2 tframe = data.frame(devExplained=devExplained, accuracy=accuracy(confmat), precision=precision(confmat), recall=recall(confmat)) tframe$model = subframe[[label]][1] # assuming there is only one label tframe } # example outcome of metric_row, for the glm model metric_row(subset(predframe, model=="glm"), outcome, "pred", "glm") ## devExplained accuracy precision recall ## 1 0.1125283 0.8094047 0.7238979 0.2335329
In this case our data processing returns a one-row data frame but you could return a multirow frame. For example, if the data we process includes predictions for both the training and test sets, we could return a data frame with one row each for test and training performance.
We would like to use split-apply-combine on all the data, to return a frame of performance metrics for all the models that we evaluated. We can do that explicitly, of course (additionally sorted by deviance explained, descending):
# # Compute performance metrics for all the model types # Order by deviance explained # split(predframe, predframe$model) %>% lapply(function(fi) {metric_row(fi, outcome, 'pred', 'model')}) %>% dplyr::bind_rows() %>% dplyr::arrange(desc(devExplained))
replyr::gapply
provides a convenient function to wrap most of the above pipe.
# # Compute performance metrics for all the model types # Order by deviance explained # replyr::gapply(predframe, 'model', function(fi) metric_row(fi,outcome, 'pred', 'model'), partitionMethod = 'split') %>% dplyr::arrange(desc(devExplained)) ## devExplained accuracy precision recall model ## 1 0.1810591 0.8174087 0.6704385 0.3547904 gam ## 2 0.1767817 0.8180757 0.6680384 0.3645210 rangerRF ## 3 0.1125283 0.8094047 0.7238979 0.2335329 glm
The partitionMethod = 'split'
argument tells gapply
to split the data using base::split
, rather than partitioning the data using dplyr::group_by
before applying the user-supplied function. dplyr::group_by
is the default partitioning method, but isn’t suitable for the function (metric_row
) that I want to apply.
Conclusion
replyr::gapply
implements the split-order-apply pattern in a convenient wrapper function. It supports dplyr
grouped operations and explicit data partitioning (as in base::split
), and can be used on any dplyr
-supported back-end. The replyr
package is on CRAN; the most recent development version is available on Github.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.