Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
@drsimonj here to show you how to go from data in a data.frame to a tidy data.frame of model output by combining twidlr and broom in a single, tidy model pipeline.
The problem
Different model functions take different types of inputs (data.frames, matrices, etc) and produce different types of output! Thus, we’re often confronted with the very untidy challenge presented in this Figure:
Thus, different models may need very different code.
However, it’s possible to create a consistent, tidy pipeline by combining the twidlr and broom packages. Let’s see how this works.
Two-step modelling
To understand the solution, think of the problem as a two-step process, depicted in this Figure:
Step 1: from data to fitted model
Step 1 must take data in a data.frame as input and return a fitted model object. twidlr exposes model functions that do just this!
To demonstrate:
#devtools::install_github("drimonj/twidlr") # To install library(twidlr) lm(mtcars, hp ~ .) #> #> Call: #> stats::lm(formula = formula, data = data) #> #> Coefficients: #> (Intercept) mpg cyl disp drat #> 79.048 -2.063 8.204 0.439 -4.619 #> wt qsec vs am gear #> -27.660 -1.784 25.813 9.486 7.216 #> carb #> 18.749
This means we can pipe data.frames into any model function exposed by twidlr. For example:
library(dplyr) mtcars %>% lm(hp ~ .) #> #> Call: #> stats::lm(formula = formula, data = data) #> #> Coefficients: #> (Intercept) mpg cyl disp drat #> 79.048 -2.063 8.204 0.439 -4.619 #> wt qsec vs am gear #> -27.660 -1.784 25.813 9.486 7.216 #> carb #> 18.749
Step2: fitted model to tidy results
Step 2 must take a fitted model object as its input and return a tidy data frame of results. This is precisely what the broom package does via three functions: glance
, tidy
, and augment
! To demonstrate:
#install.packages("broom") # To install library(broom) fit <- mtcars %>% lm(hp ~ .) glance(fit) #> r.squared adj.r.squared sigma statistic p.value df logLik #> 1 0.9027993 0.8565132 25.97138 19.50477 1.89833e-08 11 -142.8905 #> AIC BIC deviance df.residual #> 1 309.7809 327.3697 14164.76 21 tidy(fit) #> term estimate std.error statistic p.value #> 1 (Intercept) 79.0483879 184.5040756 0.4284371 0.672695339 #> 2 mpg -2.0630545 2.0905650 -0.9868407 0.334955314 #> 3 cyl 8.2037204 10.0861425 0.8133655 0.425134929 #> 4 disp 0.4390024 0.1492007 2.9423609 0.007779725 #> 5 drat -4.6185488 16.0829171 -0.2871711 0.776795845 #> 6 wt -27.6600472 19.2703681 -1.4353668 0.165910518 #> 7 qsec -1.7843654 7.3639133 -0.2423121 0.810889101 #> 8 vs 25.8128774 19.8512410 1.3003156 0.207583411 #> 9 am 9.4862914 20.7599371 0.4569518 0.652397317 #> 10 gear 7.2164047 14.6160152 0.4937327 0.626619355 #> 11 carb 18.7486691 7.0287674 2.6674192 0.014412403 augment(fit) %>% head() #> .rownames hp mpg cyl disp drat wt qsec vs am gear carb #> 1 Mazda RX4 110 21.0 6 160 3.90 2.620 16.46 0 1 4 4 #> 2 Mazda RX4 Wag 110 21.0 6 160 3.90 2.875 17.02 0 1 4 4 #> 3 Datsun 710 93 22.8 4 108 3.85 2.320 18.61 1 1 4 1 #> 4 Hornet 4 Drive 110 21.4 6 258 3.08 3.215 19.44 1 0 3 1 #> 5 Hornet Sportabout 175 18.7 8 360 3.15 3.440 17.02 0 0 3 2 #> 6 Valiant 105 18.1 6 225 2.76 3.460 20.22 1 0 3 1 #> .fitted .resid .hat .sigma .cooksd .std.resid #> 1 148.68122 -38.681220 0.2142214 24.75946 0.069964902 -1.6801773 #> 2 140.62866 -30.628664 0.2323739 25.43881 0.049861042 -1.3460408 #> 3 79.99158 13.008418 0.3075987 26.38216 0.014633059 0.6019364 #> 4 125.75448 -15.754483 0.2103960 26.31579 0.011288712 -0.6826601 #> 5 183.21756 -8.217565 0.2016137 26.53317 0.002878707 -0.3541128 #> 6 111.38490 -6.384902 0.3147448 26.55680 0.003682813 -0.2969840
A single, tidy pipeline
So twidlr and broom functions can be combined into a single, tidy pipeline to go from data.frame to tidy output:
library(twidlr) library(broom) mtcars %>% lm(hp ~ .) %>% glance() #> r.squared adj.r.squared sigma statistic p.value df logLik #> 1 0.9027993 0.8565132 25.97138 19.50477 1.89833e-08 11 -142.8905 #> AIC BIC deviance df.residual #> 1 309.7809 327.3697 14164.76 21
Any model included in twidlr and broom can be used in this same way. Here’s a kmeans
example:
iris %>% select(-Species) %>% kmeans(centers = 3) %>% tidy() #> x1 x2 x3 x4 size withinss cluster #> 1 5.901613 2.748387 4.393548 1.433871 62 39.82097 1 #> 2 5.006000 3.428000 1.462000 0.246000 50 15.15100 2 #> 3 6.850000 3.073684 5.742105 2.071053 38 23.87947 3
And a ridge regression with cross-fold validation example:
mtcars %>% cv.glmnet(am ~ ., alpha = 0) %>% glance() #> lambda.min lambda.1se #> 1 0.2284167 0.8402035
So next time you want to do some tidy modelling, keep this pipeline in mind:
Limitations
Currently, the major limitation for this approach is that a model must be covered by twidlr and broom. For example, you can’t use randomForest
in this pipeline because, although twidlr exposes a data.frame friendly version of it, broom doesn’t provide tidying methods for it. So if you want to write tidy code for a model that isn’t covered by these packages, have a go at helping out by contributing to these open source projects! To get started creating and contributing to R packages, take a look at Hadley Wickham’s free book, “R Packages”.
Sign off
Thanks for reading and I hope this was useful for you.
For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.
If you’d like the code that produced this blog, check out the blogR GitHub repository.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.