A tidy model pipeline with twidlr and broom

Simon Jackson

5 years ago

[This article was first published on blogR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

@drsimonj here to show you how to go from data in a data.frame to a tidy data.frame of model output by combining twidlr and broom in a single, tidy model pipeline.

The problem

Different model functions take different types of inputs (data.frames, matrices, etc) and produce different types of output! Thus, we’re often confronted with the very untidy challenge presented in this Figure:

Thus, different models may need very different code.

However, it’s possible to create a consistent, tidy pipeline by combining the twidlr and broom packages. Let’s see how this works.

Two-step modelling

To understand the solution, think of the problem as a two-step process, depicted in this Figure:

Step 1: from data to fitted model

Step 1 must take data in a data.frame as input and return a fitted model object. twidlr exposes model functions that do just this!

To demonstrate:

#devtools::install_github("drimonj/twidlr")  # To install
library(twidlr)

lm(mtcars, hp ~ .)
#> 
#> Call:
#> stats::lm(formula = formula, data = data)
#> 
#> Coefficients:
#> (Intercept)          mpg          cyl         disp         drat  
#>      79.048       -2.063        8.204        0.439       -4.619  
#>          wt         qsec           vs           am         gear  
#>     -27.660       -1.784       25.813        9.486        7.216  
#>        carb  
#>      18.749

This means we can pipe data.frames into any model function exposed by twidlr. For example:

library(dplyr)

mtcars %>% lm(hp ~ .)
#> 
#> Call:
#> stats::lm(formula = formula, data = data)
#> 
#> Coefficients:
#> (Intercept)          mpg          cyl         disp         drat  
#>      79.048       -2.063        8.204        0.439       -4.619  
#>          wt         qsec           vs           am         gear  
#>     -27.660       -1.784       25.813        9.486        7.216  
#>        carb  
#>      18.749

Step2: fitted model to tidy results

Step 2 must take a fitted model object as its input and return a tidy data frame of results. This is precisely what the broom package does via three functions: glance, tidy, and augment! To demonstrate:

#install.packages("broom")  # To install
library(broom)

fit <- mtcars %>% lm(hp ~ .)

glance(fit)
#>   r.squared adj.r.squared    sigma statistic     p.value df    logLik
#> 1 0.9027993     0.8565132 25.97138  19.50477 1.89833e-08 11 -142.8905
#>        AIC      BIC deviance df.residual
#> 1 309.7809 327.3697 14164.76          21

tidy(fit)
#>           term    estimate   std.error  statistic     p.value
#> 1  (Intercept)  79.0483879 184.5040756  0.4284371 0.672695339
#> 2          mpg  -2.0630545   2.0905650 -0.9868407 0.334955314
#> 3          cyl   8.2037204  10.0861425  0.8133655 0.425134929
#> 4         disp   0.4390024   0.1492007  2.9423609 0.007779725
#> 5         drat  -4.6185488  16.0829171 -0.2871711 0.776795845
#> 6           wt -27.6600472  19.2703681 -1.4353668 0.165910518
#> 7         qsec  -1.7843654   7.3639133 -0.2423121 0.810889101
#> 8           vs  25.8128774  19.8512410  1.3003156 0.207583411
#> 9           am   9.4862914  20.7599371  0.4569518 0.652397317
#> 10        gear   7.2164047  14.6160152  0.4937327 0.626619355
#> 11        carb  18.7486691   7.0287674  2.6674192 0.014412403

augment(fit) %>% head()
#>           .rownames  hp  mpg cyl disp drat    wt  qsec vs am gear carb
#> 1         Mazda RX4 110 21.0   6  160 3.90 2.620 16.46  0  1    4    4
#> 2     Mazda RX4 Wag 110 21.0   6  160 3.90 2.875 17.02  0  1    4    4
#> 3        Datsun 710  93 22.8   4  108 3.85 2.320 18.61  1  1    4    1
#> 4    Hornet 4 Drive 110 21.4   6  258 3.08 3.215 19.44  1  0    3    1
#> 5 Hornet Sportabout 175 18.7   8  360 3.15 3.440 17.02  0  0    3    2
#> 6           Valiant 105 18.1   6  225 2.76 3.460 20.22  1  0    3    1
#>     .fitted     .resid      .hat   .sigma     .cooksd .std.resid
#> 1 148.68122 -38.681220 0.2142214 24.75946 0.069964902 -1.6801773
#> 2 140.62866 -30.628664 0.2323739 25.43881 0.049861042 -1.3460408
#> 3  79.99158  13.008418 0.3075987 26.38216 0.014633059  0.6019364
#> 4 125.75448 -15.754483 0.2103960 26.31579 0.011288712 -0.6826601
#> 5 183.21756  -8.217565 0.2016137 26.53317 0.002878707 -0.3541128
#> 6 111.38490  -6.384902 0.3147448 26.55680 0.003682813 -0.2969840

A single, tidy pipeline

So twidlr and broom functions can be combined into a single, tidy pipeline to go from data.frame to tidy output:

library(twidlr)
library(broom)

mtcars %>% 
  lm(hp ~ .)  %>% 
  glance()
#>   r.squared adj.r.squared    sigma statistic     p.value df    logLik
#> 1 0.9027993     0.8565132 25.97138  19.50477 1.89833e-08 11 -142.8905
#>        AIC      BIC deviance df.residual
#> 1 309.7809 327.3697 14164.76          21

Any model included in twidlr and broom can be used in this same way. Here’s a kmeans example:

iris %>%
  select(-Species) %>% 
  kmeans(centers = 3) %>% 
  tidy()
#>         x1       x2       x3       x4 size withinss cluster
#> 1 5.901613 2.748387 4.393548 1.433871   62 39.82097       1
#> 2 5.006000 3.428000 1.462000 0.246000   50 15.15100       2
#> 3 6.850000 3.073684 5.742105 2.071053   38 23.87947       3

And a ridge regression with cross-fold validation example:

mtcars %>% 
  cv.glmnet(am ~ ., alpha = 0) %>% 
  glance()
#>   lambda.min lambda.1se
#> 1  0.2284167  0.8402035

So next time you want to do some tidy modelling, keep this pipeline in mind:

Limitations

Currently, the major limitation for this approach is that a model must be covered by twidlr and broom. For example, you can’t use randomForest in this pipeline because, although twidlr exposes a data.frame friendly version of it, broom doesn’t provide tidying methods for it. So if you want to write tidy code for a model that isn’t covered by these packages, have a go at helping out by contributing to these open source projects! To get started creating and contributing to R packages, take a look at Hadley Wickham’s free book, “R Packages”.

Sign off

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.

If you’d like the code that produced this blog, check out the blogR GitHub repository.

To leave a comment for the author, please follow the link and comment on their blog: blogR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.