The aim of intubate (logo <||>) is to offer a painless way to add R functions that are non-pipe-aware to data science pipelines implemented by magrittr with the operator %>%, without having to rely on workarounds of varying complexity.


# install.packages("devtools")


dplyr, by Hadley Wickham, Romain Francois, and RStudio, is used here to illustrate data transformations.

Suppose you have the following code:


tmp <- filter(LifeCycleSavings, dpi >= 1000)
to_fit <- select(tmp, sr, pop15, pop75)

and you would like to avoid creating the temporary object tmp. One approach could be the following:

to_fit <- select(filter(LifeCycleSavings, dpi >= 1000), sr, pop15, pop75)

The problem with this approach is that, as the number of intermediate steps increase, it is error prone and becomes more complicated to understand.

Pipes in R are made possible by the package magrittr, by Stefan Milton Bache and Hadley Wickham. They provide an elegant alternative:


LifeCycleSavings %>% 
  filter(dpi >= 1000) %>% 
  select(sr, pop15, pop75) ->

Pipelines seem to be a popular way, these days, of doing data science in R. If you need an introduction about pipelines, please follow this link ( to the chapter on data transformation of the forthcoming book “R for Data Science” by Garrett Grolemund and Hadley Wickham.

R statistical functions and pipelines

Suppose you want to perform a regression analysis of sr on pop15 and pop75 (assuming for the sake of argument that it is a valid analysis to perform).

As most R functions are not pipeline-aware, you should do something like the following:

fitted <- lm(sr ~ ., to_fit)
## Call:
## lm(formula = sr ~ ., data = to_fit)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5438 -2.1996  0.4071  2.2060  5.4754 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  38.5981     9.6146   4.015 0.000898 ***
## pop15        -0.6574     0.2481  -2.650 0.016843 *  
## pop75        -2.7315     1.2458  -2.193 0.042536 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 3.558 on 17 degrees of freedom
## Multiple R-squared:  0.3213,	Adjusted R-squared:  0.2415 
## F-statistic: 4.024 on 2 and 17 DF,  p-value: 0.03709

This is an absolutely correct approach.

But what if, in addition to the data transformation, you would also like to perform your data modeling/analysis under the same pipeline paradigm (by adding lm to it), which would impart notation consistency and would avoid the need of creating the temporary object to_fit?

LifeCycleSavings %>% 
  filter(dpi >= 1000) %>% 
  select(sr, pop15, pop75) %>% 
  lm(sr ~ .) %>%               ## Adding lm to the pipeline
## Error in cannot coerce class ""formula"" to a data.frame

You get an error.

The reason of this failure is that pipeline-aware functions (such as the ones in dplyr that were specifically designed to work in pipelines) receive the data as the first parameter, and most statistical procedures (or graphical functions such as the ones in package lattice) that work with formulas to specify the model, such as lm and lots of other rock solid reliable functions that implement well established statistical procedures, receive the data as the second parameter.

There are alternatives that allow to include lm (and others) in the pipeline without errors and without intubate. They require workarounds of varying levels of complexity. Some of the possible approaches are illustrated in the post Workarounds to include R stat functions in data science pipelines.

If you choose intubate is because you do not want to bother about workarounds when working with pipelines that include statistical procedures, or other non-pipe-aware functions.

By the way, intubate also implements three extensions for pipelines called intubOrders, intuEnv, and intuBags. These extensions will be treated in following posts.


## install.packages("intubate")
LifeCycleSavings %>% 
  filter(dpi >= 1000) %>% 
  select(sr, pop15, pop75) %>%
  ntbt_lm(sr ~ pop15 + pop75) %>% 
## Call:
## lm(formula = sr ~ pop15 + pop75)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5438 -2.1996  0.4071  2.2060  5.4754 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  38.5981     9.6146   4.015 0.000898 ***
## pop15        -0.6574     0.2481  -2.650 0.016843 *  
## pop75        -2.7315     1.2458  -2.193 0.042536 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 3.558 on 17 degrees of freedom
## Multiple R-squared:  0.3213,	Adjusted R-squared:  0.2415 
## F-statistic: 4.024 on 2 and 17 DF,  p-value: 0.03709

without error.

LifeCycleSavings %>% 
  filter(dpi >= 1000) %>% 
  select(sr, pop15, pop75) %>%
  ntbt_lm(sr ~ pop15 + pop75) %>% 
  ntbt_plot(which = 1) %>%        ## Adding a residual plot

## Call:
## lm(formula = sr ~ pop15 + pop75)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5438 -2.1996  0.4071  2.2060  5.4754 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  38.5981     9.6146   4.015 0.000898 ***
## pop15        -0.6574     0.2481  -2.650 0.016843 *  
## pop75        -2.7315     1.2458  -2.193 0.042536 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 3.558 on 17 degrees of freedom
## Multiple R-squared:  0.3213,	Adjusted R-squared:  0.2415 
## F-statistic: 4.024 on 2 and 17 DF,  p-value: 0.03709

(as plot returns NULL, intubate automatically forwards its input so summary receives the result of lm).

Calling non-pipe-aware functions “on the fly”

LifeCycleSavings %>% 
  ntbt(lm, sr ~ pop15 + pop75)
## Call:
## lm(formula = sr ~ pop15 + pop75)
## Coefficients:
## (Intercept)        pop15        pop75  
##     30.6277      -0.4708      -1.9341

Note: this approach works with any function, including the ones lacking interfaces.

LifeCycleSavings %>%
  ntbt_plot(pop75, sr) %>%
  ntbt(lsfit, pop75, sr) %>%    # Calling lsfit "on the fly" with ntbt

Creating interfaces “on demand”

ntbt_lsfit <- intubate

The only thing you need to remember is that the name of an interface must start with ntbt_ followed by the name of the interfaced function (lsfit in this particular case), no matter which function you want to interface.

You can now use the newly created interface as any other provided by intubate:

LifeCycleSavings %>%
  ntbt_plot(pop75, sr) %>%
  ntbt_lsfit(pop75, sr) %>%    # Using just created "on demand" interface

Just in case, let’s clarify that the intubate machinery does not perform any statistical computation. The interfaced functions (those that are already well tested) are the ones performing the computations.

Non-formula variants:

Some functions offer non-formula variants (or both variants). For example, including cor.test in a pipeline in any of its variants produces an error:

LifeCycleSavings %>% 
  filter(dpi >= 1000) %>% 
  select(sr, pop15, pop75) %>%
  cor.test(pop15, pop75)   ## Non-formula variant
## Error in match.arg(alternative): object 'pop75' not found


LifeCycleSavings %>% 
  filter(dpi >= 1000) %>% 
  select(sr, pop15, pop75) %>%
  cor.test(~ pop15 + pop75)   ## Formula variant
## Error in cor.test.default(., ~pop15 + pop75): 'x' and 'y' must have the same length

Both variants work when using any of the approaches provided by intubate:

LifeCycleSavings %>% 
  filter(dpi >= 1000) %>% 
  select(sr, pop15, pop75) %>%
  ntbt_cor.test(pop15, pop75)   ## Non-formula variant
## 	Pearson's product-moment correlation
## data:  pop15 and pop75
## t = -2.4193, df = 18, p-value = 0.02636
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.76924958 -0.06766132
## sample estimates:
##        cor 
## -0.4953505


LifeCycleSavings %>% 
  filter(dpi >= 1000) %>% 
  select(sr, pop15, pop75) %>%
  ntbt(cor.test, ~ pop15 + pop75)   ## Formula variant
## 	Pearson's product-moment correlation
## data:  pop15 and pop75
## t = -2.4193, df = 18, p-value = 0.02636
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.76924958 -0.06766132
## sample estimates:
##        cor 
## -0.4953505


Packages containing interfaces

The 88 R packages that have interfaces implemented so far are:

Bugs and Feature requests

The robustness and generality of the interfacing machinery still needs to be further verified (and very likely improved), as there are thousands of potential functions to interface and certainly some are bound to fail when interfaced. Some have already been addressed when implementing provided interfaces (as their examples failed).

The goal is to make intubate each time more robust by addressing the peculiarities of newly discovered failing functions.

For the time being, only cases where the interfaces provided with intubate fail will be considered as bugs.

Cases of failing user defined interfaces or when using ntbt to call functions directly that do not have interfaces provided with released versions of intubate, will be considered feature requests.

Of course, it will be greatly appreciated, if you have some coding skills and can follow the code of the interface, if you could provide the proposed solution, that shouldn’t break anything else, together with the feature request.


Workarounds to include R stat functions in data science pipelines

