Mastering the Many Models Approach
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Intro
The tidyverse “many models” approach was formally introduced in the first edition of R for Data Science (R4DS) in 2017. Since then, the tidyverse has evolved significantly, and along with it, the way we can harness the many models approach. This blog post aims to i) review the basic approach and update it using the latest tidyverse syntax, and ii) explore a range of use cases with increasing complexity while introducing new building blocks and helper functions.
The structure of this blog post also reflects my motivation for writing it. I think that this is a powerful approach that should be more widely known. Those who are actually using it, often rely on an older syntax, which makes things more complicated than necessary. In addition to the original building blocks, there are several lesser-known functions that help apply this approach to more complex use cases.
Lately, the tidyverse many models approach hasn’t received much attention. One might expect this to change with the coming release of the second edition of R4DS. However, the entire section on modeling has been omitted from this release. According to the authors, the reasons for this are twofold: First, there was never ample room to address the whole topic of modeling within R4DS. Second, the authors recommend the ‘tidymodels’ packages, which are well documented in Tidy Modeling with R which is filling the gap.
While ‘tidymodels’ is a strong framework with definite advantages when working with various algorithms and model engines, it comes with considerable conceptual and syntactic overhead. For this reason, I believe there is still a lot of room (and use cases) for the “classic” tidyverse many models approach, which is based on ‘dplyr’ syntax but utilizes base R, or alternatively package-specific, models.
But before we delve into the use cases, let’s begin with the setup.
Setup
Unlike the name suggests, we don’t need all of the ‘tidyverse’ packages for the tidyverse many models approach. The heavy lifting is done by ‘dplyr’ and ‘tidyr’. Additionally, we use ‘rlang’ and ‘purrr’ for some extra functionality. In this post we’ll be using the csat
and csatraw
data from my own package ‘dplyover’.
library(dplyr) # <- necessary
library(tidyr) # <- necessary
library(rlang) # <- nice to have
library(purrr) # <- nice to have
library(dplyover) # <- only for the data
csat
is a mock-up dataset resembling data from a customer survey. It comes in two forms: the labeled data, csat
, with meaningful column names and factor levels, and the corresponding raw data, csataw
, where each column is a survey item and responses are just numbers. Since our models will be using the numbers instead of the factor levels, we’ll use the data from csatraw
, but rename the columns according to csat
(see my earlier post on how to rename variables). Additionally, we’ll drop variables that we don’t need. Let’s take a glimpse
at the resulting dataset csat_named
:
# create a look-up vector of old and new names
lookup_vec <- set_names(names(csatraw), names(csat))
# rename the columns
csat_named <- csatraw |>
rename(any_of(lookup_vec)) |>
select(cust_id, type, product, csat,
ends_with("rating"))
glimpse(csat_named)
#> Rows: 150
#> Columns: 9
#> $ cust_id <chr> "61297", "07545", "03822", "88219", "31831", "63646", "…
#> $ type <chr> "existing", "existing", "existing", "reactivate", "new"…
#> $ product <chr> "advanced", "advanced", "premium", "basic", "basic", "b…
#> $ csat <dbl> 3, 2, 2, 4, 4, 3, 1, 3, 3, 2, 5, 4, 4, 1, 4, 4, 2, 3, 2…
#> $ postal_rating <dbl> 3, 2, 5, 5, 2, NA, 3, 3, 4, NA, NA, 4, 3, 4, 1, 5, NA, …
#> $ phone_rating <dbl> 2, 4, 5, 3, 5, 3, 4, 2, NA, 2, NA, NA, 2, 4, NA, 2, 4, …
#> $ email_rating <dbl> NA, 3, NA, NA, 5, 2, 3, 5, 3, 1, 3, 1, 3, 1, 3, 3, 4, 1…
#> $ website_rating <dbl> 5, 2, 3, 4, 1, 3, 3, 1, 3, 2, 4, 2, NA, 4, 1, 2, NA, 5,…
#> $ shop_rating <dbl> 3, 1, 2, 2, 5, 4, 4, 2, 2, 2, 4, 3, 2, NA, 3, 5, 4, 1, …
Every row is the response of a customer, cust_id
, who owns a contract-base product
available in different flavors: “basic”, “advanced” or “premium”. There are three different type
s of customers: “existing”, “new” and “reactivate”. Our dependent variable is the customer satisfaction score, csat
, which ranges from ‘1 = Very unsatisfied’ to ‘5 = Very satisfied’. The independent variables are ratings on the same scale concerning the following touchpoints: “postal”, “phone”, “email”, “website” and “shop”. We’ve dropped all other variables, but interested readers can find both datasets well documented in the ‘dplyover’ package (
?csat
,
?csatraw
).
Fundamentals
Let’s start with the basic approach as it was introduced in R4DS. Keep in mind that syntax and functions have evolved over the past five years, so we’ll be refining the original ideas into a more canonical form. There are four essential components we’ll be discussing:
- nested data
- rowwise operations
- tidy results
- unesting results
If you’re already familiar with these concepts, feel free to skip this section.
Nested data
The central idea of the many-models approach is to streamline the process of running models on various subsets of data. Let’s say we want to perform a linear regression on each product type. In a traditional base R approach, we might have used a for
loop to populate a list object with the results of each run. However, the tidyverse method begins with a nested data.frame
.
So, what is a nested data.frame? We can use dplyr::nest_by(product)
to create a data.frame
containing three rows, one for each product. The second column, data
, is a ‘list-column’ that holds a list of data.frame
's—one for each row. These data.frame
s contain data for all customers within the corresponding product category. If you’re unfamiliar with list-columns, I highly recommend reading chapter 25 of R4DS. Although some parts may be outdated, it remains an excellent resource for understanding the essential components of this approach.
csat_prod_nested <- csat_named |>
nest_by(product)
csat_prod_nested
#> # A tibble: 3 × 2
#> # Rowwise: product
#> product data
#> <chr> <list<tibble[,8]>>
#> 1 advanced [40 × 8]
#> 2 basic [60 × 8]
#> 3 premium [50 × 8]
Looking at the first element (row) of the data
column shows a data.frame
with 40 customers, all of whom have “advanced” products. The product
column is omitted, as this information is already included in our nested data: csat_prod_nested
.
csat_prod_nested$data[[1]]
#> # A tibble: 40 × 8
#> cust_id type csat postal_rating phone_rating email_r…¹ websi…² shop_…³
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 61297 existing 3 3 2 NA 5 3
#> 2 07545 existing 2 2 4 3 2 1
#> 3 63600 existing 1 3 4 3 3 4
#> 4 82048 reactivate 3 3 2 5 1 2
#> 5 41142 reactivate 3 4 NA 3 3 2
#> 6 06387 reactivate 1 4 4 1 4 NA
#> 7 63024 existing 2 NA 4 4 NA 4
#> 8 55743 new 1 4 5 4 1 5
#> 9 32689 new 5 3 NA 3 4 2
#> 10 33603 existing 2 3 3 4 3 2
#> # … with 30 more rows, and abbreviated variable names ¹email_rating,
#> # ²website_rating, ³shop_rating
Rowwise operations
Applying
nest_by()
also groups our data
rowwise()
. This means that subsequent dplyr operations will be applied “one row at a time.” This is particularly helpful when vectorized functions aren’t available, such as the
lm()
function in our case, which we want to apply to the data in each row.
First, let’s define the relationship between our dependent and independent variables using a formula object, my_formula
.
my_formula <- csat ~ postal_rating + phone_rating + email_rating +
website_rating + shop_rating
my_formula
#> csat ~ postal_rating + phone_rating + email_rating + website_rating +
#> shop_rating
Next, we use mutate
to create new columns. We’ll start by creating a column called mod
containing our model. We’ll apply the
lm()
function with the previously defined formula and supply the data
column to it. Since we are working with a rowwise
data.frame
, the
lm()
function is executed three times, one time for each row, each time using a different data.frame
of the list-column data
. As the result of each call is not an atomic vector but an lm
object of type list
, we need to wrap the function call in
list()
. This results in a new list-column, mod
, which holds an lm
object in each row.
csat_prod_nested |>
mutate(mod = list(lm(my_formula, data = data)))
#> # A tibble: 3 × 3
#> # Rowwise: product
#> product data mod
#> <chr> <list<tibble[,8]>> <list>
#> 1 advanced [40 × 8] <lm>
#> 2 basic [60 × 8] <lm>
#> 3 premium [50 × 8] <lm>
Tidy results with broom
To make the results of this model more accessible, we’ll use two functions from the ‘broom’ package:
broom::glance()
returns a data.frame
containing all model statics, such as r-squared, BIC, AIC etc., and the overall p-value of the model itself.
broom::tidy()
returns a data.frame
with all regression terms, their estimates, p-values and other statistics.
Again, we’ll wrap both functions in
list()
and call them on the model in the new mod
column. This yields a final, nested data.frame
. The rows represent the three product subgroups, while the columns contain the input data
, the model mod
, and the results modstat
and res
. Beside mod
, each of these columns is a list of data.frame
s:
csat_prod_nested_res <- csat_prod_nested |>
mutate(mod = list(lm(my_formula, data = data)),
modstat = list(broom::glance(mod)),
res = list(broom::tidy(mod)))
csat_prod_nested_res
#> # A tibble: 3 × 5
#> # Rowwise: product
#> product data mod modstat res
#> <chr> <list<tibble[,8]>> <list> <list> <list>
#> 1 advanced [40 × 8] <lm> <tibble [1 × 12]> <tibble [6 × 5]>
#> 2 basic [60 × 8] <lm> <tibble [1 × 12]> <tibble [6 × 5]>
#> 3 premium [50 × 8] <lm> <tibble [1 × 12]> <tibble [6 × 5]>
Nesting results
With the groundwork laid, it is now easy to access the results. To do this, we’ll use
tidyr::unnest()
to convert a list of data.frame
s back into a regular data.frame
. First, lets look at the model statistics. We’ll select the product
and modstat
columns and unnest
the latter. This produces a data.frame
with different model statistics for the three product subgroups. In this case, we’re interested in the r-squared, the p-value and the number of observations of each model:
csat_prod_nested_res |>
select(product, modstat) |>
unnest(modstat) |>
select(r.squared, p.value, nobs)
#> Adding missing grouping variables: `product`
#> # A tibble: 3 × 4
#> # Groups: product [3]
#> product r.squared p.value nobs
#> <chr> <dbl> <dbl> <int>
#> 1 advanced 0.112 0.941 15
#> 2 basic 0.382 0.192 20
#> 3 premium 0.185 0.645 21
Please note that the results themselves aren’t the main focus here, as the primary goal is to demonstrate how the approach works in general.
Next, we’ll inspect the coefficients, their size and their p-values. We’ll select the product
and res
columns and unnest
the latter. Additionally, since we’re not interested in the size of the intercept, we’ll filter out those rows.
csat_prod_nested_res |>
select(product, res) |>
unnest(res) |>
filter(term != "(Intercept)")
#> # A tibble: 15 × 6
#> # Groups: product [3]
#> product term estimate std.error statistic p.value
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 advanced postal_rating 0.396 0.483 0.819 0.434
#> 2 advanced phone_rating -0.235 0.423 -0.556 0.592
#> 3 advanced email_rating 0.349 0.485 0.720 0.490
#> 4 advanced website_rating 0.398 0.456 0.874 0.405
#> 5 advanced shop_rating -0.00183 0.288 -0.00637 0.995
#> 6 basic postal_rating -0.324 0.225 -1.44 0.172
#> 7 basic phone_rating -0.297 0.237 -1.25 0.231
#> 8 basic email_rating -0.136 0.249 -0.547 0.593
#> 9 basic website_rating 0.229 0.229 1.00 0.334
#> 10 basic shop_rating 0.321 0.259 1.24 0.235
#> 11 premium postal_rating 0.0538 0.220 0.245 0.810
#> 12 premium phone_rating -0.508 0.307 -1.65 0.119
#> 13 premium email_rating 0.375 0.309 1.22 0.243
#> 14 premium website_rating -0.175 0.259 -0.677 0.509
#> 15 premium shop_rating 0.0498 0.203 0.246 0.809
From this point, we could further manipulate the resulting data, such as filtering out non-significant coefficients or plotting the model results, and so on.
To wrap up this section, the info box below highlights how the approach above deviates from the original syntax introduced in R4DS.
There are two main differences between the approach outlined above and the syntax which was originally introduced in R4DS.
First, instead of nest_by(product)
, the original syntax used group_by(product) %>% nest()
. Both produce a nested data.frame
. The later, however, returns a data.frame
grouped by “product”, while
nest_by()
returns a rowwise
data.frame
.
While this difference seems negligible, it at has implications on how operations on the nested data are carried out, especially, since rowwise
operations didn’t exist in 2017. The original approach was using
purrr::map()
and friends instead to apply unvectorized functions, such as
lm()
, to list-columns.
csat_named |>
group_by(product) |>
nest() |>
ungroup() |>
mutate(mod = map(data, ~ lm(my_formula, data = .x)),
res = map(mod, broom::tidy),
modstat = map(mod, broom::glance))
#> # A tibble: 3 × 5
#> product data mod res modstat
#> <chr> <list> <list> <list> <list>
#> 1 advanced <tibble [40 × 8]> <lm> <tibble [6 × 5]> <tibble [1 × 12]>
#> 2 premium <tibble [50 × 8]> <lm> <tibble [6 × 5]> <tibble [1 × 12]>
#> 3 basic <tibble [60 × 8]> <lm> <tibble [6 × 5]> <tibble [1 × 12]>
While this approach saves us from wrapping the output in
list()
, it leads to code cluttering especially with functions that take two or more arguments and which need to be wrapped in
pmap()
.
Extensions
Building on the basic approach outlined above, we’ll introduce five advanced building blocks that help to tackle more complex use cases.
- create an overall category with
bind_rows()
- add subgroups through filters with
expand_grid()
- dynamically name list elements with
rlang::list2()
- use data-less grids
- build formulas programmatically with
reformulate()
Create an overall category with ‘bind_rows()’
Often we want to run an analysis not only on different subsets of the data but also on the entire dataset simultaneously. We can achieve this by using
mutate()
and
bind_rows()
to create an additional overall product category that encompasses all products.
csat_all <- csat_named |>
mutate(product = "All") |>
bind_rows(csat_named)
csat_all |> count(product)
#> # A tibble: 4 × 2
#> product n
#> <chr> <int>
#> 1 All 150
#> 2 advanced 40
#> 3 basic 60
#> 4 premium 50
First, we use
mutate()
to overwrite the ‘product’ column with the value “All” effectively grouping all products together under this new label. We then use
bind_rows()
to merge the original csat_named
dataset with the modified dataset from the previous step. This results in a new dataset called csat_all
that contains the original data and an extra set of rows where the product category is labeled as “All”. Consequently, the new dataset is twice the size of the original data, as it includes each row twice.
Now we can apply the same analysis as above:
csat_all |>
nest_by(product) |>
mutate(mod = list(lm(my_formula, data = data)),
res = list(broom::tidy(mod)),
modstat = list(broom::glance(mod)))
#> # A tibble: 4 × 5
#> # Rowwise: product
#> product data mod res modstat
#> <chr> <list<tibble[,8]>> <list> <list> <list>
#> 1 All [150 × 8] <lm> <tibble [6 × 5]> <tibble [1 × 12]>
#> 2 advanced [40 × 8] <lm> <tibble [6 × 5]> <tibble [1 × 12]>
#> 3 basic [60 × 8] <lm> <tibble [6 × 5]> <tibble [1 × 12]>
#> 4 premium [50 × 8] <lm> <tibble [6 × 5]> <tibble [1 × 12]>
Add subgroups through filters with ‘expand_grid()’
Sometimes, we may want to create additional subgroups that meet specific filter criteria. For instance, we might want to analyze all customers and, at the same time, compare the results with an analysis of all customers who are not of the “reactivate” type.
To achieve this, we’ll follow three steps:
-
We create a list of filter expressions, referred to as
filter_ls
.filter_ls <- list( All = TRUE, no_reactivate = expr(type != "reactivate") ) filter_ls #> $All #> [1] TRUE #> #> $no_reactivate #> type != "reactivate"
This results in a list where each element is either
TRUE
or an unevaluated expression that we’ll use later insidefilter()
. Note that we userlang::expr()
to capture an expression, although we could have usedsubstitute()
orquote()
instead. Omitting any of these function would result in an error, as R would try to evaluatetype
which is not desired since we want to delay the evaluation until later. -
We expand our nested
data.frame
for each filter category usingexpand_grid()
.csat_all_grps <- csat_all |> nest_by(product) |> expand_grid(filter_ls) |> mutate(type = names(filter_ls), .after = product) csat_all_grps #> # A tibble: 8 × 4 #> product type data filter_ls #> <chr> <chr> <list<tibble[,8]>> <named list> #> 1 All All [150 × 8] <lgl [1]> #> 2 All no_reactivate [150 × 8] <language> #> 3 advanced All [40 × 8] <lgl [1]> #> 4 advanced no_reactivate [40 × 8] <language> #> 5 basic All [60 × 8] <lgl [1]> #> 6 basic no_reactivate [60 × 8] <language> #> 7 premium All [50 × 8] <lgl [1]> #> 8 premium no_reactivate [50 × 8] <language>
We use
tidyr::expand_grid()
to expand our nested data for each category in our list of filter expressions:filter_ls
. We also add a new column,type
, which shows the name of each element infilter_ls
. Looking at the output reveals that our original nesteddata.frame
contained four rows, while our data now holds eight rows - one for each combination ofproduct
andtype
. -
We apply each filter to our data
rowwise
usingdplyr::filter(eval(filter_ls))
.csat_all_grps_grid <- csat_all_grps |> rowwise() |> mutate(data = list( filter(data, eval(filter_ls)) ), .keep = "unused" ) csat_all_grps_grid #> # A tibble: 8 × 3 #> # Rowwise: #> product type data #> <chr> <chr> <list> #> 1 All All <tibble [150 × 8]> #> 2 All no_reactivate <tibble [120 × 8]> #> 3 advanced All <tibble [40 × 8]> #> 4 advanced no_reactivate <tibble [32 × 8]> #> 5 basic All <tibble [60 × 8]> #> 6 basic no_reactivate <tibble [46 × 8]> #> 7 premium All <tibble [50 × 8]> #> 8 premium no_reactivate <tibble [42 × 8]>
We use
mutate
to applyfilter()
to eachdata.frame
in thedata
column in each row. As filter expression we use the columnfilter_ls
and evaluate it. Since we no longer need this column, we set the.keep
argument inmutate
to “unused” to eventually dropfilter_ls
after it has been used.
From this point, we could continue applying our model and then calculating and extracting the results, but we’ll omit this for the sake of brevity.
csat_all_grps_grid <- csat_all_grps |>
rowwise() |>
mutate(mod = list(lm(my_formula, data = data)),
res = list(broom::tidy(mod)),
modstat = list(broom::glance(mod)))
csat_all_grps_grid |>
select(product, type, modstat) |>
unnest(modstat) |>
select(-c(sigma, statistic, df:df.residual))
#> # A tibble: 8 × 6
#> product type r.squared adj.r.squared p.value nobs
#> <chr> <chr> <dbl> <dbl> <dbl> <int>
#> 1 All All 0.134 0.0479 0.191 56
#> 2 All no_reactivate 0.134 0.0479 0.191 56
#> 3 advanced All 0.112 -0.381 0.941 15
#> 4 advanced no_reactivate 0.112 -0.381 0.941 15
#> 5 basic All 0.382 0.161 0.192 20
#> 6 basic no_reactivate 0.382 0.161 0.192 20
#> 7 premium All 0.185 -0.0867 0.645 21
#> 8 premium no_reactivate 0.185 -0.0867 0.645 21
csat_all_grps_grid |>
select(product, type, res) |>
unnest(res) |>
filter(term == "website_rating")
#> # A tibble: 8 × 7
#> product type term estimate std.error statistic p.value
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 All All website_rating 0.0729 0.141 0.517 0.607
#> 2 All no_reactivate website_rating 0.0729 0.141 0.517 0.607
#> 3 advanced All website_rating 0.398 0.456 0.874 0.405
#> 4 advanced no_reactivate website_rating 0.398 0.456 0.874 0.405
#> 5 basic All website_rating 0.229 0.229 1.00 0.334
#> 6 basic no_reactivate website_rating 0.229 0.229 1.00 0.334
#> 7 premium All website_rating -0.175 0.259 -0.677 0.509
#> 8 premium no_reactivate website_rating -0.175 0.259 -0.677 0.509
Although this example is relatively simple, it demonstrates how this approach can be significantly expanded by providing more filter expressions in our filter_ls
list or by using multiple lists of filter expressions. This is particularly useful when performing robustness checks, where we attempt to reproduce the original findings on specific subgroups of our data.
Dynamically name list elements with ‘rlang::list2()’
So far, we’ve wrapped the results of our rowwise
operations in
list()
when they produced non-atomic vectors.
A common issue when inspecting the results is that these list-columns are often unnamed, making it difficult to determine which element we’re examining. For instance, suppose that we want to double-check the output of our call to broom::glance(mod)
stored in the modstat
column. Let’s look at the fourth element:
csat_all_grps_grid$modstat[4]
#> [[1]]
#> # A tibble: 1 × 12
#> r.squ…¹ adj.r…² sigma stati…³ p.value df logLik AIC BIC devia…⁴ df.re…⁵
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 0.112 -0.381 1.39 0.228 0.941 5 -22.4 58.9 63.8 17.5 9
#> # … with 1 more variable: nobs <int>, and abbreviated variable names
#> # ¹r.squared, ²adj.r.squared, ³statistic, ⁴deviance, ⁵df.residual
The result prints nicely, but it’s unclear which subset of the data it belongs to.
Here
rlang::list2()
comes to the rescue. Although it resembles
list()
, it provides some extra functionality. Specifically, it allows us to unquote names on the right-hand side of the walrus operator. To better grasp this idea, let’s look at an example.
We wrap our calls to
lm()
, tidy()
and glance()
in
list2()
and name each element using the walrus operator :=
. On the right-hand side of the walrus operator, we use the glue operator {
within a string to dynamically name each element according to the values in the product
and type
columns in each row. When we inspect the fourth element of the modstat
column, we can quickly see that these model statistics belong to the subset of customers with an “advanced” product and who are not of type “reactivate”.
csat_all_grps_grid <- csat_all_grps |>
rowwise() |>
mutate(mod = list2("{product}_{type}" := lm(my_formula, data = data)),
res = list2("{product}_{type}" := broom::tidy(mod)),
modstat = list2("{product}_{type}" := broom::glance(mod)))
csat_all_grps_grid$modstat[4]
#> $advanced_no_reactivate
#> # A tibble: 1 × 12
#> r.squ…¹ adj.r…² sigma stati…³ p.value df logLik AIC BIC devia…⁴ df.re…⁵
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 0.112 -0.381 1.39 0.228 0.941 5 -22.4 58.9 63.8 17.5 9
#> # … with 1 more variable: nobs <int>, and abbreviated variable names
#> # ¹r.squared, ²adj.r.squared, ³statistic, ⁴deviance, ⁵df.residual
Data-less grids
Using the methods described above, we can easily construct nested data.frame
s with several dozen subgroups. However, this approach can be inefficient in terms of memory usage, as we create a copy of our data for every single subgroup. To make this approach more memory-efficient, we can use what I call a “data-less grid”, which is similar to our original nested data.frame
, but without the data column.
Instead of nesting our data with
nest_by()
, we manually create the combinations of subgroups to which we want to apply our model. We start with a vector of all unique values in the product
column and add an overall category “All” to it. Then, we supply this vector along with our list of filter expressions filter_ls
to
expand_grid()
. Finally, we place the names of the elements in filter_ls
in a separate column: type
.
This results in an initial grid all_grps_grid
of combinations between product
and type
, with an additional column of filter expressions.
product <- c(
"All", unique(csat_named$product)
)
all_grps_grid <- expand_grid(product, filter_ls) |>
mutate(type = names(filter_ls),
.after = product)
The challenging aspect here is generating each data subset on the fly in the call to
lm()
. To accomplish this, we
filter()
our initial data csat_named
on two conditions:
-
Firstly, we filter for different
product
types using an advanced filter expression:.env$product == "All" | .env$product == product
.This expression may appear somewhat obscure, so let’s break it down:
The issue here is that both our original data
csat_named
and our gridall_grps_grid
contain a column namedproduct
. By default,product
, in thefilter()
call below, refers to the column incsat_named
. To tell ‘dyplr’ to use the column in our gridall_grps_grid
we use the.env
pronoun.So, the filter expression above essentially states: If the product category in our grid
.env$product
is “All”, then select all rows. This works because when the left side of the or-condition.env$product == "All"
evaluates toTRUE
,filter
selects all rows. If the first part of our condition is not true, then theproduct
column incsat_named
should match the value of theproduct
column of our data-less grid.env$product
. -
Next, we filter the different
type
s of customers. Here, we use the filter expressions stored infilter_ls
and evaluate them.
all_grps_grid_mod <- all_grps_grid |>
rowwise() |>
mutate(mod = list(
lm(my_formula,
data = filter(csat_named,
# 1. filter product categories
.env$product == "All" | .env$product == product,
# 2. filter customer types
eval(filter_ls)
)
)
)
) |>
select(! filter_ls)
all_grps_grid_mod
#> # A tibble: 8 × 3
#> # Rowwise:
#> product type mod
#> <chr> <chr> <list>
#> 1 All All <lm>
#> 2 All no_reactivate <lm>
#> 3 advanced All <lm>
#> 4 advanced no_reactivate <lm>
#> 5 premium All <lm>
#> 6 premium no_reactivate <lm>
#> 7 basic All <lm>
#> 8 basic no_reactivate <lm>
This returns our initial grid, now extended by an additional column, mod
, containing the linear models.
The remaining steps do not significantly differ from our initial approach, so we will omit them for brevity.
all_grps_grid_res <- all_grps_grid_mod |>
mutate(res = list(broom::tidy(mod)),
modstat = list(broom::glance(mod)))
all_grps_grid_res |>
select(product, type, modstat) |>
unnest(modstat) |>
select(-c(sigma, statistic, df:df.residual))
#> # A tibble: 8 × 6
#> product type r.squared adj.r.squared p.value nobs
#> <chr> <chr> <dbl> <dbl> <dbl> <int>
#> 1 All All 0.134 0.0479 0.191 56
#> 2 All no_reactivate 0.180 0.0771 0.145 46
#> 3 advanced All 0.112 -0.381 0.941 15
#> 4 advanced no_reactivate 0.330 -0.340 0.772 11
#> 5 premium All 0.185 -0.0867 0.645 21
#> 6 premium no_reactivate 0.156 -0.196 0.810 18
#> 7 basic All 0.382 0.161 0.192 20
#> 8 basic no_reactivate 0.465 0.221 0.172 17
Build formulas programmatically with ‘reformulate’
The final building block that essentially completes the many models approach is actually a base R function:
reformulate()
.
I recently posted an #RStats meme on Twitter highlighting that
reformulate()
is one of the lesser-known base R functions, even among advanced users. The reactions to my post largely confirmed my impression.
Before applying it in the many models context, let’s have a look at what
reformulate()
does. Instead of manually creating a formula object by typing y ~ x1 + x2
, we can use
reformulate()
to generate a formula object based on character vectors.
Important is the order of the first two arguments. While we start writing a formula from the left-hand side y
,
reformulate()
takes as first argument the right-hand side.
form1 <- y ~ x1 + x2
form1
#> y ~ x1 + x2
form2 <- reformulate(termlabels = c("x1", "x2"),
response = "y")
form2
#> y ~ x1 + x2
identical(form1, form2)
#> [1] TRUE
How can we make use of
reformulate()
in the many models approach?
Let’s begin with a simple case and assume we want to construct a separate model for each independent variable, containing only our response variable and one independent variable at a time: csat ~ indepedent_variable
. And of course, we want to do this for all of our subgroups of the previous approach.
First, we need a character vector holding the names of our independent variables. With this vector, we can now expand our data-less grid from above. This results in a new grid with 40 rows (eight subgroups times five independent variables).
indep_vars <- c("postal_rating",
"phone_rating",
"email_rating",
"website_rating",
"shop_rating")
all_grps_grid_vars <- all_grps_grid |>
expand_grid(indep_vars)
all_grps_grid_vars
#> # A tibble: 40 × 4
#> product type filter_ls indep_vars
#> <chr> <chr> <named list> <chr>
#> 1 All All <lgl [1]> postal_rating
#> 2 All All <lgl [1]> phone_rating
#> 3 All All <lgl [1]> email_rating
#> 4 All All <lgl [1]> website_rating
#> 5 All All <lgl [1]> shop_rating
#> 6 All no_reactivate <language> postal_rating
#> 7 All no_reactivate <language> phone_rating
#> 8 All no_reactivate <language> email_rating
#> 9 All no_reactivate <language> website_rating
#> 10 All no_reactivate <language> shop_rating
#> # … with 30 more rows
We can now apply a similar approach as before, creating data subgroups on the fly. The only change is that we use reformulate(indep_vars, "csat")
instead of our formula object my_formula
. This adds forty different linear models to our grid:
all_grps_grid_vars_mod <- all_grps_grid_vars |>
rowwise() |>
mutate(mod = list(
lm(reformulate(indep_vars, "csat"), # <- this part is new
data = filter(csat_named,
.env$product == "All" | .env$product == product,
eval(filter_ls)
)
)
)
) %>%
select(! filter_ls)
all_grps_grid_vars_mod
#> # A tibble: 40 × 4
#> # Rowwise:
#> product type indep_vars mod
#> <chr> <chr> <chr> <list>
#> 1 All All postal_rating <lm>
#> 2 All All phone_rating <lm>
#> 3 All All email_rating <lm>
#> 4 All All website_rating <lm>
#> 5 All All shop_rating <lm>
#> 6 All no_reactivate postal_rating <lm>
#> 7 All no_reactivate phone_rating <lm>
#> 8 All no_reactivate email_rating <lm>
#> 9 All no_reactivate website_rating <lm>
#> 10 All no_reactivate shop_rating <lm>
#> # … with 30 more rows
Although the example above is instructive, it isn’t particularly useful. In most cases, we don’t want to create a separate model for each independent variable. A much more powerful way to use
reformulate()
is to
update()
a baseline model with additional variables.
Let’s say we have the following base-line model:
my_formula2 <- csat ~ postal_rating + phone_rating + shop_rating
my_formula2
#> csat ~ postal_rating + phone_rating + shop_rating
For our many subgroups from above, we want to check if adding email_rating
or website_rating
improves our model. Let’s create a list of terms that we want to add to our model: update_vars
. Note that we need to include NULL
, as this will represent our baseline model. Again, we expand our grid from above with this list and put the names of each variable (“base”, “email”, and “website”) in a separate column to keep track of which model we are examining.
update_vars <- list(base = NULL,
email = "email_rating",
website = "website_rating")
all_grid_upd_vars <- all_grps_grid |>
expand_grid(update_vars) |>
mutate(model_spec = names(update_vars),
.after = type)
all_grid_upd_vars
#> # A tibble: 24 × 5
#> product type model_spec filter_ls update_vars
#> <chr> <chr> <chr> <named list> <named list>
#> 1 All All base <lgl [1]> <NULL>
#> 2 All All email <lgl [1]> <chr [1]>
#> 3 All All website <lgl [1]> <chr [1]>
#> 4 All no_reactivate base <language> <NULL>
#> 5 All no_reactivate email <language> <chr [1]>
#> 6 All no_reactivate website <language> <chr [1]>
#> 7 advanced All base <lgl [1]> <NULL>
#> 8 advanced All email <lgl [1]> <chr [1]>
#> 9 advanced All website <lgl [1]> <chr [1]>
#> 10 advanced no_reactivate base <language> <NULL>
#> # … with 14 more rows
We could use
update()
directly in our call to
lm()
, but to avoid overcomplicating things, let’s create a column holding our updated formula: form
:
all_grid_upd_vars_form <- all_grid_upd_vars |>
rowwise() |>
mutate(form = list2( "{product}_{type}_{model_spec}" :=
update(my_formula2, # old formula
reformulate(c(".", update_vars))) # changes to formula
)
)
update()
takes two arguments, the formula we want to update, in this case my_formula2
, and the formula we use to update the former. In our case, this is a call to
reformulate()
which says: “take all the original term labels "."
, and add
c()
to them the variable in update_vars
. Now its probably clear why we included NULL
in update_vars
. In cases where it is NULL
the original formula won’t be updated, which corresponds to our baseline model.
Checking the first three rows of our list-column containing the formulas shows that the approach works as intended:
head(all_grid_upd_vars_form$form, 3)
#> $All_All_base
#> csat ~ postal_rating + phone_rating + shop_rating
#>
#> $All_All_email
#> csat ~ postal_rating + phone_rating + shop_rating + email_rating
#>
#> $All_All_website
#> csat ~ postal_rating + phone_rating + shop_rating + website_rating
Endgame
With the building blocks introduced above, we can now combine everything and extend this approach even further.
Let’s say we want to compare two different versions of our dependent variable. The original variable and a collapsed top-box version csat_top
taking 1
if a customer gave the best rating and 0
otherwise.
Again, we create a character vector holding the names of our dependent variables, dep_vars
, and use it to expand our data-less grid from above.
csat_named_top <- csat_named |>
mutate(csat_top = ifelse(csat == 5, 1, 0))
dep_vars <- c("csat", "csat_top")
all_grps_grid_final <- all_grid_upd_vars |>
expand_grid(dep_vars)
all_grps_grid_final
#> # A tibble: 48 × 6
#> product type model_spec filter_ls update_vars dep_vars
#> <chr> <chr> <chr> <named list> <named list> <chr>
#> 1 All All base <lgl [1]> <NULL> csat
#> 2 All All base <lgl [1]> <NULL> csat_top
#> 3 All All email <lgl [1]> <chr [1]> csat
#> 4 All All email <lgl [1]> <chr [1]> csat_top
#> 5 All All website <lgl [1]> <chr [1]> csat
#> 6 All All website <lgl [1]> <chr [1]> csat_top
#> 7 All no_reactivate base <language> <NULL> csat
#> 8 All no_reactivate base <language> <NULL> csat_top
#> 9 All no_reactivate email <language> <chr [1]> csat
#> 10 All no_reactivate email <language> <chr [1]> csat_top
#> # … with 38 more rows
Next, we update the formula, generate the data on the fly, calculate our model and prepare the results using ‘broom’. Finally, we drop columns that we don’t need anymore.
all_grps_grid_final_res <- all_grps_grid_final |>
rowwise() |>
mutate(
# dynamically name list
form = list2( "{product}_{type}_{model_spec}_{dep_vars}" :=
# update formula
update(my_formula2, # old formula
reformulate(c(".", update_vars), dep_vars)) # changes to formula
),
mod = list(
lm(form,
# create data on the fly
data = filter(csat_named_top,
.env$product == "All" | .env$product == product,
eval(filter_ls)
)
)
),
res = list(broom::tidy(mod)),
modstat = list(broom::glance(mod))
) |>
select(product:model_spec, dep_vars, mod:modstat)
all_grps_grid_final_res
#> # A tibble: 48 × 7
#> # Rowwise:
#> product type model_spec dep_vars mod res modstat
#> <chr> <chr> <chr> <chr> <list> <list> <list>
#> 1 All All base csat <lm> <tibble [4 × 5]> <tibble>
#> 2 All All base csat_top <lm> <tibble [4 × 5]> <tibble>
#> 3 All All email csat <lm> <tibble [5 × 5]> <tibble>
#> 4 All All email csat_top <lm> <tibble [5 × 5]> <tibble>
#> 5 All All website csat <lm> <tibble [5 × 5]> <tibble>
#> 6 All All website csat_top <lm> <tibble [5 × 5]> <tibble>
#> 7 All no_reactivate base csat <lm> <tibble [4 × 5]> <tibble>
#> 8 All no_reactivate base csat_top <lm> <tibble [4 × 5]> <tibble>
#> 9 All no_reactivate email csat <lm> <tibble [5 × 5]> <tibble>
#> 10 All no_reactivate email csat_top <lm> <tibble [5 × 5]> <tibble>
#> # … with 38 more rows
Although we are not specifically interested in the results, below is one way to filter out those models that are statistically significant at the 10% level arranged by adjusted r-squared in descending order:
all_grps_grid_final_res |>
unnest(modstat) |>
select(-c(sigma, statistic, df:df.residual)) |>
filter(p.value < 0.1) |>
arrange(desc(adj.r.squared))
#> # A tibble: 4 × 10
#> product type model…¹ dep_v…² mod res r.squ…³ adj.r…⁴ p.value nobs
#> <chr> <chr> <chr> <chr> <lis> <list> <dbl> <dbl> <dbl> <int>
#> 1 advanced no_reac… base csat_t… <lm> <tibble> 0.423 0.279 0.0765 16
#> 2 advanced All base csat_t… <lm> <tibble> 0.349 0.234 0.0575 21
#> 3 All All email csat <lm> <tibble> 0.112 0.0575 0.0973 70
#> 4 All All base csat <lm> <tibble> 0.0884 0.0538 0.0613 83
#> # … with abbreviated variable names ¹model_spec, ²dep_vars, ³r.squared,
#> # ⁴adj.r.squared
Wrap-up
That’s it. This post started out easy and got quite complex in the end. There’s certainly room to refine this approach by encapsulating parts of those lengthy pipes in custom functions, but that’s a story for another time. For now, I hope you enjoyed this post. If you use other helper functions or have a better approach for calculating many models, I’d love to hear your feedback. Feel free to share your thoughts in the comments below or via Twitter, Mastodon, or GitHub.
Session Info
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.2.1 (2022-06-23)
#> os macOS Big Sur ... 10.16
#> system x86_64, darwin17.0
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz Europe/Berlin
#> date 2023-03-31
#> pandoc 2.19.2 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> backports 1.4.1 2021-12-13 [1] CRAN (R 4.2.0)
#> broom 1.0.1 2022-08-29 [1] CRAN (R 4.2.0)
#> cachem 1.0.6 2021-08-19 [1] CRAN (R 4.2.0)
#> cli 3.6.0 2023-01-09 [1] CRAN (R 4.2.0)
#> digest 0.6.31 2022-12-11 [1] CRAN (R 4.2.0)
#> downlit 0.4.2 2022-07-05 [1] CRAN (R 4.2.0)
#> dplyover * 0.0.8.9002 2022-10-15 [1] Github (timteafan/dplyover@f0cd984)
#> dplyr * 1.1.0 2023-01-29 [1] CRAN (R 4.2.0)
#> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0)
#> evaluate 0.19 2022-12-13 [1] CRAN (R 4.2.0)
#> fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.0)
#> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0)
#> fs 1.5.2 2021-12-08 [1] CRAN (R 4.2.0)
#> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.0)
#> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0)
#> htmltools 0.5.4 2022-12-07 [1] CRAN (R 4.2.0)
#> hugodownplus 0.0.0.9000 2023-02-19 [1] Github (timteafan/hugodownplus@d79c4c0)
#> knitr 1.41 2022-11-18 [1] CRAN (R 4.2.0)
#> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.0)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0)
#> memoise 2.0.1 2021-11-26 [1] CRAN (R 4.2.0)
#> pillar 1.8.1 2022-08-19 [1] CRAN (R 4.2.0)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0)
#> purrr * 1.0.1 2023-01-10 [1] CRAN (R 4.2.0)
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0)
#> rlang * 1.0.6 2022-09-24 [1] CRAN (R 4.2.0)
#> rmarkdown 2.19 2022-12-15 [1] CRAN (R 4.2.0)
#> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.0)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0)
#> stringi 1.7.8 2022-07-11 [1] CRAN (R 4.2.0)
#> stringr 1.5.0 2022-12-02 [1] CRAN (R 4.2.0)
#> tibble 3.1.8 2022-07-22 [1] CRAN (R 4.2.0)
#> tidyr * 1.2.1 2022-09-08 [1] CRAN (R 4.2.0)
#> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.0)
#> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.0)
#> vctrs 0.5.2 2023-01-23 [1] CRAN (R 4.2.0)
#> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0)
#> xfun 0.36 2022-12-21 [1] CRAN (R 4.2.0)
#> yaml 2.3.6 2022-10-18 [1] CRAN (R 4.2.0)
#>
#> [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
#>
#> ──────────────────────────────────────────────────────────────────────────────
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.