Use tidymodels with weighted and unweighted data
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Use case
The tidymodels framework is a collection of packages for modeling and machine learning using tidyverse principles. The get started case study helps to take the first steps. Another helpful source is lesson 10 of an R tutorial from a data mining course at George Mason University.
Building on these basics, my next step is to apply frequency weights when estimating a linear regression model in the tidymodels way of coding. However, this blog post shows that this is a feature under development and therefore some of my first attempts to create a reproducible example failed.
The tidymodels how-to add case weights to a workflow gives some examples with code that helps to crack the case. Below I give the code for two reproducible examples, one example of model estimation without using weights and one with using weights.
Data and method
The models that I estimate are linear regression models with a set of predictors and one numeric outcome variable. The parameters of this model are estimated by ordinary least squares.
I use the car_prices
data set for the examples and try to predict the car prices with the care brands as predictors. Note that, as a consequence, in my examples the outcome variable is non-negative and the predictors are mutually exclusive (0/1) dummy variables. This makes the examples easy to understand, but the code may apply to a wider range of variables nonetheless. I use mileage
as the weighting variable.
Let us start with loading the data into memory.
# Load library for the recipe. parsnip, workflow and hardhat packages, along with the rest of tidymodels library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
✔ broom 1.0.1 ✔ recipes 1.0.1 ✔ dials 1.0.0 ✔ rsample 1.1.0 ✔ dplyr 1.0.10 ✔ tibble 3.1.8 ✔ ggplot2 3.3.6 ✔ tidyr 1.2.0 ✔ infer 1.0.3 ✔ tune 1.0.0 ✔ modeldata 1.0.0 ✔ workflows 1.0.0 ✔ parsnip 1.0.1 ✔ workflowsets 1.0.0 ✔ purrr 0.3.4 ✔ yardstick 1.0.0
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ── ✖ purrr::discard() masks scales::discard() ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() ✖ recipes::step() masks stats::step() • Dig deeper into tidy modeling with R at https://www.tmwr.org
Now we select only the relevant variables. Although the weights are not yet used in the first example, mileage
is already defined as the weighting variable in the data set.
# Create a data set with one non-negative continuous variable and uncorrelated dummy variables as predictors db <- select(car_prices, Price, Buick:Saturn, Mileage) %>% mutate(Mileage = frequency_weights(Mileage)) str(db)
tibble [804 × 8] (S3: tbl_df/tbl/data.frame) $ Price : num [1:804] 22661 21725 29143 30732 33359 ... $ Buick : int [1:804] 1 0 0 0 0 0 0 0 0 0 ... $ Cadillac: int [1:804] 0 0 0 0 0 0 0 0 0 0 ... $ Chevy : int [1:804] 0 1 0 0 0 0 0 0 0 0 ... $ Pontiac : int [1:804] 0 0 0 0 0 0 0 0 0 0 ... $ Saab : int [1:804] 0 0 1 1 1 1 1 1 1 1 ... $ Saturn : int [1:804] 0 0 0 0 0 0 0 0 0 0 ... $ Mileage : freq_wts [1:804] 20105, 13457, 31655, 22479, 17590, 23635, 17381, 2755...
Example 1: linear regression without weights
Now on with the first example. In the code below we define the recipe
, define the model and set mode and engine. These are combined into a workflow. Afterwards we look at the properties of these objects to check if these are as expected. Note that Saturn
is the reference dummy variable of my choice (i.e. in effect its coefficient is set to zero by default) and is thus excluded from the regression.
# Get data ready for modeling with recipe package recipe1 <- db %>% recipe(Price ~ 1 + Buick + Cadillac + Chevy + Pontiac + Saab) # add all dummy variables but one # Define model, mode and engine with parsnip package model1 <- linear_reg() %>% # adds the basic model type set_engine('lm') %>% # adds the computational engine to estimate the model parameters set_mode('regression') # adds the modeling context in which it will be used # Bundle pre-processing, modeling, and post-processing with workflow package workflow1 <- workflow() %>% add_recipe(recipe1) %>% add_model(model1) # View object properties recipe1
Recipe Inputs: role #variables outcome 1 predictor 5
model1
Linear Regression Model Specification (regression) Computational engine: lm
workflow1
══ Workflow ════════════════════════════════════════════════════════════════════ Preprocessor: Recipe Model: linear_reg() ── Preprocessor ──────────────────────────────────────────────────────────────── 0 Recipe Steps ── Model ─────────────────────────────────────────────────────────────────────── Linear Regression Model Specification (regression) Computational engine: lm
Now that the objects look alright, the model estimation can be performed and the parameter estimates are printed.
# Now estimate the model via a single call to fit() fit1 <- fit(workflow1, data = db) # View fit1 properties tidy(fit1)
# A tibble: 6 × 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl> 1 (Intercept) 13979. 763. 18.3 7.62e- 63 2 Buick 6836. 1009. 6.77 2.46e- 11 3 Cadillac 26958. 1009. 26.7 9.16e-113 4 Chevy 2449. 832. 2.94 3.32e- 3 5 Pontiac 4433. 903. 4.91 1.10e- 6 6 Saab 15516. 943. 16.5 1.26e- 52
glance(fit1)
# A tibble: 1 × 12 r.squared adj.r.s…¹ sigma stati…² p.value df logLik AIC BIC devia…³ <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 0.645 0.642 5911. 290. 1.53e-176 5 -8120. 16254. 16287. 2.79e10 # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated # variable names ¹adj.r.squared, ²statistic, ³deviance
Example 2: linear regression with weights
Then come the weights. The first thought is to update the current workflow with a line of code to make clear that weights should be used. However, this approach does not produce the desired result.
Therefore, an alternative approach is followed. Instead of building upon the blocks of the first example, we start with a new workflow()
object and add an add_case_weights
line of code to it. Next, one would expect a line of code with an add_recipe
command, but for some reason this did not work after a “few” tries. Instead, we use add_formula
with the regression formula as an argument. Lastly, surprisingly conventional, an add_model
command is added.
workflow2 <- workflow() %>% add_case_weights(Mileage) %>% add_formula(Price ~ 1 + Buick + Cadillac + Chevy + Pontiac + Saab) %>% add_model(model1) workflow2
══ Workflow ════════════════════════════════════════════════════════════════════ Preprocessor: Formula Model: linear_reg() ── Preprocessor ──────────────────────────────────────────────────────────────── Price ~ 1 + Buick + Cadillac + Chevy + Pontiac + Saab ── Case Weights ──────────────────────────────────────────────────────────────── Mileage ── Model ─────────────────────────────────────────────────────────────────────── Linear Regression Model Specification (regression) Computational engine: lm
Now the parameters are estimated with one line of code as follows.
fit2 <- fit(workflow2, db) # View fit2 properties tidy(fit2)
# A tibble: 6 × 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl> 1 (Intercept) 13448. 694. 19.4 8.70e- 69 2 Buick 7006. 918. 7.63 6.52e- 14 3 Cadillac 26152. 933. 28.0 7.96e-121 4 Chevy 2452. 759. 3.23 1.28e- 3 5 Pontiac 4647. 828. 5.61 2.73e- 8 6 Saab 15349. 853. 18.0 5.72e- 61
glance(fit2)
# A tibble: 1 × 12 r.squared adj.r.…¹ sigma stati…² p.value df logLik AIC BIC devia…³ <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 0.664 0.661 7.67e5 315. 5.90e-186 5 -8111. 16237. 16269. 4.70e14 # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated # variable names ¹adj.r.squared, ²statistic, ³deviance
This is a nice first try! With the two examples above it is possible to experiment further in the hope of alternative/shorter routes to the estimation results. In the mean time, we wait for the tidymodels to include weights in the relevant packages. If you are inspired by these two examples (or not) and have some new ideas for progress, do not hesitate to give feedback to the Tidyverse developers.
Happy coding!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.