Model calibration with `crossval`

T. Moudiki

2 years ago

[This article was first published on T. Moudiki's Webpage - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Model calibration in the context of this post, is about finding optimal hyperparameters for Statistical/Machine learning (ML) models. Optimal in the sense that they minimize a given criterion such as model’s accuracy on unseen data, model’s precision, Root Mean Squared Error (RMSE), etc. What are ML models’ hyperparameters? Let’s take the example of a linear model:

y = beta_0 + beta_1 x_1 +  beta_2 x_2

Imagine that y is a car’s fuel consumption in Miles/(US) gallon. x_1 is its horsepower, and x_2 its number of cylinders. Knowing the values of x_1 and x_2, we would like to estimate the average value of y for many different cars. beta_1 and beta_2 are unknown model parameters, typically estimated by minimizing the distance between the observed car’s consumption y, and the model beta_0 + beta_1 x_1 + beta_2 x_2. With such a model, we can obtain for example that:

estimated fuel cons. = 0.1 + 0.4 x horsepower +  0.7 x no. of cylinders

Sometimes, when designing our linear model, we will want the unknown coefficients beta_1 and beta_2 to be bounded (beta_1 and beta_2 could otherwise exhibit a high variance). Or, we could want to consider a different polynomial degree d for x_1 or x_2. Whereas beta_1 are beta_2 are model parameters, the polynomial degree d on explanatory variables and the bound s put on parameters beta_1 and beta_2 are model hyperparameters.

Hyperparameters are those parameters that you can tune, in order to increase/decrease the model’s performance. d is a degree of freedom. It controls model’s flexibility. The higher d, the more flexible our model – meaning than it could almost fit “anything”. s is a regularization parameter that stabilizes model estimates. Increasing d might lead to overfitting, and a lower d, to underfitting. Overfitting or underfitting are about: too much flexibility or not enough. We’ll use the mtcars dataset to illustrate these concepts. This dataset is available from R console, as:

data(mtcars)

According to its description, mtcars is extracted from 1974 Motor Trend US magazine, and comprises fuel consumption (in Miles/(US) gallon) and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). We’ll use 5 explanatory variables among 10 here.

mpg:	Miles/(US) gallon # this is y
cyl:	Number of cylinders # this is x_1
disp:	Displacement (cu.in.) # this is x_2
hp:     Gross horsepower # this is x_3
wt:     Weight (1000 lbs) # this is x_4
carb:	Number of carburetors # this is x_5

Below, are the correlations between the target variable (to be explained), mpg, and explanatory variables x_1, …, x_5. We use R package corrplot to plot these correlations.

All the explanatory variables are negatively correlated to the fuel consumption (in Miles/(US) gallon). A marginal increase in any of them leads, on average, to a decrease in fuel consumption. Now, in order to illustrate the concepts of overfitting and underfitting, we fit a linear model and a smoothing spline to mpg (consumption) and hp (horsepower).

On the left: model fitting on 23 cars, for a linear model and a spline. The linear model fits all the points parsimoniously, but the spline tries to memorize the patterns. On the right: errors obtained by each model on the 9 remaining cars, as a function of the spline’s degrees of freedom. That’s overfitting, illustrated. In other situations, a linear model can also fit (very) poorly, because it’s not flexible enough.

So, how do we find a good compromise between overfitting or underfitting? One way to achieve it is to use a hold-out sample, as we did on the previous example – with 23 cars out of 32 in the training procedure, and 9 for testing. Another way is to use cross-validation. The idea of cross-validation is to divide the whole dataset into k parts (usually called folds); each part being successively included into a training set or a testing set.

On this graph, we have k=5. crossval is a – work in progress – R package, for doing just that. WHY did I implement it? Because R models are contributed by many different people. So, you’re not using a unified interface when training them. For example, in order to obtain predictions for 2 different models, you can have 2 different specifications of function predict:

predict(fitting_obj_model1, newx)

predict(fitting_obj_model2, newdata)

fitting_obj_model* are the trained models 1 and 2. newx and newdata are the unseen data on which we would like to test the trained model. The position of arguments in function calls do also matter a lot. Idea: use a common cross-validation interface for many different models. Hence, crossval. There is still room for improvement. If you find cases that are not covered by crossval, you can contribute them here. Currently, the package can be installed from Github as (in R console):

library(devtools)
devtools::install_github("thierrymoudiki/crossval")

Here is an example of use of crossval applied to glmnet (with my old school R syntax yeah, I like it!):

require(glmnet)
require(Matrix)

# load the dataset
data("mtcars")
df <- mtcars[, c(1, 2, 3, 4, 6, 11)]
summary(df)

# create response and explanatory variables
X <- as.matrix(df[, -1])
y <- df$mpg

# grid of model hyperparameters
tuning_grid <- expand.grid(alpha = c(0, 0.5, 1),
                           lambda = c(0.01, 0.1, 1))
n_params <- nrow(tuning_grid)

# list of cross-validation results
# - 5-fold cross-validation (`k`)
# - repeated 3 times (`repeats`)
# - cross-validation on 80% of the data (`p`)
# - validation on the remaining 20%
cv_results <- lapply(1:n_params,
                     function(i)
                       crossval::crossval(
                         x = X,
                         y = y,
                         k = 5,
                         repeats = 3,
                         p = 0.8,
                         fit_func = glmnet::glmnet,
                         predict_func = predict.glmnet,
                         packages = c("glmnet", "Matrix"),
                         fit_params = list(alpha = tuning_grid[i, "alpha"],
                                           lambda = tuning_grid[i, "lambda"])
                       ))
names(cv_results) <- paste0("params_set", 1:n_params)
print(cv_results)

Many other examples of use of the package can be found in the README. Also, R packages like caret or mlr do similar things, but with a different philosophy. You may want to try them out too.

Note: I am currently looking for a gig. You can hire me on Malt or send me an email: thierry dot moudiki at pm dot me. I can do descriptive statistics, data preparation, feature engineering, model calibration, training and validation, and model outputs’ interpretation. I am fluent in Python, R, SQL, Microsoft Excel, Visual Basic (among others) and French. My résumé? Here!

To leave a comment for the author, please follow the link and comment on their blog: T. Moudiki's Webpage - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.