Site icon R-bloggers

Custom errors for cross-validation using crossval::crossval_ml

[This article was first published on T. Moudiki's Webpage - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post is about using custom error measures in crossval, a tool offering generic functions for the cross-validation of Statistical/Machine Learning models. More information about cross-validation of regression models using crossval can be found in this post, or this other one. The default error measure for regression in crossval is Root Mean Squared Error (RMSE). Here, I’ll show you how to obtain two other error measures:

The same principles can be extended to any other error measure of your choice.

Installation of crossval

From Github, in R console, let’s start by installing crossval:

devtools::install_github("thierrymoudiki/crossval")

Cross-validation demo

Simulated dataset are used for this demo. With 100 examples, and 5 explanatory variables:

# dataset creation
 set.seed(123)
 n <- 100 ; p <- 5
 X <- matrix(rnorm(n * p), n, p)
 y <- rnorm(n)

Define functions for calculating cross-validation error (MAPE and MAE):

# error measure 1: Mean Absolute Percentage Error - MAPE
eval_metric_mape <- function (preds, actual)
{
  res <- mean(abs(preds/actual-1))
  names(res) <- "MAPE"
  return(res)
}
# error measure 2: Mean Absolute Error - MAE
eval_metric_mae <- function (preds, actual)
{
  res <- mean(abs(preds - actual))
  names(res) <- "MAE"
  return(res)
}

Linear model fitting, with RMSE, MAE and MAPE errors

X contains the explanatory variables. y is the response. k is the number of folds in k-fold cross-validation. repeats is the number of repeats of the k-fold cross-validation procedure.

crossval::crossval_ml(x = X, y = y, k = 5, repeats = 3)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=================================================================| 100%
##    user  system elapsed 
##   0.149   0.005   0.163

## $folds
##         repeat_1  repeat_2  repeat_3
## fold_1 0.8987732 0.9270326 0.7903096
## fold_2 0.8787553 0.8704522 1.2394063
## fold_3 1.0810407 0.7907543 1.3381991
## fold_4 1.0594537 1.1981031 0.7368007
## fold_5 0.7593157 0.8913229 0.7734180
## 
## $mean
## [1] 0.9488758
## 
## $sd
## [1] 0.1902999
## 
## $median
## [1] 0.8913229
crossval::crossval_ml(x = X, y = y, k = 5, repeats = 3, 
                      eval_metric = eval_metric_mape)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=================================================================| 100%
##    user  system elapsed 
##   0.117   0.003   0.127

## $folds
##        repeat_1  repeat_2  repeat_3
## fold_1 1.486233 0.9517148 1.1181554
## fold_2 1.382454 1.1669799 1.0954839
## fold_3 1.267862 1.0583498 1.7768124
## fold_4 1.110386 1.1569593 1.3466701
## fold_5 1.242622 1.6604326 0.9615794
## 
## $mean
## [1] 1.25218
## 
## $sd
## [1] 0.2411539
## 
## $median
## [1] 1.16698
crossval::crossval_ml(x = X, y = y, k = 5, repeats = 3, 
                      eval_metric = eval_metric_mae)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=================================================================| 100%
##    user  system elapsed 
##   0.118   0.003   0.133

## $folds
##         repeat_1  repeat_2  repeat_3
## fold_1 0.7609698 0.6799802 0.6528781
## fold_2 0.7548409 0.7061494 0.9147533
## fold_3 0.8246641 0.5686014 1.0612401
## fold_4 0.7378648 0.9079500 0.5792025
## fold_5 0.6176459 0.7448324 0.6630864
## 
## $mean
## [1] 0.7449773
## 
## $sd
## [1] 0.1357212
## 
## $median
## [1] 0.7378648

Note: I am currently looking for a gig. You can hire me on Malt or send me an email: thierry dot moudiki at pm dot me. I can do descriptive statistics, data preparation, feature engineering, model calibration, training and validation, and model outputs’ interpretation. I am fluent in Python, R, SQL, Microsoft Excel, Visual Basic (among others) and French. My résumé? Here!

To leave a comment for the author, please follow the link and comment on their blog: T. Moudiki's Webpage - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.