Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
When estimating a model, the quality of the model fit will always be higher in-sample than out-of-sample. A model will always fit the data that it is trained on, warts and all, and may use those warts and statistical noise to make predictions. As a result, a model that performs very well on a data set may perform poorly when used more generally.
Since we often can’t fit a model, put it in use and determine how well it performs (and probably wouldn’t want to even if we could), we need some methods for predicting out-of-sample error when only using the data we have.
The simplest method is to randomly divide the data into a training set and a test set. However, this provides only one test set. If you train a model, test it and then train a new model and re-test it using the same test set and see an improvement, it may have come from the model actually being better or it may come from you over-fitting the test set.
One possible way to address this problem is to use something like k-fold cross validation. The general idea is is to use some portion of your data (typically 90%) to fit a model and then test that model on the withheld portion (typically 10%) to assess the quality of the model fit. You would repeat this process over and over again until each point has been used to estimate the model \(k-1\) times and been included in the test set 1 time.
This can be done with a loop or a member of the apply
family, but I’m going to use purrr
to write functional code to fit the models and modelr
to evaluate the models.
My “preamble” looks like this:
Data and the Model
For this example, I’m going to use the cars
dataset from the caret
package. It looks like this:
data("cars", package = "caret") glimpse(cars) ## Observations: 804 ## Variables: 18 ## $ Price <dbl> 22661.05, 21725.01, 29142.71, 30731.94, 33358.77, ... ## $ Mileage <int> 20105, 13457, 31655, 22479, 17590, 23635, 17381, 2... ## $ Cylinder <int> 6, 6, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,... ## $ Doors <int> 4, 2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4,... ## $ Cruise <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,... ## $ Sound <int> 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1,... ## $ Leather <int> 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1,... ## $ Buick <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... ## $ Cadillac <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... ## $ Chevy <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... ## $ Pontiac <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... ## $ Saab <int> 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,... ## $ Saturn <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... ## $ convertible <int> 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,... ## $ coupe <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... ## $ hatchback <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... ## $ sedan <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,... ## $ wagon <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
We want to build a model to predict the price of the car. We have the mileage, number of cylinders, size, some features (cruise control, sound system, leather seats), brand and then the style of car. The data already has dummy variables constructed but we need to drop one in order to estimate the model (e.g., set it as the reference level). We’ll drop the brand Chevy indicator (setting Chevy to the reference brand) and set the reference style to sedan. We’ll also drop the doors variable as that is perfectly co-linear with the coupe indicator and styles likely contain more information than the number of doors.
Our final model will look something like this:
select(cars, -Doors, -Chevy, -sedan) %>% lm(Price ~ Milage + Cylinder + Cruise + Sound + Leather + Buick + Cadillac + Pontiac + Saab + Saturn + convertible + coupe + hatchback + wagon, data = .)
The modelr
package has a useful tool for making the cross-validation folds. crossv_kfold
will divide the data into \(k\) (by default 10) folds and returns a tibble with the list-columns of train
(the training data for the fold), test
(the test set for the fold) and .id
(the fold number). For more details on the idea of list-columns, see the Many Models Chapter in R for Data Science by Garrett Grolemund and Hadley Wickham.
cars_cv <- cars %>% select(-Doors, -Chevy, -sedan) %>% crossv_kfold(10) cars_cv ## # A tibble: 10 × 3 ## train test .id ## <list> <list> <chr> ## 1 <S3: resample> <S3: resample> 01 ## 2 <S3: resample> <S3: resample> 02 ## 3 <S3: resample> <S3: resample> 03 ## 4 <S3: resample> <S3: resample> 04 ## 5 <S3: resample> <S3: resample> 05 ## 6 <S3: resample> <S3: resample> 06 ## 7 <S3: resample> <S3: resample> 07 ## 8 <S3: resample> <S3: resample> 08 ## 9 <S3: resample> <S3: resample> 09 ## 10 <S3: resample> <S3: resample> 10
You might notice that instead of the elements being data.frame
or tibble
, they are resample
objects. resample
objects are cool because they are lazy. Instead of making 10 copies of cars
, crossv_kfold()
returned what are essentially pointers to the original data. So instead of storing the data 10 times, we store it once plus a little bit for the pointers. Indeed, we can see that in the object size:
pryr::object_size(cars) ## 63.6 kB pryr::object_size(cars_cv) ## 93.7 kB
Awesome, right?
Now it is time to fit the model. It all can be down within the very tidy tidyverse (see what I did there?). The code below uses purrr:map()
to fit the model for each one of the 10 folds and returns the RMSE on the test set.
cars_cv %>% mutate(train = map(train, as_tibble)) %>% mutate(model = map(train, ~ lm(Price ~ Mileage + Cylinder + Cruise + Sound + Leather + Buick + Cadillac + Pontiac + Saab + Saturn + convertible + coupe + hatchback + wagon, data = .))) %>% mutate(rmse = map2_dbl(model, test, rmse)) %>% select(.id, rmse) ## # A tibble: 10 × 2 ## .id rmse ## <chr> <dbl> ## 1 01 2372.046 ## 2 02 2844.509 ## 3 03 3398.572 ## 4 04 3165.065 ## 5 05 2838.691 ## 6 06 2675.153 ## 7 07 3720.843 ## 8 08 2692.567 ## 9 09 2362.007 ## 10 10 3016.648
It is reasonably fast too.
ten_fold <- function(x) { x %>% mutate(model = map(train, ~ lm(Price ~ Mileage + Cylinder + Cruise + Sound + Leather + Buick + Cadillac + Pontiac + Saab + Saturn + convertible + coupe + hatchback + wagon, data = .))) %>% mutate(rmse = map2_dbl(model, test, rmse)) %>% select(.id, rmse) } microbenchmark::microbenchmark(ten_fold(cars_cv)) ## Unit: milliseconds ## expr min lq mean median uq max ## ten_fold(cars_cv) 45.47155 50.18167 61.01467 53.27679 60.67367 139.0201 ## neval ## 100
microbenchmark
puts it a median time of about 50 ms. We can probably make it even faster though. We don’t actually need to store the model and we can do it all in one step with a function.
fitter <- function(training_data, testing_data) { lm(Price ~ Mileage + Cylinder + Cruise + Sound + Leather + Buick + Cadillac + Pontiac + Saab + Saturn + convertible + coupe + hatchback + wagon, data = training_data) %>% rmse(testing_data) } ten_fold_2 <- function(x) { x %>% mutate(rmse = map2_dbl(train, test, fitter)) %>% select(.id, rmse) } microbenchmark::microbenchmark(ten_fold_2(cars_cv)) ## Unit: milliseconds ## expr min lq mean median uq max ## ten_fold_2(cars_cv) 46.1202 49.16919 52.21734 51.25691 53.72752 100.7599 ## neval ## 100
Hey, that saved 2 or 3 milliseconds on the median. Totally worth it. (If your model fit takes a long time, it is probably worth it).
It is possible to parallelize the model fitting using multidplyr
as the backend for dplyr
; however, multidplyr
doesn’t seem to work well with the resample
objects. The easiest solution is to store nine copies of the data to allow for the cross-validation to work most simply - but I’m more likely to hurt for shortage of memory than shortage of CPU cycles/time when dealing with health care administrative data. Alternatively, you could use do()
and multidplyr
to filter by fold ID, fit a model, compute RMSE and proceed from there, but that solution is much less simple.
Packages like caret
do provide powerful tools for cross-validation; however, I like the ability to keep my entire workflow smooth and within the tidyverse. I found that, in general, I prefer to use the modelr
tools and do the model fitting the “old fashioned way.”
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.