Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
@drsimonj here to show you how to use pipelearner to easily grid-search hyperparameters for a model.
pipelearner is a package for making machine learning piplines and is currently available to install from GitHub by running the following:
# install.packages("devtools") # Run this if devtools isn't installed devtools::install_github("drsimonj/pipelearner") library(pipelearner)
In this post we’ll grid search hyperparameters of a decision tree (using the rpart package) predicting cars’ transmission type (automatic or manual) using the mtcars data set. Let’s load rpart along with tidyverse, which pipelearner is intended to work with:
library(tidyverse) library(rpart)
The data
Quickly convert our outcome variable to a factor with proper labels:
d <- mtcars %>% mutate(am = factor(am, labels = c("automatic", "manual"))) head(d) #> mpg cyl disp hp drat wt qsec vs am gear carb #> 1 21.0 6 160 110 3.90 2.620 16.46 0 manual 4 4 #> 2 21.0 6 160 110 3.90 2.875 17.02 0 manual 4 4 #> 3 22.8 4 108 93 3.85 2.320 18.61 1 manual 4 1 #> 4 21.4 6 258 110 3.08 3.215 19.44 1 automatic 3 1 #> 5 18.7 8 360 175 3.15 3.440 17.02 0 automatic 3 2 #> 6 18.1 6 225 105 2.76 3.460 20.22 1 automatic 3 1
Default hyperparameters
We’ll first create a pipelearner object that uses the default hyperparameters of the decision tree.
pl <- d %>% pipelearner(rpart, am ~ .) pl #> $data #> # A tibble: 32 × 11 #> mpg cyl disp hp drat wt qsec vs am gear carb #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fctr> <dbl> <dbl> #> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 manual 4 4 #> 2 21.0 6 160.0 110 3.90 2.875 17.02 0 manual 4 4 #> 3 22.8 4 108.0 93 3.85 2.320 18.61 1 manual 4 1 #> 4 21.4 6 258.0 110 3.08 3.215 19.44 1 automatic 3 1 #> 5 18.7 8 360.0 175 3.15 3.440 17.02 0 automatic 3 2 #> 6 18.1 6 225.0 105 2.76 3.460 20.22 1 automatic 3 1 #> 7 14.3 8 360.0 245 3.21 3.570 15.84 0 automatic 3 4 #> 8 24.4 4 146.7 62 3.69 3.190 20.00 1 automatic 4 2 #> 9 22.8 4 140.8 95 3.92 3.150 22.90 1 automatic 4 2 #> 10 19.2 6 167.6 123 3.92 3.440 18.30 1 automatic 4 4 #> # ... with 22 more rows #> #> $cv_pairs #> # A tibble: 1 × 3 #> train test .id #> <list> <list> <chr> #> 1 <S3: resample> <S3: resample> 1 #> #> $train_ps #> [1] 1 #> #> $models #> # A tibble: 1 × 5 #> target model params .f .id #> <chr> <chr> <list> <list> <chr> #> 1 am rpart <list [1]> <fun> 1 #> #> attr(,"class") #> [1] "pipelearner"
Fit the model with learn()
:
results <- pl %>% learn() results #> # A tibble: 1 × 9 #> models.id cv_pairs.id train_p fit target model params #> <chr> <chr> <dbl> <list> <chr> <chr> <list> #> 1 1 1 1 <S3: rpart> am rpart <list [1]> #> # ... with 2 more variables: train <list>, test <list>
The fitted results include our single model. Let’s assess the model’s performance on the training and test sets:
# Function to compute accuracy accuracy <- function(fit, data, target_var) { # Coerce `data` to data.frame (needed for resample objects) data <- as.data.frame(data) # Obtain predicted class predicted <- predict(fit, data, type = "class") # Return accuracy mean(predicted == data[[target_var]]) } # Training accuracy accuracy(results$fit[[1]], results$train[[1]], results$target[[1]]) #> [1] 0.92 # Test accuracy accuracy(results$fit[[1]], results$test[[1]], results$target[[1]]) #> [1] 0.8571429
Looks like we’ve achieved 92% accuracy on the training data and 86% accuracy on the test data. Perhaps we can improve on this by tweaking the model’s hyperparameters.
Adding hyperparameters
When using pipelearner, you can add any arguments that the learning function will accept after we provide a formula. For example, run ?rpart
and you’ll see that control options can be added. To see these options, run ?rpart.control
.
An obvious choice for decision trees in minsplit
, which determines “the minimum number of observations that must exist in a node in order for a split to be attempted.” By default it’s set to 20. Given that we have such a small data set, this seems like a poor choice. We can adjust it as follows:
pl <- d %>% pipelearner(rpart, am ~ ., minsplit = 5) results <- pl %>% learn() # Training accuracy accuracy(results$fit[[1]], results$train[[1]], results$target[[1]]) #> [1] 0.92 # Test accuracy accuracy(results$fit[[1]], results$test[[1]], results$target[[1]]) #> [1] 0.8571429
Reducing minsplit
will generally increase your training accuracy. Too small, however, and you’ll overfit the training data resulting in poorer test accuracy.
Using vectors
All the model arguments you provide to pipelearner()
can be vectors. pipelearner will then automatically expand those vectors into a grid and test all combinations. For example, let’s try out many values for minsplit
:
pl <- d %>% pipelearner(rpart, am ~ ., minsplit = c(2, 4, 6, 8, 10)) results <- pl %>% learn() results #> # A tibble: 5 × 9 #> models.id cv_pairs.id train_p fit target model params #> <chr> <chr> <dbl> <list> <chr> <chr> <list> #> 1 1 1 1 <S3: rpart> am rpart <list [2]> #> 2 2 1 1 <S3: rpart> am rpart <list [2]> #> 3 3 1 1 <S3: rpart> am rpart <list [2]> #> 4 4 1 1 <S3: rpart> am rpart <list [2]> #> 5 5 1 1 <S3: rpart> am rpart <list [2]> #> # ... with 2 more variables: train <list>, test <list>
Combining mutate
from dplyr and map
functions from the purrr package (all loaded with tidyverse), we can extract the relevant information for each value of minsplit
:
results <- results %>% mutate( minsplit = map_dbl(params, "minsplit"), accuracy_train = pmap_dbl(list(fit, train, target), accuracy), accuracy_test = pmap_dbl(list(fit, test, target), accuracy) ) results %>% select(minsplit, contains("accuracy")) #> # A tibble: 5 × 3 #> minsplit accuracy_train accuracy_test #> <dbl> <dbl> <dbl> #> 1 2 1 0.5714286 #> 2 4 1 0.5714286 #> 3 6 1 0.5714286 #> 4 8 1 0.5714286 #> 5 10 1 0.5714286
This applies to as many hyperparameters as you care to add. For example, let’s grid search combinations of values for minsplit
, maxdepth
, and xval
:
pl <- d %>% pipelearner(rpart, am ~ ., minsplit = c(2, 20), maxdepth = c(2, 5), xval = c(5, 10)) pl %>% learn()%>% mutate( minsplit = map_dbl(params, "minsplit"), maxdepth = map_dbl(params, "maxdepth"), xval = map_dbl(params, "xval"), accuracy_train = pmap_dbl(list(fit, train, target), accuracy), accuracy_test = pmap_dbl(list(fit, test, target), accuracy) ) %>% select(minsplit, maxdepth, xval, contains("accuracy")) #> # A tibble: 8 × 5 #> minsplit maxdepth xval accuracy_train accuracy_test #> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 2 2 5 1.00 0.8571429 #> 2 20 2 5 0.92 0.8571429 #> 3 2 5 5 1.00 0.8571429 #> 4 20 5 5 0.92 0.8571429 #> 5 2 2 10 1.00 0.8571429 #> 6 20 2 10 0.92 0.8571429 #> 7 2 5 10 1.00 0.8571429 #> 8 20 5 10 0.92 0.8571429
Not much variance in the accuracy, but it demonstrates how you can use this in your own work.
Using train_models()
A bonus tip for those of you how are comfortable so far: you can use learn_models()
to isolate multiple grid searches. For example:
pl <- d %>% pipelearner() %>% learn_models(rpart, am ~ ., minsplit = c(1, 2), maxdepth = c(4, 5)) %>% learn_models(rpart, am ~ ., minsplit = c(6, 7), maxdepth = c(1, 2)) pl %>% learn()%>% mutate( minsplit = map_dbl(params, "minsplit"), maxdepth = map_dbl(params, "maxdepth"), accuracy_train = pmap_dbl(list(fit, train, target), accuracy), accuracy_test = pmap_dbl(list(fit, test, target), accuracy) ) %>% select(minsplit, maxdepth, contains("accuracy")) #> # A tibble: 8 × 4 #> minsplit maxdepth accuracy_train accuracy_test #> <dbl> <dbl> <dbl> <dbl> #> 1 1 4 1.00 1.0000000 #> 2 2 4 1.00 1.0000000 #> 3 1 5 1.00 1.0000000 #> 4 2 5 1.00 1.0000000 #> 5 6 1 0.88 0.8571429 #> 6 7 1 0.88 0.8571429 #> 7 6 2 0.96 0.8571429 #> 8 7 2 0.96 0.8571429
Notice the separate grid searches for minsplit = c(1, 2), maxdepth = c(4, 5)
and minsplit = c(6, 7), maxdepth = c(1, 2)
.
This is because grid search is applied separately for each model defined by a learn_models()
call. This means you can separate various hyperparameters combinations if you want to.
Sign off
Thanks for reading and I hope this was useful for you.
For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.
If you’d like the code that produced this blog, check out the blogR GitHub repository.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.