Site icon R-bloggers

Machine learning using H2O

[This article was first published on Posts on A stats website , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post will be a quick introduction to using H2O through R. H2O is a platform for machine learning; it is distributed which means it can use all the cores in your computer offering parallelisation out of the box. You can also hook it up to already set up Hadoop or Spark clusters. It is also supposed to be industrial scale and able to cope with large amounts of data.

You can install H2O from CRAN in the usual way using install.packages(). Once you load the package you can initialise a cluster using the h2o.init() command.

require(h2o)
h2o.init()

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 hours 17 minutes 
##     H2O cluster timezone:       Europe/London 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.30.0.1 
##     H2O cluster version age:    2 months and 10 days  
##     H2O cluster name:           H2O_started_from_R_victor.enciso_ird509 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.26 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4 
##     R Version:                  R version 4.0.0 (2020-04-24)

You will get some detail about your cluster as above.

I’ve got a prepared data set that I can load in and start playing around with.

The dataset has 10,000 rows. Using H2O with such a small dataset might be overkill but I just want to illustrate the basics of how it works.

dim(mwTrainSet)

## [1] 10000    28

I preprocess the data using the recipes package as in my xgboost post.

myRecipe<- recipes::recipe(outcome ~ ., data=mwTrainSet) %>% 
  recipes::step_mutate(os = as.factor(os)) %>%
  recipes::step_mutate(ob = as.factor(ob)) %>%
  step_rm(id) %>%
  step_mutate(w50s = ifelse(ds<=0.5,'TRUE','FALSE')) %>%
  prep()

proc_mwTrainSet <- myRecipe %>% bake(mwTrainSet)
proc_mwTestSet <- myRecipe %>% bake(mwTestSet)

Also, I get the names of the predictors in an array which will be used as input when the model is constructed.

predictors <- setdiff(colnames(proc_mwTrainSet), c("outcome"))

The training dataset needs to be converted into an H2O dataset so it can be passed to the model.

train.h2o <- as.h2o(proc_mwTrainSet, destination_frame = "train.h2o")
test.h2o <- as.h2o(proc_mwTestSet, destination_frame = "test.h2o")

Actually, all the preprocessing can be done using H2O specific commands rather than R commands. This will become necessary if your dataset becomes larger.

I’m going to fit a gradient boosted tree model to the dataset. Originally I wanted to use xgboost here but I later discovered that H2O doesn’t support it on Windows. However, if you’re running Linux or OS X then you’re in luck. If you’re set on using it on Windows one solution could be to create a Linux VM.

I specify the gbm model with some parameters I used when I trained the same dataset using xgboost with the rationale that they should translate reasonably well. Note that I’m doing 5-fold cross-validation through the nfolds parameter, I’m building 1000 trees and setting a stopping parameter.

gbm <- h2o.gbm(x = predictors, y = "outcome", training_frame = train.h2o,
               ntrees=1000, nfolds = 5 ,max_depth = 6, learn_rate = 0.01
               ,min_rows = 5, col_sample_rate = 0.8 ,sample_rate = 0.75
               ,stopping_rounds = 25, seed=2020)

When the cluster is initialised you also get access to a web-based UI. This UI can be accessed locally through a web browser on http://localhost:54321/. In theory you can do all your analysis and build all your models directly in the UI if you want without interacting with R at all.

Having the UI is handy to get a quick view of your model results without running any more commands.

Finally, we can feed new data into the model to get predictions.

pred<-h2o.predict(object = gbm , newdata=test.h2o)

kablify(head(pred,5))
predict Type1 Type2 Type3
Type2 0.0696576 0.9231076 0.0072348
Type2 0.0051987 0.9566815 0.0381198
Type2 0.0082406 0.9884921 0.0032673
Type2 0.0118451 0.9852316 0.0029233
Type2 0.1531306 0.8428315 0.0040379

I don’t actually know the labels of my test set but if I did I could use the following to get the performance in the test set

h2o.performance(model = gbm, newdata = test.h2o)

Once all the work is done we shut down the cluster

h2o.shutdown()

## Are you sure you want to shutdown the H2O instance running at http://localhost:54321/ (Y/N)?

That will do for now. This was a very light introduction into H2O, one more tool to be aware of if you work with machine learning.

To leave a comment for the author, please follow the link and comment on their blog: Posts on A stats website .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.