Site icon R-bloggers

An Introduction to H2O Using R

[This article was first published on R on orrymr.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
< !-- toc -->

1. Introduction

In this post we discuss the H2O machine learning platform. We talk about what H2O is, and how to get started with it, using R – we create a Random Forest which we use to classify the Iris Dataset.

2. What is H2O?

The definition found on H2O’s Github page is a lot to take in, especially if you’re just starting out with H2O: “H2O is an in-memory platform for distributed, scalable machine learning. H2O uses familiar interfaces like R, Python, Scala, Java, JSON and the Flow notebook/web interface, and works seamlessly with big data technologies like Hadoop and Spark.”

We spend the rest of section 2 as well as section 3 discussing salient points of the above definition.

2.1 H2O is an In-Memory Platform

H2O is an “in-memory platform”. Saying that it’s in-memory means that the data being used is loaded into main memory (RAM). Reading from main memory, (also known as primary memory) is typically much faster than secondary memory (such as a hard drive).

H2O is a “platform.” A platform is software which can be used to build something – in this case, machine learning models.

Putting this togther we now know that H2O is an in-memory environment for building machine learning models.

2.2 H2O is Distributed and Scalable

H2O can be run on a cluster. Hadoop is an example of a cluster which can run H2O.

H2O is said to be distributed because your object can be spread amongst several nodes in your cluster. H2O does this by using a Distributed Key Value (DKV). You can read more about it here, but essentially what this means, is that any object you create in H2O can be distributed amongst several nodes in the cluster.

The key-value part of DKV means that when you load data into H2O, you get back a key into a hashmap containing your (potentially distributed) object.

3. How H2O Runs Under the Hood

We spoke earlier about H2O being a platform. It’s important to distinguish between the R interface for H2O, and H2O itself. H2O can exist perfectly fine without R. H2O is just a .jar which can be run on its own. If you don’t know (or particularly care) what a .jar is – just think of it as Java code packaged with all the stuff you need in order to run it.

When you start H2O, you actually create a server which can respond to REST calls. Again, you don’t really need to know how REST works in order to use H2O. But if you do care, just know that you can use any HTTP client to speak with an H2O instance.

R is just a client interfact for H2O. All the R functions you call when working with H2O are actually calling H2O using a REST API (a JSON POST request) under the hood. The Python H2O library, as well as the Flow UI, interface with H2O in a similar way. If this is all very confusing just think about it like this: you use R to send commands to H2O. You could equally well just use Flow or Python to send commands.

4. Running An Example

4.1 Installing H2O

You can install H2O using R: install.packages("h2o"). If you’re having trouble with this, have a look here.

4.2 Starting H2O and Loading Data

First we’ll need to load the packages we’ll be using: h2o and datasets. We load the latter as we’ll be using the famous Iris Dataset, which is part of the datasets package.

library(datasets)
library(h2o)

The Iris Dataset contains attributes of three species of iris flowers.

Let’s load the iris dataset, and start up our H2O instance:

h2o.init(nthreads = -1) 
## Warning in h2o.clusterInfo(): 
## Your H2O cluster version is too old (1 year, 7 months and 4 days)!
## Please download and install the latest version from http://h2o.ai/download/
data(iris)

By default, H2O starts up using 2 cores. By calling h2o.init(nthreads = -1), with nthreads = -1, we use all available cores.

Edit: it doesn’t default to two cores anymore (as per this tweet from H2O’s chief ML scientist):

Nice post! BTW, H2O in R no longer defaults to 2 cores, so you can just do h2o.init() now. 🙂

— Erin LeDell (@ledell) December 19, 2018

If h2o.init() was succesful, you should have an instance of H2O running locally! You can verify this by navigating to http://localhost:54321. There, you should see the Flow UI.

The iris dataset is now loaded into R. However, it’s not yet in H2O. Let’s go ahead and load the iris data into our H2O instance:

iris.hex <- as.h2o(iris)
h2o.ls()

h2o.ls() lists the dataframes you have loaded into H2O. Right now, you should see only one: iris.

Let’s start investigating this dataframe. We can get the summary statistics of the various columns:

h2o.describe(iris.hex)
##          Label Type Missing Zeros PosInf NegInf Min Max     Mean     Sigma
## 1 Sepal.Length real       0     0      0      0 4.3 7.9 5.843333 0.8280661
## 2  Sepal.Width real       0     0      0      0 2.0 4.4 3.057333 0.4358663
## 3 Petal.Length real       0     0      0      0 1.0 6.9 3.758000 1.7652982
## 4  Petal.Width real       0     0      0      0 0.1 2.5 1.199333 0.7622377
## 5      Species enum       0    50      0      0 0.0 2.0       NA        NA
##   Cardinality
## 1          NA
## 2          NA
## 3          NA
## 4          NA
## 5           3

We can also use H2O to plot histograms:

h2o.hist(iris.hex$Sepal.Length)

You can use familiar R syntax to modify your H2O dataframe:

iris.hex$foo <- iris.hex$Sepal.Length + 2

If we now run h2o.describe(iris.hex), we should see this extra variable:

h2o.describe(iris.hex)
##          Label Type Missing Zeros PosInf NegInf Min Max     Mean     Sigma
## 1 Sepal.Length real       0     0      0      0 4.3 7.9 5.843333 0.8280661
## 2  Sepal.Width real       0     0      0      0 2.0 4.4 3.057333 0.4358663
## 3 Petal.Length real       0     0      0      0 1.0 6.9 3.758000 1.7652982
## 4  Petal.Width real       0     0      0      0 0.1 2.5 1.199333 0.7622377
## 5      Species enum       0    50      0      0 0.0 2.0       NA        NA
## 6          foo real       0     0      0      0 6.3 9.9 7.843333 0.8280661
##   Cardinality
## 1          NA
## 2          NA
## 3          NA
## 4          NA
## 5           3
## 6          NA

(What I still don’t understand is why we don’t see this extra column from the Flow UI. If anyone knows, please let me know in the comments!)

But we don’t really need this nonsense column, so let’s get rid of it:

iris.hex$foo <- NULL

We can also get our dataframe back into R, from H2O:

r_df <- as.data.frame(iris.hex)

4.3 Building a Model

We’ve got our H2O instance up and running, with some data in it. Let’s go ahead and do some machine learning – let’s implement a Random Forest.

First off, we’ll split our data into a training set and a test set. I’m not going to explicitly set a validation set, as the algorithm will use the out of bag error instead.

splits <- h2o.splitFrame(data = iris.hex,
                         ratios = c(0.8),  #partition data into 80% and 20% chunks
                         seed = 198)

train <- splits[[1]]
test <- splits[[2]]

h2o.splitFrame() uses approximate splitting. That is, it won’t split the data into an exact 80%-20% split. Setting the seed allows us to create reproducible results.

We can use h2o.nrow() to check the number of rows in our train and test sets.

print(paste0("Number of rows in train set: ", h2o.nrow(train)))
## [1] "Number of rows in train set: 117"
print(paste0("Number of rows in test set: ", h2o.nrow(test)))
## [1] "Number of rows in test set: 33"

Next, let’s call h2o.randomForest() to create our model:

rf <- h2o.randomForest(x = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"),
                    y = c("Species"),
                    training_frame = train,
                    model_id = "our.rf",
                    seed = 1234)

The parameters x and y specify our independent and dependent variables, respectively. The training_frame specified the training set, and model_id is the model name, within H2O (not to be confused with variable rf in the above code – rf is the R handle; whereas our.rf is what H2O calls the model). seed is used for reproducibility.

We can get the model details simply by printing out the model:

print(rf)
## Model Details:
## ==============
## 
## H2OMultinomialModel: drf
## Model ID:  our.rf 
## Model Summary: 
##   number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1              50                      150               18940         1
##   max_depth mean_depth min_leaves max_leaves mean_leaves
## 1         7    3.26000          2         12     5.41333
## 
## 
## H2OMultinomialMetrics: drf
## ** Reported on training data. **
## ** Metrics reported on Out-Of-Bag training samples **
## 
## Training Set Metrics: 
## =====================
## 
## Extract training frame with `h2o.getFrame("RTMP_sid_b2ea_7")`
## MSE: (Extract with `h2o.mse`) 0.03286954
## RMSE: (Extract with `h2o.rmse`) 0.1812996
## Logloss: (Extract with `h2o.logloss`) 0.09793089
## Mean Per-Class Error: 0.0527027
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
## =========================================================================
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
##            setosa versicolor virginica  Error      Rate
## setosa         40          0         0 0.0000 =  0 / 40
## versicolor      0         38         2 0.0500 =  2 / 40
## virginica       0          4        33 0.1081 =  4 / 37
## Totals         40         42        35 0.0513 = 6 / 117
## 
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
## =======================================================================
## Top-3 Hit Ratios: 
##   k hit_ratio
## 1 1  0.948718
## 2 2  1.000000
## 3 3  1.000000

That seems pretty good. But let’s see how the model performs on the test set:

rf_perf1 <- h2o.performance(model = rf, newdata = test)
print(rf_perf1)
## H2OMultinomialMetrics: drf
## 
## Test Set Metrics: 
## =====================
## 
## MSE: (Extract with `h2o.mse`) 0.05806405
## RMSE: (Extract with `h2o.rmse`) 0.2409648
## Logloss: (Extract with `h2o.logloss`) 0.1708688
## Mean Per-Class Error: 0.1102564
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>, <data>)`)
## =========================================================================
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
##            setosa versicolor virginica  Error     Rate
## setosa         10          0         0 0.0000 = 0 / 10
## versicolor      0          9         1 0.1000 = 1 / 10
## virginica       0          3        10 0.2308 = 3 / 13
## Totals         10         12        11 0.1212 = 4 / 33
## 
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>, <data>)`
## =======================================================================
## Top-3 Hit Ratios: 
##   k hit_ratio
## 1 1  0.878788
## 2 2  1.000000
## 3 3  1.000000

Finally, let’s use our model to make some predictions:

predictions <- h2o.predict(rf, test)
print(predictions)
##   predict    setosa versicolor  virginica
## 1  setosa 0.9969698          0 0.00303019
## 2  setosa 0.9969698          0 0.00303019
## 3  setosa 0.9969698          0 0.00303019
## 4  setosa 0.9969698          0 0.00303019
## 5  setosa 0.9969698          0 0.00303019
## 6  setosa 0.9969698          0 0.00303019
## 
## [33 rows x 4 columns]

5. Conclusion

This post discussed what H2O is, and how to use it from R. The full code used in this post can be found here.

To leave a comment for the author, please follow the link and comment on their blog: R on orrymr.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.