H2O.ai: Going for a paddle

[This article was first published on Mango Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Owen Jones, Placement Student

A quick disclaimer: This post isn’t called H2O.ai: Going for the 100m freestyle world record. I’m not trying to win a Kaggle competition. I’m not carrying out detailed, highly-controlled benchmarking tests. I’m not, in fact, claiming to be doing anything particularly useful at all. This is just me, just playing around with some code, just for the fun of it.

H-to-O dot what?

“I’ve never even heard of that,” you might be thinking. Well, don’t worry – neither had I until very recently. Let’s start by having a look at their website.

H2O is the world’s leading open source deep learning platform. H2O is used by over 100,000 data scientists and more than 10,000 organizations around the world. – www.h2o.ai

In other words, H2O – which has been around since way back in 2011 – is a widely-used platform for carrying out machine learning and data analysis, which is optimised for working wth large datasets.

The primary focus really is on the machine learning. There is a wide selection of algorithms to choose from, ranging from random forests and neural networks to PCA and Word2vec (a full list can be found in the documentation).

H2O is a language in its own right, although under the hood it is largely written in Java. This makes it incredibly flexible: you can use it with R or Python, link into Spark and Hadoop, or use the built-in “Flow” interface. I’ll cover some of this functionality in the rest of this post.

Demo: Random Forest

Let’s test out a couple of H2O’s algorithms. To start with, we’ll use a random forest for a classification task.

I’m going to use the“Covertype” dataset from the UCI ML datasets archive. This dataset contains nearly 600000 records representing different locations in the USA. The challenge is to predict tree cover type (spruce/fir, pine, aspen etc.) using only cartographic variables (elevation, slope, soil type etc.).

So effectively, the plan is to use a random forest to classify… yep, random forests.

Data import

Without further ado, let’s get the data into R and have a look at it. Note: for now, I’m going to be using the R interface to H2O, but later we’ll try out the Flow interface too.

# Load some packages for data manipulation
library(dplyr)
library(readr)

# Data from http://archive.ics.uci.edu/ml/datasets/Covertype
covtype <- read_csv("~/H2O_exploration/covtype.csv", col_names = FALSE) %>%
    # Convert some columns to factor
    mutate_at(11:55, funs(factor(.)))

# Have a look at what we've got
str(covtype)
## Classes 'tbl_df', 'tbl' and 'data.frame':    581012 obs. of  55 variables:
##  $ X1 : int  2596 2590 2804 2785 2595 2579 2606 2605 2617 2612 ...
##  $ X2 : int  51 56 139 155 45 132 45 49 45 59 ...
##  [... more columns...]
##  $ X10: int  6279 6225 6121 6211 6172 6031 6256 6228 6244 6230 ...
##  $ X11: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  [... loads more columns...]
##  $ X54: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ X55: Factor w/ 7 levels "1","2","3","4",..: 5 5 2 2 5 2 5 5 5 5 ...

An explanation of the dataset is available on the download page. The CSV file itself doesn’t contain any column names, and the ones R creates for us aren’t particularly helpful; but actually, it won’t really matter. We only need to take note of a couple of things:

  • The first 10 columns are numeric variables representing various geographical aspects of the location in question. Average yearly rainfall, degree of slope, number of minutes of sunlight between 3 and 4 o’clock on the second Wednesday of August during a leap year, and so on.
  • Columns 11-14 are binary categorical variables signalling whether the forest is within one of four “Wilderness Areas”. So if the forest is in one of these areas, then the column representing that area will contain a 1; otherwise each column will contain a 0.
  • Columns 15-54 are binary categorical variables representing soil type: i.e. Column 11 is storing the answer to “is this area Soil Type 1?”, where the answers “yes” or “no” are represented by 1 or 0 respectively. So each record has a 1 in precisely one of these columns, and 0 in the other columns.
  • Column 55 is the important one: it’s the label indicating what the cover type actually is. This is the target variable which we’re going to try to predict! There are 7 different possible labels, representing spruce/fir, aspen, Ponderosa pine etc.

We converted columns 11-55 to factors when we imported the data, so we can crack on with some machine learning right away.

The h2o R package

H2O can be downloaded on its own, but if you’re using it with R it’s easier to install the h2o package (which is available on CRAN). You’ll also need to have Java installed.

Then you can go ahead and load the package into R, and before you can do anything else you’ll need to get an H2O cluster up and running using h2o.init().

library(h2o)
h2o.init()
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 hours 54 minutes 
##     H2O cluster version:        3.10.5.3 
##     H2O cluster version age:    1 month and 15 days  
##     H2O cluster name:           H2O_started_from_R_ojones_mlh028 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   1.50 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 3.4.0 (2017-04-21)

This shows us some information about the session we’ve just created. Notice the cluster is running on a Java Virtual Machine (JVM), and in my case I’m running on 1 node (my laptop) and 4 cores. If you’re lucky enough to have a larger setup then you can edit these settings to suit you.

The only function we’ve used so far gives a good indication of how the package works: the general form of functions is h2o.*(...), and more often than not the * is replaced by what you would guess you should put there. At first, we want to initialise a cluster, so we use h2o.init. The function names tend to be quite intuitive, so after you’ve seen them once or twice it’s very easy to remember them.

On with the show! Recall that our covtype dataset is currently sitting happily in an R dataframe. However, we’re only accessing our H2O cluster through R (remember it’s running behind the scenes on a JVM), so we need to pass the data on to the cluster in a format it can work with.

Fortunately this is very easy: as.h2o converts a dataframe to an “H2OFrame”. In the R session, it’s stored as an environment, but actually this is just a pointer to the data structure which has been created by H2O.

wet <- as.h2o(covtype)

We’re going to want to assess the performance of our random forest once we’ve trained it, so we’ll split our data into training and test sets using h2o.splitFrame - intuitive names, remember?

wet_split <- h2o.splitFrame(wet, ratios = 0.75, seed = 820)

The observations are randomly divided into sets according to the ratios specfied. Our split is 75:25, but ratios can be a vector too: if, for example, you wanted to create a cross-validation set too, then you could specify ratios = c(0.6, 0.2) to get a 60:20:20 split.

Now we can train our classifier. We want a random forest, so we use - you guessed it - h2o.randomForest. Remember that the label for each observation is in the 55th column, so we set the target variable y to 55. Our training dataframe is the first set from our split data, and then you can include a whole load of other optional parameters. I’m not going to go into huge detail here; I’m just going to set the number of trees to train, and the number of folds for cross-validation. But if you’re super keen on making sure your trees are balanced or on fiddling around with per-class sample rates, then rest assured that you can do so. If you don’t think balancing trees will do your back any good then fear not: H2O does a pretty good job of using sensible defaults for any parameters you don’t explicitly include.

h2o_rf <- h2o.randomForest(y = 55, training_frame = wet_split[[1]],
                           ntrees = 1, nfolds = 2)

You might argue that one tree a forest doth not make, and I would agree with you. This is just a demonstration of the syntax - we’ll make a proper forest in a minute!

Once the model has finished training, we can calculate some useful performance metrics using h2o.performance - I’m passing in the model we’ve just trained, and the second part of our dataset (which is our test set).

h2o.performance(h2o_rf, wet_split[[2]])
## H2OMultinomialMetrics: drf
## 
## Test Set Metrics: 
## =====================
## 
## MSE: (Extract with `h2o.mse`) 0.1253515
## RMSE: (Extract with `h2o.rmse`) 0.3540501
## Logloss: (Extract with `h2o.logloss`) 1.204014
## Mean Per-Class Error: 0.256692
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>, <data>)`)
## =========================================================================
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
##            1     2    3   4    5    6    7  Error               Rate
## 1      42484  9651   12   1   98   37  442 0.1942 =  10,241 / 52,725
## 2       5576 64313  457   7  266  268   96 0.0940 =   6,670 / 70,983
## 3        167   526 7609  93   13  469    2 0.1430 =    1,270 / 8,879
## 4         27    34  129 463    1   59    0 0.3506 =        250 / 713
## 5        146  1144   44   0 1030    2    3 0.5652 =    1,339 / 2,369
## 6        114   634  584  40    6 3006    7 0.3154 =    1,385 / 4,391
## 7        557   105    4   1    1   15 4401 0.1343 =      683 / 5,084
## Totals 49071 76407 8839 605 1415 3856 4951 0.1505 = 21,838 / 145,144
## 
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>, <data>)`
## =======================================================================
## Top-7 Hit Ratios: 
##   k hit_ratio
## 1 1  0.849542
## 2 2  0.969492
## 3 3  0.978669
## 4 4  0.979165
## 5 5  0.979200
## 6 6  0.979200
## 7 7  1.000000

This gives us numeric metrics such as mean square error; a confusion matrix of predicted vs actual classes, with per-class error rates calculated for us; and a “hit ratio table”.

For each observation, the classifier works out how likely it is that the observation belongs to each of the 7 classes. This allows it to make a “best guess” at the actual class (which is the “prediction”), but it can also make a second-best guess, a third-best, and so on.

The \(k\)th row of the hit ratio table gives us the proportion of observations which are correctly classified within the top \(k\) guesses the classifier makes. So \(k=1\) is the accuracy of the classifier - i.e. the proportion of observations which are correctly predicted - \(k=2\) is the proportion which are within the first two guesses the classifier makes, and so on. The actual class will definitely be in the top 7 guesses (there are only 7 possible classes!) so for \(k=7\) the hit ratio is 100%.

If your classification task is binomial then you can calculate many other relevant metrics (precision, F1 score, false negative rate etc.) using the selection of h2o.metric functions which are available.

We’ll come back to this random forest soon, but first I’m going to have a little look at H2O Flow.

H2O Flow

H2O Flow is H2O’s own browser-based interface. Getting started is easy:

  1. Initialise an H2O cluster (e.g. by running h2o.init() in R)
  2. Go to localhost:54321 in browser
  3. Unplug keyboard

Honestly. You can do that. You don’t need it.

Flow is based on a notebook-style layout. Code is added to cells, which are then run in sequence. You can write raw H2O code yourself, but it’s much easier to use the drop-down menus and buttons and let Flow compile the code for you.

The really nice thing about Flow is that it allows you to see exactly what your options are at each stage. When you finish one operation, it gives you a selection of things which you might want to do next. When you are able to change options or parameters, it lists them all out for you so you can see which ones are available and which ones you might want to change.

If you like writing code and you already know exactly what you’re planning on doing, then the R interface is quicker to use and much more concise. But I’d encourage you to play around with Flow for a while at some point, because you’ll probably discover something that you hadn’t used before.

Also, it’s worth clarifying that when you’re using H2O, your data is always on your cluster (in my case, my laptop, but it could be your big server in the basement). Flow uses the browser for display, but it’s not pinging a server - it’s all hosted locally (hence localhost!). So you don’t need to worry about data being intercepted or anything.

Comparison: randomForest and caret

Going back to our random-forest random forest from earlier, we can set up the R equivalent of the H2O forest using the randomForest and caret packages. Specifically, caret is great for dealing with the ML process as a whole - splitting data, managing cross-validation etc. - and in this case it uses randomForest to actually train a classifier.

I’m putting the code for this into a little function called do_rf: it just records the time taken to set up and train the forest, and then returns this time along with performance metrics for the forest.

do_rf <- function(ntrees, nfolds) {

    # Record time at start
    start <- Sys.time()

    # Make/train the forest
    library(caret)
    set.seed(820)

    caret_split <- createDataPartition(covtype$X55, p = 0.75, list = FALSE, times = 1)

    rf <- train(X55 ~ ., data = covtype[caret_split, ],
               trControl = trainControl(method = "cv", number = 2),
               method = "rf", ntree = ntrees)

    # Stop the clock
    exec_time <- Sys.time() - start

    # Make predictions
    preds <- predict(rf, covtype[-caret_split, ])

    # Return runtime and performance metrics
    list(exec_time, confusionMatrix(preds, covtype$X55[-caret_split]))
}

Technically, cross-validation isn’t really necessary for a random forest, since the algorithm isn’t prone to overfitting. However, it’s a useful example to show how the entire ML workflow is integrated into H2O (using cross-validation is as simple as setting the nfolds parameter), whereas in R you really need to use a “meta”-ML package such as caret or mlr to set up ML projects.

Let’s run the do_rf function and see what happens. We’ll use 10 trees and 2-fold cross-validation.

do_rf(ntrees = 10, nfolds = 2)
## [[1]]
## Time difference of 7.788886 mins
## 
## [[2]]
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     1     2     3     4     5     6     7
##          1 50591  1564     1     0    29    11   200
##          2  2208 68905   128     0   352    93    31
##          3     2   119  8576    80    30   243     0
##          4     0     0    43   583     0    18     0
##          5    26   126    14     0  1955     4     2
##          6     4    85   176    23     7  3972     0
##          7   129    26     0     0     0     0  4894
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9602          
##                  95% CI : (0.9592, 0.9612)
##                [... some more stats]

Performance is reasonably good - according to the UCI website a simple backprop neural net achieved only 70% accuracy on this dataset, so we’re well above that.

We can compare this with the performance of the H2O forest by writing a similar do_h2o function containing the code from earlier, then running this function and looking at the results.

h2o_results <- do_h2o(ntrees = 10, nfolds = 2)
print(h2o_results)
## [[1]]
## Time difference of 4.436078 mins
## 
## [[2]]
## H2OMultinomialMetrics: drf
## 
## Test Set Metrics: 
## =====================
## 
## MSE: (Extract with `h2o.mse`) 0.09325105
## RMSE: (Extract with `h2o.rmse`) 0.3053703
## Logloss: (Extract with `h2o.logloss`) 0.3262433
## Mean Per-Class Error: 0.183704
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>, <data>)`)
## =========================================================================
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
##            1     2    3   4    5    6    7  Error               Rate
## 1      45661  6932    8   0    8    4  112 0.1340 =   7,064 / 52,725
## 2       2887 67758  188   2   55   71   22 0.0454 =   3,225 / 70,983
## 3          0   438 8284  16    0  141    0 0.0670 =      595 / 8,879
## 4          0     0  133 558    0   22    0 0.2174 =        155 / 713
## 5         38  1143   38   0 1147    3    0 0.5158 =    1,222 / 2,369
## 6          7   357  499  24    1 3503    0 0.2022 =      888 / 4,391
## 7        507    22    0   0    0    0 4555 0.1041 =      529 / 5,084
## Totals 49100 76650 9150 600 1211 3744 4689 0.0942 = 13,678 / 145,144
## 
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>, <data>)`
## =======================================================================
## Top-7 Hit Ratios: 
##   k hit_ratio
## 1 1  0.905763
## 2 2  0.994791
## 3 3  0.999290
## 4 4  0.999862
## 5 5  0.999972
## 6 6  0.999986
## 7 7  1.000000

It turns out that in this case, H2O is faster than R, but the performance is actually worse. In. This. Case. This is not a solid benchmarking test, and of course it would be possible to improve the performance of both of the implementations through some parameter adjustment, but I’m not going to dive into all that. This is just a paddle, remember.

Comparison: sparklyr

Perhaps the best-known ML platform is Apache Spark, and seeing as there’s an R package to interface with it - sparklyr - I thought I’d take a brief tangent in order to try it out.

It turned out I was being a bit optimistic. It took me about 10 minutes to set up the H2O random forest for the first time; it took a few hours and lots of yelling at incomprehensible Java-related error messages before I managed to get compatible versions of Spark and Java configured correctly. In fairness this was probably more my fault than Spark's...

Once it was finally set up, the ML process was quite similar to H2O (although I must admit that initially I found the syntax a little more difficult to get my head around).

library(sparklyr)
library(dplyr)

sc <- spark_connect(master = "local", version = "2.0.0")

spark_read_csv(sc, "spark_covtype", "covtype.csv", header = FALSE)

covtype_tbl <- tbl(sc, "spark_covtype")

partition <- sdf_partition(covtype_tbl, train = 0.75, test = 0.25, seed = 820)
train_tbl <- partition$train
test_tbl <- partition$test

ml_rf <- ml_random_forest(train_tbl, V55 ~ ., max.depth = 20, max_bins = 32,
                          num.trees = 20,
                          type = "classification")

preds <- sdf_predict(ml_rf, test_tbl)

ml_classification_eval(preds, "V55", "prediction", "accuracy")
## [1] 0.8295277

At the time of writing there is no facility in sparklyr for cross-validation. Having said that, the Spark MLlib library does include cross-validation, and sparklyr is still under active development, so this is something which may be added in future.

Running a do_spark(ntrees = 10) function reveals a faster time but much lower accuracy than either R or H2O, but this isn’t surprising given that we’re not using cross-validation so we’re doing less than half the work.

Results comparison

Accuracy Time (mins)
randomForest and caret 0.960 7.8
h2o 0.906 4.4
sparklyr (no X-val) 0.830 1.9

Conversion with rsparkling

If your dataset is already sitting in Spark (or you're accessing it using Spark) and you want to use H2O’s algorithms on it, c’est possible. The rsparkling package allows you to convert a Spark frame to an H2OFrame, which you can then attack with h2o’s functions (e.g. you can take advantage of H2O's cross-validation).

library(rsparkling)

options(rsparkling.sparklingwater.version = "2.0.11")

sc <- spark_connect("local", version = "2.0.0")

rsp_train <- as_h2o_frame(sc, train_tbl, strict_version_check = FALSE)
rsp_test <- as_h2o_frame(sc, test_tbl, strict_version_check = FALSE)

rsp_rf <- h2o.randomForest(y = 55, training_frame = rsp_train,
                           nfolds = 2, ntrees = 10)

Right, we’ve been talking about random forests for quite a long time now. I’d say we’re certainly in over our knees, and by my definition that means we’re wading, not paddling. So I’m going to move on to a new algorithm and a new dataset.

Demo: Neural nets for MNIST

Time for a real classic - according to Kaggle, the de facto “hello world” dataset of computer vision. The MNIST dataset is a collection of 60000 images of handwritten digits. Each image is 28x28 pixels in size, and is represented by a 28x28 matrix of values representing the intensity of each pixel.

This matrix is then “unrolled”, row by row, into a 784-length vector; and these vectors can be used to train a neural network to classify new images.

nnet

First let’s get our data into R and create a split for training and test sets.

library(dplyr)
library(caret)

train <- readr::read_csv("~/H2O_exploration/train.csv") %>%
    # Make the label a factor
    mutate(label = as.factor(label)) %>%
    # Pixel intensities are currently on 0-255 scale;
    # rescale on 0-1 by dividing by 255
    mutate_if(is.numeric, funs(./255))

set.seed(528491)
part <- createDataPartition(1:nrow(train), p = 0.75, list = FALSE)

The nnet package comes as part of the standard R installation and can be used to set up a simple feed-forward neural network. Here we’ll use a 25-unit hidden layer, and we’ll stop optimising after 100 iterations if we haven’t already run out of steam by then.

As before, we can wrap this code in a little timer function:

library(nnet)

do_rnet <- function() {

    start <- Sys.time()

    # Predict label based on all other features (i.e. all the pixels)
    # We have to set a high MaxNWts to avoid a memory error
    rnet <- nnet(label ~ ., data = train, subset = part,
                 size = 25, maxit = 100, MaxNWts = 20000)

    exec_time <- Sys.time() - start

    preds <- predict(rnet, train[-part, -1], type = "class")

    list(exec_time, confusionMatrix(preds, train$label[-part]))
}

do_rnet()
## [[1]]
## Time difference of 33.20143 mins
## 
## [[2]]
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2    3    4    5    6    7    8    9
##          0  977    0    8    7    2   15    9    3    4    3
##          1    0 1093    5    3    5    1    0    3   11    2
##          2    4   10  977   17    7    5   11   15    7    3
##          3    5    7   23  982    5   29    2    8   26    8
##          4    1    3    7    0  964    1    5    8    2   33
##          5    8    4    4   19    3  880   14    6    9    7
##          6    4    0    3    3   11   14  999    1    7    1
##          7    0    5    7   14    9    4    2 1025    3   20
##          8    9    8    8   20    2   19    7    6  937    9
##          9    4    2    3   13   32   11    1   17    7  973
## 
## Overall Statistics
##                                           
##                Accuracy : 0.934           
##                  95% CI : (0.9291, 0.9387)
## [... etc.]

The accuracy ain’t much to shout about - the top MNIST classifiers nowadays are well above 99% - but it isn’t too bad for such a small implementation. It takes a little while to get there though.

H2O

Let’s see how H2O’s implementation compares (they follow the relatively recent trend of calling neural networks “deep learning” classifiers). Note that we’re using the same setup here as we did for nnet, i.e. one hidden layer of 25 units, although I couldn’t figure out how to limit the number of steps. Also note that the syntax is almost identical to when we implemented the random forest earlier; we’ve just swapped out randomForest for deeplearning.

do_h2o_net <- function() {nnet

    start <- Sys.time()

    library(h2o)
    h2o.init()

    wetter <- as.h2o(train)
    wetter_split <- h2o.splitFrame(wetter, ratios = 0.75, seed = 528491)

    h2o_net <- h2o.deeplearning(
    x = 2:785,
    y = 1,
    training_frame = wetter_split[[1]],
    hidden = 25
    )

    exec_time <- Sys.time() - start

    list(exec_time, h2o.performance(h2o_net, wetter_split[[2]]))
}

results <- do_h2o_net()
results
## [[1]]
## Time difference of 1.802627 mins
## 
## [[2]]
## H2OMultinomialMetrics: deeplearning
## 
## Test Set Metrics: 
## =====================
## 
## MSE: (Extract with `h2o.mse`) 0.0657653
## RMSE: (Extract with `h2o.rmse`) 0.2564475
## Logloss: (Extract with `h2o.logloss`) 0.5847708
## Mean Per-Class Error: 0.07453935
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>, <data>)`)
## =========================================================================
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
##           0    1    2    3    4   5    6    7    8   9  Error
## 0      1029    0    1    5    3   6    4    5    6   5 0.0329
## 1         0 1189    2    6    3   1    3    1    5   0 0.0174
## 2         8   12  967   17    6   4   15   16   13   2 0.0877
## 3         1    6   25 1005    1  51    3   16   16   8 0.1122
## 4         4    6    9    1  976   1    3    9    2  31 0.0633
## 5         3    5    6   38    5 844   19    2   19   6 0.1088
## 6         7    5    6    2    8  14  971    2   11   0 0.0536
## 7         5    2   15    9    5   4    3 1010    2  23 0.0631
## 8         2   27    5   23    7  15    7    4  929   8 0.0954
## 9        10    2    4    7   36  13    2   35    5 913 0.1110
## Totals 1069 1254 1040 1113 1050 953 1030 1100 1008 996 0.0735
##                  Rate
## 0      =   35 / 1,064
## 1      =   21 / 1,210
## 2      =   93 / 1,060
## 3      =  127 / 1,132
## 4      =   66 / 1,042
## 5      =    103 / 947
## 6      =   55 / 1,026
## 7      =   68 / 1,078
## 8      =   98 / 1,027
## 9      =  114 / 1,027
## Totals = 780 / 10,613
## 
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>, <data>)`
## =======================================================================
## Top-10 Hit Ratios: 
##     k hit_ratio
## 1   1  0.926505
## 2   2  0.972581
## 3   3  0.985490
## 4   4  0.992556
## 5   5  0.995006
## 6   6  0.996608
## 7   7  0.998116
## 8   8  0.998775
## 9   9  0.999152
## 10 10  1.000000

Woah! The accuracy is virtually identical, but we manage it in under 2 minutes, rather than over half an hour.

Just for the sake of it, I also ran a TensorFlow convolutional neural net in Python to see how performance compared. The accuracy was higher than both of the previous implementations and the training process was nearly as fast as H2O had been. However, an upgraded H2O net with two 50-unit hidden layers (as simple as swapping out hidden = 25 for hidden = c(50, 50) in the previous code) surpassed even the convnet, both in terms of accuracy and speed.

Neural nets: results comparison

Accuracy Time (mins)
nnet (100 steps, 25 hidden units) 0.934 33.0
H2O deeplearning (??? steps, 25 hidden units) 0.927 1.8
TensorFlow (500 steps, 2*conv/relu/FC) 0.950 3.7
H2O deeplearning (??? steps, 2*50 hidden units) 0.956 2.2

There’s another slightly subtle point here, but it’s a very important one. Notice that when we increase the complexity of the H2O net, the impact on training time is very small.

However, swapping one 25-unit hidden layer for two 50-unit hidden layers means a massive increase in computational complexity. It’s not just because we have 4 times the number of neurons; that extra hidden layer means another stage of backpropagation, which is very computationally expensive. So actually, from the relatively small increase in training time, we see that the ML process scales very well (for this medium-sized dataset at least) and we can infer that a large chunk of those 2 minutes is in fact spent on data conversion.

H2O or H2No?

So, should you consider using H2O for your ML applications? Let’s review what we’ve discovered.

H2Pros

  • Multiple interfaces available. If you use R or Python you can seamlessly integrate it into your existing workflows.
  • Intuitive to use. The R interface uses sensible, memorable function names. The Flow interface guides you through the process and shows you e v e r y t h i n g.
  • ML scales very well. More data? No problem. Complicated algorithm? No problem.
  • Fine control is possible. Loads of parameters to fiddle around with. Flow shows you exactly what they are. If the “Advanced” ones aren’t hard-core enough for you, there’s an “Expert” section too.
  • Integrated ML workflow. (sparklyr, we’re waiting!)

H2Woes

  • Limited selection of algorithms. I mean, “limited” here is relative. If you need to use Gaussian mixtures or your application desperately requires Latent Dirichlet Allocation, you’d better head on over to Spark. But H2O covers most things you are likely to use on a regular basis.
  • Data conversion can be H2Slow. But again, this is relative. If you have to spend 2 minutes converting your data to an H2OFrame, but this saves you 30 minutes of algorithm runtime, you’re probably prepared to do it!
  • Some algorithms are black box methods. But… ¯\_(ツ)_/¯. The algorithms which are “black-box” are still black-box whichever implementation you’re using, so this probably isn’t a downside to H2O per se - it’s more of a general ML “problem” (whether it’s an actual problem often depends on your point of view). And H2O is open source, so if you want to you can go and dig around in the source code.

So if you’re looking for a way to speed up your ML training times by replacing R algorithm implementations, in my opinion H2O is a great choice of ML framework.

Bon voyage!

I might have mentioned once or twice that this was just a paddle (sorry if you’re bored of the theme, but I promise I’m nearly done!). I really have only touched on the basics here: there’s loads more to H2O than what I’ve covered in this post. If you’re interested in GPU-based deep learning, you might want to go and read about Deep Water. If Spark and H2O sounds like a super combination, check out Sparkling Water. If you’re keen on producing streamlined data analysis applications, have a look at Steam. And of course the documentation for H2O, and for all those other H2O.ai enterprises, goes into much greater detail than I possibly could.

Off you go then! It’s been fun paddling together, but now I’ll leave you to do your own thing. Go and dry off, or carry on frolicking in the waves, or get started training for your 100m freestyle world record attempt. Me? I’m off for an ice cream.


Who is "just me"?

Hi, I'm Owen. I'm a maths undergraduate at the University of Bath. Just over a year ago I found out about machine learning (and consequently data science), and decided that it looked much more fun than all the other careers I'd been half-heartedly researching up to that point. I'm now a couple of months into a year-long placement at Mango: I'm thoroughly enjoying myself, and given that I haven't been kicked out yet I assume they don't mind having me around too much.

If you want to talk about H2O, ML or anything else at all you can reach me by email at [email protected], on Twitter @owenjonesuob, or via my website.

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)