Predicting binary response variable with h2o framework
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
H2O is an open source distributed scalable framework used to train machine learning and deep learning models as well as data analysis. It can handle large data sets, with ease of use, by creating a cluster from the available nodes. Fortunately, it provides an API for R users to get the most benefits from it, especially when it comes to large data sets, with which R has its most limitations.
The beauty is that R users are able to load and use this system via the package h2o which can be called and used like any other R packages.
# install.packages("h2o") if not already installed library(tidyverse) -- Attaching packages ------------------- v ggplot2 3.3.1 v purrr 0.3.4 v tibble 3.0.1 v dplyr 1.0.0 v tidyr 1.1.0 v stringr 1.4.0 v readr 1.3.1 v forcats 0.5.0 -- Conflicts --- tidyverse_conflicts() -- x dplyr::filter() masks stats::filter() x dplyr::lag() masks stats::lag() library(h2o) ---------------------------------------------------------------------- Your next step is to start H2O: > h2o.init() For H2O package documentation, ask for help: > ??h2o After starting H2O, you can use the Web UI at http://localhost:54321 For more information visit http://docs.h2o.ai ---------------------------------------------------------------------- Attaching package: 'h2o' The following objects are masked from 'package:stats': cor, sd, var The following objects are masked from 'package:base': %*%, %in%, &&, ||, apply, as.factor, as.numeric, colnames, colnames<-, ifelse, is.character, is.factor, is.numeric, log, log10, log1p, log2, round, signif, trunc
Then to lunch the cluster run the following script
h2o.init(nthreads = -1) Connection successful! R is connected to the H2O cluster: H2O cluster uptime: 42 minutes 55 seconds H2O cluster timezone: Europe/Paris H2O data parsing timezone: UTC H2O cluster version: 3.30.0.1 H2O cluster version age: 2 months and 25 days H2O cluster name: H2O_started_from_R_dell_uhh542 H2O cluster total nodes: 1 H2O cluster total memory: 1.69 GB H2O cluster total cores: 4 H2O cluster allowed cores: 4 H2O cluster healthy: TRUE H2O Connection ip: localhost H2O Connection port: 54321 H2O Connection proxy: NA H2O Internal Security: FALSE H2O API Extensions: Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4 R Version: R version 4.0.1 (2020-06-06)
Looking at this output, we see that h2o uses java virtual machine JVM, so you need java already installed. If you notice I have specified the nthreads argument to be -1 to tell h20 to create its cluster using all the available cores I have less than 1.
Since our purpose is understanding how to work with h2o, we are going be using a small data set, in wcich the response will be a binary variable. The data that we will use is creditcard whcich is downloaded from kaggle website.
data preparation
To import the data directly into the h2o cluster we use the function h2O.importFile as follows.
card <- h2o.importFile("../creditcard.csv") | | | 0% | |============================================================= | 87% | |======================================================================| 100% h2o.dim(card) [1] 284807 31
This data has 284807 and 31 columns. According to the description of this data, the response variable is class with two values 1 for fraudulent card and 0 for regular card. The other variables are PCA components derived from the original ones for privacy purposes to protect, for instance, the users identities.
So first let’s check the summary of this data.
h2o.describe(card) Label Type Missing Zeros PosInf NegInf Min Max 1 Time int 0 2 0 0 0.000000 1.727920e+05 2 V1 real 0 0 0 0 -56.407510 2.454930e+00 3 V2 real 0 0 0 0 -72.715728 2.205773e+01 4 V3 real 0 0 0 0 -48.325589 9.382558e+00 5 V4 real 0 0 0 0 -5.683171 1.687534e+01 6 V5 real 0 0 0 0 -113.743307 3.480167e+01 7 V6 real 0 0 0 0 -26.160506 7.330163e+01 8 V7 real 0 0 0 0 -43.557242 1.205895e+02 9 V8 real 0 0 0 0 -73.216718 2.000721e+01 10 V9 real 0 0 0 0 -13.434066 1.559499e+01 11 V10 real 0 0 0 0 -24.588262 2.374514e+01 12 V11 real 0 0 0 0 -4.797473 1.201891e+01 13 V12 real 0 0 0 0 -18.683715 7.848392e+00 14 V13 real 0 0 0 0 -5.791881 7.126883e+00 15 V14 real 0 0 0 0 -19.214325 1.052677e+01 16 V15 real 0 0 0 0 -4.498945 8.877742e+00 17 V16 real 0 0 0 0 -14.129855 1.731511e+01 18 V17 real 0 0 0 0 -25.162799 9.253526e+00 19 V18 real 0 0 0 0 -9.498746 5.041069e+00 20 V19 real 0 0 0 0 -7.213527 5.591971e+00 21 V20 real 0 0 0 0 -54.497720 3.942090e+01 22 V21 real 0 0 0 0 -34.830382 2.720284e+01 23 V22 real 0 0 0 0 -10.933144 1.050309e+01 24 V23 real 0 0 0 0 -44.807735 2.252841e+01 25 V24 real 0 0 0 0 -2.836627 4.584549e+00 26 V25 real 0 0 0 0 -10.295397 7.519589e+00 27 V26 real 0 0 0 0 -2.604551 3.517346e+00 28 V27 real 0 0 0 0 -22.565679 3.161220e+01 29 V28 real 0 0 0 0 -15.430084 3.384781e+01 30 Amount real 0 1825 0 0 0.000000 2.569116e+04 31 Class int 0 284315 0 0 0.000000 1.000000e+00 Mean Sigma Cardinality 1 9.481386e+04 4.748815e+04 NA 2 1.759011e-12 1.958696e+00 NA 3 -8.250683e-13 1.651309e+00 NA 4 -9.647560e-13 1.516255e+00 NA 5 8.321544e-13 1.415869e+00 NA 6 1.648035e-13 1.380247e+00 NA 7 4.249739e-13 1.332271e+00 NA 8 -3.055929e-13 1.237094e+00 NA 9 8.772193e-14 1.194353e+00 NA 10 -1.179836e-12 1.098632e+00 NA 11 7.093134e-13 1.088850e+00 NA 12 1.875289e-12 1.020713e+00 NA 13 1.052772e-12 9.992014e-01 NA 14 7.125770e-13 9.952742e-01 NA 15 -1.477477e-13 9.585956e-01 NA 16 -5.231893e-13 9.153160e-01 NA 17 -2.282734e-13 8.762529e-01 NA 18 -6.425575e-13 8.493371e-01 NA 19 4.950876e-13 8.381762e-01 NA 20 7.057097e-13 8.140405e-01 NA 21 1.766136e-12 7.709250e-01 NA 22 -3.405763e-13 7.345240e-01 NA 23 -5.724055e-13 7.257016e-01 NA 24 -9.725862e-13 6.244603e-01 NA 25 1.464123e-12 6.056471e-01 NA 26 -6.987098e-13 5.212781e-01 NA 27 -5.617541e-13 4.822270e-01 NA 28 3.332080e-12 4.036325e-01 NA 29 -3.518873e-12 3.300833e-01 NA 30 8.834962e+01 2.501201e+02 NA 31 1.727486e-03 4.152719e-02 NA
The most important issues that we usually check first are missing values and imbalance problem for classification.
For the missing values, you should know that a value recognized by R as missing value if it is written as NA or blank cells. If, otherwise a missing value in imported data written in any other format, for instance, in a string format like na or missing, we should tell R that these are missing values to be converted to NA. Or like in our case when a variable has zero values while it should not have. The Amount variable, for instance, we know that any transaction requires some amount of money so that it should not be equal to zero, while in the data it has 1825 zero’s. the same thing apply for the Time variable with two zero’s. However, since the data is large then this is not a big issue, and we can comfortably remove these rows.
card$Amount <- h2o.ifelse(card$Amount == 0, NA, card$Amount) card$Time <- h2o.ifelse(card$Time == 0, NA, card$Time) card <- h2o.na_omit(card)
it is a good practice to check your output after each transformation to make sure your code did what would be expected.
h2o.describe(card) Label Type Missing Zeros PosInf NegInf Min Max 1 Time int 0 0 0 0 1.000000 1.727920e+05 2 V1 real 0 0 0 0 -56.407510 2.454930e+00 3 V2 real 0 0 0 0 -72.715728 2.205773e+01 4 V3 real 0 0 0 0 -48.325589 9.382558e+00 5 V4 real 0 0 0 0 -5.683171 1.687534e+01 6 V5 real 0 0 0 0 -113.743307 3.480167e+01 7 V6 real 0 0 0 0 -26.160506 7.330163e+01 8 V7 real 0 0 0 0 -43.557242 1.205895e+02 9 V8 real 0 0 0 0 -73.216718 2.000721e+01 10 V9 real 0 0 0 0 -13.320155 1.559499e+01 11 V10 real 0 0 0 0 -24.588262 2.374514e+01 12 V11 real 0 0 0 0 -4.797473 1.201891e+01 13 V12 real 0 0 0 0 -18.683715 7.848392e+00 14 V13 real 0 0 0 0 -5.791881 7.126883e+00 15 V14 real 0 0 0 0 -19.214325 1.052677e+01 16 V15 real 0 0 0 0 -4.498945 8.877742e+00 17 V16 real 0 0 0 0 -14.129855 1.731511e+01 18 V17 real 0 0 0 0 -25.162799 9.253526e+00 19 V18 real 0 0 0 0 -9.498746 5.041069e+00 20 V19 real 0 0 0 0 -7.213527 5.591971e+00 21 V20 real 0 0 0 0 -54.497720 3.942090e+01 22 V21 real 0 0 0 0 -34.830382 2.720284e+01 23 V22 real 0 0 0 0 -10.933144 1.050309e+01 24 V23 real 0 0 0 0 -44.807735 2.252841e+01 25 V24 real 0 0 0 0 -2.836627 4.584549e+00 26 V25 real 0 0 0 0 -10.295397 7.519589e+00 27 V26 real 0 0 0 0 -2.604551 3.517346e+00 28 V27 real 0 0 0 0 -22.565679 3.161220e+01 29 V28 real 0 0 0 0 -15.430084 3.384781e+01 30 Amount real 0 0 0 0 0.010000 2.569116e+04 31 Class int 0 282515 0 0 0.000000 1.000000e+00 Mean Sigma Cardinality 1 9.484963e+04 4.748196e+04 NA 2 -3.483003e-04 1.956753e+00 NA 3 -2.017854e-03 1.650496e+00 NA 4 -3.302670e-03 1.514214e+00 NA 5 -1.199330e-02 1.404852e+00 NA 6 -2.239576e-03 1.378819e+00 NA 7 -1.305135e-03 1.331596e+00 NA 8 2.509043e-03 1.233944e+00 NA 9 2.694780e-05 1.191177e+00 NA 10 1.464239e-03 1.099065e+00 NA 11 -2.278298e-03 1.087587e+00 NA 12 2.311358e-03 1.018693e+00 NA 13 8.656482e-04 9.972279e-01 NA 14 6.992494e-04 9.945502e-01 NA 15 2.020090e-04 9.555395e-01 NA 16 3.645617e-03 9.137113e-01 NA 17 -1.095849e-03 8.760560e-01 NA 18 1.618984e-03 8.462568e-01 NA 19 1.306725e-03 8.386969e-01 NA 20 1.914719e-03 8.119902e-01 NA 21 9.806997e-04 7.705625e-01 NA 22 -4.814073e-05 7.326525e-01 NA 23 -1.607323e-03 7.255767e-01 NA 24 1.474258e-04 6.230342e-01 NA 25 2.017716e-04 6.057968e-01 NA 26 -5.086588e-04 5.209869e-01 NA 27 -1.364828e-03 4.819297e-01 NA 28 2.533476e-04 4.029874e-01 NA 29 1.926961e-04 3.303524e-01 NA 30 8.891949e+01 2.508252e+02 NA 31 1.643226e-03 4.050347e-02 NA
In contrast, we have a very serious imbalance problem since the class variable with only two values 1 and 0 has mean equals 0.00173 which means that we have large number of class label 0.
h2o.table(card$Class) Class Count 1 0 282515 2 1 465 [2 rows x 2 columns]
As expected, the majority of cases are of class label 0. Any machine learning model fitted to this data without correcting this problem will be dominated by the label 0, and will hardly correctly predict the fraudulent card (label 1) which is our main interest.
The h2o package provides a way to correct the imbalance problem. For glm models, for instance, we have three arguments for this purpose:
- balance_classes: if it is set to true then it performs subsampling method specified in the next argument.
- class_sampling_factors: The desired sampling ratios per class (over or under sampling).
- max_after_balance_size: The desired relative size of the training data after balancing class counts.
One of the last two arguments can be used. To reduce thus our data we will set the last argument to 0.1 , to only use 10% of the data.
Before going ahead we should split the data randomly between training (80% of the data) and testing set (the rest 20%).
card$Class <- h2o.asfactor(card$Class) parts <- h2o.splitFrame(card, 0.8, seed = 1111) train <- parts[[1]] test <- parts[[2]] h2o.table(train$Class) Class Count 1 0 226268 2 1 372 [2 rows x 2 columns] h2o.table(test$Class) Class Count 1 0 56247 2 1 93 [2 rows x 2 columns]
Logistic regression
For binary classification problems, the first model that comes in mind is the logistic regression model. This model belongs to the glm models such that when we set the argument family to binomial we get a logistic regression model. The following are the main arguments of **glm* models (besides the arguments discussed above):
- x: should contains the predictor names (not the data) or their indices.
- y: the name of the response variable (again not the whole column).
- training frame: The training data frame.
- model_id: to name the model.
- nfolds: the number of folds to use for cross validation for hyperparameters tuning.
- seed: for reproducibility.
- fold_assignment: the skim of the cross validation: AUTO, Random, Stratified, or Modulo.
- family: many distributions are provided, for binary we have binomial, quasibinomial.
- solver: the algorithm used, with AUTO will decide the best one given the data, but you can choose another one like: IRLSM, L_BFGS, COORDINATE_DESCENT, …etc.
- alpha: ratio to mix the regularization L1 (lasso) and L2(ridge regression). larger values yield more lasso.
- lambda_search: lambda is the strength of the L2 regularization, if TRUE then the model tries different values.
- standardize: to standardize the numeric columns.
- compute_p_value: it does not work with regularization.
- link: the link function.
- interaction: if we want interaction between predictors.
Now we are ready to train our model with some specified values. But, first let’s try to use the original data without correcting the imbalance problem.
model_logit <- h2o.glm( x = 1:30, y = 31, training_frame = train, model_id = "glm_binomial_no_eg", seed = 123, lambda = 0, family = "binomial", solver = "IRLSM", standardize = TRUE, link = "family_default" ) | | | 0% | |============= | 18% | |======================================================================| 100%
h2o provides a bunch of metrics already computed during the training process along with the confusion matrix. we can get access to them by calling the function h2O.performance.
h2o.performance(model_logit) H2OBinomialMetrics: glm ** Reported on training data. ** MSE: 0.0006269349 RMSE: 0.02503867 LogLoss: 0.003809522 Mean Per-Class Error: 0.1103587 AUC: 0.9731273 AUCPR: 0.7485898 Gini: 0.9462545 R^2: 0.6174137 Residual Deviance: 1726.78 AIC: 1788.78 Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold: 0 1 Error Rate 0 226203 65 0.000287 =65/226268 1 82 290 0.220430 =82/372 Totals 226285 355 0.000649 =147/226640 Maximum Metrics: Maximum metrics at their respective thresholds metric threshold value idx 1 max f1 0.135889 0.797799 176 2 max f2 0.058169 0.808081 200 3 max f0point5 0.475561 0.833333 102 4 max accuracy 0.135889 0.999351 176 5 max precision 0.999976 0.909091 0 6 max recall 0.000027 1.000000 397 7 max specificity 0.999976 0.999960 0 8 max absolute_mcc 0.135889 0.797693 176 9 max min_per_class_accuracy 0.001118 0.919355 345 10 max mean_per_class_accuracy 0.002782 0.934336 314 11 max tns 0.999976 226259.000000 0 12 max fns 0.999976 282.000000 0 13 max fps 0.000007 226268.000000 399 14 max tps 0.000027 372.000000 397 15 max tnr 0.999976 0.999960 0 16 max fnr 0.999976 0.758065 0 17 max fpr 0.000007 1.000000 399 18 max tpr 0.000027 1.000000 397 Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
To extract only the confusion matrix we call the function h2O.confusionMatrix
h2o.confusionMatrix(model_logit) Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.135888872638703: 0 1 Error Rate 0 226203 65 0.000287 =65/226268 1 82 290 0.220430 =82/372 Totals 226285 355 0.000649 =147/226640
by looking at the confusion matrix, we get very low error rate for the major label (0.029%), whereas, the error rate for the minor label is quite high (22.04%). This result is expected since the data is highly dominated by the label “0”.
h2o.confusionMatrix(model_logit, test) Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.0767397449673996: 0 1 Error Rate 0 56223 24 0.000427 =24/56247 1 19 74 0.204301 =19/93 Totals 56242 98 0.000763 =43/56340
Using the testing set, the error rates are not far away from the previous ones. The error rate of the minor class, however, is exactly the same as that of the training data.
Now let’s correct the imbalance problem by setting the argument balance_classes to TRUE, and in order to reduce the data size, we under sample the majority class to 20%, and we over sample the minority class 20 times.
model_logit2 <- h2o.glm( x = 1:30, y = 31, training_frame = train, model_id = "glm_binomia_balance", seed = 123, lambda = 0, family = "binomial", solver = "IRLSM", standardize = TRUE, link = "family_default", balance_classes = TRUE, class_sampling_factors = c(0.2, 20) ) | | | 0% | |====== | 8% | |======================================================================| 100%
What we care more about to evaluate our model is the testing set.
h2o.confusionMatrix(model_logit, test) Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.0767397449673996: 0 1 Error Rate 0 56223 24 0.000427 =24/56247 1 19 74 0.204301 =19/93 Totals 56242 98 0.000763 =43/56340
One strategy to improve our model is to remove the less important variables. h2o provides a function to list the predictors in decreasing order of their importance in predicting the response variable. So we can think to remove the less important variable in order to reduce the error rate of the minor class.
h2o.varimp(model_logit) variable relative_importance scaled_importance percentage 1 V4 0.994893006 1.000000000 0.1440671995 2 V10 0.916557351 0.921262231 0.1327236697 3 V14 0.532427886 0.535160950 0.0770991394 4 V22 0.481085303 0.483554815 0.0696643880 5 V9 0.371758112 0.373666424 0.0538330752 6 V20 0.360902368 0.362754956 0.0522610906 7 V27 0.340929107 0.342679168 0.0493688281 8 V13 0.324390153 0.326055315 0.0469738762 9 V21 0.309050873 0.310637296 0.0447526453 10 V16 0.230524987 0.231708320 0.0333815688 11 Amount 0.213887758 0.214985688 0.0309723861 12 V8 0.211491780 0.212577411 0.0306254323 13 Time 0.209412404 0.210487362 0.0303243248 14 V6 0.189761276 0.190735361 0.0274787093 15 V5 0.176081346 0.176985209 0.0254977634 16 V1 0.164618852 0.165463875 0.0238379171 17 V12 0.134542774 0.135233410 0.0194826987 18 V11 0.129031560 0.129693906 0.0186846379 19 V28 0.093633317 0.094113956 0.0135587341 20 V26 0.093283287 0.093762130 0.0135080475 21 V17 0.081893193 0.082313568 0.0118586852 22 V7 0.077962451 0.078362649 0.0112894873 23 V23 0.067840817 0.068189058 0.0098238066 24 V18 0.065741510 0.066078975 0.0095198129 25 V25 0.033325258 0.033496323 0.0048257215 26 V2 0.029047974 0.029197083 0.0042063420 27 V24 0.025833162 0.025965769 0.0037408156 28 V19 0.022354254 0.022469003 0.0032370463 29 V15 0.020189854 0.020293493 0.0029236267 30 V3 0.003304571 0.003321534 0.0004785241
Or as plot as follows:
h2o.varimp_plot(model_logit)
Another strategy to remove the less important variables, which is more better, is using the lasso regression (L1) that can strip out the less important ones automatically, known also as feature selection method . Lasso , like ridge regression (L2), is a regularization technique to fight over fitting problem, and besides that, it is also known as a reduction technique since it reduces the number of predictors. We enable this method in h2o by setting alpha=1
, where alpha is a ratio to trade-off between lasso (L1) or ridge regression (L2). alpha closer to zero means more ridge than lasso.
model_lasso <- h2o.glm( x = 1:30, y = 31, training_frame = train, model_id = "glm_binomial_lasso", seed = 123, alpha = 1, family = "binomial", solver = "IRLSM", standardize = TRUE, link = "family_default", balance_classes = TRUE, max_after_balance_size = 0.1 ) | | | 0% | |= | 2% | |======== | 12% | |======================================================================| 100%
Using the testing set, the confusion matrix will be:
h2o.confusionMatrix(model_lasso, test) Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.0431528451934093: 0 1 Error Rate 0 56221 26 0.000462 =26/56247 1 16 77 0.172043 =16/93 Totals 56237 103 0.000745 =42/56340
With the lasso model, the error rate of the minor class has improved from 20.43% to 17.20%.
The last thing about hyperparameters tuning is that some of which are not supported by h2o.grid function like, for instance, the solver argument. But this not an issue since we can recycle a loop over the hyperparameters in question. Let’s try to explore the most popular solvers y using the R lapply function.
solvers <- c( "IRLSM", "L_BFGS", "COORDINATE_DESCENT" ) mygrid <- lapply(solvers, function(solver) { grid_id <- paste0("glm_", solver) h2o.glm( x = 1:30, y = 31, training_frame = train, seed = 123, alpha = 1, model_id = paste0("lasso_", solver), family = "binomial", solver = solver, standardize = TRUE, link = "family_default", balance_classes = TRUE ) }) | | | 0% | |=== | 4% | |======== | 12% | |======================================================================| 100% | | | 0% | |======== | 11% | |======================================================================| 100% | | | 0% | |=== | 4% | |=========== | 16% | |======================================================================| 100% df <- cbind( h2o.confusionMatrix(mygrid[[1]])$Error, h2o.confusionMatrix(mygrid[[2]])$Error, h2o.confusionMatrix(mygrid[[3]])$Error ) df <- t(round(df, digits = 6)) dimnames(df) <- list( list("IRLSM", "L_BFGS", "COORDINATE_DESCENT"), list("Error (0)", "Error (1)", "Total Error") ) df Error (0) Error (1) Total Error IRLSM 0.000309 0.212366 0.000657 L_BFGS 0.000292 0.220430 0.000653 COORDINATE_DESCENT 0.000283 0.225806 0.000653
It seems there is no significant difference between these solvers, so we should still with the default one.
Random forest
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.