How H2O propels data scientists ahead of itself: enhancing Driverless AI with advanced options, recipes and visualizations
[This article was first published on novyden, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
H2O engineers continually innovate and implement latest techniques by following and adopting latest research, working on cutting edge use cases, and participating and winning machine learning competitions like Kaggle. But thanks to explosion of AI research and applications even most advanced automated machine learning platforms like H2O.ai Driverless AI can not come with all bells and whistles to satisfy each data scientist out there. Which means there is that feature or algorithm that customer may be wanting and not yet finding in H2O docs.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Having that in mind we designed several mechanisms to help users lead the way with Driverless AI instead of waiting or looking elsewhere. These mechanisms allow data scientists extend functionality with little (or possibly more involved) effort by seamlessly integrating into Driverless AI workflow and model pipeline:
- experiment configuration profile
- transformer recipes (custom feature engineering)
- model recipes (custom algorithms)
- scorer recipes (custom loss functions)
- data recipes (data load, prep and augmentation; starting with 1.8.1)
- Client APIs for both Python and R
Experiment Configuration
All possible configuration options inside Driverless AI can be found inside config.toml file (see here). Every experiment can override any of these options (as applicable) in its Expert Settings using Add to config.toml via toml String entry while applying them for this experiment only.ÂFor example, while Driverless AI completely automates tuning and selection of its built-in algorithms (GLM, LightGBM, XGBoost, TensorFlow, RuleFit, FTRL) it can not possibly foresee all possible use cases to control for all available parameters. So with every experiment the following configuration settings let user customize any parameters for each algorithm:
- LightGBM parameters: params_lightgbm and params_tune_lightgbm
- XGBoost GBM: params_xgboost and params_tune_xgboost
- XGBoost Dart: params_dart and params_tune_dart
- Tensorflow: params_tensorflow and params_tune_tensorflow
- GLM: params_gblinear and params_tune_gblinear
- RuleFit: params_rulefit and params_tune_rulefit
- FTRL: params_ftrl and params_tune_ftrl
params_tensorflow = "{'lr': 0.01, 'add_wide': False, 'add_attention': True, 'epochs': 30, 'layers': (100, 100), 'activation': 'selu', 'batch_size': 64, 'chunk_size': 1000, 'dropout': 0.3, 'strategy': 'one_shot', 'l1': 0.0, 'l2': 0.0, 'ort_loss': 0.5, 'ort_loss_tau': 0.01, 'normalize_type': 'streaming'}"
 or override LightGBM parameters:Â
  Â
params_lightgbm = "{'objective': 'binary:logistic', 'n_estimators': 100, 'max_leaves': 64, 'random_state': 1234}"
 or use params_tune_xxxx to provide a grid for Driverless AI to limit or extend its hyper parameter search for optimal model over certain values like this:
params_tune_xgboost = "{'max_leaves': [8, 16, 32, 64]}"
To add multiple parameters via Expert Settings just use double double quotes (“”) around whole configuration string and new line (\n) to separate parameters:
""params_tensorflow = "{'lr': 0.01, 'epochs': 30, 'activation': 'selu'}" \n params_lightgbm = "{'objective': 'binary:logistic', 'n_estimators': 100, 'max_leaves': 64}" \n params_tune_xgboost = "{'max_leaves': [8, 16, 32, 64]}"""
To confirm that new settings took effect just look inside experiment’s log file (to accomplish that while experiment running see here or for completed experiment here) and find Driverless AI Config Settings. Overridden settings should appear with an asterisk and assigned values:
 Â
params_tensorflow *: {'lr': 0.01, 'epochs': 30, 'activation': 'selu'} params_lightgbm *: {'objective': 'binary:logistic', 'n_estimators': 100, 'max_leaves': 64} params_tune_xgboost *: {'max_leaves': [8, 16, 32, 64]}
Â
Transformer Recipes
Starting with version 1.7.0 (July 2019) Driverless AI supports Bring Your Own Recipe (BYOR) framework to seamlessly integrate user written extension into its workflow. Feature engineering and selection make up significant part of the automated machine learning (AutoML) workflow utilizing Genetic Algorithm (GA) and rich set of built-in feature transformers and interactions to maximize model performance. A high-level and rough view of Driverless AI GA and BYOR workflow to illustrate how its pieces fall together displayed below:Figure 1. Driverless AI GA and BYOR workflow |
Still, variety of data and ever more comlex use cases often demand more specialized transformations and interactions performed on features. Using custom transformer recipes (a.k.a. BYOR transformers) core functionality can be extended to include any transformations and interactions written in Python accoring to BYOR specification. Implemented in Python and able to use any Python packages transformer recipe will be part of core GA workflow to compete with built-in feature transformations and interactions.
Such fare competition of transformers inside Driverless AI is good for both Driverless AI models and for customers who can share and borrow ideas from each other and became true realization of democratising AI H2O.ai stands for. To start with custom transformers use one of many recipes found in public H2O repo for recipes in transformer section: https://github.com/h2oai/driverlessai-recipes/tree/master/transformers. For more help on how to create your own transformer see How to Write a Transformer Recipe.Â
Model Recipes
XGBoost and LightGBM consistently deliver top models and carry most of transactional and time series use cases in Driverless AI. Other workhorse algorithm delivering top models for NLP and multi-class use cases is TensorFlow. Still more algorithms – Random Forest, GLM, and FTRL – compete for the best model in Driverless AI. But this competition is not a closed tournament because BYOR lets any algorithm available in Python to compete for the best model. Using BYOR model recipes user can incorporate their classification and regression algorithms into Driverless AI workflow and let it tune and select the one with the best score for the final model or ensemble. Based on accuracy setting Driverless AI either picks the best algorithm or continues workflow with meta learner to assemble final ensemble out of top finishers. Any program written in Python just need to implement BYOR model interface to start competing as part of Driverless AI workflow. For examples and wide variety of existing recipes refer to h2oai/driverlessai-recipes/models repository.Scorer Recipes
Often data scientists swear by their favorite scorer so Driverless AI includes sufficiently large set of built-in scorers for both classificaiton and regression. But we don’t pretend to have all the answers and, again, BYOR framework allows to extend Driverless AI workflow to any scoring (loss) function being it from the latest research papers, or driven by specific business requirements. Rather representative and useful collection of scorers can be found in h2oai/driverlessai-recipes/scorers repository and for how to use custom scorers in Driverless AI here. Remember that Driverless AI uses custom scorers inside GA workflow to select best features and model parameters and not inside its algorithms where it is more dangerous and likely not desirable.ÂData Recipes
Starting with version 1.8.1 (December 2019) new BYOR feature – data recipes – were added to Driverless AI. The concept is simple: bring your Python code into Driverless AI to create new and manipulate existing data to enhance data and elevate models. Data recipe utilize data APIs, datatable, pandas, numpy and other third-party libraries in Python and belong to one of the are two types:- producing data recipe create new dataset(s) by prototyping connectors, bringing data in and processing it. They are similar to data connectors in a way they import and munge data from the external sources (see here);
- modifying data recipe creates new dataset(s) by transforming a copy of existing one (see here). Variety of data preprocessing (data prep) use cases fall into this category including data munging, data quality, data labeling, unsupervised algorithms such as clustering or latent topic analysis, anomaly detection, etc.
Python Client
All Driverless AI features and actions found in web user interface are also available via Python Client API. See docs for instructions on how to install Python package here and for examples here. For Driverless AI users who are proficient in Python scripting repeatable and reusable tasks with Python Client is next logical step in adopting and productionizing Driverless AI models. Examples of such tasks are re-fitting on latest data and deploying final ensemble, executing business-driven workflows that combine data prep and Driverless AI modeling, computing business reports and KPIs using models, implementing Reject Inference method for credit approval, and other use cases.R Client
Driverless AI R Client parallels functionality of Python Client emphasising consistency with R language conventions and appeals to data scientists practicing R. Moreover R’s unparallel visualization libraries extend model analysis beyond already powerful tools and features found in Driverless AI web interface. Let’s conclude with exmpale of using ggplot2 package based on grammar of graphics by Leland Wilkinson (Chief Scientists at H2O.ai) and create Response Distribution Chart (RDC) to analyze binary classification models trained in Driverless AI. RDC lets us analyze the distribution of responses (probabilities) generated by the model to assess quality of the model on a basis how well it distinguishes two classes (see 150 Successful Machine Learning Models: 6 Lessons Learned at Booking.com, section 6).To plot distributions we show the full workflow how to connect, import, split data, run experiment to create model, and finally score data inside Driverless AI with R client.
Before starting with creating R client script install Driverless AI R client package by downloading it from the Driverless AI instance itself:
Figure 2. Downloading Driverless AI Client R package |
After download completes RStudio lets you find and install package from its menu Tools -> Install Packages…Â Â Â
With dai package installed every script begins with connecting to running Driverless AI instance it will be using (change its name, user id, and password):
For our examples we will use infamous titanic dataset that I saved and slightly enhanced on my local machine (you can download dataset here). The following command uploads data file from local machine into Driverless AI:
While H2O pipeline automates machine learning workflow including creating and using validation splits it is best practices to provide separate test set so that Driverless AI can produce an out of sample score estimate for its final model. Splitting data on appropriate target, fold, or time column is one of many built-in functions called using:
Now we can start automated machine learning workflow to predict survival chances for Titanic passengers that resutls in complete and fully featured classification model:
If you login into Driverless AI you can observe just created model via browser UI:
Having Driverless AI classifier there are many ways to obtain predictions. One way is to download file with computed test predictions to client and then read it into R:
Because we want to use features from the model in visualizations there is a way to score dataset and attach hand picked features in results (scoring all Titanic data in this case):
At this point full power of R graphics is available to produce additional visualizations on the model with predictions stored inside R data frame. As promised, we show how to implement the method of Response Distribution Analysis:
The method is based on the Response Distribution Chart (RDC), which is simply a histogram of the output of the model. The simple observation that the RDC of an ideal model should have one peak at 0 and one peak at 1 (with heights given by the class proportion). Source: https://www.kdd.org/kdd2019/accepted-papers/view/150-successful-machine-learning-models-6-lessons-learned-at-booking.com First, we plot RDC on all data:
Few more examples of RDC follow – first with separate distributions on survived and not survived passengers:
Next plot compares RDC’es for male and female passengers:
Finally, RDC’s by port of embarkation:
 Â
H2O engineers hardly ever stop improving and enhancing the product with new features so likely RDC’s will become part of model diagnostics tools in Driverless AI soon. But this example still serves its purpose of illustrating how to produce practically any type of model analysis with the help of R Client.
To leave a comment for the author, please follow the link and comment on their blog: novyden.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.