Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
We’re at the final day of Business Science Demo Week. Today we are demo-ing the h2o
package for machine learning on time series data. What’s demo week? Every day this week we are demoing an R package: tidyquant
(Monday), timetk
(Tuesday), sweep
(Wednesday), tibbletime
(Thursday) and h2o
(Friday)! That’s five packages in five days! We’ll give you intel on what you need to know about these packages to go from zero to hero. Today you’ll see how we can use timetk
+ h2o
to get really accurate time series forecasts. Here we go!
Previous Demo Week Demos
- class(Monday) <- tidyquant
- class(Tuesday) <- timetk
- class(Wednesday) <- sweep
- class(Thursday) <- tibbletime
h2o: What’s It Used For?
The h2o
package is a product offered by H2O.ai that contains a number of cutting edge machine learning algorithms, performance metrics, and auxiliary functions to make machine learning both powerful and easy. One of the main benefits of H2O is that it can be deployed on a cluster (this will not be discussed today). From the R perspective, there are four main uses:
-
Data Manipulation: Merging, grouping, pivoting, imputing, splitting into training/test/validation sets, etc.
-
Machine Learning Algorithms: Very sophisiticated algorithms in both supervised and unsupervised categories. Supervised include deep learning (neural networks), random forest, generalized linear model, gradient boosting machine, naive bayes, stacked ensembles, and xgboost. Unsupervised include generalized low rank models, k-means and PCA. There’s also Word2vec for text analysis. The latest stable release also has AutoML: automatic machine learning, which is really cool as we’ll see in this post!
-
Auxiliary ML Functionality Performance analysis and grid hyperparameter search
-
Production, Map/Reduce and Cloud: Capabilities for productionizing into Java environments, cluster deployment with Hadoop / Spark (Sparkling Water), deploying in cloud environments (Azure, AWS, Databricks, etc)
Sticking with the theme for the week, we’ll go over how h2o
can be used for time series machine learning as an advanced algorithm. We’ll use h2o
locally to develop a high accuracy time series model on the same data set (beer_sales_tbl
) from the timetk
and sweep
posts. This is a supervised regression problem.
Load Libraries
We’ll need three libraries today:
h2o
: Awesome machine learning librarytidyquant
: For getting data and loading the tidyverse behind the scenestimetk
: Toolkit for working with time series in R
IMPORTANT FOR INSTALLING H2O
For h2o
, you must install the latest stable release. Select H2O » Latest Stable Release » Install in R. Then follow the instructions exactly.
Installing Other Packages
If you haven’t done so already, install the timetk
and tidyquant
packages:
Loading Libraries
Load the libraries.
Data
We’ll get data using the tq_get()
function from tidyquant
. The data comes from FRED: Beer, Wine, and Distilled Alcoholic Beverages Sales.
It’s a good idea to visualize the data so we know what we’re working with. Visualization is particularly important for time series analysis and forecasting, and it’s a good idea to identify spots where we will split the data into training, test and validation sets.
Now that you have a feel for the time series we’ll be working with today, let’s move onto the demo!
DEMO: h2o + timetk, Time Series Machine Learning
We’ll follow a similar workflow for time series machine learning from the timetk
+ linear regression post on Tuesday. However, this time we’ll swap out the lm()
function for h2o.autoML()
to get superior accuracy!
Time Series Machine Learning
Time series machine learning is a great way to forecast time series data, but before we get started here are a couple pointers for this demo:
-
Key Insight: The time series signature ~ timestamp information expanded column-wise into a feature set ~ is used to perform machine learning.
-
Objective: We’ll predict the next 8 months of data for 2017 using the time series signature. We’ll then compare the results to the two prior demos that predicted the same data using different methods:
timetk
+lm()
(linear regression) andsweep
+auto.arima()
(ARIMA).
We’ll go through a workflow that can be used to perform time series machine learning.
Step 0: Review data
Just to show our starting point, let’s print out our beer_sales_tbl
. We use glimpse()
to take a quick peek at the data.
Step 1: Augment Time Series Signature
The tk_augment_timeseries_signature()
function expands out the timestamp information column-wise into a machine learning feature set, adding columns of time series information to the original data frame. We’ll again use glimpse()
for quick inspection. See how there are now 30 features. Not all will be important, but some will.
Step 2: Prep the Data for H2O
We need to prepare the data in a format for H2O. First, let’s remove any unnecessary columns such as dates or those with missing values, and change the ordered classes to plain factors. We prefer dplyr
operations for these steps.
Let’s split into a training, validation and test sets following the time ranges in the visualization above.
Step 3: Model with H2O
First, fire up h2o
. This will initialize the Java Virtual Machine (JVM) that H2O uses locally.
We change our data to an H2OFrame
object that can be interpreted by the h2o
package.
Set the names that h2o will use as the target and predictor variables.
Apply any regression model to the data. We’ll use h2o.automl
.
x = x
: The names of our feature columns.y = y
: The name of our target column.training_frame = train_h2o
: Our training set consisting of data from 2010 to start of 2016.validation_frame = valid_h2o
: Our validation set consisting of data in the year 2016. H2O uses this to ensure the model does not overfit the data.leaderboard_frame = test_h2o
: The models get ranked based on MAE performance against this set.max_runtime_secs = 60
: We supply this to speed up H2O’s modeling. The algorithm has a large number of complex models so we want to keep things moving at the expense of some accuracy.stopping_metric = "deviance"
: Use deviance as the stopping metric, which provides very good results for MAPE.
Next we extract the leader model.
Step 4: Predict
Generate predictions using h2o.predict()
on the test data.
Step 5: Evaluate Performance
There are a few ways to evaluate performance. We’ll go through the easy way, which is h2o.performance()
. This yields a preset values that are commonly used to compare regression models including root mean squared error (RMSE) and mean absolute error (MAE).
Our preference for this is assessment is mean absolute percentage error (MAPE), which is not included above. However, we can easily calculate. We can investigate the error on our test set (actuals vs predictions).
For comparison sake, we can calculate a few residuals metrics.
And The Winner of Demo Week Is…
The MAPE for the combination of h2o
+ timetk
is superior to the two prior demos:
- timetk + h2o: MAPE = 3.9% (This demo)
- timetk + linear regression: MAPE = 4.3% (timetk demo)
- sweep + ARIMA: MAPE = 4.3%, (sweep demo)
A question for the interested reader to figure out: What happens to the accuracy when you average the predictions of all three different methods? Try it to find out.
Note that the accuracy of time series machine learning may not always be superior to ARIMA and other forecast techniques including those implemented by prophet
and GARCH methods. The data scientist has a responsibility to test different methods and to select the right tool for the job.
HaLLowEen TRick oR TrEat BoNuS!
We are going to visualize the forecast compared to the actual values, but this time taking a cue from @lenkiefer’s theme_spooky
described in one of his recent posts, Mortgage Rates are Low!
We’re going to need to load a few libraries to get setup. The biggest challenge is the s, but there’s a really cool package called extra
that we can use. We’ll use extra
to load the Chiller set. Load the bonus library.
Next, you’ll need to setup the Chiller . Revolutions Analytics has a great article, How to Use Your Favorite Fonts in R Charts, which will get you up and running with extra
. IMPORTANT: Make sure you go throught the process of loading your system s with _import()
.
Once s are imported, you can load s using.
We’ll use Len’s script for theme_spooky()
. I highly encourage you to use theme_spooky()
all month of October around the office. Very spooky, and surprisingly engaging. 🙂
Now let’s create the final visualization so we can see our spooky forecast… Conclusion from the plot: It’s scary how accurate h2o
is.
Next Steps
We’ve only scratched the surface of h2o
. There’s more to learn including working classifiers and unsupervised learning. Here are a few resources to help you along the way:
- H2O Website
- H2O documentation
- H2O GitHub Page
- Business Science Articles:
Announcements
We have a busy couple of weeks. In addition to Demo Week, we have:
Facebook LIVE DataTalk
Matt was recently hosted on Experian DataLabs live webcast, #DataTalk, where he spoke about Machine Learning in Human Resources. The talk already has 80K+ views and is growing!! Check it out if you are interested in #rstats, #hranalytics and #MachineLearning.
Machine Learning to Reduce Employee AttritionMachine Learning to Reduce Employee Attrition w/ Business Science, LLC
Posted by Experian News on Thursday, October 26, 2017
EARL
On Friday, November 3rd, Matt will be presenting at the EARL Conference on HR Analytics: Using Machine Learning to Predict Employee Turnover.
?Hey #rstats. I'll be presenting @earlconf on #MachineLearning applications in #HumanResources. Get 15% off tickets: https://t.co/b6JUQ6BSTl
— Matt Dancho (@mdancho84) October 11, 2017
Courses
Based on recent demand, we are considering offering application-specific machine learning courses for Data Scientists. The content will be business problems similar to our popular articles:
-
HR Analytics: Using Machine Learning to Predict Employee Turnover
-
Sales Analytics: How To Use Machine Learning to Predict and Optimize Product Backorders
The student will learn from Business Science how to implement cutting edge data science to solve business problems. Please let us know if you are interested. You can leave comments as to what you would like to see at the bottom of the post in Disqus.
About Business Science
Business Science specializes in “ROI-driven data science”. Our focus is machine learning and data science in business applications. We help businesses that seek to add this competitive advantage but may not have the resources currently to implement predictive analytics. Business Science works with clients primarily in small to medium size businesses, guiding these organizations in expanding predictive analytics while executing on ROI generating projects. Visit the Business Science website or contact us to learn more!
Follow Business Science on Social Media
- @bizScienc is on twitter!
- Check us out on Facebook page!
- Check us out on LinkedIn!
- Sign up for our insights blog to stay updated!
- If you like our software, star our GitHub packages!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.