Product Price Prediction: A Tidy Hyperparameter Tuning and Cross Validation Tutorial
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Product price estimation and prediction is one of the skills I teach frequently – It’s a great way to analyze competitor product information, your own company’s product data, and develop key insights into which product features influence product prices. Learn how to model product car prices and calculate depreciation curves using the brand new tune
package for Hyperparameter Tuning Machine Learning Models. This is an Advanced Machine Learning Tutorial.
R Packages Covered
Machine Learning
tune
– Hyperparameter tuning frameworkparsnip
– Machine learning frameworkdials
– Grid search development frameworkrsample
– Cross-validation and sampling frameworkrecipes
– Preprocessing pipeline frameworkyardstick
– Accuracy measurements
EDA
correlationfunnel
– Detect relationships using binary correlation analysisDataExplorer
– Investigate data cleanlinessskimr
– Summarize data by data type
Summary (Why read this?)
Hyperparameter tuning and cross-validation have previously been quite difficult using parsnip
, the new machine learning framework that is part of the tidymodels
ecosystem. Everything has changed with the introduction of the tune
package – thetidymodels
hyperparameter tuning framework that integrates:
parsnip
for machine learningrecipes
for preprocessingdials
for grid searchrsample
for cross-validation.
This is a machine learning tutorial where we model auto prices (MSRP) and estimate depreciation curves.
Estimate Depreciation Curves using Machine Learning
To implement the Depreciation Curve Estimation, you develop a machine learning model that is hyperparameter tuned using a 3-Stage Nested Hyper Parameter Tuning Process with 5-Fold Cross-Validation.
3-Stage Nested Hyperparameter Tuning Process
3 Stage Hyperparameter Tuning Process:
-
Find Parameters: Use Hyper Parameter Tuning on a “Training Dataset” that sections your training data into 5-Folds. The output at Stage 1 is the parameter set.
-
Compare and Select Best Model: Evaluate the performance on a hidden “Test Dataset”. The ouput at Stage 2 is that we determine best model.
-
Train Final Model: Once we have selected the best model, we train on the full dataset. This model goes into production.
Need to learn Data Science for Business? This is an advanced tutorial, but you can get the foundational skills, advanced machine learning, business consulting, and web application development using R
, Shiny
(Apps), H2O
(ML), AWS
(Cloud), and tidyverse
(data manipulation and visualization). I recommend my 4-Course R-Track for Business Bundle.
Why Product Pricing is Important
Product price prediction is an important tool for businesses. There are key pricing actions that machine learning and algorithmic price modeling can be used for.
Which brands will customers pay more for?
Defend against inconsistent pricing
There is nothing more confusing to customers than pricing products inconsistently. Price too high, and customers fail to see the difference between competitor products and yours. Price too low and profit can suffer. Use machine learning to price products consistently based on the key product features that your customers care about.
Learn which features drive pricing and customer purchase decisions
In competitive markets, pricing is based on supply and demand economics. Sellers adjust prices to maximize profitability given market conditions. Use machine learning and explainable ML techniques to interpret the value of features such as brand (luxury vs economy), performance (engine horsepower), age (vehicle year), and more.
Develop price profiles (Appreciation / Depreciation Curves)
Another important concept to products like homes and automobiles is the ability to monitor the effect of time. Homes tend to appreciate and machinery (including automobiles) tend to depreciate in value over time. Use machine learning to develop price curves. We’ll do just that in this tutorial examining the MSRP of vehicles that were manufactured across time.
Depreciation Curve for Dodge Ram 1500 Pickup
Read on to learn how to make this plot
Product Price Tutorial
Onward – To the Product Price Prediction and Hyperparameter Tuning Tutorial.
1.0 Libraries and Data
Load the following libraries.
Next, get the data used for this tutorial. This data set containing Car Features and MSRP was scraped from "Edmunds and Twitter".
Download the Dataset
Car Features and MSRP Dataset
2.0 Data Understanding
Read the data using read_csv()
and use clean_names()
from the janitor
package to clean up the column names.
msrp | make | model | year | engine_fuel_type | engine_hp | engine_cylinders | transmission_type | driven_wheels | number_of_doors | market_category | vehicle_size | vehicle_style | highway_mpg | city_mpg | popularity |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
46135 | BMW | 1 Series M | 2011 | premium unleaded (required) | 335 | 6 | MANUAL | rear wheel drive | 2 | Factory Tuner,Luxury,High-Performance | Compact | Coupe | 26 | 19 | 3916 |
40650 | BMW | 1 Series | 2011 | premium unleaded (required) | 300 | 6 | MANUAL | rear wheel drive | 2 | Luxury,Performance | Compact | Convertible | 28 | 19 | 3916 |
36350 | BMW | 1 Series | 2011 | premium unleaded (required) | 300 | 6 | MANUAL | rear wheel drive | 2 | Luxury,High-Performance | Compact | Coupe | 28 | 20 | 3916 |
29450 | BMW | 1 Series | 2011 | premium unleaded (required) | 230 | 6 | MANUAL | rear wheel drive | 2 | Luxury,Performance | Compact | Coupe | 28 | 18 | 3916 |
34500 | BMW | 1 Series | 2011 | premium unleaded (required) | 230 | 6 | MANUAL | rear wheel drive | 2 | Luxury | Compact | Convertible | 28 | 18 | 3916 |
We can get a sense using some ggplot2
visualizations and correlation analysis to detect key features in the dataset.
2.1 Engine Horsepower vs MSRP
First, let’s take a look at two interesting factors to me:
- MSRP (vehicle price) - Our target, what customers on average pay for the vehicle, and
- Engine horsepower - A measure of product performance
We can see that there are two distinct groups in the plots. We can inspect a bit closer with gghighlight
and patchwork
. I specifically focus on the high end of both groups:
- Vehicles with MSRP greater than $650,000
- Vehicles with Engine HP greater than 350 and MSRP less than $10,000
Key Points:
- The first group makes total sense - Lamborghini, Bugatti and Maybach. These are all super-luxury vehicles.
- The second group is more interesting - It’s BMW 8 Series, Mercedes-Benz 600-Class. This is odd to me because these vehicles normally have a higher starting price.
2.2 Correlation Analysis - correlationfunnel
Next, I’ll use my correlationfunnel
package to hone in on the low price vehicles. I want to find which features correlate most with low prices.
Key Points:
Ah hah! The reduction in price is related to vehicle age. We can see that Vehicle Year less than 2004 is highly correlated with Vehicle Price (MSRP) less than $18,372. This explains why some of the 600 Series Mercedes-Benz vehicles (luxury brand) are in the low price group. Good - I’m not going crazy.
2.3 Engine HP, MSRP by Vehicle Age
Let’s explain this story by redo-ing the visualization from 2.1, this time segmenting by Vehicle Year. I’ll segment by year older than 2000 and newer than 2000.
Key Points:
As Joshua Starmer would say, “Double Bam!” Vehicle Year is the culprit.
3.0 Exploratory Data Analysis
Ok, now that I have a sense of what is going on with the data, I need to figure out what’s needed to prepare the data for machine learning. The data set was webscraped. Datasets like this commonly have issues with missing data, unformatted data, lack of cleanliness, and need a lot of preprocessing to get into the format needed for modeling. We’ll fix that up with a preprocessing pipeline.
Goal
Our goal in this section is to identify data issues that need to be corrected. We will then use this information to develop a preprocessing pipeline. We will make heavy use of skimr
and DataExplorer
, two EDA packages I highly recommend.
3.1 Data Summary - skimr
I’m using the skim()
function to breakdown the data by data type so I can assess missing values, number of uniue categories, numeric value distributions, etc.
Key Points:
- We have missing values in a few features
- We have several categories with low and high unique values
- We have pretty high skew in several categories
Let’s go deeper with DataExplorer
.
3.2 Missing Values - DataExplorer
Let’s use the plot_missing()
function to identify which columns have missing values. We’ll take care of these missing values using imputation in section 4.
Key Point: We’ll zap missing values and replace with estimated values using imputation.
3.3 Categorical Data - DataExplorer
Let’s use the plot_bar()
function to identify the distribution of categories. We have several columns with a lot of categories that have few values. We can lump these into an “Other” Category.
One category that doesn’t show up in the plot is “market_category”. This has 72 unique values - too many to plot. Let’s take a look.
Hmmm. This is actually a “tag”-style category (where one vehicle can have multiple tags). We can clean this up using term-frequency feature engineering.
Key Points:
- We need to lump categories with few observations
- We can use text-based feature engineering on the market categories that have multiple categories.
3.4 Numeric Data - DataExplorer
Let’s use plot_histogram()
to investigate the numeric data. Some of these features are skewed and others are actually discrete (and best analyzed as categorical data). Skew won’t be much of an issue with tree-based algorithms so we’ll leave those alone (better from an explainability perspective). Discrete features like engine-cylinders and number of doors are better represented as factors.
Here’s a closer look at engine cylinders vs MSRP. Definitely need to encode this one as a factor, which will be better because of the non-linear relationship to price. Note that cars with zero engine cylinders are electric.
Key Points:
- We’ll encode engine cylinders and number of doors as categorical data to better represent non-linearity
- Tree-Based algorithms can handle skew. I’ll leave alone for explainability purposes. Note that you can use transformations like Box Cox to reduce skew for linear algorithms. But this transformation makes it more difficult to explain the results to non-technical stakeholders.
4.0 Machine Learning Strategy
You’re probably thinking, “Wow - That’s a lot of work just to get to this step.”
Yes, you’re right. That’s why good Data Scientists are in high demand. Data Scientists that understand business, know data wrangling, visualization techniques, can identify how to treat data prior to Machine Learning, and communicate what’s going on to the organization - These data scientists get good jobs.
At this point, it makes sense to re-iterate that I have a 4-Course R-Track Curriculum that will turn you into a good Data Scientist in 6-months or less. Yeahhhh!
ML Game Plan
We have a strategy that we are going to use to do what’s called “Nested Cross Validation”. It involves 3 Stages.
Stage 1: Find Parameters
We need to create several machine learning models and try them out. To accomplish we do:
- Initial Splitting - Separate into random training and test data sets
- Preprocessing - Make a pipeline to turn raw data into a dataset ready for ML
- Cross Validation Specification - Sample the training data into 5-splits
- Model Specification - Select model algorithms and identify key tuning parameters
- Grid Specification - Set up a grid using wise parameter choices
- Hyperparameter Tuning - Implement the tuning process
Stage 2: Select Best Model
Once we have the optimal algorithm parameters for each machine learning algorithm, we can move into stage 2. Our goal here is to compare each of the models on “Test Data”, data that were not used during the parameter tuning process. We re-train on the “Training Dataset” then evaluate against the “Test Dataset”. The best model has the best accuracy on this unseen data.
Stage 3: Retrain on the Full Dataset
Once we have the best model identified from Stage 2, we retrain the model using the best parameters from Stage 1 on the entire dataset. This gives us the best model to go into production with.
Ok, let’s get going.
5.0 Stage 1 - Preprocessing, Cross Validation, and Tuning
Time for machine learning! Just a few more steps and we’ll make and tune high-accuracy models.
5.1 Initial Train-Test Split - rsample
The first step is to split the data into Training and Testing sets. We use an 80/20 random split with the initial_split()
function from rsample
. The set.seed()
function is used for reproducibility. Note that we have 11,914 cars (observations) of which 9,532 are randomly assigned to training and 2,382 are randomly assigned to testing.
5.2 Preprocessing Pipeline - recipes
What the heck is a preprocessing pipeline? A “preprocessing pipeline” (aka a “recipe”) is a set of preprocessing steps that transform raw data into data formatted for machine learning. The key advantage to a preprocessing pipeline is that it can be re-used on new data. So when you go into production, you can use the recipe to process new incoming data.
Remember in Section 3.0 when we used EDA to identify issues with our data? Now it’s time to fix those data issues using the Training Data Set. We use the recipe()
function from the recipes
package then progressively add step_*
functions to transform the data.
The recipe we implement applies the following transformations:
- Encoding Character and Discrete Numeric data to Categorical Data.
- Text-Based Term Frequency Feature Engineering for the Market Category column
- Consolidate low-frequency categories
- Impute missing data using K-Nearest Neighbors with 5-neighbors (kNN is a fast and accurate imputation method)
- Remove unnecessary columns (e.g. model)
The preprocessing recipe hasn’t yet changed the data. We’ve just come up with the recipe
. To transform the data, we use bake()
. I create a new variable to hold the preprocessed training dataset.
We can use DataExplorer
to verify that the dataset has been processed. First, let’s inspect the Categorical Features.
- The category distributions have been fixed - now “other” category present lumping together infrequent categories.
- The text feature processing has added several new columns beginning with “tf_market_category_”.
- Number of doors and engine cylinders are now categorical.
We can review the Numeric Features. The remaining numeric features have been left alone to preserve explainability.
5.3 Cross Validation Specification - rsample
Now that we have a preprocessing recipe and the initial training / testing split, we can develop the Cross Validation Specification. Standard practice is to use either a 5-Fold or 10-Fold cross validation:
- 5-Fold: I prefer 5-fold cross validation to speed up results by using 5 folds and an 80/20 split in each fold
- 10-Fold: Others prefer a 10-fold cross validation to use more training data with a 90/10 split in each fold. The downside is that this calculation requires twice as many models as 5-fold, which is already an expensive (time consuming) operation.
To implement 5-Fold Cross Validation, we use vfold_cv()
from the rsample
package. Make sure to use your training dataset (training()
) and then apply the preprocessing recipe using bake()
before the vfold_cv()
cross validation sampling. You now have specified the 5-Fold Cross Validation specification for your training dataset.
5.4 Model Specifications - parnsip
We’ll specify two competing models:
-
glmnet
- Uses an Elastic Net, that combines the LASSO and Ridge Regression techniques. This is a linear algorithm, which can have difficulty with skewed numeric data, which is present in our numeric features. -
xgboost
- A tree-based algorithm that uses gradient boosted trees to develop high-performance models. The tree-based algorithms are not sensitive to skewed numeric data, which can easily be sectioned by the tree-splitting process.
5.4.1 glmnet - Model Spec
We use the linear_reg()
function from parsnip
to set up the initial specification. We use the tune()
function from tune
to identify the tuning parameters. We use set_engine()
from parsnip to specify the engine as the glmnet
library.
5.4.2 xgboost - Model Spec
A similar process is used for the XGBoost model. We specify boost_tree()
and identify the tuning parameters. We set the engine to xgboost
library. Note that an update to the xgboost
library has changed the default objective from reg:linear
to reg:squarederror
. I’m specifying this by adding a objective
argument in set_engine()
that get’s passed to the underlying xgb.train(params = list([goes here]))
.
5.5 Grid Specification - dials
Next, we need to set up the grid that we plan to use for Grid Search. Grid Search is the process of specifying a variety of parameter values to be used with your model. The goal is to to find which combination of parameters yields the best accuracy (lowest prediction error) for each model.
We use the dials
package to setup the hyperparameters. Key functions:
parameters()
- Used to specify ranges for the tuning parametersgrid_***
- Grid functions including max entropy, hypercube, etc
5.5.1 glmnet - Grid Spec
For the glmnet
model, we specify a parameter set, parameters()
, that includes penalty()
and mixture()
.
Next, we use the grid_max_entropy()
function to make a grid of 20 values using the parameters
. I use set.seed()
to make this random process reproducible.
Because this is a 2-Dimensional Hyper Paramater Space (only 2 tuning parameters), we can visualize what the grid_max_entropy()
function did. The grid selections were evenly spaced out to create uniformly distributed hyperparameter selections. Note that the penalty
parameter is on the Log Base 10-scale by default (refer to the dials::penalty()
function documentation). This functionality results in smarter choices for critical parameters, a big benefit of using the tidymodels
framework.
5.5.2 xgboost - Grid Spec
We can follow the same process for the xgboost
model, specifying a parameter set using parameters()
. The tuning parameters we select for are grid are made super easy using the min_n()
, tree_depth()
, and learn_rate()
functions from dials()
.
Next, we set up the grid space. Because this is now a 3-Dimensional Hyperparameter Space, I up the size of the grid to 30 points. Note that this will drastically increase the time it takes to tune the models because the xgboost
algorithm must be run 30 x 5 = 150 times. Each time it runs with 1000 trees, so we are talking 150,000 tree calculations. My point is that it will take a bit to run the algorithm once we get to Section 5.6 Hyperparameter Tuning.
5.6 Hyper Parameter Tuning - tune
Now that we have specified the recipe, models, cross validation spec, and grid spec, we can use tune()
to bring them all together to implement the Hyperparameter Tuning with 5-Fold Cross Validation.
5.6.1 glmnet - Hyperparameter Tuning
Tuning the model using 5-fold Cross Validation is straight-forward with the tune_grid()
function. We specify the formula
, model
, resamples
, grid
, and metrics
.
The only piece that I haven’t explained is the metrics
. These come from the yardstick
package, which has functions including mae()
, mape()
, rsme()
and rsq()
for calculating regression accuracy. We can specify any number of these using the metric_set()
. Just make sure to use only regression metrics since this is a regression problem. For classification, you can use all of the normal measures like AUC, Precision, Recall, F1, etc.
Use the show_best()
function to quickly identify the best hyperparameter values.
Key Point:
A key observation is that the Mean Absolute Error (MAE) is $16,801, meaning the model is performing poorly. This is partly because we left the numeric features untransformed. Try updating the recipe with step_boxcox()
and see if you can do better. Note that your MSRP will be transformed so you need to invert the MAE to the correct scale by finding the power for the box-cox transformation. But I digress.
5.6.2 xgboost - Hyperparameter Tuning
Follow the same tuning process for the xgboost
model using tune_grid()
.
Warning: This takes approximately 20 minutes to run with 6-core parallel backend. I have recommendations to speed up this at the end of the article.
We can see the best xgboost
tuning parameters with show_best()
.
Key Point:
A key observation is that the Mean Absolute Error (MAE) is $3,784, meaning the xgboost
model is performing about 5X better than the glmnet
. However, we won’t know for sure until we move to Stage 2, Evaluation.
6.0 Stage 2 - Compare and Select Best Model
Feels like a whirlwind to get to this point. You’re doing great. Just a little bit more to go. Now, let’s compare our models.
The “proper” way to perform model selection is not to use the cross validation results because in theory we’ve optimized the results to the data the training data. This is why we left out the testing set by doing initial splitting at the beginning.
Now, do I agree with this? Let me just say that normally the Cross Validation Winner is the true winner.
But for the sake of showing you the correct way, I will continue with the model comparison.
6.1 Select the Best Parameters
Use the select_best()
function from the tune
package to get the best parameters. We’ll do this for both the glment
and xgboost
models.
6.2 Finalize the Models
Next, we can use the finalize_model()
function to apply the best parameters to each of the models. Do this for both the glmnet
and xgboost
models.
6.3 Calculate Performance on Test Data
We can create a helper function, calc_test_metrics()
, to calculate the performance on the test set.
6.3.1 glment - Test Performance
Use the calc_test_metrics()
function to calculate the test performance on the glmnet
model.
6.3.2 xgboost - Test Performance
Use the calc_test_metrics()
function to calculate the test performance on the xgboost
model.
6.4 Model Winner!
Winner: The xgboost
model had an MAE of $3,209 on the test data set. We select this model with the following parameters to move forward.
7.0 Stage 3 - Final Model (Full Dataset)
Now that we have a winner from Stage 2, we move forward with the xgboost
model and train on the full data set. This gives us the best possible model to move into production with.
Use the fit()
function from parsnip
to train the final xgboost model on the full data set. Make sure to apply the preprocessing recipe to the original car_prices_tbl
.
8.0 Making Predictions on New Data
We’ve went through the full analysis and have a “production-ready” model. Now for some fun - let’s use it on some new data. To avoid repetitive code, I created the helper function, predict_msrp()
, to quickly apply my final model and preprocessing recipe to new data.
8.1 What does a
I’ll simulate a 2008 Dodge Ram 1500 Pickup and calculate the predicted price.
8.2 What’s the “Luxury” Effect?
Let’s play with the model a bit. Next, let’s change the market_category
from “N/A” to “Luxury”. The price just jumped from $30K to $50K.
8.3 How Do Prices Vary By Model Year?
Let’s see how our XGBoost model views the effect of changing the model year. First, we’ll create a dataframe of Dodge Ram 1500 Pickups with the only feature that changes is the model year.
Next, we can use ggplot2
to visualize this “Year” effect, which is essentially a Depreciation Curve. We can see that there’s a huge depreciation going from models built in 1990’s vs 2000’s and earlier.
8.4 Which Features Are Most Important?
Finally, we can check which features have the most impact on the MSRP using the vip()
function from the vip
(variable importance) package.
9.0 Conclusion and Next Steps
The tune
package presents a much needed tool for Hyperparameter Tuning in the tidymodels
ecosystem. We saw how useful it was to perform 5-Fold Cross Validation, a standard in improving machine learning performance.
An advanced machine learning package that I’m a huge fan of is h2o
. H2O provides a automatic machine learning, which takes a similar approach (minus the Stage 2 - Comparison Step) by automating the cross validation and hyperparameter tuning process. Because H2O is written in Java, H2O is much faster and more scalable, which is great for large-scale machine learning projects. If you are interested in learning h2o
, I recommend my 4-Course R-Track for Business Program - The 201 Course teaches Advanced ML with H2O.
The last step is to productionalize the model inside a Shiny
Web Application. I teach two courses on production with Shiny
:
Shiny
Dashboards - Build 2 Predictive Web DashboardsShiny
Developer withAWS
- Build Full-Stack Web Applications in the Cloud
The two Shiny
courses are part of the 4-Course R-Track for Business Program.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.