Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In previous posts, I used popular machine learning algorithms to fit models to best predict MPG using the cars_19 dataset which is a dataset I created from publicly available data from the Environmental Protection Agency. It was discovered that support vector machine was clearly the winner in predicting MPG and SVM produces models with the lowest RMSE. In this post I am going to use LightGBM to build a predictive model and compare the RMSE to the other models.
The raw data is located on the EPA government site.
Similar to the other models, the variables/features I am using are: Engine displacement (size), number of cylinders, transmission type, number of gears, air inspired method, regenerative braking type, battery capacity Ah, drivetrain, fuel type, cylinder deactivate, and variable valve. The LightGBM package does not handle factors so I will have to transform them into dummy variables. After creating the dummy variables, I will be using 33 input variables.
str(cars_19) 'data.frame': 1253 obs. of 12 variables: $ fuel_economy_combined: int 21 28 21 26 28 11 15 18 17 15 ... $ eng_disp : num 3.5 1.8 4 2 2 8 6.2 6.2 6.2 6.2 ... $ num_cyl : int 6 4 8 4 4 16 8 8 8 8 ... $ transmission : Factor w/ 7 levels "A","AM","AMS",..: 3 2 6 3 6 3 6 6 6 5 ... $ num_gears : int 9 6 8 7 8 7 8 8 8 7 ... $ air_aspired_method : Factor w/ 5 levels "Naturally Aspirated",..: 4 4 4 4 4 4 3 1 3 3 ... $ regen_brake : Factor w/ 3 levels "","Electrical Regen Brake",..: 2 1 1 1 1 1 1 1 1 1 ... $ batt_capacity_ah : num 4.25 0 0 0 0 0 0 0 0 0 ... $ drive : Factor w/ 5 levels "2-Wheel Drive, Front",..: 4 2 2 4 2 4 2 2 2 2 ... $ fuel_type : Factor w/ 5 levels "Diesel, ultra low sulfur (15 ppm, maximum)",..: 4 3 3 5 3 4 4 4 4 4 ... $ cyl_deactivate : Factor w/ 2 levels "N","Y": 1 1 1 1 1 2 1 2 2 1 ... $ variable_valve : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
One of the biggest challenges with this dataset is it is small to be running machine learning models on. The train data set is 939 rows and the test data set is only 314 rows. In an ideal situation there would be more data, but this is real data and all data that is available.
After getting a working model and performing trial and error exploratory analysis to estimate the hyperparameters, I am going to run a grid search using:
max_depth num_leaves num_iterations early_stopping_rounds learning_rate
As a general rule of thumb num_leaves = 2^(max_depth) and num leaves and max_depth need to be tuned together to prevent overfitting. Solving for max_depth:
max_depth = round(log(num_leaves) / log(2),0)
This is just a guideline, I found values for both hyperparameters higher than the final hyper_grid below caused the model to overfit.
After running a few grid searches, the final hyper_grid I am looking to optimize (minimize RMSE) is 4950 rows. This runs fairly quickly on a Mac mini with the M1 processor and 16 GB RAM making use of the early_stopping_rounds parameter.
#grid search #create hyperparameter grid num_leaves =seq(20,28,1) max_depth = round(log(num_leaves) / log(2),0) num_iterations = seq(200,400,50) early_stopping_rounds = round(num_iterations * .1,0) hyper_grid <- expand.grid(max_depth = max_depth, num_leaves =num_leaves, num_iterations = num_iterations, early_stopping_rounds=early_stopping_rounds, learning_rate = seq(.45, .50, .005)) hyper_grid <- unique(hyper_grid)Running a for loop:
for (j in 1:nrow(hyper_grid)) { set.seed(123) light_gbn_tuned <- lgb.train( params = list( objective = "regression", metric = "l2", max_depth = hyper_grid$max_depth[j], num_leaves =hyper_grid$num_leaves[j], num_iterations = hyper_grid$num_iterations[j], early_stopping_rounds=hyper_grid$early_stopping_rounds[j], learning_rate = hyper_grid$learning_rate[j] #feature_fraction = .9 ), valids = list(test = test_lgb), data = train_lgb ) yhat_fit_tuned <- predict(light_gbn_tuned,train[,2:34]) yhat_predict_tuned <- predict(light_gbn_tuned,test[,2:34]) rmse_fit[j] <- RMSE(y_train,yhat_fit_tuned) rmse_predict[j] <- RMSE(y_test,yhat_predict_tuned) cat(j, "\n") }I am going to run this model as final:
set.seed(123) light_gbn_final <- lgb.train( params = list( objective = "regression", metric = "l2", max_depth = 4, num_leaves =23, num_iterations = 400, early_stopping_rounds=40, learning_rate = .48 #feature_fraction = .8 ), valids = list(test = test_lgb), data = train_lgb )
postResample(y_test,yhat_predict_final) RMSE Rsquared MAE 1.7031942 0.9016161 1.2326575Graph of features that are most explanatory:
sum(abs(r) <= rmse_predict_final) / length(y_test) #[1] 0.7547771 [1] 0.7547771 > sum(abs(r) <= 2 * rmse_predict_final) / length(y_test) #[1] 0.9522293 [1] 0.9522293 > > summary(r) Min. 1st Qu. Median Mean 3rd Qu. Max. -11.21159 -0.96398 0.06337 -0.02708 0.96796 5.77861
Comparison of RMSE: svm = .93 lightGBM = 1.7 XGBoost = 1.74 gradient boosting = 1.8 random forest = 1.9 neural network = 2.06 decision tree = 2.49 mlr = 2.6
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.