Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
eXtreme Gradient Boosting is a machine learning model which became really popular few years ago after winning several Kaggle competitions. It is very powerful algorithm that use an ensemble of weak learners to obtain a strong learner. Its R implementation is available in xgboost
package and it is really worth including into anyone’s machine learning portfolio.
This is the first part of eXtremely Boost your machine learning series. For other parts follow the tag xgboost.
Answers to the exercises are available here.
If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.
Exercise 1
Load xgboost
library and download German Credit dataset. Your goal in this tutorial will be to predict Creditability
(the first column in the dataset).
Exercise 2
Convert columns c(2,4,5,7,8,9,10,11,12,13,15,16,17,18,19,20)
to factors and then encode them as dummy variables. HINT: use model.matrix()
Exercise 3
Split data into training and test set 700:300. Create xgb.DMatrix
for both sets with Creditability
as label.
Exercise 4
Train xgboost
with logistic objective and 30 rounds of training and maximal depth 2.
Exercise 5
To check model performance calculate test set classification error.
Exercise 6
Plot predictors importance.
- Create a machine learning algorithm from a beginner point of view
- Quickly dive into more advanced methods in an accessible pace and with more explanations
- And much more
This course shows a complete workflow start to finish. It is a great introduction and fallback when you have some experience.
Exercise 7
Use xgb.train()
instead of xgboost()
to add both train and test sets as a watchlist. Train model with same parameters, but 100 rounds to see how it performs during training.
Exercise 8
Train model again adding AUC and Log Loss as evaluation metrices.
Exercise 9
Plot how AUC and Log Loss for train and test sets was changing during training process. Use plotting function/library of your choice.
Exercise 10
Check how setting parameter eta
to 0.01 influences the AUC and Log Loss curves.
Related exercise sets:
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.