forester: the simplicity of AutoML

[This article was first published on R in ResponsibleML on Medium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this blog, we’d like to describe in detail the main function of the forester package called the train(). We will focus on showing you how particular steps work and how the user can shape the training process. The concept of the package is described in previous blog posts introducing the package, describing it in greater detail, and providing a use case scenario.
The train() function components.

Data check report

The first step of the AutoML pipeline is running the check_data() function which provides the user with information about a given dataset divided into a few categories:

  1. Basic info: The number of features and observations, column names, and the name of the target value.
  2. Static columns: Provides the name and dominating value for static columns (where over 99% of the observations have the same value).
  3. Duplicate columns: Informs about the columns which have identical values.
  4. Missing fields: Informs how many observations have missing predictor or target values.
  5. Dimensionality check: Informs if the given dataset has too many columns for the tree based models to handle (over 30 features), or if there are more columns than rows.
  6. Correlated features: Prints out which categorical or numerical values are highly (> 0.7) correlated, according to Crammer’s V and Spearman’s rank correlation coefficient respectively.
  7. Outliers: Detects outliers inside the dataset. An observation is considered to be an outlier when it satisfies three criteria based on the values of mean standard deviation, median absolute deviation, and inter-quartile range.
  8. Target balance: Informs if the target values are imbalanced and prints outs the imbalance values. The maximal imbalanced proportion for the binary classification is 40% — 60%, and for the regression, we calculate the 4 quantile bins and the proportion of observations inside the bin of the biggest and smallest one has to be between 40% — 60%.
  9. Id column detection: Detects if any column might be an identifier. It is considered an id feature if it has a name that indicates so or its values are growing by one with every row.

The data_check() output is printed in the console only when the user sets the verbose parameter as ‘TRUE’.

Preprocessing

The preprocessing is the first function executed inside the train() pipeline. It increases the general quality of the data frame in a few steps presented below:

  1. Removal of static columns detected during the data check.
  2. Binarization of the target into the classes ‘1’ and ‘2’ for the binary classification task.
  3. Imputation of the missing values with the MICE algorithm.
  4. Removal of highly correlated (>0.7) features from the data set (one feature per pair).
  5. Removal of the id columns.
  6. Feature selection with the BORUTA algorithm.
  7. Saving information about the deleted columns.

In the train() function there is a parameter advanced_preprocessing which indicates which preprocessing method will be performed. With the default ‘FALSE’ value, the basic method contains steps 1, 2, 3, and 7, whereas the other option covers all 7 steps. The advanced method might result in better performance, however, it highly depends on the given data set.

Data preparation

In this stage, we first divide the given dataset into the train, test, and validation datasets according to the set train() parameter called train_test_split. The method used for the split comes from the splitTools package, which enables us to balance the returned datasets.

The outcoming data frames are later transformed for every model engine because every model has different needs and expectations for the incoming data. For example, the xgboost model needs one hot encoding, and lightgbm accepts only datasets in its own format. During this stage, we also ensure that every categorical column has the level ‘other’ which resembles the data unseen in the training dataset.

Model training and tuning

This stage involves training up to 5 tree-based model engines which are: random forest, xgboost, decision tree, lightgbm, and catboost. Each of them comes with a different hyperparameter set in the tuning phase. The training and tuning are performed in 3 substantially different paths.

Default parameters

The basic method, which is always present during the process is training selected models with the default parameters. This method is quick and provides the user with a baseline model, which can be a decent comparison for more advanced optimisation methods.

Random search

This method trains the models with parameters which are the outcomes of the random search from a hyperparameter space defined individually for every model. The number of models trained by the method is described by the train() parameter called random_evals. The value 5 means that the random search will produce 5 models for every engine.

Bayesian optimisation

The last method is the most advanced as it performs a Bayesian optimisation from the ParBayesianOptimisation package in order to train a better model with every iteration. This method is the most effective and time-consuming one. The number of optimisation iterations is described by the train() parameters called bayes_iter. The more iterations the user sets, the longer the training process, but the algorithm also covers a larger piece of hyperparameters space.

Model evaluation

The model evaluation is performed at the end of the AutoML process and the outputs are presented as a sorted ranked list of models. The list contains columns describing the index of the model, its unique name, the engine of the model, the tuning method and columns involving the evaluation metrics, which differ depending on the task type.

For binary classification, we calculate the accuracy, AUC ROC (area under the ROC curve), f1, recall, and precision, whereas for the regression these are: RMSE (rooted mean squared error), MSE (mean squared error), R2, MAD (median absolute deviation), and MAE (mean absolute error).

The user can also provide their own function with the train() parameters called metric_function and metric_function_name. One can also select the metrics which will be calculated (metrics) and change by which of them the ranked list will be sorted (sort_by).

Output object

As the output, the user gets a list containing over 20 different objects, which are briefly described in the train() documentation. They cover the different train, test, and validation datasets used during the training, predictions on given datasets and their observed values, check data report, a list of outliers, trained models and many more. The most important however are the ranked lists for the train, test, and validation datasets which are named score_train, score_test, and, score_valid.

With the output object, the user can easily create a report describing the training process or explain the model.

In the next blog post, we will take a look at the automatically generated reports present in the package. We will describe a general scheme of the document and describe all present information for both binary classification and regression task reports.
If you are interested in other posts about explainable, fair and responsible ML, follow #ResponsibleML on Medium.
In order to see more R related content visit https://www.r-bloggers.com.

forester: the simplicity of AutoML was originally published in ResponsibleML on Medium, where people are continuing the conversation by highlighting and responding to this story.

To leave a comment for the author, please follow the link and comment on their blog: R in ResponsibleML on Medium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)