Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
TLDR. If you already have some experience in imbalanced learning problems, you’re probably aware of the time-consuming part of testing lengthy lists of resampling strategies (one of the available methods to tackle imbalanced learning tasks) and respective hyperparameters. The idea behind the ATOMIC method in the autoresampling package is to save a lot of your time (and related energy consumption) while getting you a very decent approximation of the best possible solution out of an extensive list of strategies-hyperparameters combos. R code with use-case examples at the bottom of the post. If you don’t have experience with this type of problems, you might want to read on.
Imbalanced Learning
Imbalanced learning (usually prefer imbalanced domain learning, IDL for short) is a very common issue with machine learning problems, and particularly popular in classification problems. This issue describes scenarios where the distribution of classes in the data is highly skewed. The under-represented classes are those that you’re highly interested in accurately predicting. For example, fraud detection: there’s a small percentage of fraud cases (we’re hoping), and those are the cases you’re attempting to anticipate.
Why is this a problem? Although more theoretical insight into the why’s is lacking, we are very aware of the practical difficulties in this context. When we are trying to create a model using imbalanced data, the model will tend to be more accurate towards better-represented concepts (classes) in the training data. Such effect is enhanced by several additional factors, such as the impact of optimisation/evaluation criteria and learning algorithms’ internal optimisation processes. In any case, from a general point of view, we can expect data biases will be ‘transported’ to the concepts that are captured in the resulting models.
If you dive into related literature and efforts, you will most probably find three different methods to tackle IDL tasks. First, data pre-processing methods, most commonly known as resampling strategies. Second, algorithm-level methods focus on producing alternative versions of learning algorithms tailored to counter such data bias and post-processing methods, which are applied to models’ predictions (e.g. changing the probability threshold for predicting a particular class). Of all these, there is little doubt that resampling strategies are the go-to solution and by far the most popular (and thus developed) set of methods.
Resampling Strategies
Resampling strategies are straightforward in the way they work. What they do is change the original distribution of data. For example, random undersampling is one of the first strategies to be used to tackle IDL tasks. This strategy works by reducing the number of cases in the majority class (the ones that are not our prediction focus) by randomly selecting a sub-sample of such cases. An example: you apply random undersampling with a 70% hyperparameter definition. The resulting data set will have representations of the majority classes with 30% fewer cases. With this, a new problem surges. There are a ton of resampling strategies, and new ones are constantly being proposed. Choosing which strategy to test is already a demanding task by itself, and they have hyperparameters that should be optimised. This usually results in a lot of cherry-picking concerning which strategies to use. Worse, you will often find that the strategies’ default hyperparameters are used in both academia and industry.
Here we’ll focus on the problem of magnitude, specifically the numerous (virtually infinite) combinations that one can choose from in resampling strategy selection and respective hyperparameter optimisation. The way to go about this is to define a decent (with enough size to be credible) list and validate each of them, the same methodology one would use concerning finding the best model possible. And here we enter a big problem, one of complexity and high energy consumption. Finding the best possible model is already a time-demanding business. Now, imagine adding tens or hundreds of variants to each traditional validation step to evaluate which resampling strategy (and respective hyperparameters) are the best option. And thus, we arrive at the primary purpose of this post.
Automated Imbalanced Classification (ATOMIC)
I’ve been working with my colleague Vitor Cerqueira since late 2018 (on and mostly off) on what we’ve come to call the ATOMIC method – shorthand for automated imbalanced classification. This is an automated machine learning (AutoML) solution for significantly reducing the time and effort associated with reaching the best possible combination of resampling strategy and respective hyperparameters for a given data set. How? Put simply: we evaluated the performance of a 400+ combo list of several resampling strategies and respective hyperparameters in 101 data sets. We then combined such evaluation data with extensive meta-feature characterisation of each 101 data sets. Such a combination leads to a data set used to create a meta-model. This tries to answer the following question: given a particular learning problem, which combination of resampling strategies and respective hyperparameters we anticipate will provide the best possible solution?
Important note: At this point, the ATOMIC method only builds models using the Random Forest learning algorithm!
The following image is a high-level illustration of the ATOMIC method in both its phases: development and prediction. This work has been recently accepted at Expert Systems with Applications1, so feel free to check it out for details and feedback is very much wanted.
ATOMIC at work
The ATOMIC method is implemented in the autoresampling package and it’s really easy to use. First, let’s install the package.
install.packages("remotes") remotes::install_github("nunompmoniz/autoresampling")
To use the ATOMIC method, you just to go through a very simple (and pretty standard) code snippet, as follows.
library(autoresampling) library(mlbench) data(PimaIndiansDiabetes) #Split the data 70%-30% for training and test sets ind <- sample(1:nrow(PimaIndiansDiabetes), 0.7*nrow(PimaIndiansDiabetes)) train <- PimaIndiansDiabetes[ind,] test <- PimaIndiansDiabetes[-ind,] atomic.m <- ATOMIC(diabetes ~ .,PimaIndiansDiabetes) # Build the ATOMIC solution preds <- predict(atomic.m, test) # Obtain the predictions
For the final touch:
table(preds,test$diabetes) preds neg pos neg 154 1 pos 0 76
Conclusions
If you’re facing some imbalanced learning problem and you want a good-enough baseline that isn’t the simple random undersampling/oversampling or SMOTE with more-or-less default hyperparameters, this could be a great solution for you. Also, if you’re just trying to have an idea of how resampling strategies might help, this could also prove very useful!
If you have any questions, comments or suggestions, it would be great to receive them! Enjoy!
Nuno Moniz and Vítor Cerqueira (2021). “Automated Imbalanced Classification with Meta-learning”. Expert Systems with Applications, Elsevier↩︎
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.