ZeroR: The Simplest Possible Classifier… or: Why High Accuracy can be Misleading

Posted on April 28, 2020 by Learning Machines in R bloggers | 0 Comments

[This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In one of my most popular posts So, what is AI really? I showed that Artificial Intelligence (AI) basically boils down to autonomously learned rules, i.e. conditional statements or simply, conditionals.

In this post, I create the simplest possible classifier, called ZeroR, to show that even this classifier can achieve surprisingly high values for accuracy (i.e. the ratio of correctly predicted instances)… and why this is not necessarily a good thing, so read on!

In the above-mentioned post, I gave an example of a classifier that was able to give you some guidance on whether a certain mushroom is edible or not. The basis for this was rules, which separated the examples based on the given attributes:

## Rules:
## If odor = almond   then type = edible
## If odor = anise    then type = edible
## If odor = creosote then type = poisonous
## If odor = fishy    then type = poisonous
## If odor = foul     then type = poisonous
## If odor = musty    then type = poisonous
## If odor = none     then type = edible
## If odor = pungent  then type = poisonous
## If odor = spicy    then type = poisonous

Obviously, the more rules the more complex a classifier is. In the example above we used the so-called OneR classifier which bases its decision on one attribute alone. Here, I will give you an even simpler classifier! The ZeroR classifier bases its decision on no attribute whatsoever… zero, zilch, zip, nada! How can this be? Easy: it just takes the majority class of the target attribute! I will give you an example.

First, we build a function for the classifier by using the OneR package (on CRAN) and some S3-class magic:

library(OneR)

ZeroR <- function(x, ...) {
  output <- OneR(cbind(dummy = TRUE, x[ncol(x)]), ...)
  class(output) <- c("ZeroR", "OneR")
  output
}
predict.ZeroR <- function(object, newdata, ...) {
  class(object) <- "OneR"
  predict(object, cbind(dummy = TRUE, newdata[ncol(newdata)]), ...)
}

As an example we take the well-known German Credit Dataset (originally from my old alma mater, the University of Hamburg) and divide it into a training and a test set:

data <- read.table("data/german.data", header = FALSE)
data <- data.frame(data[ , 1:20], creditrisk = factor(data[ , 21]))
table(data$creditrisk)
## 
##   1   2 
## 700 300

set.seed(805)
random <- sample(1:nrow(data), 0.6 * nrow(data))
data_train <- data[random, ]
data_test <- data[-random, ]

We see that 700 customers have a good credit risk while 300 have a bad one. The ZeroR classifier now takes the majority class (good credit risk) and uses it as the prediction every time! You have read correctly, it just predicts that every customer is a good credit risk!

Seems a little crazy, right? Well, it illustrates an important point: many of my students, as well as some of my consulting clients, often ask me what a good classifier is and how long it does take to build one. Many people in the area of data science (even some “experts”) will give you something like the following answer (source: A. Burkov):

Machine learning accuracy rule:
0-80%: one day
80-90%: one week
90-95%: one month
95-97%: three months
97-99%: one year (or never)

Well, to be honest with you: this is not a very good answer. Why? Because it very much depends on… the share of the majority class! To understand that, let us have a look at how the ZeroR classifier performs on our dataset:

model <- ZeroR(data_train)
summary(model)
## 
## Call:
## OneR.data.frame(x = cbind(dummy = TRUE, x[ncol(x)]))
## 
## Rules:
## If dummy = TRUE then creditrisk = 1
## 
## Accuracy:
## 481 of 700 instances classified correctly (68.71%)
## 
## Contingency table:
##           dummy
## creditrisk  TRUE Sum
##        1   * 481 481
##        2     219 219
##        Sum   700 700
## ---
## Maximum in each column: '*'
## 
## Pearson's Chi-squared test:
## X-squared = 98.063, df = 1, p-value < 2.2e-16

plot(model)

prediction <- predict(model, data_test)
eval_model(prediction, data_test)
## 
## Confusion matrix (absolute):
##           Actual
## Prediction   1   2 Sum
##        1   219  81 300
##        2     0   0   0
##        Sum 219  81 300
## 
## Confusion matrix (relative):
##           Actual
## Prediction    1    2  Sum
##        1   0.73 0.27 1.00
##        2   0.00 0.00 0.00
##        Sum 0.73 0.27 1.00
## 
## Accuracy:
## 0.73 (219/300)
## 
## Error rate:
## 0.27 (81/300)
## 
## Error rate reduction (vs. base rate):
## 0 (p-value = 0.5299)

So, because 70% of the customers are good risks we get an accuracy of about 70%! You can take this example to extremes: for example, if you have a dataset with credit card transactions where 0.1% of the transactions are fraudulent (which is about the actual number) you will get an accuracy of 99.9% just by using the ZeroR classifier! Concretely, just by saying that no fraud exists (!) you get an accuracy even beyond the “one year (or never)” bracket (according to the above scheme)!

Another example even concerns life and death: the probability of dying within one year lies at about 0.8% (averaged over all the people worldwide, according to “The World Factbook” by the CIA). So by declaring that we are all immortal, we are in more than 99% of all cases right! Many medical studies have a much higher error rate…

Now, let us try the OneR classifier on our credit dataset:

model <- OneR(optbin(data_train))
summary(model)
## 
## Call:
## OneR.data.frame(x = optbin(data_train))
## 
## Rules:
## If V3 = A30 then creditrisk = 2
## If V3 = A31 then creditrisk = 2
## If V3 = A32 then creditrisk = 1
## If V3 = A33 then creditrisk = 1
## If V3 = A34 then creditrisk = 1
## 
## Accuracy:
## 492 of 700 instances classified correctly (70.29%)
## 
## Contingency table:
##           V3
## creditrisk  A30  A31   A32  A33   A34 Sum
##        1     10   14 * 247 * 37 * 173 481
##        2   * 16 * 19   124   21    39 219
##        Sum   26   33   371   58   212 700
## ---
## Maximum in each column: '*'
## 
## Pearson's Chi-squared test:
## X-squared = 39.504, df = 4, p-value = 5.48e-08

plot(model)

# Attribute 3:  (qualitative)
# Credit history
# A30 : no credits taken/
#   all credits paid back duly
# A31 : all credits at this bank paid back duly
# A32 : existing credits paid back duly till now
# A33 : delay in paying off in the past
# A34 : critical account/
#   other credits existing (not at this bank)

prediction <- predict(model, data_test)
eval_model(prediction, data_test)
## 
## Confusion matrix (absolute):
##           Actual
## Prediction   1   2 Sum
##        1   207  63 270
##        2    12  18  30
##        Sum 219  81 300
## 
## Confusion matrix (relative):
##           Actual
## Prediction    1    2  Sum
##        1   0.69 0.21 0.90
##        2   0.04 0.06 0.10
##        Sum 0.73 0.27 1.00
## 
## Accuracy:
## 0.75 (225/300)
## 
## Error rate:
## 0.25 (75/300)
## 
## Error rate reduction (vs. base rate):
## 0.0741 (p-value = 0.2388)

Here, we see that we get an out-of-sample accuracy of 75%, which is more than 7 percentage points better than what we got with the ZeroR classifier, here called base rate. Yet, this is not statistically significant (for an introduction to statistical significance see From Coin Tosses to p-Hacking: Make Statistics Significant Again!).

Because the concept of “error rate reduction” compared to ZeroR (= base rate) and its statistical significance is so relevant it is displayed by default in the eval_model() function of the OneR package.

To end this post, we build a random forest model with the randomForest package (on CRAN) on the dataset (for some more information on random forests see Learning Data Science: Predicting Income Brackets):

set.seed(78)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.

model <- randomForest(creditrisk ~., data = data_train, ntree = 2000)
prediction <- predict(model, data_test)
eval_model(prediction, data_test)
## 
## Confusion matrix (absolute):
##           Actual
## Prediction   1   2 Sum
##        1   209  43 252
##        2    10  38  48
##        Sum 219  81 300
## 
## Confusion matrix (relative):
##           Actual
## Prediction    1    2  Sum
##        1   0.70 0.14 0.84
##        2   0.03 0.13 0.16
##        Sum 0.73 0.27 1.00
## 
## Accuracy:
## 0.8233 (247/300)
## 
## Error rate:
## 0.1767 (53/300)
## 
## Error rate reduction (vs. base rate):
## 0.3457 (p-value = 9.895e-05)

The out-of-sample accuracy is over 80% here and the error rate reduction (compared to ZeroR) of about one third is statistically significant. Yet 80% is still not that impressive when you keep in mind that 70% is the base rate!

You should now be able to spot why this is one of the worst scientific papers I have ever seen: Applications of rule based Classification Techniques for Thoracic Surgery (2015). This also shows one of the more general problems: although this is a medical topic not many medical professionals would be able to spot the elephant in the room here… this will be true for most other areas too, where machine learning will be used ever more frequently. (Just as an aside: this type of blunder wouldn’t have happened had the authors used the OneR package: One Rule (OneR) Machine Learning Classification in under One Minute.)

As you can imagine, there are many strategies to deal with the above challenges of imbalanced/unbalanced data, e.g. other model metrics (like recall or precision) and other sampling strategies (like undersampling the majority class or oversampling the minority class)… but that are topics for another post, so stay tuned!

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

ZeroR: The Simplest Possible Classifier… or: Why High Accuracy can be Misleading

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)