Unbalanced Data Is a Problem? No, BALANCED Data Is Worse
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Say we are doing classification analysis with classes labeled 0 through m-1. Let Ni be the number of observations in class i. There is much handwringing in the machine learning literature over situations in which there is a wide variation among the Ni. I will argue here, though, that the problem is much worse in the case in which there is — artificially — little or no variation among those sample sizes.
To simplify matters, in what follows I will take m = 2, with the population class probabilities denoted by p and 1-p. Let Y be 1 or 0, according to membership in Class 1 or 0, and let X be the vector of v predictor variables.
First, what about this problem of lack of balance? If your data are a random sample from the target population, then the lack of balance is natural if p is near 0 or 1, and there really isn’t much you can do about it, short of manufacturing data. (Some have actually proposed that, in various forms.) And with a parametric model, say a logit, you may do fairly well if the model is pretty accurate over the range of X. To be sure, the lack of balance may result in substantial within-class misclassification rates even if the overall rate is low. One can try different weightings and the like, but one is pretty much stuck with it.
But at least in this unbalanced situation, you will get consistent estimators of the regression function P(Y = 1 | X = t), as the sample size grows. That’s not true for what I will call the artificially balanced case. Here the Ni are typically the same or nearly so, and arise from our doing separate samplings of each of the classes. Clearly we cannot estimate p in this case, and it matters. Here’s why.
By an elementary derivation, we have that (at the population level)
P(Y | X = t) = 1 / (1 + [(1-p)/p] [f(t)/g(t)]) Eqn. (1)
where f and g are the densities of X within Classes 0 and 1. Consider the logistic model. Equation (1) implies that
β0 + β1 t1 + … + βv tv = -ln[(1-p)/p] – ln[f(t)/g(t)] Eqn. (2)
From this you can see that
β0 = -ln[(1-p)/p], Eqn. (3)
which in turn implies that if the sample sizes are chosen artificially, then our estimate of β0 in the output of R’s glm() function (or any other code for logit) will be wrong. If our goal is Prediction, this will cause a definite bias. And worse, it will be a permanent bias, in the sense that we will not have consistent estimates as the sample size grows.
So, arguably the problem of (artificially) balanced data is worse than the unbalanced case.
The remedy is easy, though. Equation (2) shows that even with the artificially balanced sampling scheme, our estimates of βi WILL be consistent for i > 0 (since the within-class densities of X won’t change due to the sampling scheme). So, if we have an external estimate of p, we can just substitute it in Equation (3) to get the right intercept term, and then happily do our future classifications.
As an example, consider the UCI Letters data set. There, the various English letters have approximately equal sample sizes, quite counter to what we know about English. But there are good published sources for the true frequencies.
Now, what if we take a nonparametric regression approach? We can still use Equation (1) to make the proper adjustment. For each t at which we wish to predict class membership, we do the following:
- Estimate the left-hand side (LHS) of (1) nonparametrically, using any of the many methods on CRAN, or the version of kNN in my regtools package.
- Solve for the estimated ratio f(t)/g(t).
- Plug back into (1), this time with the correct value of (1-p)/p from the external source, now providing the correct value of the LHS.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.