Simpson’s Paradox in a Logistic Regression
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Simpson’s paradox is when a trend that is present in various groups of data seems to disappear or even reverse when those groups are combined. One sees examples of this often in things like medical trials, and the phenomenon is generally due to one or more unmodelled confounding variables, or perhaps differing causal assumptions.
As part of a project I was working on, I wanted an example beyond a simple linear regression where one of the model coefficients had a clearly incorrect sign. There are several reasons why unexpected signs might happen: separation or quasi separation of the data being the obvious ones. But Simpson’s paradox is another possible cause. The original project ended up not needing the example, but since I had it, I thought I’d write it up, since I’ve never seen Simpson’s paradox presented in quite this way before.
Synthetic Example: Weight Loss Trial
This is a problem statement where we would expect the coefficients of a logistic regression to be non-negative (except the intercept).
Consider a trial that tests the efficacy of a specific eating regimen (let’s say 16/8 intermittent fasting, which we’ll call ifasting
) and a specific exercise regimen (a brisk 30 minute walk every day, which we’ll just call exercise
). The goal (“success”) is to lose at least five pounds by the end of the trial period. We’ve set up three treatment groups, as follows:
- 200 subjects try exercise alone
- 300 subjects try ifasting alone
- 300 subjects try ifasting plus exercise
Prior to the trial, all the subjects led fairly sedentary lifestyles, and weren’t dieting in any formal way.
For these subjects, one might reasonably expect that neither exercise nor ifasting would be less successful for losing weight than doing nothing. One would also reasonably expect that ifasting plus exercise should do no worse than doing either one alone. Therefore, modeling the results of such an experiment as a logistic regression should lead to a model where the coefficients and
are both non-negative, as any treatment should increase (or at least, not decrease) the odds that the subject loses weight.
Let’s show an example where our expectations aren’t met. The easiest way to do that is to generate a dataset that has Simpson’s paradox hidden within it.
First, let’s load the packages we need.
Show the code
library(poorman) # or dplyr library(ggplot2) library(kableExtra) library(WVPlots)
Here’s a function that will generate a specific subset of data, as needed.
# ifasting: 1 if this group fasted, else 0 # exercise: 1 if this group exercised, else 0 # total: total number of subjects in this group # successes: number of subjects who successfully lost weight # label: label for the group. generate_samples = function(ifasting, exercise, total, successes, label) { failures = total-successes data.frame(ifasting = ifasting, exercise = exercise, success = c(rep(1, successes), rep(0, failures)), label=label) }
Modelling
Now let’s fit a logistic regression model to try to infer the effects of the various treatments on weight loss. We’ll do it on the whole population first, since that was the original task.
Show the code
tab_coeff = function(model, caption) { coeff = summary(model)$coefficients[, c(1, 4)] |> as.data.frame() colnames(coeff) = c('Estimate', 'pval') # using cell_spec below breaks the digits setting # (because of course it does) so round the numbers first. coeff = coeff |> mutate(Estimate = as.numeric(formatC(Estimate, format="f", digits=3)), pval = as.numeric(format(pval, format="g", digits=3))) coeff = coeff |> mutate(Estimate = cell_spec(Estimate, color=ifelse(Estimate < 0, "red", "black")), pval = cell_spec(pval, color=ifelse(pval < 0.05, "darkblue", "darkgray"))) knitr::kable(coeff, caption=caption) } bothpops = rbind(popA, popB) mAll = glm(success ~ ifasting + exercise, data=bothpops, family=binomial) tab_coeff(mAll, "Model coefficients, whole population")
Estimate | pval | |
---|---|---|
(Intercept) | -4.038 | 2.72e-11 |
ifasting | 4.731 | 1.6e-15 |
exercise | -0.147 | 0.392 |
Intermittent fasting has a positive coefficient, meaning intermittent fasting is positively correlated with weight loss success. But exercise has a negative coefficient, implying the exercise is negatively correlated with weight loss, and that doing both together will be less successful than intermittent fasting alone!
And indeed, if we look at the raw summaries, we’ll see that the data bears these inferences out.
Show the code
df1 = bothpops |> group_by(treatment) |> summarize(success_rate=mean(success)) |> ungroup() df2 = bothpops |> summarize(success_rate = mean(success)) |> mutate(treatment = "overall") rbind(df1, df2) |> knitr::kable(digits=3, caption = "Success rates, entire population")
treatment | success_rate |
---|---|
exercise alone | 0.015 |
ifast alone | 0.667 |
both | 0.633 |
overall | 0.491 |
This is an example of how Simpson’s paradox might manifest itself in a logistic regression model, and it’s due to the unmodelled confounding variable, population type. This, plus some bad luck in the relative sizes of the treatment groups with respect to population type, lead to the above, counter intuitive, results.
Note that we have reported p-values, and in this case the coefficient for exercise is insignificant (to ), implying that exercise may not have any notable effect on weight loss. However, we may still see coefficients with counter intuitive signs in arbitrarily large populations, which may then appear significant.
Conclusion
Simpson’s paradox is a case where model inference seems to contradict domain knowledge. Usually it is merely a symptom of some combination of omitted variable bias, unbalanced studies, or wrong causal specification (or perhaps no such specification). If you take care to look for the this effect, it can in fact give clues towards a better analysis.
Nina Zumel is a data scientist based in San Francisco, with 20+ years of experience in machine learning, statistics, and analytics. She is the co-founder of the data science consulting firm Win-Vector LLC, and (with John Mount) the co-author of Practical Data Science with R, now in its second edition.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.