Simpson’s Paradox in a Logistic Regression

[This article was first published on R Works, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Simpson’s paradox is when a trend that is present in various groups of data seems to disappear or even reverse when those groups are combined. One sees examples of this often in things like medical trials, and the phenomenon is generally due to one or more unmodelled confounding variables, or perhaps differing causal assumptions.

As part of a project I was working on, I wanted an example beyond a simple linear regression where one of the model coefficients had a clearly incorrect sign. There are several reasons why unexpected signs might happen: separation or quasi separation of the data being the obvious ones. But Simpson’s paradox is another possible cause. The original project ended up not needing the example, but since I had it, I thought I’d write it up, since I’ve never seen Simpson’s paradox presented in quite this way before.

Synthetic Example: Weight Loss Trial

This is a problem statement where we would expect the coefficients of a logistic regression to be non-negative (except the intercept).

Consider a trial that tests the efficacy of a specific eating regimen (let’s say 16/8 intermittent fasting, which we’ll call ifasting) and a specific exercise regimen (a brisk 30 minute walk every day, which we’ll just call exercise). The goal (“success”) is to lose at least five pounds by the end of the trial period. We’ve set up three treatment groups, as follows:

  • 200 subjects try exercise alone
  • 300 subjects try ifasting alone
  • 300 subjects try ifasting plus exercise

Prior to the trial, all the subjects led fairly sedentary lifestyles, and weren’t dieting in any formal way.

For these subjects, one might reasonably expect that neither exercise nor ifasting would be less successful for losing weight than doing nothing. One would also reasonably expect that ifasting plus exercise should do no worse than doing either one alone. Therefore, modeling the results of such an experiment as a logistic regression should lead to a model where the coefficients and are both non-negative, as any treatment should increase (or at least, not decrease) the odds that the subject loses weight.

Let’s show an example where our expectations aren’t met. The easiest way to do that is to generate a dataset that has Simpson’s paradox hidden within it.

First, let’s load the packages we need.

Show the code
library(poorman) # or dplyr
library(ggplot2)
library(kableExtra)
library(WVPlots)

Here’s a function that will generate a specific subset of data, as needed.

# ifasting: 1 if this group fasted, else 0
# exercise: 1 if this group exercised, else 0
# total: total number of subjects in this group
# successes:  number of subjects who successfully lost weight
# label: label for the group.
generate_samples = function(ifasting, exercise, total, successes, label) {
  failures = total-successes
  data.frame(ifasting = ifasting,
             exercise = exercise,
             success = c(rep(1, successes), rep(0, failures)),
             label=label)
}

Hidden, Unmodelled Population Effects

Now suppose that, unbeknownst to the researchers, there are two sub populations among the subjects. These two sub populations respond differently to the exercise and intermittent fasting regimes. So we will generate our synthetic data by population.

Population A

The first population, Population A, responds quite well to intermittent fasting. Let’s create just this population. This next block of code generates population A, prepares the data for plotting, and plots it.

Show the code
popA = data.table::rbindlist(list(
  generate_samples(ifasting=0, exercise=1, total=100, successes=2, "Population A"),
  generate_samples(ifasting=1, exercise=0, total=200, successes=160, "Population A"),
  generate_samples(ifasting=1, exercise=1, total=100, successes=90, "Population A")
))

#| code-fold: true
#| code-summary: "Show the code"
popA$treatment = with(popA, ifelse(ifasting, ifelse(exercise, 'both', 'ifast alone'),
                                   ifelse(exercise, 'exercise alone', 'none')))
popA$treatment = factor(popA$treatment,
                            levels = c('none', 'exercise alone', 'ifast alone', 'both'))
popA$outcome = factor(with(popA, ifelse(success==1, "success", "failure")),
                      levels = c('success', 'failure'))

pvals = c(success='darkblue', failure='gray')

ggplot(popA, aes(x=treatment, color=outcome, fill=outcome)) + 
  geom_bar() +
  scale_fill_manual(values=pvals) + 
  scale_color_manual(values=pvals) +
  ggtitle("Outcomes for population A") + 
  theme(legend.position="bottom")

If we look at the success rates, we will see that (as expected) subjects who both exercise and practice intermittent fasting lose weight more successfully than subjects who do one or the other alone.

Show the code
df1 = popA |>
  group_by(treatment) |>
  summarize(success_rate=mean(success)) |>
  ungroup() 

df2 = popA |>
  summarize(success_rate = mean(success)) |>
  mutate(treatment = "overall") 

rbind(df1, df2) |>
  knitr::kable(digits=3, caption = "Success rates, Population A")
Success rates, Population A
treatment success_rate
exercise alone 0.02
ifast alone 0.80
both 0.90
overall 0.63

Population B

There is also another population, Population B, that has what you might call a “stickier” metabolism. For them, intermittent fasting is not quite as effective.

Show the code
popB = data.table::rbindlist(list(
  generate_samples(ifasting=0, exercise=1, total=100, successes=1, "Population B"),
  generate_samples(ifasting=1, exercise=0, total=100, successes=40, "Population B"),
  generate_samples(ifasting=1, exercise=1, total=200, successes=100, "Population B")
))


#| code-fold: true
#| code-summary: "Show the code"
popB$treatment = with(popB, ifelse(ifasting, ifelse(exercise, 'both', 'ifast alone'),
                                   ifelse(exercise, 'exercise alone', 'none')))
popB$treatment = factor(popB$treatment,
                            levels = c('none', 'exercise alone', 'ifast alone', 'both'))
popB$outcome = factor(with(popB, ifelse(success==1, "success", "failure")),
                      levels = c('success', 'failure'))



pvals = c(success='darkblue', failure='gray')

ggplot(popB, aes(x=treatment, color=outcome, fill=outcome)) + 
  geom_bar() +
  scale_fill_manual(values=pvals) + 
  scale_color_manual(values=pvals) +
  ggtitle("Outcomes for population B") + 
  theme(legend.position="bottom")

Again, subjects who both exercise and practice intermittent fasting are the most successful, but overall, success rates in population B are not as good as in population A.

Show the code
df1 = popB |>
  group_by(treatment) |>
  summarize(success_rate=mean(success)) |>
  ungroup() 

df2 = popB |>
  summarize(success_rate = mean(success)) |>
  mutate(treatment = "overall") 

rbind(df1, df2) |>
  knitr::kable(digits=3, caption = "Success rates, Population B")
Success rates, Population B
treatment success_rate
exercise alone 0.010
ifast alone 0.400
both 0.500
overall 0.352

One of the things to notice about this experimental design is that, even before taking the hidden population types into account, the treatment groups are not balanced. We saw this in the initial statement above (200 subjects try exercise alone; 300 subjects try ifasting alone; 300 subjects try ifasting plus exercise). This is often what happens in experimental trials with human subjects. as subjects may drop out for one reason or another. Unbalanced treatment groups tend to be the norm in retrospective analyses, as well.

This imbalance will be even more pronounced when we account for the hidden population types, as well.

Show the code
bothpops = rbind(popA, popB)
palette = c('Population A' = '#1b9e77', 'Population B' = '#d95f02')
ggplot(bothpops, aes(x=treatment, color=label, fill=label)) + 
  geom_bar() + 
  scale_fill_manual(values=palette) + 
  scale_color_manual(values=palette) +
  ggtitle("Treatment Group Sizes, with population labels") + 
  theme(legend.position="bottom")

This population imbalance is part of what can cause Simpson’s paradox.

Modelling

Now let’s fit a logistic regression model to try to infer the effects of the various treatments on weight loss. We’ll do it on the whole population first, since that was the original task.

Show the code
tab_coeff = function(model, caption) {
  coeff = summary(model)$coefficients[, c(1, 4)] |>
    as.data.frame()

  colnames(coeff) = c('Estimate', 'pval')

  # using cell_spec below breaks the digits setting
  # (because of course it does) so round the numbers first.
  coeff = coeff |>
    mutate(Estimate = as.numeric(formatC(Estimate, format="f", digits=3)),
           pval = as.numeric(format(pval, format="g", digits=3)))

  coeff = coeff |>
    mutate(Estimate = cell_spec(Estimate, color=ifelse(Estimate < 0, "red", "black")),
           pval = cell_spec(pval, color=ifelse(pval < 0.05, "darkblue", "darkgray")))

  knitr::kable(coeff, caption=caption)

}

bothpops = rbind(popA, popB)
mAll = glm(success ~ ifasting + exercise, data=bothpops, family=binomial)

tab_coeff(mAll, "Model coefficients, whole population")
Model coefficients, whole population
Estimate pval
(Intercept) -4.038 2.72e-11
ifasting 4.731 1.6e-15
exercise -0.147 0.392

Intermittent fasting has a positive coefficient, meaning intermittent fasting is positively correlated with weight loss success. But exercise has a negative coefficient, implying the exercise is negatively correlated with weight loss, and that doing both together will be less successful than intermittent fasting alone!

And indeed, if we look at the raw summaries, we’ll see that the data bears these inferences out.

Show the code
df1 = bothpops |>
  group_by(treatment) |>
  summarize(success_rate=mean(success)) |>
  ungroup() 

df2 = bothpops |>
  summarize(success_rate = mean(success)) |>
  mutate(treatment = "overall") 

rbind(df1, df2) |>
  knitr::kable(digits=3, caption = "Success rates, entire population")
Success rates, entire population
treatment success_rate
exercise alone 0.015
ifast alone 0.667
both 0.633
overall 0.491

This is an example of how Simpson’s paradox might manifest itself in a logistic regression model, and it’s due to the unmodelled confounding variable, population type. This, plus some bad luck in the relative sizes of the treatment groups with respect to population type, lead to the above, counter intuitive, results.

Note that we have reported p-values, and in this case the coefficient for exercise is insignificant (to ), implying that exercise may not have any notable effect on weight loss. However, we may still see coefficients with counter intuitive signs in arbitrarily large populations, which may then appear significant.

Taking the Hidden Confounder into Account

As you might expect, if we are able to identify the confounding variable and account for it, then we can eliminate the paradoxical negative coefficients. To see this, let’s first model the populations separately and look at the model coefficients.

Show the code
mA = glm(success ~ ifasting + exercise, data=popA, family=binomial)

tab_coeff(mA, "Model coefficients, Population A")
Model coefficients, Population A
Estimate pval
(Intercept) -4.703 5.82e-09
ifasting 6.089 1.12e-14
exercise 0.811 0.0316
Show the code
mB = glm(success ~ ifasting + exercise, data=popB, family=binomial)

tab_coeff(mB, "Model coefficients, Population B")
Model coefficients, Population B
Estimate pval
(Intercept) -5.001 1.36e-06
ifasting 4.595 5.97e-06
exercise 0.405 0.103

In both cases, both intermittent fasting and exercise have positive coefficients, indicating that both activities correlate positively with weight loss.

We can also model the entire population, including the population label as a variable.

Show the code
mAllplus = glm(success ~ ifasting + exercise + label, data=bothpops, family=binomial)

#| code-fold: true
#| code-summary: "Show the code"
tab_coeff(mAllplus, "Model coefficients, whole population with population labels")
Model coefficients, whole population with population labels
Estimate pval
(Intercept) -4.143 1.69e-11
ifasting 5.584 1.15e-19
exercise 0.523 0.0102
labelPopulation B -1.917 7.65e-20

As expected, both ifasting and exercise now have positive (and significant, to ) coefficients, indicating that both actions increase the probability of weight loss, and doing them both increases it even more.

The model also correctly identifies that subjects of population type B have a (significantly) lower success rate than subjects from Population A.

Conclusion

Simpson’s paradox is a case where model inference seems to contradict domain knowledge. Usually it is merely a symptom of some combination of omitted variable bias, unbalanced studies, or wrong causal specification (or perhaps no such specification). If you take care to look for the this effect, it can in fact give clues towards a better analysis.

Nina Zumel is a data scientist based in San Francisco, with 20+ years of experience in machine learning, statistics, and analytics. She is the co-founder of the data science consulting firm Win-Vector LLC, and (with John Mount) the co-author of Practical Data Science with R, now in its second edition.

To leave a comment for the author, please follow the link and comment on their blog: R Works.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)