Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The latest issue of the American Statistician has a set of thought-provoking point/counterpoint papers on Simpson’s Paradox, with a tie-in to the controversial issue of causality. (I will not address the causality issue here.) Since I have long had my own thoughts about Simpson’s, I’ll postpone the topic I had planned to post this week, and address Simpson’s. Much of the material here will be adapted from my open-source textbook on probability and statistics. I’ll use R code to perform the analysis.
As readers undoubtedly already know, Simpson’s Paradox describes a situation in which variables X and Y are positively related overall, but suddenly become negatively related when conditioned on a third variable Z. (Or vice versa.) The clever title of a medical journal paper, “Good for Women, Good for Men, Bad for People,” neatly sums it up.
I’ve always contended that Simpson’s is not a paradox in the first place, and instead is simply the effect of using variables in the wrong order. I mean this in a sense similar to forward stepwise regression; I’m not a fan of stepwise as a variable selection procedure, but my point is that I regard Simpson’s so-called “Paradox” as simply being a problem arising from adding a minor factor to a model before including a major factor. This is in contrast to forward stepwise regression, which aims to add the most important factors first.
To illustrate this, let’s consider what is probably the most famous example of Simpson’s, the UC Berkeley graduate admissions data. This concerns a lawsuit that asserted that UCB discriminated against women in admissions to graduate school. In short: Though the data provided in the suit showed that male applicants had a 44% acceptance rate, compared to only 35% for women, some UCB statistics faculty showed that when a third variable is added to the analysis, things change entirely; in fact, this more probing analysis indicates that actually women were faring slightly better than men in the admissions process.
So let’s start up R and take a look. Actually, the UCB data is so famous that it is included in R! Let’s load the data:
> ucb <- UCBAdmissions
Our X, Y and Z here will be Admit (“Admitted” or “Rejected”), Gender (“Male” or “Female”) and Dept. In that last case, six departments are included, referred to as A, B, C, D, E and F.
The data is provided as an R table, which is great, as the R function loglin() expects its input in this form. So we can use R’s table capabilities, for instance determining the total number of observations in the data set:
> sum(ucb) [1] 4526
We’ll use the log-linear model methodology. Again see my open-source textbook if you are not familiar with this approach, but here is a quick summary, in the present context:
Let pijk denote the population probability of cell (i,j,k), so that for example p112 is the proportion of applicants who are admitted, male and apply to Department B. The model writes the log of pijk (or of the expected cell count) as an ANOVA-style expression of sums of main effects and interaction terms, the latter representing various kinds of dependencies among the variables.
Now we’ll perform a log-linear model analysis, specifying a model that consists of all terms through second-order:
> llout <- loglin(ucb,list(1:2,c(1,3),2:3),param=TRUE)
Note carefully the argument param = TRUE. To me, one of the most unfortunate aspects of log-linear analysis as it is commonly practiced is that it is significance testing-centric, rather than based on point or interval estimation. The fact that by default parameter estimates are not returned by loglin() reflects that, but at least the option is available, thus requested here. (If one uses glm() and a “Poisson trick,” one can also get standard errors, though with stepwise model fitting they shouldn’t be taken literally.)
Here we have specified the model to include all two-way interaction terms, and by implication all the main effects as well. Let’s look at some of the parameter estimates that result.
The constant term is 4.802457, which gives us something to compare the other parameter estimates to. Here are the main effects:
Admit:
Admitted | Rejected |
-0.3212111 | 0.3212111 |
Gender:
Male | Female |
0.3287569 | -0.3287569 |
Dept:
A | B | C | D | E | F |
0.15376258 | -0.76516841 | 0.53972054 | 0.43021534 | -0.02881353 | -0.32971651 |
These don’t really tell us much, other than again giving us an idea of what is considered large and small among the terms. The Dept main effects, by the way, are telling us which departments tend to have more or fewer applicants, interesting but not very relevant here.
Let’s look at some interaction terms. Since we are interested in admissions, let’s consider the interactions Admit-Gender and Admit-Department.
Admit-Gender:
. | Male | Female |
Admitted | -0.02493703 | 0.02493703 |
Rejected | 0.02493703 | -0.02493703 |
Admit-Dept:
. | A | B | C | D | E | F |
Admitted | 0.6371804 | 0.6154772 | 0.005914624 | -0.01010004 | -0.232437 | -1.016035 |
Rejected | -0.6371804 | -0.6154772 | -0.005914624 | 0.01010004 | 0.232437 | 1.016035 |
What a difference! In terms of association with Admit, the Dept variable relation is much stronger than that of Gender, meaning that most of the estimated parameters are much larger in the former case. In other words, Dept is the more important variable, not Gender.
And, the results above also show that there is a positive Admit-Female interaction, i.e. that women fare slightly better than men.
Let’s look at this again from a “forward stepwise regression” point of view. Instead of fitting a three-variable model right away, we could first fit the two-variable models Admit-Gender and Admit-Dept. For the latter, for instance, we could run
> loglin(ucb,list(c(1,3)),param=TRUE)
The result is not shown here, but it does reveal that Dept has a much stronger association with Admit than does Gender. We would thus take Admit-Dept as our “first-step” model, and then add Gender to get our second-step model (again, “step” as in the spirit of stepwise regression). That would produce the model we analyzed above, WITH NO PARADOXICAL RESULT.
By contrast, if we “entered the variables” (continuing the stepwise regression analogy) in the wrong order, i.e. used Gender for our first-step model and then added Dept to produce the second-step model, we get a direction reversal–women appear to be disadvantaged in the first-step model, but then turn out to be advantaged once run runs the second-step model.
We could make the regression analogy stronger by actually running logistic regressions, with glm(), instead of using the log-linear model (though the parameter estimates and population values would be somewhat different).
Of course, all of this assumes that we are insightful enough to include Dept in our data set in the first place. But in terms of Simpson’s Paradox, the analysis here could go a long way toward dispelling the notion that there is a paradox at all.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.