[This article was first published on Fear and Loathing in Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
“Statistics are used much like a drunk uses a lamppost: for support, not illumination.”< o:p>
A.E. Housman (commonly attributed to Andrew Lang)Recently at work I stumbled into a classic example of Simpson’s Paradox (Yule-Simpson Effect). Many people are not familiar with this phenomena, so I thought I would provide a brief example to help shed some light on it. If you are not aware of the problem of confounding or “lurking” variables in your analysis, you can box yourself into an analytical corner. To further understand what is happening in a Simpson’s Paradox check out the following links:< o:p>
In short what is happening is that a trend or association that appears between two different variables reverses when a third variable is included. You will stumble into or at least need to be on the lookout for this effect of spurious correlation when you have unbalanced group sizes, like you often have using observational data. This is what happened in my recent discovery.< o:p>
In the example below, let’s have a look at the UC Berkeley data, or at least the portion of it by Department, as provided in a Wikipedia article. https://en.wikipedia.org/wiki/Simpson%27s_paradox#cite_note-Bickel-11< o:p>
What the data we explore contains is the number of applications and admissions by gender to six different graduate schools. Again, this is just a portion of the data, focusing in on the largest departments.
dpt = c(“A”,“B”,“C”,“D”,“E”,“F”,“A”,“B”,“C”,“D”,“E”,“F”)
app = c(825,560,325,417,191,272,108,25,593,375,393,341)
adm = c(512,353,120,138,53,16,89,17,202,131,94,24)
gen = c(“m”,“m”,“m”,“m”,“m”,“m”,“f”,“f”,“f”,“f”,“f”,“f”)
df = cbind.data.frame(dpt,app,adm,gen)
str(df)
## ‘data.frame’: 12 obs. of 4 variables:
## $ dpt: Factor w/ 6 levels “A”,”B”,”C”,”D”,..: 1 2 3 4 5 6 1 2 3 4 …
## $ app: num 825 560 325 417 191 272 108 25 593 375 …
## $ adm: num 512 353 120 138 53 16 89 17 202 131 …
## $ gen: Factor w/ 2 levels “f”,”m”: 2 2 2 2 2 2 1 1 1 1 …< o:p>
## $ dpt: Factor w/ 6 levels “A”,”B”,”C”,”D”,..: 1 2 3 4 5 6 1 2 3 4 …
## $ app: num 825 560 325 417 191 272 108 25 593 375 …
## $ adm: num 512 353 120 138 53 16 89 17 202 131 …
## $ gen: Factor w/ 2 levels “f”,”m”: 2 2 2 2 2 2 1 1 1 1 …< o:p>
There are a number of ways to demonstrate the effect, but I thought I would give it a go using dplyr. First, let’s group the data by gender (gen) then look at their overall admission rate.
library(dplyr)
by_gen = group_by(df, gen)
summarize(by_gen, adm_rate=sum(adm)/sum(app))< o:p>
summarize(by_gen, adm_rate=sum(adm)/sum(app))< o:p>
## Source: local data frame [2 x 2]
##
## gen adm_rate
## 1 f 0.3035422
## 2 m 0.4602317< o:p>
##
## gen adm_rate
## 1 f 0.3035422
## 2 m 0.4602317< o:p>
Clearly there is a huge gender bias problem in graduate school admissions at UC Berkeley and it is time to round up a herd of lawyers. On a side note, what is the best way to save a drowning lawyer? It’s easy, just take your foot off their head.< o:p>
We can even provide a statistical test to strengthen the assertion of bias. In R, a proportions test can be done via the prop.test() function, inputting the number of “hits” and the number of “trials”.
summarize(by_gen, sum(adm))< o:p>
## Source: local data frame [2 x 2]
##
## gen sum(adm)
## 1 f 557
## 2 m 1192< o:p>
##
## gen sum(adm)
## 1 f 557
## 2 m 1192< o:p>
summarize(by_gen, sum(app))< o:p>
## Source: local data frame [2 x 2]
##
## gen sum(app)
## 1 f 1835
## 2 m 2590< o:p>
##
## gen sum(app)
## 1 f 1835
## 2 m 2590< o:p>
prop.test(x=c(557,1192), n=c(1835,2590))< o:p>
##
## 2-sample test for equality of proportions with continuity
## correction
##
## data: c(557, 1192) out of c(1835, 2590)
## X-squared = 109.67, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.1856333 -0.1277456
## sample estimates:
## prop 1 prop 2
## 0.3035422 0.4602317< o:p>
## 2-sample test for equality of proportions with continuity
## correction
##
## data: c(557, 1192) out of c(1835, 2590)
## X-squared = 109.67, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.1856333 -0.1277456
## sample estimates:
## prop 1 prop 2
## 0.3035422 0.4602317< o:p>
We see in the output a high level of statistical significance and a confidence interval with a difference of roughly 13 to 19 percent. However, beware of the analyst proclaiming truths using a 2×2 table with observational data! In this instance, the confounding variable is the department (dpt).< o:p>
In order to have a look at the data by gender and by department, we will once again use the group_by() function, then calculate the admission rates and finally turn it from “long” to “wide” format.< o:p>
by_dpt = group_by(df, dpt,gen)
df2 = as.data.frame(summarize(by_dpt, adm_rate=sum(adm)/sum(app)))
df2< o:p>
df2 = as.data.frame(summarize(by_dpt, adm_rate=sum(adm)/sum(app)))
df2< o:p>
## dpt gen adm_rate
## 1 A f 0.82407407
## 2 A m 0.62060606
## 3 B f 0.68000000
## 4 B m 0.63035714
## 5 C f 0.34064081
## 6 C m 0.36923077
## 7 D f 0.34933333
## 8 D m 0.33093525
## 9 E f 0.23918575
## 10 E m 0.27748691
## 11 F f 0.07038123
## 12 F m 0.05882353< o:p>
## 1 A f 0.82407407
## 2 A m 0.62060606
## 3 B f 0.68000000
## 4 B m 0.63035714
## 5 C f 0.34064081
## 6 C m 0.36923077
## 7 D f 0.34933333
## 8 D m 0.33093525
## 9 E f 0.23918575
## 10 E m 0.27748691
## 11 F f 0.07038123
## 12 F m 0.05882353< o:p>
library(reshape2)
df2_wide = dcast(df2, dpt~gen, value.var=“adm_rate”)
df2_wide< o:p>
df2_wide = dcast(df2, dpt~gen, value.var=“adm_rate”)
df2_wide< o:p>
## dpt f m
## 1 A 0.82407407 0.62060606
## 1 A 0.82407407 0.62060606
## 2 B 0.68000000 0.63035714
## 3 C 0.34064081 0.36923077
## 4 D 0.34933333 0.33093525
## 5 E 0.23918575 0.27748691
## 6 F 0.07038123 0.05882353
## 3 C 0.34064081 0.36923077
## 4 D 0.34933333 0.33093525
## 5 E 0.23918575 0.27748691
## 6 F 0.07038123 0.05882353
With the inclusion of department we now see that females actually have higher admission rates in four of the six departments (A,B,D,F). How can this be? It had to do with rates and the volume of admission to the various departments. Again, the groups were highly unbalanced. Alright, enough of this as it is now time to focus on something important as Ohio State is squaring off against Notre Dame and the NHL Winter Classic!< o:p>
To leave a comment for the author, please follow the link and comment on their blog: Fear and Loathing in Data Science.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.