Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Updated on 2023-01-16: Explains reasoning in confidence intervals to conclude that some parties are similar.
Updated on 2022-08-03: Corrects chances increase in the text for the 2.775 value.
< section id="context" class="level1">Context
Let’s say we are interested in predicting the gender of a candidate for the British General Elections in 1992 by using the Political Parties as a predictor. We have the next data:
library(dplyr) library(tidyr) elections <- tibble( party = c("Tories", "Labour", "LibDem", "Green", "Other"), women = c(57,126,136,60,135), men = c(577,508,496,192,546) ) elections
# A tibble: 5 × 3 party women men <chr> <dbl> <dbl> 1 Tories 57 577 2 Labour 126 508 3 LibDem 136 496 4 Green 60 192 5 Other 135 546
Being the dependent variable a categorical one, we need to propose a Logistic Model.
Let
In this case the Generalized Linear Model matches the probability of success (i.e., the probability of the candidate being a woman if we define that
A good reference for all the mathematical details is McCullag and Nelder, 1983.
< section id="model-specification" class="level1">Model Specification
Before proceeding, we need to reshape the data.
elections_long <- elections %>% pivot_longer(-party, names_to = "gender", values_to = "candidates") %>% mutate( gender_bin = case_when( gender == "women" ~ 1L, TRUE ~ 0L ) ) %>% mutate_if(is.character, as.factor) elections_long
# A tibble: 10 × 4 party gender candidates gender_bin <fct> <fct> <dbl> <int> 1 Tories women 57 1 2 Tories men 577 0 3 Labour women 126 1 4 Labour men 508 0 5 LibDem women 136 1 6 LibDem men 496 0 7 Green women 60 1 8 Green men 192 0 9 Other women 135 1 10 Other men 546 0
To specify a Generalized Linear Model that considers Gender (i.e., 1: female, 0: male) as the response and the Political Party as the predictor, we fit the proposed model in R.
fit <- glm(gender_bin ~ party, weights = candidates, family = binomial(link = "logit"), data = elections_long) summary(fit)
Call: glm(formula = gender_bin ~ party, family = binomial(link = "logit"), data = elections_long, weights = candidates) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.1632 0.1479 -7.864 3.71e-15 *** partyLabour -0.2310 0.1783 -1.296 0.195 partyLibDem -0.1308 0.1768 -0.740 0.459 partyOther -0.2342 0.1764 -1.328 0.184 partyTories -1.1516 0.2028 -5.678 1.37e-08 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 2683.2 on 9 degrees of freedom Residual deviance: 2628.7 on 5 degrees of freedom AIC: 2638.7 Number of Fisher Scoring iterations: 5
Odds Ratios
We can obtain the odds ratio of a women candidate moving from Tories to Liberal Democrats. This corresponds to
From the model, we have
In R, we obtain the odds ratio as a substraction of the estimated coefficients No. 5 and No. 3 for this case. This is,
exp(coef(fit)[5] - coef(fit)[3])
partyTories 0.3602814
Which means that the chances of having a women candidate drop around 65% by moving from Tories to Liberal Democrats.
The opposite exercise would tell us that the chances increase around 177% by moving from Liberal Democrats to Tories.
exp(coef(fit)[3] - coef(fit)[5])
partyLibDem 2.775608
Hypothesis Testing
Consider the following hypothesis:
To test these hypothesis we can estimate the constrats, followed by their exponentials and the respective confidence intervals. The function to use in this case corresponds to the General Linear Hypotheses.
For
library(multcomp) summary(glht(fit, mcp(party = "Tukey")), test = Chisqtest())
General Linear Hypotheses Multiple Comparisons of Means: Tukey Contrasts Linear Hypotheses: Estimate Labour - Green == 0 -0.231049 LibDem - Green == 0 -0.130770 Other - Green == 0 -0.234193 Tories - Green == 0 -1.151640 LibDem - Labour == 0 0.100278 Other - Labour == 0 -0.003145 Tories - Labour == 0 -0.920591 Other - LibDem == 0 -0.103423 Tories - LibDem == 0 -1.020870 Tories - Other == 0 -0.917447 Global Test: Chisq DF Pr(>Chisq) 1 45.75 4 2.773e-09
The global test returns
For the other hypothesis we have
summary(glht(fit, mcp(party = "Tukey")))
Simultaneous Tests for General Linear Hypotheses Multiple Comparisons of Means: Tukey Contrasts Fit: glm(formula = gender_bin ~ party, family = binomial(link = "logit"), data = elections_long, weights = candidates) Linear Hypotheses: Estimate Std. Error z value Pr(>|z|) Labour - Green == 0 -0.231049 0.178269 -1.296 0.689 LibDem - Green == 0 -0.130770 0.176760 -0.740 0.946 Other - Green == 0 -0.234193 0.176391 -1.328 0.669 Tories - Green == 0 -1.151640 0.202841 -5.678 <1e-04 *** LibDem - Labour == 0 0.100278 0.138831 0.722 0.950 Other - Labour == 0 -0.003145 0.138361 -0.023 1.000 Tories - Labour == 0 -0.920591 0.170806 -5.390 <1e-04 *** Other - LibDem == 0 -0.103423 0.136411 -0.758 0.941 Tories - LibDem == 0 -1.020870 0.169230 -6.032 <1e-04 *** Tories - Other == 0 -0.917447 0.168845 -5.434 <1e-04 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Adjusted p values reported -- single-step method)
exp(confint(glht(fit, mcp(party = "Tukey")))[[10]])
Estimate lwr upr Labour - Green 0.7937008 0.4888709 1.2886038 LibDem - Green 0.8774194 0.5426587 1.4186905 Other - Green 0.7912088 0.4898305 1.2780163 Tories - Green 0.3161179 0.1821282 0.5486822 LibDem - Labour 1.1054788 0.7579634 1.6123249 Other - Labour 0.9968603 0.6843630 1.4520516 Tories - Labour 0.3982834 0.2503463 0.6336411 Other - LibDem 0.9017453 0.6223559 1.3065588 Tories - LibDem 0.3602814 0.2274320 0.5707320 Tories - Other 0.3995379 0.2524772 0.6322571 attr(,"conf.level") [1] 0.95 attr(,"calpha") [1] 2.718402
Here, the differences that are not statistically significant reveal that some parties are similar to each other (in the gender dimension), which is the case for Green vs Labour and Labour vs LibDem but not for Greens vs Tories.
In this case we cannot ignore the exponential transformation to the confidence interval. What happens here is that the “zero is not round”. My apologies for the non-technical explanation, but here if the number one is contained in the confidence interval (i.e., one is the “non-round zero”), then the difference between parties is not statistically significant, and therefore the two parties are similar.
< section id="changing-the-reference-factor" class="level1">Changing the Reference Factor
Consider that Green is the reference factor in the previous model. To change the reference, we can use the Tories or any other party.
library(forcats) elections_long$party <- fct_relevel(elections_long$party, "Tories", after = 0L) fit <- glm(gender_bin ~ party, weights = candidates, family = binomial(link = "logit"), data = elections_long) fit
Call: glm(formula = gender_bin ~ party, family = binomial(link = "logit"), data = elections_long, weights = candidates) Coefficients: (Intercept) partyGreen partyLabour partyLibDem partyOther -2.3148 1.1516 0.9206 1.0209 0.9174 Degrees of Freedom: 9 Total (i.e. Null); 5 Residual Null Deviance: 2683 Residual Deviance: 2629 AIC: 2639
Now we can compute the differences again
summary(glht(fit, mcp(party = "Tukey")))
Simultaneous Tests for General Linear Hypotheses Multiple Comparisons of Means: Tukey Contrasts Fit: glm(formula = gender_bin ~ party, family = binomial(link = "logit"), data = elections_long, weights = candidates) Linear Hypotheses: Estimate Std. Error z value Pr(>|z|) Green - Tories == 0 1.151640 0.202841 5.678 <1e-04 *** Labour - Tories == 0 0.920591 0.170806 5.390 <1e-04 *** LibDem - Tories == 0 1.020870 0.169230 6.032 <1e-04 *** Other - Tories == 0 0.917447 0.168845 5.434 <1e-04 *** Labour - Green == 0 -0.231049 0.178269 -1.296 0.689 LibDem - Green == 0 -0.130770 0.176760 -0.740 0.946 Other - Green == 0 -0.234193 0.176391 -1.328 0.669 LibDem - Labour == 0 0.100278 0.138831 0.722 0.950 Other - Labour == 0 -0.003145 0.138361 -0.023 1.000 Other - LibDem == 0 -0.103423 0.136411 -0.758 0.941 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Adjusted p values reported -- single-step method)
Green vs Tories are still different, but the sign is reversed!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.