Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Just before Christmas I blogged about the positive correlation between depression incidence in US counties and their vote for Trump in the 2024 presidential election. In addition to my casual interest in the topic, I used it as a case study in multilevel modelling while adjusting for spatial correlation. I explicitly said that I didn’t think it likely that the depression-vote relationship was a causal link; I suspected that most likely, some underlying variable that caused depression was also related to voting behaviour.
I’m coming back to the issue because on reflection, I have three bits of unfinished business:
- I had a nagging thought that with 50+ counties per state (and hence some degrees of freedom to spare), I perhaps should have allowed random slopes for depression incidence in each state, rather than just random intercepts
- I thought my spatial statistics workaround, of just adding a rubbery mat to space to suck up the spatial autocorrelation between observations, perhaps was a bit slap-dash and I should be as a matter of course modelling the spatial autocorrelation explicitly
- An alert reader, Jonathan Spring, pointed out that in the USA there are very marked racial differences between depression diagnoses and suggested that perhaps in my model the depression incidence was standing in as a proxy for “whiteness”.
Of these, the first two felt like bits of probably-immaterial-to-the-question methodological details that I should polish up, whereas number 3 seemed likely to be the explanation of the whole phenomenon. In other words, race is likely a confounder of the depression-voting relationship, as per this directed acyclic graph:
Which means that if we want to actually understand the causal relationship of depression on voting we would need to control for race in the regression. Now, there are other things we’d need to do too; in particular to identify and control for the various “other factors”. I’m not up for that right now – this is someone else’s job – but I’m interested enough to go part-way into things to at least check out the degree to which race makes the depression relationship go away.
This is a long post and parts of it are likely to be of interest only to people concerned with the minutiae of multilevel modelling with spatial autocorrelation. If you just want to see what including race in the model does to the depression effect on voting for Trump, you can skip down to the final section.
All the code for this post assumes that the code from the previous post has already been run. If you want a version of the whole script that just runs, it’s here in the source repository of the whole blog.
But here’s the code for just that DAG diagram:
OK, on to fixing up the loose ends with my modelling approach.
Move from gam
to gamm
I decided that the simplest material improvement to my approach to the spatial autocorrelation would be to explicitly model the spatial co-variance of the residuals. One way of doing this that is the minimal change from my approach so far would be to move the model-fitting from mgcv::gam()
to mgcv::gamm()
. gamm()
fits models by iterating between calls to nlme::lme()
and a generalized additive model until convergence. Basically this gives us the ability to use the correlation structures for residuals and mixed random and fixed effects of lme()
while still using the splines and response distributions of gam()
. This is what I need as I want to continue modelling my response with a quasi-binomial family, and I am probably going to want to keep my “rubber sheet” nuisance effect over the US space modelled with a two-dimensional spline (s(x, y)
), even while using the correlation features of lme()
.
So the first thing I do is create a model6b
, as similar as possible to the model6
that was the best and final model from the last blog post, but just changes the estimation method. So here it is, a straight transfer from gam()
to gamm()
… which gives us these two different results:
> summary(model6) Family: quasibinomial Link function: logit Formula: per_gop ~ cpe + s(x, y) + s(state_name, bs = "re") Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.7972 20.7488 -0.135 0.893 cpe 15.5908 0.6778 23.001 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms: edf Ref.df F p-value s(x,y) 28.39 28.94 7.995 <2e-16 *** s(state_name) 49.00 49.00 7.949 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = 0.386 Deviance explained = 44.8% GCV = 3138.5 Scale est. = 3170.7 n = 3107 > summary(model6b$gam) Family: quasibinomial Link function: logit Formula: per_gop ~ cpe + s(x, y) + s(state_name, bs = "re") Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.8878 0.1481 -19.50 <2e-16 *** cpe 15.0166 0.6271 23.95 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms: edf Ref.df F p-value s(x,y) 25.41 25.41 6.438 <2e-16 *** s(state_name) 39.91 49.00 7.154 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = 0.382 Scale est. = 3005.3 n = 3107
There’s some differences coming from the different estimation methods. The gamm()
model uses less effective degrees of freedom for both the s(x,y)
rubber mat and the s(state_name, bs="re")
state level random intercept. The fixed effect coefficient for cpe
(‘crude prevalence estimate’ of county-level depression) is a little different – 15.0 versus 15.6. But we can see we’re fitting the same model and getting substantively the same results.
Next small change I make is to move the state level random intercept from the gam
specification into lme
. Again, this is an identical model, just changing how the fitting is done:
This is a big performance speed up (which we’re going to need as the models start getting more complex) for materially the same results.
Random slopes
Now I’m ready to take advantage of the move to the gamm
syntax and I let the slopes, not just the intercepts, vary for each state:
…which gives us these results. Note that the Akaike Information Criterion has decreased by 200 from adding the random slopes, suggesting the model is overall worth the increased complexity. The cpe (depression) coefficient has decreased in size but is still quite large and definitely significant.
> AIC(model7$lme, model6c$lme) df AIC model7$lme 9 8605.692 model6c$lme 7 8806.902 > summary(model7$lme) Linear mixed-effects model fit by maximum likelihood Data: data AIC BIC logLik 8605.692 8660.064 -4293.846 Random effects: Formula: ~Xr - 1 | g Structure: pdIdnot Xr1 Xr2 Xr3 Xr4 Xr5 Xr6 Xr7 StdDev: 4.440986 4.440986 4.440986 4.440986 4.440986 4.440986 4.440986 Xr8 Xr9 Xr10 Xr11 Xr12 Xr13 Xr14 StdDev: 4.440986 4.440986 4.440986 4.440986 4.440986 4.440986 4.440986 Xr15 Xr16 Xr17 Xr18 Xr19 Xr20 Xr21 StdDev: 4.440986 4.440986 4.440986 4.440986 4.440986 4.440986 4.440986 Xr22 Xr23 Xr24 Xr25 Xr26 Xr27 StdDev: 4.440986 4.440986 4.440986 4.440986 4.440986 4.440986 Formula: ~1 + cpe | state_name %in% g Structure: General positive-definite, Log-Cholesky parametrization StdDev Corr (Intercept) 3.294624 (Intr) cpe 15.833056 -0.985 Residual 50.963709 Variance function: Structure: fixed weights Formula: ~invwt Fixed effects: list(fixed) Value Std.Error DF t-value p-value X(Intercept) -2.099504 0.5204783 3054 -4.033796 0.0001 Xcpe 10.482779 2.5041023 3054 4.186243 0.0000 Xs(x,y)Fx1 0.221830 0.1339667 3054 1.655859 0.0979 Xs(x,y)Fx2 0.134028 0.1735957 3054 0.772070 0.4401 Correlation: X(Int) Xcpe X(,)F1 Xcpe -0.985 Xs(x,y)Fx1 -0.083 0.084 Xs(x,y)Fx2 -0.121 0.126 0.568 Standardized Within-Group Residuals: Min Q1 Med Q3 Max -9.65917765 0.05092882 0.32541129 0.62277480 6.27537901 Number of Observations: 3107 Number of Groups: g state_name %in% g 1 50
We can use this model to make a new version of the final chart from the previous blog post:
It doesn’t look much different. So my hunch on this aspect was right; there’s enough data to justify letting the depression-vote slope vary in each state (ie we get a better model) but it doesn’t change the substantive conclusion.
Better spatial autocorrelation
Now I’m ready to improve the way I’m handling spatial autocorrelation. As I mentioned last time, my approach to this is to model the centroid of each county with an s(x, y)
two-dimensional spline, a sort of rubber mat overlaid over the USA to soak up the things counties have in common with their neighbouring counties. I still think this is much better than nothing, but admit there is still a problem that even after doing this the residuals of counties that are close together will still be correlated. Which means that each observation is not worth as much as if it were truly independent, which means that my inferences are over-confident in their precision.
First I wanted to check out if this hunch was correct. The standard way to look for spatial autocorrelation is via a variogram. I couldn’t get the Variogram()
function in nlme
to return sensible results from the models that had been fit with gamm()
so I had to make one more explicitly with the variogram()
function in gstat
:
… which gets me this:
A variogram works by calculating the distances between each pair of observations, binning these and then assessing how related the pairs of observations (in this case, residuals after the model fitting) are in each bin. It doesn’t do this by a correlation coefficient but by another measure that I haven’t got my head around but is described as ‘half the variance of the differences between all possible points spaced a constant distance apart.’. The important thing is that a score of zero means perfect correlation, and the higher the numbers are the more independent the pairs of observations contained in the bin are. So to interpret the chart above we can say that the counties that are about 20 degrees (of latitude/longitude, ignoring curvature of the earth) apart or closer have some degree of correlation with eachother; once they get to that far apart this measure more or less stablises.
I don’t know why spatial statistics doesn’t just use a good-ole correlation coefficient for this job, but presume there are interesting historical reasons. To check that I wasn’t mangling things, I decided to “roll my own” spatial correlation measure, with:
That gives me this chart, which is more like the sort of thing we use in time series analysis.
I’m pleased it tells a similar story to the variogram (which is more of a black box to me) – the correlation between pairs of counties’ residuals is above zero for counties that are within 6 degrees/units of eachother, and stablises at a small negative number from about 20 units apart onwards.
My conclusion from this is that yes, there is still some spatial autocorrelation that should be taken into account. Particularly for those counties very close together (around 2 degree/units apart or less).
Now, there is a tricky problem with the shape of spatial autocorrelation. Even if you’re prepared to assume (as I am in this case) that it is symmetrical north-south and east-west (not always the case when it depends on eg wind), there are differing shapes of decay in the level of correlation, as a function of distance. Experts can apparently/allegedly judge which method is best by looking at the shape of the variogram, but I feel safest by trying all five correlation structures available to me and choosing the one with the lowest AIC.
That gets me these results:
> sapply(model8, \(m){round(AIC(m$lme), -1)}) [1] 8230 8600 8590 8330 8580 > AIC(model7$lme) [1] 8605.692
We see that all the models that adjust for spatial correlation are better than model7 (which doesn’t), but the models using corExp
and corRatio
are much better than the other three. The corExp
model (model8[[1]]
) is our new best model so far.
If we look at the t statistics for the coefficients of these five models, we see that for the first time our best model has a “non-significant” slope for cpe (ie depression), 1.695. Now, I’m not going to remove a variable from a complex mixed effects model like this because of a non-significant t statistic, but it is definitely note-worthy that the depression effect has become less important once we let it vary by state and do the best correction for spatial autocorrelation:
> sapply(model8, \(m){summary(m$gam)$p.t}) [,1] [,2] [,3] [,4] [,5] (Intercept) -2.031622 -4.036009 -3.987122 -3.494438 -3.952872 cpe 1.695218 4.206850 4.193251 3.746362 4.167077
Also noteworthy though is that the other spatial correlation shapes still leave depression with significant t statistic.
Here’s the full report from the lme
part of the best fit so far:
> summary(model8[[1]]$lme) Linear mixed-effects model fit by maximum likelihood Data: data AIC BIC logLik 8231.487 8291.901 -4105.744 Random effects: Formula: ~Xr - 1 | g Structure: pdIdnot Xr1 Xr2 Xr3 Xr4 Xr5 Xr6 Xr7 Xr8 StdDev: 6.68243 6.68243 6.68243 6.68243 6.68243 6.68243 6.68243 6.68243 Xr9 Xr10 Xr11 Xr12 Xr13 Xr14 Xr15 Xr16 StdDev: 6.68243 6.68243 6.68243 6.68243 6.68243 6.68243 6.68243 6.68243 Xr17 Xr18 Xr19 Xr20 Xr21 Xr22 Xr23 Xr24 StdDev: 6.68243 6.68243 6.68243 6.68243 6.68243 6.68243 6.68243 6.68243 Xr25 Xr26 Xr27 StdDev: 6.68243 6.68243 6.68243 Formula: ~1 + cpe | state_name %in% g Structure: General positive-definite, Log-Cholesky parametrization StdDev Corr (Intercept) 2.702224 (Intr) cpe 12.092468 -0.99 Residual 66.289425 Correlation Structure: Exponential spatial correlation Formula: ~x + y | g/state_name Parameter estimate(s): range 0.7289188 Variance function: Structure: fixed weights Formula: ~invwt Fixed effects: list(fixed) Value Std.Error DF t-value p-value X(Intercept) -0.889436 0.4390918 3054 -2.0256268 0.0429 Xcpe 3.370112 1.9919865 3054 1.6918346 0.0908 Xs(x,y)Fx1 0.165268 0.1447102 3054 1.1420604 0.2535 Xs(x,y)Fx2 0.000906 0.1909150 3054 0.0047431 0.9962 Correlation: X(Int) Xcpe X(,)F1 Xcpe -0.988 Xs(x,y)Fx1 -0.065 0.066 Xs(x,y)Fx2 -0.131 0.141 0.510 Standardized Within-Group Residuals: Min Q1 Med Q3 Max -5.2880800 0.2592565 0.5963373 1.0331947 4.6052452 Number of Observations: 3107 Number of Groups: g state_name %in% g 1 50
I believe that Formula: ~Xr - 1 | g
bit refers to the s(x,y)
term from the GAM when brought into the linearised LME fit. The range of 0.73 refers to the exponential spatial correlation.
Race
Finally I’m ready to look at the question that’s of most substantive interest – does incuding race in the model make the “depression effect” go away, suggesting that depression is acting as a proxy for whiteness. I sourced data on county characteristics of various sorts from the UN Census Bureau. I considered four candidate variables:
white_alone
which is the proportion of the county that describe themselves as white and not other racewhite_all
which is the proportion of the county that describe themselves as white, whether or not they also are of another race as wellhispanic
proportion of county that describe themselves as hispanichispanic_multi
proportion of county that describe themselves as hispanic and more races in addition
I deliberately left out the proportion of African-Americans in each county, assuming it would be very collinear with some combination of the others. If were seriously interested in how race worked in this election I would probably have included it anyway.
Looking at these four variables (without peeking at their relationship to vote) gives me this pairs plot:
which convinces me I should save a degree of freedom and drop the white_all
variable from my modelling as containing virtually no extra information from the white_alone
variable. Here’s the code to import that data and draw the pairs plot:
Next I wanted to look at the relationship of these race variables to the logit of vote for Trump, to check if a linearity assumption was going to be reasonable. That got me this chart, which seemed linear enough for me (for my purposes):
…produced with this code:
OK so now I fit a whole bunch of different models to be confident that individual decisions from me weren’t going to be leading to my final conclusions. I fit models with many of the combinations of using a rubber mat to deal with spatial autocorrelation, explicitly modelling the spatial autocorrelation, random or fixed slopes (ie varying by state) for cpe
(depression), random or fixed slopes (ie varying by state) for hte race variables. Then I calculated the AIC of all these models and the t statistics for cpe
(depression), which gets me this table, sorted with the best models (lowest AIC) at the top:
model | AIC | rubber_mat | random_cpe_slope | race | random_race_slope | SAC_fix | cpe_t_stat |
---|---|---|---|---|---|---|---|
15 | 5676.498 | 1 | 1 | 1 | 1 | 1 | -1.612598 |
13 | 5904.910 | 1 | 1 | 1 | 0 | 1 | -1.900880 |
10 | 6063.537 | 0 | 1 | 1 | 0 | 1 | -1.581859 |
14 | 6241.311 | 1 | 0 | 1 | 0 | 1 | -4.612077 |
11 | 6416.801 | 0 | 0 | 1 | 0 | 1 | -2.996010 |
12 | 7136.013 | 0 | 0 | 1 | 0 | 0 | 5.347793 |
8 | 8231.487 | 1 | 1 | 0 | 0 | 1 | 1.695218 |
9 | 8246.613 | 0 | 1 | 0 | 0 | 1 | 2.162823 |
Lots of interesting things here, but most important:
- The best models are definitely those that include race
- The best model of all is also the most complex – random slopes for both race and depression, rubber mat, plus addressing the spatial autocorrelation (SAC) explicitly with an exponential correlation structure
- The best models all have a negative sign for
cpe
, indicating that after controlling for race, it is the counties with less depression that were more likely to vote for Trump – a compelling change in narrative from looking at the data without controlling for race. But in the best model of all, depression isn’t very important. - One model that I describe in the code below as “definitely illegitimate because it makes no attempt at all to correct for spatial autocorrelation” is the one that gives the biggest positive effect (as in t statistics) for the depression impact on voting for Trump
I regard this as good evidence that in fact incidence of depression by county wasn’t important in driving the vote for Trump, but that race composition of counties probably was (or at least ‘more likely’ was). Of course, how that mechanism works, and whether ‘whiteness’ here is in fact standing in for something else, is beyond the scope of this blog post (already far too long) to explain.
Here’s the code that fit all those candidate models described above and generates the table:
Here’s the key results from the overall winning model:
> summary(model15$lme) Linear mixed-effects model fit by maximum likelihood Data: data AIC BIC logLik 5676.498 5797.326 -2818.249 Random effects: Formula: ~Xr - 1 | g Structure: pdIdnot Xr1 Xr2 Xr3 Xr4 Xr5 Xr6 Xr7 StdDev: 4.947787 4.947787 4.947787 4.947787 4.947787 4.947787 4.947787 Xr8 Xr9 Xr10 Xr11 Xr12 Xr13 Xr14 StdDev: 4.947787 4.947787 4.947787 4.947787 4.947787 4.947787 4.947787 Xr15 Xr16 Xr17 Xr18 Xr19 Xr20 Xr21 StdDev: 4.947787 4.947787 4.947787 4.947787 4.947787 4.947787 4.947787 Xr22 Xr23 Xr24 Xr25 Xr26 Xr27 StdDev: 4.947787 4.947787 4.947787 4.947787 4.947787 4.947787 Formula: ~1 + cpe + white_alone + hispanic | state_name %in% g Structure: General positive-definite, Log-Cholesky parametrization StdDev Corr (Intercept) 1.503864 (Intr) cpe wht_ln cpe 6.544726 -0.761 white_alone 1.055536 -0.437 -0.188 hispanic 1.719388 -0.134 -0.152 0.180 Residual 43.651012 Correlation Structure: Exponential spatial correlation Formula: ~x + y | g/state_name Parameter estimate(s): range 0.8315562 Variance function: Structure: fixed weights Formula: ~invwt Fixed effects: list(fixed) Value Std.Error DF t-value p-value X(Intercept) -2.395289 0.271082 3051 -8.836033 0.0000 Xcpe -1.862160 1.155263 3051 -1.611892 0.1071 Xwhite_alone 3.678207 0.191161 3051 19.241375 0.0000 Xhispanic -0.544371 0.327396 3051 -1.662729 0.0965 Xhispanic_multi 1.751862 4.307237 3051 0.406725 0.6842 Xs(x,y)Fx1 0.202288 0.093536 3051 2.162664 0.0306 Xs(x,y)Fx2 0.429865 0.131173 3051 3.277087 0.0011 Correlation: X(Int) Xcpe Xwht_l Xhspnc Xhspn_ X(,)F1 Xcpe -0.761 Xwhite_alone -0.461 -0.164 Xhispanic -0.178 -0.053 0.161 Xhispanic_multi -0.002 -0.114 0.101 -0.278 Xs(x,y)Fx1 -0.051 0.030 0.028 -0.032 0.157 Xs(x,y)Fx2 -0.121 0.101 0.083 0.000 -0.081 0.350 Standardized Within-Group Residuals: Min Q1 Med Q3 Max -6.1786290 0.2312745 0.4830700 0.8228585 5.5847101 Number of Observations: 3107 Number of Groups: g state_name %in% g 1 50
Looking at that table of fixed effects, we see the white_alone
variable is definitely significant, as is the rubber mat s(x,y)
spatial correlation absorber. So we could tentatively conclude that this suggests that whiteness strongly contributed to vote for Trump; that depression incidence didn’t contribute anywhere near as much; and there was a strong spatial correlation not explained by either whiteness or depression.
Looking for a way to summarise all this I came up with this chart of the partial relationship of whiteness and of depression incidence to vote for Trump; after controlling for the other things in the final model:
That was produced with this code, which involved fitting a new model to produce the residuals after controlling for race but not controlling for depression (once combination that hadn’t yet been done in the frenzy of model-fitting above):
That’s it for today. I hope that’s of interest for someone, either as a messy but realistic modelling strategy case study, or for those interested in the specific issue of depression and the 2024 election.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.