Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Someone asked on Twitter about the characteristics of New Zealand First voters. While some crude conclusions can be drawn from examining votes by location cast and then comparing that with census data, we really need individual level data to answer the question properly. I set myself this task as a motivation for exploring the New Zealand Election Study.
This turned into a fairly lengthy blog post and includes more than 500 lines of code. Skip down to the very end if you just want to see how things turn out for the four main parties and the main conclusions from that. But first here’s the best version of the result with relation to New Zealand First:
Later down there’s a chart comparing this to the voters for the National, Labour and Green parties.
Functionality
Here’s the R packages I needed to do this analysis, plus a few convenience functions I wrote specifically for it.
Sourcing the New Zealand Election Study data
The New Zealand Election Study generously makes its data available for free. The code snippet below assumes the zip file with the SPSS version of the data has been downloaded into the location specified on d: drive; obviously change this to wherever you have it saved.
SPSS data includes both short column names and full verbose questions, so the data frame of metadata called varlab
in the R snippet above looks like this:
Data prep
The sample size is 2,835 and it includes 234 respondents who claim they party-voted for New Zealand First (note that nine of these were identified by the researchers as actually not voting at all; I made the call to count people on the basis of their self-report). I’m interested in a demographic characterisation of the New Zealand First voters compared to the rest of the New Zealand population (rather than, for example, compared to the voters of a single other party) so I’m going to be using a logistic regression with a response variable of 1 if the respondent voted for NZ First and 0 otherwise.
It’s often not appreciated that, on a rule of thumb, you need more data per explanatory variable for logistic regression than a continuous response: twice as much if the response is split 50/50, 10 times as much if it’s split 90/10.
Based on Frank Harrell’s sample size advice in Regression Modeling Strategies I’ve got somewhere between 11 and 23 degrees of freedom to “spend” on explanatory variables; that is 1/10 or 1/20 of the number of cases of the least frequent value of the categorical response variable. I certainly can’t afford to use all 437 columns in the data. Many of those columns are attitudinal variables that are out of scope for today (hope to come back later), but there’s still far too much socio-economic and demographic data to use it all in a model without more data. So I follow Harrell’s general approach of working out how many degrees of freedom I have to spend; which explanatory variables are committed to be in (no sneaky stepwise reductions based on p values); and how to allocate the degrees of freedom to those variables.
My options were restricted by what is available, and with what is available I had to make choices. For example, I collapsed down the nine household income categories into just two: “lower” and “higher”. In some cases, not being an expert in the exactly best data for theoretically interesting questions, I made some fairly arbitrary choices eg to use the answer to “what is your housing status?” to identify owner-occupiers, rather than “do you or any family member own a residence?”
Here’s the code that created the explanatory variables I wanted, a data set that is just a subset of the explanatory variables, the survey weights, and four possible response variables (party vote for the four highest voting parties).
As the last comment in the code above notes, about 29% of rows are missing at least one value. There are three ways of dealing with this:
- knock out all rows missing a value
- explicitly include a “missing” dummy variable for each factor (which means about a dozen or more extra degrees of freedom)
- multiple imputation (creating multiple datasets with different imputed values, reflecting uncertainty in their expected values as well as population randomness – and pooling models estimated from each dataset).
Imputation is definitely the best and I will come back to it, but in the early stages I am knocking out those 29% of the data, for workflow simplicity.
glm versus svyglm versus glm with weights
The NZES data deliberately oversampled some groups (conspicuously, Māori) and used weights to control for this and for the incidental imbalance by sex, age, education, and non-voting (voters were also oversampled).
My first model was deliberately naive and I chose to ignore the weights, something I wouldn’t do but I wanted to see what impact it had. This gave me these results:
When I correctly used the survey weights and estimated the model via Thomas Lumley’s survey
package I got this, slightly differing set of results:
To check this, I also tried the base function glm
with a weights =
argument. The publishers normalised the weights so their mean value is nearly exactly 1 (which wouldn’t be the case with a survey used to estimate population totals for official statistics purposes), so in principle this should be basically ok:
The naive method returns different results from the two that correctly adjust for the weights. In particular, it over-estimates the evidence linking Māori ethnicity and unknown household income to a New Zealand First vote. The two methods that adjust for weights give basically identical results to eachother.
Here’s the code that fitted those first three trial models:
Non-linear age effect
The data provide an age in years for each respondent (except missing values), which unlike the other variables leaves open the possibility of modelling a non-lineary curved relationship between age and the logarithm of the odds of voting New Zealand First. In my standard models I had split age into three categories (18-29, 30-55, and 56+, with 30-55 the reference point) which would allow some crude non-linearity but it is of interest to see if there is something else going on. To test this, I fit a generalized additive model (GAM) with the mgcv
package, which allows the user to specify weights. The results for all the variables of interest were very similar and there are complexities in producing confidence intervals for fixed effects out of a GAM, so I’m not presenting them here. Instead, here’s the relationship between age and voting New Zealand First:
There’s a definite non-linearity here, but it’s not strong or defined enough for me to worry about, and I decided it’s easier to stick to my three simple categories of young, middle and old from here on. It looks like the strong impact really kicks in at about aged 65 rather 55, but I can’t go back and change my category coding just to accommodate that or I’ll be guilty of p-hacking.
Here’s the code for fitting the GAM:
Elastic net regularization
To get a feel for the robustness of the results, I also tried using both the “lasso” and “ridge regression” to shrink coefficient estimates towards zero. Results not shown here but basically all coefficient estimates were reduced to zero in either method. This intrigues me… but a thorough investigation will have to wait for a later post (maybe).
Mixed-effects (multilevel) modelling with a regional electorate effect
I’ve just finished Andrew Gelman’s Red State, Blue State, Rich State, Poor State: Why Americans Vote the Way They Do. One of my key takeaways was that while richer states vote Democrats over Republicans (and poor states vice versa), within each state richer individuals vote Republican over Democrats (and poorer individuals vice versa); and the slopes vary in each state. That is, there’s an interaction effect between region and individual income as explanatory variables, and also quite a noticeable region effect that is independent of individual characteristics.
More generally, Gelman uses multi-level models with a regional (and district, perhaps?) random effect as well as individual randomness, when he can get the data. My first ever analysis of voting behaviour was working with David Charnock when he was building on similar research he’d done with Australian federal politics.
We don’t have anywhere near enough data to look into the interaction effects, but we can look to see if there is a regional voting pattern that can’t be explained by individual characteristics.
The NZES data has too many missing values for Regional Council, but has near-complete data on which electorate each respondent is in. So I fit a multi-level model with an electorate-level random effect. There was a noticeable amount of variation by electorate – standard deviation of 0.33 – in the tendency to vote New Zealand First in a model with no individual explanatory variables. However, this went all the way down to zero when the individual variables were included.
Interestingly, once we control for individual characteristics, we see no evidence of a residual region-based electorate effect in tendency to vote New Zealand First.
As an aside, this doesn’t apply to the tendency to vote National, Labour or Green (ie there is a residual electorate effect for those three parties), but investigating why would take me too far afield.
Code for the multilevel models:
Bootstrap with imputation and recalibration of survey weights each resample
All the various models described above I regarded as basically exploratory. My “real” model and estimation process was always going to involve a bootstrap, and imputation of the missing values in a way that captures the randomness that would have been expected to be seen if we had a full set of data.
Recalibrating survey weights
The bootstrap involves a resample with replacement of a full sized sample from the sample we’ve got. One result of doing such a resample is that the weights will no longer correctly represent the proportions of people of various age, sex, ethnicity and voting status (did or didn’t vote) as the original weighted sample did. Fixing this requires creating a new set of weights for each bootstrap resample.
To calculate those weights, I need some more information on how the original weights were set. I haven’t been able to find (admittedly, after minimal effort) a definitive description of the weighting scheme of the NZES, but it definitely includes ethnicity, age, sex, and voting status; and I thought it included education. In fact, I particularly hope it included education, because a much-awaited review out today on the polling performance leading up to the 2016 US Presidential election pinged the failure of many state polls to weight by education as a major contributor to error (more educated people respond more to surveys; and in that particular year, education was more strongly related to vote than had been the case in previous elections).
We can get a glimpse of the sampling scheme (both deliberate and incidental under and oversampling) by a regression of the published survey weight on our explanatory variables. Looking at this, I’m sure that the weights are taking education into account:
This graphic nicely shows that Maori, voters, older people, and highly educated people were over-sampled and hence have low weights; whereas young people, men and students were under-sampled (probably through lower response rates rather than research design) and hence get slightly higher weights than usual.
Most commonly, in a bootstrap of a complex survey one would calculate a set of “replicate weights”, which have a one-off calibration to the correct population totals using Lumley’s survey
package. However, I wanted to do the calibration for each resample separately, for convenience, because I was going to perform imputation separately for each resample, which adds quite a complication. Imputation depends on the observed data, hence should be performed for each resample to ensure the full randomness of the approach is taken into account.
To be ready for this recalibration, I needed the sum of weights for various marginal totals from the very first sample – these totals will reveal to me the implicit known, correct population totals the original researchers used in their weighting scheme. I won’t be doing it exactly the way they did, becuase I’m using my own simplified versions of age, qualifications, etc.; but it should be close enough for my purposes. Here’s the code that calculates those marginal totals for me, for later use in the bootstrap cycle:
Bootstrap results
Now, I’m ready to run my loop of bootstrap resamples, to get a decent “best possible” estimate of the relationship between demographic explanatory variables and tendency to party-vote New Zealand First, compared to the rest of the New Zealand adult population. Each resample will:
- have a single set of imputed values calculated to replace its NAs, using the
mice
package that is usually used for multiple imputation even though I only want one imputation set per resample.mice
adequately captures the existing observed structure in the data (much better than just imputing average values), as well as the observed randomness. - have its weights recalibrated to the correct marginal totals
- have a model fit to it with that full set of data and new set of weights
The end result has already been shown at the top of this post, but as this is all so long here it is reproduced:
The code that does this exercise took about 90 minutes to run on 7 cores on my laptop:
Multiple imputation without bootstrap
The bootstrapped, multiply-imputed model gives a warm glow of thorough justificationness (if there is such a word) but is expensive in processor time. I was left with the impression that the standard model was fitting ok enough that even analytically calculated confidence intervals are probably good enough – after all, the bootstrap results looked pretty similar to the original non-bootstrap versions with no imputation at all.
To round out the analysis with fitting similar models to all parties, I skipped the bootstrap part and just used the original sample, but with five different sets of imputed values using the core functionality provided by mice
. I didn’t bother to re-calibrate the weights for each set of imputed values because I’m fairly confident changes from the original weights would have been minimal. This means I have five different complete (no NA values) datasets, each of my four models is fit to each dataset, and the five results for each party are pooled using the Barnard-Rubin method.
Here’s the end results (click image to enlarge):
… and the code to do it is pretty simple:
Discussion
While I do a chunk of analysis on voting behaviour (eg see my election forecasts), I’m the wrong person for all sorts of reason to give actual commentary on New Zealand politics. So I’m going to make only the most obvious comments:
- To answer the original question at the beginning of this post, New Zealand First voters were more likely than the general population to be older, born in New Zealand, identifying as working class and male.
- No surprise that people such as home owners, higher income people, European heritage, those who don’t identify as working class, don’t have a trade union member in the household voted for the centre-right National party
- Perhaps more surprising that being born in New Zealand is estimated to make people less likely to vote for National after controlling for education, ethnicity, income etc (in fact this one is surprising enough for me I’d like to see it replicated with fresh data before making too much of it)
- In addition to the variables that would be expected to lead to Labour vote (union membership, lower income, less qualification), being Maori and female are both estimated to point in that direction
- The characteristics of Green voters are an interesting but not that surprising list – one or more of university qualifications, beign a student, part time, union member in the household, living in a city, not identifying as Christian, not of European ethnicity.
So there we go. Relationship of individual socio-economi status on voting behaviour in the 2014 New Zealand General Election. Rather sadly, I don’t see much of this amazing dataset in the academic literature. Maybe I’m looking in the wrong spot.
Corrections and other comments welcomed. Beware – all the above code isn’t checked by anyone except me.
Big thanks to the Electoral Commission for funding the NZES, the NZES researchers themselves, and (as always) everyone behind the R core project and the 16 R packages listed near the beginning of this post.
> thankr::shoulders() %>% + mutate(maintainer = gsub("<.*>", "", maintainer)) %>% + group_by(maintainer) %>% + summarise(no_packages = sum(no_packages)) %>% + arrange(desc(no_packages)) %>% + as.data.frame() maintainer no_packages 1 Hadley Wickham 18 2 R Core Team 12 3 Brian Ripley 4 4 Kirill Müller 3 5 Rich Calaway 3 6 Yixuan Qiu 3 7 Dirk Eddelbuettel 2 8 Jennifer Bryan 2 9 Martin Maechler 2 10 "Thomas Lumley" 1 11 Achim Zeileis 1 12 Adelchi Azzalini 1 13 Baptiste Auguie 1 14 Ben Bolker 1 15 Charlotte Wickham 1 16 David Robinson 1 17 Deepayan Sarkar 1 18 Duncan Temple Lang 1 19 Gábor Csárdi 1 20 James Hester 1 21 Jelmer Ypma 1 22 Jeroen Ooms 1 23 Jim Hester 1 24 JJ Allaire 1 25 Joe Cheng 1 26 Katharine M. Mullen 1 27 Luke Tierney 1 28 Marek Gagolewski 1 29 Peter Ellis 1 30 R-core 1 31 Simon Wood 1 32 Stef van Buuren 1 33 Stefan Milton Bache 1 34 Terry M Therneau 1 35 Trevor Hastie 1 36 Vitalie Spinu 1 37 William Revelle 1 38 Winston Chang 1 39 Yihui Xie 1
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.