[This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Motivation
This thread on Twitter prompted some questions for me about who actually turns up to vote in New Zealand’s elections. With limited time, I can’t answer many of the important questions raised in that thread and the article it refers to critically. However, I can use the New Zealand Election Study to look into the specific question raised about the demographics of people on the electoral roll who fail to vote, and in particular if income is a significant factor.
To do this I adapted code from an earlier post where I modelled party vote based on individual socio-economic characteristics. For the main analysis, I used “Voted” as the response variable in a regression model with about 20 possible explanatory variables; this lets us see the impact of each variable while simultaneously controlling for the others. Based on thinking that I was interested in behaviour rather than reasons, I wrapped up into one category those who claimed they “chose not to vote” and those who just “didn’t manage to vote”.
The NZES sample is drawn from the electoral roll, which means the important question of who gets enrolled in the first place can’t be analysed from it.
Results
To cut to the chase, it’s not clear that household income is a factor. However, there are so many people who did not tell the surveyors their income that in this analysis and in my earlier post I was obliged to code these people separately; in the chart below they show up as “HHIncome don’t know / NA”. Unlike the two “lower” and “higher” income groups they are contrasted to, this variable does show up as negatively related to likelihood to turn up to vote, even after controlling for all the other variables in the chart:
To be clear on the interpretation of this chart, the following are characteristics for which there was significant evidence of a relationship with more likely to vote:
Own their own house or flat
Someone in the household is a member of a professional association
Live in a city
Work part time (this is the only one that surprised me)
Has a university qualification
Older
Married or with a long term partner
Generally speaking, these are mostly things associated with people who are doing well out of society.
The following factors are associated with being less likely to vote:
Male
Young
Not European
Income not known (ie not told to the interviewer)
I was interested that being Māori did not show up as significantly related to non-voting, above and beyond the general “non-European” factor (remembering that multiple ethnicities are usually allowed in New Zealand surveys and censuses). Assuming the chart in the Twitter thread referred to above is correct, this must mean that the Māori indicator is conflated with some of the other variables – such as being younger, not owning one’s own house or flat, not living in a city, not European, not having a university qualification, etc. I re-ran a regression, this time with “being Māori” as the response variable, to check that and found this conflation was indeed happening.
All up, this is pretty strong evidence for socio-economic disadvantage being a unifying factor in non-voting behaviour by people on the electoral roll. The fact that the “being Māori” non-voting effect disappears when we control for these other factors probably counts as a real finding of interest.
It’s a shame that we can’t see a clear income effect in itself (other than the people who don’t report income to surveyors), but income is notoriously difficult to measure in any social science context, so not that surprising.
More exploration
I poked around the data just a little more before writing up.
More granular exploration of income
Here is a mosaic plot of the original survey question on income matched to whether the respondent voted in the 2014 election:
Once one is familiar with these charts, they are a powerful way of visualising a two-way cross-tab. They are conceptually related to the basic Chi-square test used to test for independence of the two variables in a cross-tab like this. The cells coloured blue indicate a “surprisingly” high number of people in that cell, relative to the null hypothesis of no relationship between the two variables. Red means a surprisingly low number. The area of each box indicates the number of people in that particular cell of the table. For this “income by voting” plot we see:
there are less people in the “don’t know income and did vote” category than would be expected if the two variables were unrelated
there are more people in the “don’t know income” and “chose not to vote” or “didn’t manage to vote” categories than would be expected
people with incomes between $31,000 and $55,000 turned up in “chose not to vote” more than would have been expected
people with incomes between $76,000 and $148,000 were less likely to be in “chose not to vote” than would have been expected
All up, those exploratory findings broadly match an expectation that people with lower incomes didn’t vote, and those with higher incomes did; albeit with some complications in the detail.
This mosaic plot gives a more nuanced view of income than my regression, where I had to lump together categories from both variables. For example, a regression that differentiated between those who “chose not to vote” and those “didn’t manage to vote” would have been interesting but taken us into the world of multinomial responses which are extremely hard to explain visually, and which suck up more degrees of freedom from our fairly small sample size.
There’s not a lot of people in many of the cells in this table. With a bit more data, and a preparedness for some modelling complexity, I suspect we’d find an income effect somewhere. To tackle this seriously and with a big enough sample size I’d want to use all the election studies from previous years.
Some attitudinal variables
Here I present without comment some similar graphics comparing voting behaviour to some of the attitudinal questions in this survey:
There’s lots to say here but to do it justice would require engaging much more with the political science literature than I have time for just now.
Method
Around a third of the 2,835 rows of data are missing at least one of the variables I wanted to include in my regression, so I needed to think carefully about my modelling strategy. Choosing a simpler variant of the different methods I tried in my earlier post on party vote, I used:
the survey weights provided by the NZES organizers
multiple imputations by chained equations (with the R mice package), imputing five alternative values for each missing value so we can fit five regressions and pool the results
glm (with a quasi-binomial response to be safe) from the standard stats package with R, because it plays nicely with mice and my previous experience suggested there wasn’t much to gain by using svyglm from the survey package with this particular dataset.
If I had more time and it was more important to me, I would have used survey::svyglm in combination with a bootstrap that encompasses the imputation process, as per the previous post. My experience suggests that this is unlikely to change the result materially.
Code
Here’s the R code that did the analysis. Two small points to note were that since my last post using this data, with the upgrade to R3.4.x,
the foreign package seems to import the SPSS data slightly differently, which required a tweak to some of the code handling factors (on the plus side, I think it preserves more information from the SPSS version in doing so)
the mice package stores the contrasts for factors it used in imputation in a different spot
Both these issues were food for thought and required a small amount of bug hunting.
Related
To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R.