New Zealand Election Study individual level data
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Individual level data is essential to understand voting behaviour
My previous analysis has occasionally come up against the problem “only individual level data could resolve that,”. Since I last wrote that, the New Zealand Election Study data for the 2014 General Election have become available, and this post is my first glance at it. The New Zealand Election Study makes available data on nine general elections (back to 1990) and looks to be a great resource. Some time in the last half dozen years they took a decision to publish it all pro-actively with minimal (nearly zero) gate-keeping, which is a great thing.
Caveat: I have no association whatsover with the New Zealand Election Study. Any errors are mine and in fact I looked at the actual data for the first time today, so treat anything I say with caution.
How to get the data
The data are available from the New Zealand Election Study website in SPSS format. There is a brief online form to fill in so they can keep track of who is using their data. Because of this, I’m not planning on publishing a copy of the data or incorporating it into the nzelect
R package. The download is straightforward with a minimum of red tape (just the one form, no credentials or gate-keeping). To get it into R, the code below will work with just the necessary tweak to wherever the zip file containing the 2014 data has been saved.
The data have 2,835 rows and 438 columns. The column names are the variable names from the original SPSS file. In the code above, I extract the more verbose “variable labels” that each column refers to and store them in a data frame with a single column, varlab
. The entries in varlab
can now be referred to by their row name, for example:
There’s an exciting range of questions in here, and when I’ve got my head around them I’ll be doing some interesting analysis; particularly once I get efficient code to join it up with the Census data in my nzcensus
package. I think that much of the analysis by others to date has been with SPSS, so I’ll publish any useful R code I develop for others to use too.
Weights
The data have weights to correct for:
- deliberate over-sampling (I think of Maori, although I haven’t yet tracked down a definitive description of the sample design)
- accidental (ie from response rate) disproportionate sampling by age, gender and education
- disproportionate representation of voters and non-voters
I’m pretty sure the correct final weight to use is the column dwtfin
.
Note - the weight column sums to the sample size (not to the population). This minimises the chance of an SPSS-related disaster. Unless the specialist complex surveys module is paid for and used, SPSS interprets weights as frequencies, and hence will give completely wrong standard errors and confidence intervals if the weights add up to anything other than sample size. I think that they are still wrong when the weights add up to sample size, but not by an order of magnitude! (SPSS’ limitations in this regard were an important reason in moving to R at my work in 2011, but I digress.)
Here’s my prep code before doing further analysis. When I get into this for real, this is likely to be much bigger, and I’ll develop special script/s specifically to do data cleaning and tidying. For now, all I do is make a summary variable that groups party vote for some of the smaller parties into an “other party” category; and set up the data with a survey design object using Thomas Lumley’s survey
package, which is the go-to point for anyone studying complex (ie weighted) surveys in R.
New Zealanders in particularly will be interested in the answer to this question, presumably asked as a proxy of how closely people follow politics:
Most people who ventured an opinion got it right; about one in seven didn’t know.
Example analysis - perceptions of Nicky Hagar’s Dirty Politics book
I was interested to see that one of the questions was:
"B12: how much truth in Nicky Hager's Dirty Politics book"
Non New Zealanders may wish to read the Wikipedia article on Dirty Politics for context. The book was released a bit over a month before the 2014 election and at times dominated the news leading up to it. It had a number of revelations, critical in particular of the National Party (which was in government prior to the 2014 election, and won again in 2014), focusing on its relations with right wing bloggers. It’s not surprising that political scientists wanted to collect data about voters’ views of the book in relation to voting behaviour.
Complex relations of categorical variables via a mosaic plot
I thought I’d use this as my first bit of familiarisation with the data, and I started with this graphic:
This is a mosaic plot, which is used to show relationships between categorical variables. The columns show individuals’ party vote, and the number of boxes down shows the view of Mr Hagar’s book. For example, the large blue rectangle second from the top under the “National” heading represents survey respondents who said they party-voted National, and thought there was “a little truth” in Mr Hagar’s book. In addition:
- The size of each rectangle is proportionate to the size of total number of voters estimated to be in that combination of views and voting behaviour.
- The colour of each rectangle indicates how much that combination of variables differs from what would be expected if the two variables (voting, and view of the book) were unrelated. Blue indicates “surprisingly large number” and red indicates “surprisingly small number” - where ‘surprise’ means “not predicted by the null model of no relationship”.
We can see a pattern of views pretty much along party lines. Compared to the null hypothesis of independence (which of course was never plausible in this instance), there are ‘surprisingly’ many National Voters who think there is little or no truth, and suprisingly few who think there is some or a lot of truth. The pattern is reversed for Labour, Green and Internet / Mana voters, none of which would really surprise anyone who was following the news at all during the election campaign. Interestingly, there seems no relation between the view of the Dirty Politics book and voting Conservative (warning for overseas readers - the Conservative Party in New Zealand is much smaller and newer than its UK namesake).
Interestingly, there’s a noticeably large number of non-voters in the “some truth” category, which could fit in with a frustration / cynicism / “pox on both your houses” narrative. Relatively few non-voters thought there was “no truth” in Mr Hagar’s book.
Here’s the code to make this graphic.
.. and via bar charts
The mosaic plot is a nice graphic and I often use mosaic plots in exploratory and analytical stages of an analytical project; but rarely is it a good tool for communication to people who aren’t specialists. A better graphic for that purpose is this one, using the same information but in two plots that use more well-known format:
The messages are still there. Perhaps it’s less stark and immediate than the mosaic plot, but it’s got a better chance of general understanding. I’ll use this one for Twitter I think.
Here’s the code to create those two bar charts; main interest will be for those interested in “extreme ggplot2 polishing” practices. Also worth noting - I’m warming to Hadley Wickham’s forcats
R package for manipulating factors, very useful in this context for controlling the order that categories are drawn in a plot..
Just the numbers
People might want the summarised numbers behind the charts, here they are:
> # Views about Mr Hagar's book, within each group of party voters (columns add to 100)
> round(prop.table(xtabs(Freq ~ ddirtypol + vpartysum, data = as.data.frame(dirtypol)), margin = 2) * 100, 0)
vpartysum
ddirtypol No Vote Conservative National NZ First Labour Green Internet /\nMana Party Other party
No truth 3 2 13 3 2 0 0 5
A little truth 12 21 33 11 7 4 0 15
Some truth 24 19 20 30 33 26 18 20
A lot of truth 10 16 2 25 30 40 61 17
Don't know 51 41 31 31 27 29 21 43
>
> # Party votes, within each category of views about Mr Hagar's book (rows add to 100)
> round(prop.table(xtabs(Freq ~ ddirtypol + vpartysum, data = as.data.frame(dirtypol)), margin = 1) * 100, 0)
vpartysum
ddirtypol No Vote Conservative National NZ First Labour Green Internet /\nMana Party Other party
No truth 10 1 73 4 7 0 0 5
A little truth 14 3 64 4 8 2 0 5
Some truth 22 2 29 8 26 8 1 5
A lot of truth 15 3 5 10 37 20 4 6
Don't know 32 3 30 6 15 6 1 7
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.