Forecasts for the 2020 New Zealand elections using R and Stan by @ellis2013nz

free range statistics - R

2 years ago

[This article was first published on free range statistics - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

New New Zealand election forecasts

I have finalised the first proper release version of my forecast for the New Zealand general election to be held on 19 September 2020. Here are the predicted probability distributions for number of seats for several of the realistic possible coalitions:

As can be seen, it’s looking very good for Prime Minister Jacinda Ardern and her Labour Party, with a very high probability of governing either alone or in coalition. Unless things change dramatically, the National Party’s chance of forming a government are vanishingly small without forming an unlikely coalition with New Zealand First, and small even then.

It’s interesting to forecast genuine multi-party politics

New Zealand elections are interesting to forecast because of the mixed-member proportional representation system. The first election on this basis was in 1996. Since then, no single party has governed without coalition partners. This is a stark contrast to earlier years of first-past-the-post single-seat electorates. With New Zealand’s single house of parliament, this led to a single party dominating between elections with minimal checks and balances in the formal political system. In contrast, every Prime Minister since 1996 has had to carefully negotiate with their minor coalition partners on an ongoing basis during their term of government.

In terms of forecasting, we have a two-step process:

estimate probability distributions for party votes on the day of election. These have to be based largely on the imperfect measurements we have from polling data. They need to take into account uncertainty about change between “today” and polling day;
convert those probability distributions for votes to seats in Parliament via simulations, and observe the coalition-building possibilities.

Party votes need to be estimated for all parties with a realistic chance of representation in Parliament, making this a much more computationally intensive exercise than forecasting Australian elections. For the Australian House of Representatives, it suffices to estimate a single two-party-preferred number to get a good model of the election outcom. Of course, the proportionaal representation Senate is another matter.

For my forecasts of vote, I use the same Bayesian latent state space model that I used in my 2017 election forecasts. I have dropped the second approach I used as backup in that election, which was a generalized additive model extrapolated forwards. Extrapolating with splines is hard to do right, and I think adds little to a good quality Bayesian model with explicit assumptions.

The specification of the model in Stan is here on GitHub. It takes just 133 lines of code. The ‘parameters’ block of that code contains around 25,000 unseen values that need to be estimated. These include:

the unobserved support for seven parties (including ‘other’) for each of 3,223 days from 26 November 2011 to 19 September 2020 (the actual parameters estimated are the day to day changes). 7 x 3,223 = 22,561.
the measurement error for each of those seven parties for a total of 239 polls. 7 x 239 = 1,673.
a 7 x 6 matrix of the house effects of six pollsters for seven parties. 7 x 6 = 42.
a 7 by 7 symmetrical correlation matrix for the daily changes. 7 x 6 / 2 = 21.
the standard deviation of the daily changes in support for each party. 7 x 1 = 7.
the impact on each party of a change in methodology by one of the pollsters in 2017. 7 x 1 = 7.

Although there are only two or three pollsters currently publishing voting-intention polls in New Zealand today, I estimate the house effects of previously-active pollsters as they are useful for understanding the house effects of those who are still standing. That is, by looking at all the pollsters over time, I get a better estimate of the “true” unobserved voting tendencies and how far out each pollster is on average.

A key assumption here is that the house effect of the tendency to over- or under-estimate by each pollster for each party is constant over time. This is clearly wrong but seems a necessary simplifying assumption that I hope still leaves the model useful.

The conversion of expected party vote into individual seats has to take into account the vagaries of MMP and in particular the seat of Epsom (held by ACT Party who otherwise would not exceed the 5% threshold for representation) and uncertainty about the Māori seats. This is undertaken with the simulate_seats function which is largely unchanged from my 2017 version, with the main change being to how the Māori seats are treated in the light of Labour’s dominance there last election and minimal information to counteract this in 2020.

Incorporating non-polling expectations as a prior distribution

One improvement in my model from my models for either the 2017 New Zealand election or 2019 Australian election is the introduction of a prior expectation of the vote for the Prime Minister’s party, based on what I am loosely calling a “political science model”. What I mean here is the sort of model that predicts vote based on economic performance, scandals, etc. Unfortunately, we have only a small number of elections to use to estimate the distribution of this (because the pre-MMP elections do not seem to me to be useful in understanding).

Here is the history of swings against the incumbent Prime Minister in New Zealand since MMP:

The code that produces this chart is in this script for getting the parameters of the prior for the forecasting model.

To the human eye there’s an interesting pattern here. After their first election, the Prime Minister’s party has a swing to them, then the swing declines until it becomes negative. Obviously each cycle finishes with a swing against their party, because that’s what’s needed for them to no longer be the incumbent next time around. But the positive swing to the PM in their first go as incumbent is interesting.

However, with only eight data points and a time series at that (time series data is not worth as much as data points that are independent of eachother) I don’t think I should presume that pattern will continue. After all, if the unit of analysis is “first term Prime Ministers facing their first election as incumbent”, we only have two data points – Helen Clark in 2002 and John Key in 2011. Nor is there enough data to try to do a regression of pro-PM swing on economic growth or unemployment as I’d like. Instead, I chose as my prior the very flat assumption that the swing against Jacinda Ardern’s party will be drawn from a distribution with a mean of 1.3% and standard deviation of 3.4%. I still have to make a call on the shape of that distribution; it’s clearly not normally distributed so I chose a relatively fat tailed t distribution with 4 degrees of freedom for the shape. This all feels a bit arbitrary, but at least its transparent.

Why 4 degrees of freedom, not 1 as I did in my first run of the model? This distribution is about expressing what I’d expect to see if I didn’t have any polling information at all. And it turns out if you sample from a t distribution with one degree of freedom and scale to the appropriate mean and standard deviation, there are too many values that are too far away from the mean to strike me as plausible, as an observer of election outcomes. So I chose the prior that really reflects my prior.

We can see the impact of that prior expectation in this chart of the estimated latent voting intention and the polls (imperfect measurement of voting intention). The “pull” of the prior can be seen in the downwards jab for the Labour Party between the two recent very positive polls and the expected result on election day:

The big uncertainty around that downwards jab is also clearly shown. I’m pretty satisfied with the prediction intervals there. They’re wide, but that just reflects the lack of polling data in New Zealand.

More updates will follow as polls come in

The model takes a long time to fit – 18+ hours on my laptop. I have several ideas for reducing that, but all of them seem to involve throwing out some information:

start at 2014, rather than 2011
change the grain from daily to weekly estimates of the underlying voting intention
wrap more of the smaller parties (eg ACT and Māori Party) into “other”

I’d do one or more of these if I thought I was going to have to run the model lots with frequently updating polling data. But with so few polls expected, I don’t think I will bother.

I will update this model (and also the nzelect R package that it draws on) as more polls come in.

Key links:

New Zealand 2020 election forecasts (updating regularly)
Source code for the forecasting preparation and model
Source code for the nzelect R package which holds the polling and election results data (there’s also a version on CRAN but it’s out of date at the moment).

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.