Beta Distribution and the NJ U.S. Senate Election

Wesley

8 years ago

[This article was first published on Statistical Research » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The beta distribution is highly flexible distribution and applies to many situations and environments. The beta distribution applies well when there are percentages. The upcoming New Jersey U.S. Senate election on Wednesday fits that criterion quite well. So here I applied the beta distribution to some pre-election polls where the numbers were obtained through the poll aggregator www.realclearpolitics.com.

The candidates for New Jersey election this Wednesday — to fill the vacant seat left by the death of Frank Lautenberg — are Cory Booker and Steve Lonegan. Though there are other third-party candidates running the race it is effectively between Booker and Lonegan. Though more complex models can be used reducing the candidates to two the beta distribution can be applied to these data and the outcomes and a simple simulation can be achieved using the given data.

Some Historical Notes

This general election is on a non-standard Election Day (Wednesday, October 16th). It happens to be the first time that a New Jersey general election has been held on a Wednesday. Aside from the current Republican senator who was appointed by Chris Christie the last time there was a Republican U.S. Senator in New Jersey was back in the early 1980′s and even then he too was appointed to the office.

The Beta Distribution

As can be seen from the elections since 1990 the democratic candidate has won by an average of about 8.9%.

2012 — Menendez: 58.9% v. Kyrillos: 39.4%
2008 — Lautenberg: 55.5% v. Zimmer: 42.5%
2006 — Menendez: 53.3% v. Kean Jr.: 44.3%
2002 — Lautenberg: 53.9% v. Forrester: 44.0%
2000 — Corzine: 50.1% v. Franks: 47.1%
1996 — Torricelli: 52.7% v. Zimmer: 42.6%
1994 — Lautenberg: 50.3% v. Haytaian: 47.0%
1990 — Bradley: 50.5% v. Whitman: 47.4%

Based on recent pre-election polling it looks like Booker will likely win by a similar margin and maybe a little higher than the average of 8.9% and, based on pre-election polls, closer to 12 percentage points. The marginal difference between Booker and Lonegan is distributed as a beta distribution and we can see that the threshold of zero (0) is out in the far tail of the distribution. So based on historical election and current pre-election polling it seems that the likelihood that Booker will win is very high.

Example Code

[sourcecode language=”css”] library(MCMCpack) ## Set up several of the recent polls but will only work with the most recent on raw.1 = NULL raw.1 = data.frame( rbind( Quinnipiac = c(.53,.41,899), RSC = c(.50,.39,729), FD= c(.45,.29,702), Mon = c(.53, .40,571) ) ) raw.1 = rbind(raw.1, c(apply(raw.1[,1:2],2,weighted.mean,raw.1[,3]),sum(raw.1[,3]))) names(raw.1) = c(“Cand1″,”Cand2″,”size”) raw.1$Other.und = 1-raw.1$Cand1-raw.1$Cand2 raw.1.no.und = data.frame(raw.1[5,1:2] + raw.1[5,1:2]/sum(raw.1[5,1:2])*raw.1[5,4],size=raw.1[5,3],Other.und=0) raw = rbind(raw.1, raw.1.no.und) ################################################################### ## More than two candidates so Beta distribution won’t work ## Function to randomly generate data from a dirichlet distribution ################################################################### j= 4 prob.win = function(j,export=1){ p=rdirichlet(100000, raw$size[j] * c(raw$Cand1[j], raw$Cand2[j], 1-raw$Cand1[j]-raw$Cand2[j])+1 ) if(export==1){ mean(p[,1]>p[,2]) } else { return(p) } } ( cand1.win.probs = sapply(1:nrow(raw),prob.win) ) sim.dir = prob.win(4,export=2) ## set simulated data for plotting and determining parameters sim.dir.diff = sim.dir[,1]-sim.dir[,2] ## Get the marginal. From a Dirichlet the is distributed as a Beta. sim.dir = cbind(sim.dir, sim.dir[,1]-sim.dir[,2]) ## The shape parameters (shape1 and shape2) might need some manual adjusting and tweaking. ## In this case I ran the function a few time to set the start value close to the output fit.distr.1 = fitdistr(sim.dir[,1], “beta”, start=list(shape1=302,shape2=270)) fit.distr.2 = fitdistr(sim.dir[,2], “beta”, start=list(shape1=229,shape2=343)) fit.distr.margin = fitdistr(sim.dir[,4], “beta”, start=list(shape1=5,shape2=5)) ## Could also draw a histogram of simulated data curve(dbeta(x,fit.distr.1$estimate[1],fit.distr.1$estimate[2]), ylim=c(0,20), xlim=c(.3,.6), col=’blue’, lty=1, lwd=2, ylab=”Density”, xlab=”theta”, main=”Distribution of the NJ U.S. Senate Election 2013″, sub=paste(“Probability that Booker beats Lonegan: “, round(cand1.win.probs[6],2) ) ) ## Candidate 1 curve(dbeta(x,fit.distr.2$estimate[1],fit.distr.2$estimate[2]), add=T, col=’red’, lty=2, lwd=2) ## Candidate 2 abline(v=c(median(sim.dir[,1]), median(sim.dir[,2])), col=c(‘blue’,’red’), lwd=2, lty=c(1,2,3)) legend(“topleft”,c(“Booker”,”Lonegan”), lwd=2, col=c(‘blue’,’red’), lty=c(1,2)) ## Draw a histogram of simulated data hist(sim.dir[,4], nclass=100, main=”Histogram of the Candidate Differences”, xlab=”Candidate Difference”) abline(v=0, col=c(‘black’), lwd=2, lty=c(1)) [/sourcecode]

To leave a comment for the author, please follow the link and comment on their blog: Statistical Research » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.