Exploring Your Voter File with R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
In this post I’ll perform an analysis of the voter file for Wake County, NC. I’ll follow the same processes as a political consultant, but I’ll be using R instead of traditional tools. The data and code for this post are all available on my GitHub repository, so a reader should be able to follow along and expand on this analysis.
What is a voter file?
A voter file is a list of every registered voter for some political area, like a county or a congressional district. The voter file is purchased from state parties, 3rd party vendors, PACs, or municipal offices. The file usually contains the name, address, and voting history of every registered voter. Campaigns can purchase extra data (“appends”) like phone numbers, email addresses, demographic information, or organization membership information. All these data are used to sort, group, and prioritize contact of a voter.
Who uses a voter file?
The voter file is used by candidates for political office or issue campaigns to understand their target constituents. An good analysis of the voter file will reveal strategic demographic and turnout information that may otherwise be hidden. Armed with this information a campaign will assemble a voter contact plan and build turnout projections, both of which are essential campaign processes. A candidate without a strong understanding of demographic and turnout information for his election will most likely waste campaign resources talking to the wrong people.
How would this normally be done?
- A small campaign will probably use MS Excel or Access to perform a precinct analysis and build lists and counts.
- For larger campaigns, the CRM system (NGPVAN, Aristotle) all have cross-tabs, lists, and counting, but are primarily used for contact and fundraising compliance.
- Very sophisticated campaigns will use something like the Q-tool from Catalist. This is a voter analysis tool providing data-mining and modeling capabilities, along with the standard counting. Q-Tool is extremely impressive.
Why use R?
From the R-project home page: R is a free software environment for statistical computing and graphics. The R programming language allows a wide range of analysis and visualization not available in traditional political tools. Data manipulation is easier on the messy and disjoint data we deal with in political analysis. The visualization tools (like ggplot2 and lattice) let us easily generate graphics suited to our exact data. Finally, R has excellent support for basic politics statistics like clustering and regression analysis, to say nothing of more advanced statistical tools multilevel modeling and simulation.
The Voter File
We will be exploring the voter file for Wake County, NC. Besides being free and updated regularly, this voter file has several unique features including voter gender, party registration, and absentee voting information. This level of data is almost unheard of in free lists, so our analysis will be very realistic. Wake County has both a strong absentee and vote by mail program, as well as early voting and same day registration. Wake County also allows voters not affiliated with a major party to vote in primary elections, but for only one primary per election season. The Wake County Board of Elections is very far above the average with elections administration and data disbursement in all areas save one. Why would they distribute the voter file as a self-extracting zip archive?
The toolkit
I’ll be using the Wake Count voter file downloaded from WakeGOV.com on Nov 21, 2011. I’ll be using 2.14 of R, along with the following packages: plyr, ggplot2, gmodels, and RColorBrewer. All code and a slightly trimmed-down voter file can be downloaded from my GitHub repository.
Cross tabs
The initial analysis will focus on understanding the demographic makeup and voting history of registered voters in Wake County. We’ll perform simple counts (cross-tabs) on different segments of the file to better understand the demographic makeup and voting history of registered voters in Wake County.
Voting Status
CrossTable(vf$status,prop.c=F,prop.chisq=F,format="SPSS",max.width=10) |
Total Observations in Table: 595741 | Active | Inactive | |-----------|-----------| | 533011 | 62730 | | 89.470% | 10.530% | |-----------|-----------|
Our first cross-tab tells us that 10.5% of the voters on our list are inactive. Wake County considers a voter inactive is mail has been returned from this address. Determining voter status ahead of time is usually an expensive or impossible task but Wake County has helpfully done this for us. Without this information a campaign would waste money sending mail or door knocking ad the wrong address.
Party Affiliation
CrossTable(vf$party,prop.c=F,prop.chisq=F,format="SPSS",max.width=10) |
| DEM | LIB | REP | UNA | |-----------|-----------|-----------|-----------| | 246641 | 1577 | 178676 | 168847 | | 41.401% | 0.265% | 29.992% | 28.342% | |-----------|-----------|-----------|-----------|
The party breakdown in Wake County shows a 12 point Democratic registration advantage, which is a solid lead. The nearly 1/3rd of voters who aren’t affiliated with a party could make for some very close elections as each major party aggressively courts the unaffiliated.
Gender
CrossTable(vf$gender,prop.c=F,prop.chisq=F,format="SPSS",max.width=10) |
| Female | Male | Unknown | |-----------|-----------|-----------| | 316958 | 273696 | 5087 | | 53.204% | 45.942% | 0.854% | |-----------|-----------|-----------|
The gender cross-tab tells is a majority of Wake County voters are female, and by a healthy 15% margin. This information will inform every level of communication by the campaign.
Age Group
CrossTable(vf$age.group,prop.c=F,prop.chisq=F,format="SPSS",max.width=10) |
Total Observations in Table: 595741 | [17,30) | [30,40) | [40,50) | [50,60) | [60,70) | |-----------|-----------|-----------|-----------|-----------| | 114391 | 122686 | 132329 | 108995 | 67758 | | 19.201% | 20.594% | 22.213% | 18.296% | 11.374% | |-----------|-----------|-----------|-----------|-----------| |-----------|-----------|-----------|-----------|-----------| | [70,80) | [80,90) | [90,100) | [100,110) | [110,120) | |-----------|-----------|-----------|-----------|-----------| | 30337 | 15469 | 3599 | 172 | 5 | | 5.092% | 2.597% | 0.604% | 0.029% | 0.001% | |-----------|-----------|-----------|-----------|-----------|
The Wake County file has age information, which we’ve binned into roughly 10-year sized buckets. As much as gender, age of registered voters will play role in determing the policy goals and communication method used by a campaign. For example: a precinct with many older voters is a good candidate for an afternoon or early evening canvas since many of your targets will be home. But a precinct on a college campus or otherwise full of younger voters may be better contacted through alternative forms like social media or email.
Party affiliation, by Gender
CrossTable(vf$gender,vf$party,prop.c=F,prop.chisq=F,format="SPSS",max.width=10) |
|-------------------------| | Count | | Row Percent | | Total Percent | |-------------------------| | vf$party vf$gender | DEM | LIB | REP | UNA | Row Total | -------------|-----------|-----------|-----------|-----------|-----------| F | 146512 | 627 | 88295 | 81524 | 316958 | | 46.224% | 0.198% | 27.857% | 25.721% | 53.204% | | 24.593% | 0.105% | 14.821% | 13.684% | | -------------|-----------|-----------|-----------|-----------|-----------| M | 98497 | 932 | 89645 | 84622 | 273696 | | 35.988% | 0.341% | 32.753% | 30.918% | 45.942% | | 16.534% | 0.156% | 15.048% | 14.204% | | -------------|-----------|-----------|-----------|-----------|-----------| U | 1632 | 18 | 736 | 2701 | 5087 | | 32.082% | 0.354% | 14.468% | 53.096% | 0.854% | | 0.274% | 0.003% | 0.124% | 0.453% | | -------------|-----------|-----------|-----------|-----------|-----------| Column Total | 246641 | 1577 | 178676 | 168847 | 595741 | -------------|-----------|-----------|-----------|-----------|-----------|
The first two-way cross-tab is somewhat intimidating at first but is easy enough to read with a key. Gender is along the left side, and party affiliation is along the top. The three numbers in each box represent the raw count of voters, the percentage of voters in that row, and the percentage of voters overall. The first box tells us that there are 146,512 registered Female Democrats; that Democrats make up 46% of female voters; Female democrats are 24.5% of the total electorate.
This crosstab is full of useful information that would be invaluable for campaign planning in Wake County: The Democratic registration advantage is 19 points for women, but only 3 points for men. Me are more likely to be unaffiliated than women, indeed the unaffiliated voters almost make up a 3rd party by themselves. The Wake County Board of Elections says unaffiliated voters may still vote in a partisan primary, but only one party’s primary per election. This quirk will make the unaffiliated voters huge targets of opportunity during primary elections.
Graphics
As we’ve shown Cross-tabs are very powerful tools but can get quickly cause information overload. This is where data visualizations (fancy word for charts) come in. We’ll use the powerful ggplot2 library to do some quick visualizations on the Wake County voter file. There are many other R packages for data visualization, but the ggplot2 library works very well for our purposes.
Voter Registration, by Age Group
qplot(age.group,data=vf,type="histogram",main="Wake County Registered Voters, by Age Group ") |
This histogram represents the same data as the Age Group cross-tab from above, but its much easier to compare age groups and understand the overall distribution of voters in Wake County.
2010 Turnout by, Age Group
qplot(age.group,data=vf[vf$regdate <= "2010-11-04",],type="histogram",fill=g2010.v,position="dodge",main="Wake Count 2010 Turnout, by Age Group") + scale_fill_brewer(name="Voted 2010", pal="Set1") |
In the previous plot we saw what looked like pretty even registration among voters, with 40-49 having the highest numbers. But this chart shows us turnout between the age groups in 2010 is very divergent. Turnout for the 17-29 age group was dismal, 30-39 slightly better. For voters older than forty and younger than eighty, turnout was always greater than 50%. A candidate that needs to win the youth vote will have their work cut out for them in Wake County, if 2010 is any indication of the norm.
Turnout in 2010 vs 2008, by Gender
gender.turnout <- ddply(vf,"gender",function(x) { data.frame(total=nrow(x),turnout=c(sum(x$g2010.v),sum(x$g2008.v)),election=c("2010","2008")) }) qplot(gender,turnout / total, data=gender.turnout, geom="histogram", stat="identity",fill=election,position="dodge", main="Turnout in 2010 vs 2008 Wake County, by Gender") + scale_fill_brewer(name="Election cycle",pal="Set1") |
gender total turnout election 1 F 316958 146812 2010 2 F 316958 227910 2008 3 M 273696 127999 2010 4 M 273696 188888 2008 5 U 5087 1470 2010 6 U 5087 2157 2008
Previously we saw women were much more likely to be Democrats than Republicans, but didn't tell us anything about their turnout propensity. This chart shows us turnout percentages by gender, for the 2008 and 2010 general elections. We see turnout was much higher for both men and women in 2008 than 2010. Also that women turned out at a higher rate than men in 2008, but closer to par in 2010. While both genders turned out at around 45% in 2010, almost 40,000 more women turned out than men due to the registration disparity.
Change in turnout by precinct from 2008 to 2010
Until now we've been comparing very simple counts that could have been easily been done with cross-tabs. The next chart plots turnout percentage by precinct from 2008 against turnout percentage in 2010. With 200 precincts we would be hard pressed to easily visualize these data without some sort of chart.
precinct.turnout <- ddply(vf, "precinct", summarize, turnout2010=sum(g2010.v) / length(g2010.v), turnout2008=sum(g2008.v) / length(g2010.v),reg2010=length(g2008.v) ) qplot(turnout2010, turnout2008, data=precinct.turnout,xlim=c(.1,1),ylim=c(.1,1),main="Turnout percentage 2008 to 2010, by Precinct") + geom_abline(intercept=0,slope=1) |
This chart shows turnout percentage in 2008 along the vertical axis, turnout percentage for 2010 along the horizontal axis, and a line representing equal turnout in both. Points above the line had higher turnout in 2008, while points below the line had lower turnout. None of the points are below the line, meaning all precincts turned out lower in 2010 than 2008. We knew this, but we didn't know if the decrease was inform across precincts. Now we do know the decrease wasn't uniform, and certainly there is a relationship between turnout in 2008 and turnout in 2010.
Given this type of relationship we will fit a simple linear regression against the data and see if we can quantify change further:
# fit the turnout with a simple linear model > lm(turnout2008~turnout2010,data=precinct.turnout) Call: lm(formula = turnout2008 ~ turnout2010, data = precinct.turnout) Coefficients: (Intercept) turnout2010 0.4124 0.6216 # summary gives us an rsqured of 0.84 # plot the same graph w/ a line of best fit qplot(turnout2010, turnout2008, data=precinct.turnout,xlim=c(.1,1),ylim=c(.1,1),main="Turnout percentage 2008 to 2010, by Precinct") + geom_abline(intercept=0,slope=1) + geom_abline(intercept=0.4124,slope=.62) |
In the R code above we fit a simple linear regression to the precinct turnout data, which gave us an intercept and single regression coefficient. We used these values to plot a line of best fit on the same graph. From this chart we see that the change in turnout from 2008 to 2010 by precinct was almost uniform for most data.
Democratic registration % and precinct size by 2010 turnout
This final graph is a complicated one - We'll look at Democratic voter registration and turnout in 2010 by precinct. A campaign would use this chart to find precincts with high democratic registration but low 2010 turnout. A precinct with these characteristics will be near the top on a target list for a Democratic candidate to canvas. Finding precincts like these is a valuable part of the precinct analysis process that drives a campaign plan.
# use ddply to summarize registration and turnout by precinct dem.reg.prec <- ddply(vf, "precinct", summarize, registered=length(status),turnout2010=sum(g2010.v),dem.pct=sum(party == "DEM") / length(party) ) # now plot it qplot(dem.pct,turnout2010/registered,data=dem.reg.prec,alpha=I(0.8), main="Democratic registration percentage and 2010 turnout\n by precinct",xlab="Democratic Registration%", ylab="Turnout 2010 (All Parties)",xlim=c(0,1), ylim=c(0,1)) |
Points in the upper left side of the graph represent precincts with high turnout and low Democratic registration - those are precincts we'll want to ignore since they most likely voted for our opponents. Precincts in the lower right side mean low turnout and high Democratic registration - those are the precincts we'd want to target most heavily for turnout efforts in 2012. Precincts with turnout below 50% with a Democratic registration percentage of 50 or greater will probably be next on the list for canvassing efforts.
Wrapping up
Thank you for reading this simple overview on performing a voter file analysis using R. The Wake County voter file is full of interesting information, we've just barely scratched the surface and I encourage the curious user to explore the file themselves. Thanks to the Wake County Board of Elections for keeping such a high quality file free and up to date. Please don't hesitate to leave any feedback in the comments below, via Follow me on twitter.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.