Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I have been working on an analysis, using OP HouseData, of what effect esoteric campaign finance variables might have on election returns in the US House. To kickoff this project I need a baseline idea of how the Democratic vote share in the US House changed during my target period of 2002 to 2008. With this information I could look for intra-year trends or inter-year clusters that could inform which financial variables I’d include in my analysis.
For the baseline summary I considered using a color-coded map (like CQ, CNN) but I care more about aggregates than individual districts or states. Instead I created five non-map visualizations of the same vote share data, using R and ggplot2. Each visualization helped me better understand my data and refine my assumptions and expectations, even if I eventually discarded the output. The interactive nature of R allowed me to experiment and iterate very quickly until I got what I needed. The R code and data are available at the end of the post.
Methodology: Figure 1 is a simple scatter plot with the vote share on the Y axis and the election year on the X axis. Each of 435 seats is plotted as a single point, and the points are alpha blended to highlight groupings of similar returns. A horizontal line is drawn at 50% vote share.
Interpretation: Points below the 50% line show a loss for a Democrat, points above show a win. Lighter gray points means fewer seats at a particular vote share.
Problems: With 435 points per year the plot suffers from over plotting even with alpha blending. The breaks in the alpha blend are too few so 5 points and 25 points look identical.
Methodology: Figure 2 is another scatter plot with the vote share on the Y axis and the election year on the X axis. Each of 435 seats is plotted as a single point, and each point is alpha blended to visually highlight similar returns. A random horizontal jitter was added to every point to reduce overplotting. A horizontal line is drawn at 50% vote share.
Interpretation: Points below the 50% line show a loss for a Democrat, points above show a win. I can’t answer how many seats had a given vote share, and due to the jitter I can’t reasonably identify groupings let alone intra-year trends.
Problems: Jittering addresses some of the over fit problem from Figure 1, but now Figure 2 obfuscates any patterns since the data now looks like random noise.
The scatter plots helped me realize what I really wanted was a summary of the distribution of the democratic vote share, not the raw values themselves. That lead me to the following:
Methodology: A 4-panel graphic with a histogram of vote share per-year. The histogram bar/bin width is two percentage points. A vertical line is drawn at the 50% vote mark.
Interpretation: Bars to the left of the 50% line show a loss for a Democrat, bars to the right show a win. The Y measure shows us how many races were uncontested by Democrats (0 vote share), and how many were uncontested by Republicans (100 vote share). We can see clear groupings of core Democratic and Republican seats that remain somewhat static across elections to the left and right of center, but there is some movement back and forth across the 50% win line as control of the House changed hands in 2006.
Problems: The counting measure is much better at showing the actual distribution of returns but is too raw for comparisons.
Since a histogram was too raw I decided to switch back to a box-and-whisker plot.
Methodology: A box and whisker plot summarizing the distribution of Democratic vote share. The box shows the median value with a horizontal line, and 1st and 3rd quartiles below and above the median line. The whiskers represent values outside the inter-quartile range of the box.
Interpretation: This plot provides several pieces of useful information. The spread between the Q1 and Q3 quartiles shrinks from 2002 to 2008, indicating closer races. The median line of jumps over the 50% win mark in 2006 and 2008 which coincides with the Democrats taking back the House.
Problems: The whisker portion of the plot is less useful since we can’t see the distribution of outliers.
This lead me to use the established seats-votes plot from theoretic political science literature (Kastellec, Gellman, Chandler (2006), and Jackman, etc (PDF).
Methodology: A 4-panel smoothed density curve of Democratic vote share, and 1d rug showing counts across the bottom. A vertical line is drawn at the 50% mark.
Interpretation: The contours of the curves on the seats-votes plot show some very interesting information about the makeup of the House, and taken over time it is very easy to see the emergence Democratic majority in 2006. I also see changes in the number of uncontested seats over time, and the stabilization of the safe Democratic seat peak to the left of the 50% mark. The stabilization of the peaks in 2006 and 2008 around the 50% mark aligns well with what we saw in Figure 4.
Problems: There are no problems with this Figure.
It is no surprise the seats-votes plot proved to be the most useful for my purposes since it was specifically designed, by very smart social and political scientists, to look at this type of data. The seats-votes plot is very versatile and can be adapted to a single election by looking at all precincts within a single district. I performed this type of analysis in Aggregate electoral targeting blog post: Democratic vote share, by precinct in VA HOD 13.
Even though the other plots aren’t as useful, they do provide some diagnostic information. The box and whisker plot is probably easier to read if you only cared about median vote share, and the histogram plot was excellent in finding uncontested seats. For fewer than 435 points even the scatter plots could be very useful. Please email me or comment with ideas or alternative visualizations of vote share data.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.