open-source campaign finance analysis with R and MySQL
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
In Part 1 of this tutorial we introduced the fechell library by extracting all itemized contributions from individuals made to the Obama For America campaign in 2007 and 2008. In Part 2 of the tutorial we will summarize that data set by importing it into a MySQL database and aggregating contributions by week and zip code. Next we’ll visualize the contribution data amounts on a map, week by week. Below is a sample image:
This visualization contains two separate measures: the top portion is a map of the continental US, marked with a dot at the geographic center of every zip code that had individuals who made contributions to Obama for America. The dot is colored according to how much money in total was raised from that zip code since the 2nd week of 2007, with the colors going from a dark blue to bright red indicating amounts. The second measure is a vertical bar plot showing how much money was raised, per week, from all zip codes combined. The current week is highlighted and annotated with the amount raised for that week. Taken together these measures can show us where the first monetary support for the Obama for America campaign came from and how it progressed geographically and in volume. We will use the free R statistics and visualization package to producing the weekly image we shown above. Looking at still images week by week is informative but not very exciting. After all the images are created we’ll use MEncoder to string together the images into a movie to demonstrate the growth of individual financial support pledged to Obama for America.
Installing the database
To follow along with this exercise you will need to download MySQL 5.1 for your platform. English instructions for installation are available in the MySQL 5.1 Reference Manual. After installation you will also need to create a database user and grant rights to create databases, create indexes and perform LOAD DATA LOCAL functions. All of these administration tasks are covered in the MySQL 5.1 Reference Manual Section 5.5 (MySQL User Account Management). Keep the user name and password of the account you created handy as we’ll use it in the next few steps to access the database.
Creating the database
We will be using two different tables to represent the itemized individual contributions: the transactions table will hold all 2.8 million transaction values, and the transactions_summary_weekly table will hold the sum of all contributions for every combination of year, week, and 5 digit zip code that exists in the data.
To populate these tables, download and save the transaction table creation script create-transactions.sql available in the Resources sub-section below to a directory. Next download the output of the first part of the post into that same directory (obama-data-F3P-2007-2008.csv), available as ZIP archive in Resources sub-section below. If this CSV file isn’t in the same directory as the create-transactions.sql script you’ll need to change line #4 to reference the exact location.
create-transactions.sql looks like this:
1 2 3 4 5 | CREATE DATABASE fechell; USE fechell; CREATE TABLE transactions (contribution_date date,contribution_amount int, zipcode char(5)) engine 'MyISAM'; LOAD DATA LOCAL INFILE 'obama-data-F3P-2007-2008.csv' INTO TABLE transactions FIELDS terminated BY ',' LINES terminated BY 'rn' IGNORE 1 LINES (contribution_date,contribution_amount,zipcode); ALTER TABLE transactions ADD INDEX ix_zipcode (zipcode); |
We are simply creating a new database (‘fechell’), and creating a new MyISAM table called ‘transactions’ on that database. We chose MyISAM since large data imports are dramatically faster with MyISAM than the default InnoDB, at least in the default setup we’re using. You can execute this script by running the following command in the directory where you saved create-transactions.sql and obama-data-F3P-2007-2008.csv. Replace YOURDBUSERNAME with the user you created during the MySQL installation.
mysql -uYOURDBUSERNAME < create-transactions.sql
Since we are loading in 2.8 million records and then adding an index to the zip code field this command might take quite a while to execute depending on your processor speed. After it is finished you should have a large database of all individual itemized transactions in the fechell.transactions table.
Next we will create and populate the transactions_summary_weekly table, which will contain aggregated itemized individual fund raising totals by zip code by week and year.
The script create-transactions_summary_weekly.sql available in the Resources sub-section below looks like this:
1 2 3 4 | USE fechell; CREATE TABLE transactions_summary_weekly (zipcode char(5) NOT NULL,contribution_year int(10) UNSIGNED NOT NULL,contribution_week int(10) UNSIGNED NOT NULL, total int(10) UNSIGNED NOT NULL) engine 'MyISAM'; INSERT INTO transactions_summary_weekly(zipcode,contribution_year,contribution_week,total) (SELECT zipcode,year(contribution_date),week(contribution_date),sum(contribution_amount) FROM transactions GROUP BY zipcode,year(contribution_date),week(contribution_date)); |
You can run the script with the following command:
mysql -uYOURDBUSERNAME < create-transactions_summary_weekly.sql
After the script is finished executing you should have two rather large tables populated with the information we’ll need to visualize the data.
Installing R
Before we can create the visualizations we’ll need to make sure the R toolkit and several add-on packages are installed correctly. If it isn’t installed already, you can download an R installer from the R mirrors list. The R Frequently Asked Questions page contains instructions for Linux, Unix, Mac, and Windows platforms.
Installing R Packages
After R is installed you will need to install several add-on packages that may or may not be included with your distribution: sp, maps,maptools, and RMySQL. All are available via CRAN and platform-dependent instructions for installing add-on packages fcan be found on the R Wiki under How do I install a package?.
Creating the visualization part 1 – images
Before you can create the visualization you will need to download the R script to draw the frames, and a support file containing latitude and longitude pairs for every zip-code in the US. Both are available in the References sub-section below as draw.R and zips.zip. Unzip the zips.zip file and move the newly created zips.csv file into the same directory as draw.R. You will need to make 1 change to the draw.R script before you can run it: line 37 creates the connection to the database we created and populated earlier. You’ll need to change the host, user name, and password arguments to match your database setup.
The draw.R file is pretty simple, so we wont walk through the code line by line. Most of the heavy lifting is done by the fantastic sp and RMySQL packages, with a little bit of help from a lat/long database of zip codes.
Once you’ve edited the draw.R to match your configuration you can can create the images with the following command line, assuming the R bin directory has been correctly added to your PATH:
R CMD BATCH draw.R
Assuming the packages were installed correctly, the zips.csv file was in the same directory, and the database configuration was modified you should now see 96 PNG files in your directory. The files are named o4a_year_week.png.
Creating the visualization part 2 – video
To create a FLV movie from the weekly images you will need to install MPlayer/MEncoder if it isn’t installed on your system already. You can find download and install instructions on the MPlayer download page. Once the installation is complete you can create a FLV movie with the following command line (some options taken from a page on avi-to-flv-conversion">http://jeremychaman.info/
mencoder.exe mf://*.png -mf w=1024:h=768:type=png -ovc lavc -lavcopts vcodec=flv:mbd=2:mv0:v4mv:cbp:last_pred=3:trell:keyint=50:vbitrate=300 -o output.flv -fps 2 -ofps 5 -of lavf
After the command is complete you will have a FLV file called output.flv. The final file is embedded below, and available in the Resources section:
[See post to watch Flash video]Problems with this visualization
While this visualization serves our purpose of tracking the geographic spread and growing amount of individual contributions to the Obama for America campaign, it is not perfect. The entire process, from extraction to visualization was created in about half a day, so there is obviously more work that could be done. Also the specific process used in these tutorials was to be a demonstration of several strategies to perform your own analysis, not necessarily to build the best visualization of independent contributions.
One problem with the current visualization is the representation of an entire zip code by a dot at its geographic center. This is problematic because zip codes represent regions that are vastly different in size, so using single dots will end up in decisions being skewed. Based on our maps alone, we could conclude Obama for America enjoyed very little support in places like Montana and Colorado since they have very few dots within their borders. In truth the campaign received more than $11M of support from Colorado and just over $1.2M in itemized donations from Montana, but this is very difficult to compare this to Philadelphia or New York City where the map is a smear of dots. A better way to display information by zip code could involve drawing the ZIP boundaries and filling with a color, instead of just coloring a single point. This would work fine for places like Montana, Idaho, and Colorado where a zip code might refer to a very large area, but would fail in more populous areas. This is especially true in New York City, where a single zip code could represent an area as small as a single square mile and would be almost impossible to view on a reasonably sized graphic. Merging of zip code areas and averaging their fund raising totals could make this procedure more useful for national analysis or local analysis of heavily populated metro areas.
Additionally, the output format of the visualization could be improved by making it interactive. Combining our weekly summary data with a tool like the Google Motion Chart Gadget along with an Animated Timeline would provide a much greater user experience.
Finally, the national analysis of Obama for America is interesting by itself but would be much more useful if it included data from other candidates. Being able to overlay zip code summaries from primary challengers like Hillary Clinton for President, and general election opponent McCain-Pailn 2008 Inc would allow a much greater depth of analysis to be performed. Including demographic information, primary dates, and election returns on the same visualization would be the best.
I intend to address these topics -as well as analyzing congressional races- in a future tutorial.
What the visualization tells us
Despite the flaws listed above after viewing the final output of our visualization we can draw several conclusions about the fund raising success of the Obama for America campaign. First, the campaign was truly national and had financial support from across the country by the end of the campaign. But the campaign didn’t start out national – during the critical first two months of the campaign Obama for America drew financial support from several metro areas including Chicago, New York City, and DC/Northern Virginia. After several months of heavy development of the grassroots network the campaign started seeing donations coming in from across the US.
Using the weekly summary data we can try to attribute large spikes in total receipts to campaign events. Looking closely we see the campaign saw a large increase of contributions in the last week of March, which potentially could be attributed to popular support of candidate Obama’s More Perfect Union speech or as a result of the large new donor push at the beginning of that month. We can also look at weeks 2008/4 and 2008/5(beginning of February 2008) and see week to week receipts jump by $5M to around $11M for two full weeks. This spike coincides with the Super Tuesday Democratic primaries where candidate Obama nearly split the day with then-front runner Hillary Clinton.
Conclusion
Using free disclosure data and a few open-source tools, like Ruby, fechell, R, and MySQL just about anybody can perform their own professional analysis on federal campaign finance data.
Resources
Database creation script
Database population script
R script to draw images draw.R
Zip codes
FLV file
CSV extract from Part 1
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.