[This article was first published on A Distant ObserveR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Yesterday I posted an example of plotting 2012 U.S. presidential exit poll results using ggplot2. There I took for granted that a data.frame containing all we need resides in a file called “PresExitPolls2012.Rdata”. Today I want to show how I scraped the data from CNN.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The challenge
At first I tried to scrape the site using RCurl and the XML package. But the result was very disappointing. I just got empty data.frames while all browsers I used showed the data. Looking at the source code of the page, however, was equally disappointing:
Where I expected the percentage of say women voting for Romney, I saw a javascript variable name. Only looking at the generated source with Firebug revealed the data. The CNN pages are dynamically created by javascript that jqueried the data into variables. No way getting the data with RCurl.
The solution
So I needed a real browser that could be controlled by a script. I decided to use a Python script to read the generated html from CNN. Here’s the Python code that draws heavily on a thread I stumbled upon in a German forum:
Next I needed a function in R that puts together the URL for one of the CNN state sites, calls the Python and returns a page tree of the generated html. getStateData() does the job:
The page tree getStateData returns contains a lot of noise like preliminary county results for some, but only some, of the counties. There are some “fake” exit polls designed to explain “ho to read exit polls”. And for every question asked the results appear a couple of times.
Filtering out the noise
To separate the wheat from the chaff, the grain from the husk, I split the job over two functions, parseEpNode and getExitPolls.
getExitPolls parses the tree using XPath, then calls parseEpNode for each of the nodes containing exit polls. (As an aside: this is an application of the “Split-Apply-Combine Strategy for Data Analysis” (pdf) described by Hadley Wickham when he introduced the plyr package. Ironically my getExitPolls doesn’t use plyr::llply but the R standard lapply, though it makes use of plyr::rbind.fill…)
parseEpNode is the real work horse of the process. It filters out duplicate entries and demo polls. Again it relies on the Split-Apply-Combine Strategy without using l*ply. Sometimes lapply is easy enough, and Hadley himself uses it internally for some cases as well.
Putting it all together
This script puts it all together and produces the Rdata file the existence of which I only assumed yesterday. It starts with a list of the 19 states + D.C. where no exit polls have been conducted in 2012 taken from the Washington Post and puts together the states of interest, again as a list to which getExitPolls can be lapply’d.
A probably much shorter post will add some improvements to the process. More later…
To leave a comment for the author, please follow the link and comment on their blog: A Distant ObserveR.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.