Get the exit polls from CNN using R and Python

[This article was first published on A Distant ObserveR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Yesterday I posted an example of plotting 2012 U.S. presidential exit poll results using ggplot2. There I took for granted that a data.frame containing all we need resides in a file called “PresExitPolls2012.Rdata”. Today I want to show how I scraped the data from CNN.

The challenge


At first I tried to scrape the site using RCurl and the XML package. But the result was very disappointing. I just got empty data.frames while all browsers I used showed the data. Looking at the source code of the page, however, was equally disappointing:

Where I expected the percentage of say women voting for Romney, I saw a javascript variable name. Only looking at the generated source with Firebug revealed the data. The CNN pages are dynamically created by javascript that jqueried the data into variables. No way getting the data with RCurl.

The solution


So I needed a real browser that could be controlled by a script. I decided to use a Python script to read the generated html from CNN. Here’s the Python code that draws heavily on a thread I stumbled upon in a German forum:

# harald 2012-11-11
# heavily based on http://www.python-forum.de/viewtopic.php?f=3&t=20308
import sys
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import *
URI = sys.argv[1]
class P_MainWindow(QWebView):
def __init__(self, url, parent = None):
QWebView.__init__(self, parent)
self.load(QUrl(url))
self.connect(self, SIGNAL("loadFinished(bool)"),
self.print_html)
print "OK!"
def print_html(self):
x = open("temp.html", "w")
x.write(self.page().mainFrame().toHtml())
print "OK"
app.quit()
app = QApplication(sys.argv)
w = P_MainWindow(URI)
sys.exit(app.exec_())
view raw dl_CNN_EP.py hosted with ❤ by GitHub

Next I needed a function in R that puts together the URL for one of the CNN state sites, calls the Python and returns a page tree of the generated html. getStateData() does the job:

## harald, 2012-11-11
## function to read all data for a single state from CNN
getStateData <- function(state) {
require(XML)
# compose the URI
base_url <- "http://us.cnn.com/election/2012/"
exit_polls_url <- "results/state/"
state <- toupper(state)
URI <- paste0(base_url, exit_polls_url, state, "/president")
if (state == "US") {
URI <- "http://us.cnn.com/election/2012/results/race/president"
}
# call python script that downloads the polls from CNN
# R can't do this, because it's dynamically generated html
# print a message
print(paste("processing data for", state))
syscall <- paste("python dl_CNN_EP.py", URI)
system(syscall)
webpage <- readLines("temp.html")
pagetree <- htmlParse(webpage
, asText = TRUE
, error=function(...){}
, useInternalNodes = TRUE
)
return(pagetree)
}
view raw getStateData.R hosted with ❤ by GitHub


The page tree getStateData returns contains a lot of noise like preliminary county results for some, but only some, of the counties. There are some “fake” exit polls designed to explain “ho to read exit polls”. And for every question asked the results appear a couple of times.

Filtering out the noise


To separate the wheat from the chaff, the grain from the husk, I split the job over two functions, parseEpNode and getExitPolls.

getExitPolls parses the tree using XPath, then calls parseEpNode for each of the nodes containing exit polls. (As an aside: this is an application of the “Split-Apply-Combine Strategy for Data Analysis” (pdf) described by Hadley Wickham when he introduced the plyr package. Ironically my getExitPolls doesn’t use plyr::llply but the R standard lapply, though it makes use of plyr::rbind.fill…)

## harald, 2012-11-11
# a function to retrieve the exit polls for one state:
getExitPolls <- function(state) {
require(XML)
require(plyr)
# get the raw state data
raw <- getStateData(state)
xpath.expression <- "//div[@class=\"exit_poll\"]"
ep_nodes <- getNodeSet(doc = raw, path = xpath.expression)
# the first two nodes contain the "About Exit Polls" examples
ep_nodes <- ep_nodes[c(-1, -2)]
# convert elements to data.frames
# there will be some warnings about "NAs introduced by coercion"
# those can safely be ignored
EP <- lapply(ep_nodes, function(x) parseEpNode(x))
# drop the NULL frames
EP <- EP[-grep("NULL", EP)]
# in some cases the same question has been asked with different
# breakdowns. to distiguish them we'll add a question number
for (i in seq_along(EP)) {EP[[i]]$QNo <- i}
EP <- rbind.fill(EP)
EP$state <- state
return(EP)
}
view raw getExitPolls.R hosted with ❤ by GitHub
parseEpNode is the real work horse of the process. It filters out duplicate entries and demo polls. Again it relies on the Split-Apply-Combine Strategy without using l*ply. Sometimes lapply is easy enough, and Hadley himself uses it internally for some cases as well.

## harald, 2012-11-11
## function to parse a single node
##
parseEpNode <- function(node) {
require(XML)
require(plyr)
if (grepl("\n", xmlValue(xmlChildren(node)[[2]]))
&& grepl("TQ", xmlAttrs(node[4]$div[3]$ul)[1])
) {
question <- xmlValue(xmlChildren(node)[[2]])
question <- gsub("\n", "", gsub("\nClose\n", "", question))
foo <-
lapply(
xmlChildren(
xmlChildren(node)[[4]]
),
function(x) unlist(strsplit(xmlValue(x),"\n"))
)[-1]
# add a prefix to make the following lapply possible
foo[[1]][1] <- paste0("answer:", foo[[1]][1])
foo <- lapply(foo, function(x) unlist(strsplit(x, ":")))
# there's at least one state, Massachussetts, where CNN shows
# an empty exit poll table, so we better check:
if (length(unlist(foo[-1])) == 0) {
return(NULL)
}
foo.df <-
as.data.frame(
do.call(rbind, foo),
stringsAsFactors = FALSE
)
# there's a problem with residual blanks on some CNN pages
names(foo.df) <- gsub(" ", "", foo.df[1, ])
foo.df <- foo.df[-1, ]
for (i in 2:length(foo.df)) {
# the warnings produced are legitimate coercions from
# "N/A" to NA. so we can safely suppress them here.
foo.df[, i] <- suppressWarnings(
as.numeric(gsub("[%]|[(]|[)]", "", foo.df[, i]))
)
}
foo.df$question <- question
return(foo.df)
} else {
}
}
view raw parseEpNode.R hosted with ❤ by GitHub

Putting it all together


This script puts it all together and produces the Rdata file the existence of which I only assumed yesterday. It starts with a list of the 19 states + D.C. where no exit polls have been conducted in 2012 taken from the Washington Post and puts together the states of interest, again as a list to which getExitPolls can be lapply’d.

## harald, 2012-11-11
# this script downloads all exit polls for the U.S. presidential
# election on 2012-11-06 from CNN and saves the resulting
# data.frame to file
# for later puropses it can be loaded by
# load(file="PresExitPolls2012.Rdata")
library(XML)
library(plyr)
source("getStateData.R")
source("parseEpNode.R")
source("getExitPolls.R")
# In 2012 exit polls weren't conducted in all states.
# "Here is a list of the states that will be excluded from coverage: Alaska, Arkansas, D$
# (source: http://www.washingtonpost.com/blogs/the-fix/wp/2012/10/04/networks-ap-cancel-$
# a bit of editing gives:
nonep.states <- paste0("Alaska, Arkansas, Delaware, District of Columbia, ",
"Georgia, Hawaii, Idaho, Kentucky, Louisiana, ",
"Nebraska, North Dakota, Oklahoma, Rhode Island, ",
"South Carolina, South Dakota, Tennessee, Texas, ",
"Utah, West Virginia, Wyoming"
)
excluded <- unlist(strsplit(nonep.states, ", "))
# we need the abbreviations, however, so we have to translate:
ep.states <- as.list(state.abb[!(state.name %in% excluded)])
# add "nationwide" to states
ep.states <- c(ep.states, "US")
# It's "Split" already, so "Apply"...
all_exit_polls <- lapply(ep.states, function(x) getExitPolls(x))
# "Combine"...
EP <- rbind.fill(all_exit_polls)
# Save
save(EP, file = "PresExitPolls2012.Rdata")
view raw EP2012_1.R hosted with ❤ by GitHub


A probably much shorter post will add some improvements to the process. More later…

To leave a comment for the author, please follow the link and comment on their blog: A Distant ObserveR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)