Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The GOP recently relaunched its main web site with a new design and numerous interactive and social features like Facebook integration, blogs, etc. Of particular interest is the GOP Faces section, which asks users to submit a photo and answer the question “Why are you a Republican?” Not being a Republican, I was curious to see if there were any common themes among the submissions that would lead to insights about being a Republican and GOP.com user. Not excited about actually reading all 180 reasons, I instead used R to download, transform, analyze and visualize the data for me.
I used several packages (XML and plyr) to fetch and extract reasons, and then tm to filter stop words and identify commonly used terms. Finally, I used ggplot2, the invaluable ggplot2 blook, and a helpful post from the R-help mailing list to perform the visualization.
R code
library(XML) library(plyr) library(ggplot2) library(tm) # fetch & parse the HTML doc <- htmlParse("http://gop.com/index.php/learn/republican_faces/",isURL = TRUE) # pull the matching A elements of CSS class tipz nodes <- getNodeSet(doc, "//a[@class='tipz']") # extract the 'title' attribute titles <- sapply(nodes, function(x) xmlAttrs(x)[["title"]]) # clean up the title attribute titles <- sub("^[^:]+::","",titles) # create the corpus and doc term matrix co <- Corpus(VectorSource(titles)) tdm <- DocumentTermMatrix(co, control=list("tolower", removeNumbers=TRUE, stopwords=TRUE)) # extract the tags at each level levels <- c(1,2,3,4) df <- ldply(levels, function(x) data.frame(freq=x,term=findFreqTerms(tdm,x,x))) #assign random non-repeating coordinates to the terms df$x <- sample(1:nrow(df),nrow(df), replace=F) df$y <- df$freq + rnorm(nrow(df)) # clear standard graph options (thanks mike lawrence on r-help) clear <- opts( legend.position = 'none' , panel.grid.minor = theme_blank() , panel.grid.major = theme_blank() , panel.background = theme_blank() , axis.line = theme_blank() , axis.text.x = theme_blank() , axis.text.y = theme_blank() , axis.ticks = theme_blank() , axis.title.x = theme_blank() , axis.title.y = theme_blank() ) p <- ggplot(df,aes(x=x,y=y,colour=freq,label=term,size=freq)) + geom_text() + coord_polar()+ clear ggsave("because.png",p,dpi=72,scale=1.3) ggsave("because.pdf", p)
And the output:
Click for a page-sized PDF, or the raw terms and frequency counts.
The most common term is ‘freedom’, followed by ‘equal’, and ‘pro’. After those come ‘personal’, ‘government’, ‘people’, ‘school’, ‘family’, and ‘believe’. A more robust analysis could use term extraction (pro family, pro life, anti government) or stemming, and then feed the results into a better visualization. That would take more than the 10 minutes I spent so far, so I’m leaving that as an exercise to somebody else.
As it is I have the most common answer as to why GOP.com visitors are Republicans: freedom. I think that’s probably why anybody belongs to any political party, but without a corpus from other parties I suppose we’ll never know.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.