Presidential Debates 2012
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I have been playing with the beta version of qdap utilizing the presidential debates as a data set. qdap is in a beta phase lacking documentation though I’m getting there. In previous blog posts (presidential debate 1 LINK and VP debate LINK) I demonstrated some of the capabilities of qdap. I wanted to further show some of qdap’s capabilities while seeking to provide information about the debates.
In previous posts readers made comments or emailed regarding functionality of qdap . This was extremely helpful in working out bugs that arise on various operating systems. If you have praise or methods you used to run the qdap scripts please leave a comment saying so. However, if you are having difficulty please file an issue at qdap’s home, GitHub (LINK).
In this post we’ll be looking at:
1. A faceted gantt plot for each of the speeches via gantt_plot
2. Various word statistics via word_stats
3. A venn diagram showing the overlap in word usage via trans.venn
4. A dissimilarity matrix indicating closeness in speech via dissimilarity
5. iGraph Visualization of dissimilarity
Reading in the data sets and Cleaning
library(qdap) #load qdap # download transcript of the debate to working directory url_dl(pres.deb1.docx, pres.deb2.docx, pres.deb3.docx) # load multiple files with read transcript and assign to global environment dat1 <- read.transcript("pres.deb1.docx", c("person", "dialogue")) dat2 <- read.transcript("pres.deb2.docx", c("person", "dialogue")) dat3 <- read.transcript("pres.deb3.docx", c("person", "dialogue")) # qprep for quick cleaning dat1$dialogue <- qprep(dat1$dialogue) dat2$dialogue <- qprep(dat2$dialogue) dat3$dialogue <- qprep(dat3$dialogue) # Split each sentece into it's own line dat1b <- sentSplit(dat1, "dialogue") dat1$person <- factor(dat1$person , levels = qcv(ROMNEY, OBAMA, LEHRER)) dat2b <- sentSplit(dat2, "dialogue") dat3b <- sentSplit(dat3, "dialogue") # Create a large data frame by the three debates times L1 <- list(dat1b, dat2b, dat3b) L1 <- lapply(seq_along(L1), function(i) data.frame(L1[[i]], time = paste("time", i))) dat4 <- do.call(rbind, L1) #view a truncated version of the data (see also htruncdf) truncdf(dat4)
Faceted Gantt Plot
#reorder factor levels dat4$person <- factor(dat4$person, levels=qcv(terms="OBAMA ROMNEY CROWLEY LEHRER QUESTION SCHIEFFER")) with(dat4, gantt_plot(dialogue, person, time, xlab = "duration(words)", scale = "free"))
Basic Word Statistics
This section utilizes the word_stats function in conjunction with ggplot2 to create a heat map for various descriptive word statistics. Below is a list of column names for the function’s default print method.
column title description 1 n.tot number of turns of talk 2 n.sent number of sentences 3 n.words number of words 4 n.char number of characters 5 n.syl number of syllables 6 n.poly number of polysyllables 7 sptot syllables per turn of talk 8 wptot words per turn of talk 9 wps words per sentence 10 cps characters per sentence 11 sps syllables per sentence 12 psps poly-syllables per sentence 13 cpw characters per word 14 spw syllables per word 15 n.state number of statements 16 n.quest number of questions 17 n.exclm number of exclamations 18 n.incom number of incomplete statements 19 p.state proportion of statements 20 p.quest proportion of questions 21 p.exclm proportion of exclamations 22 p.incom proportion of incomplete statements 23 n.hapax number of hapax legomenon 24 n.dis number of dis legomenon 25 grow.rate proportion of hapax legomenon to words 26 prop.dis proportion of dis legomenon to words z <- with(dat4, word_stats(dialogue, list(person, time), tot)) z$ts z$gts plot(z, low="white", high="black") plot(z, label=TRUE, low="white", high="black", lab.digits=1)
Venn Diagram
With proper stop word use and small, variable data sets a Venn diagram can be informative. In this case the overlap is fairly strong and less informative though labels are centered. Thus labels closer in proximity are closer in words used.
with(subset(dat4, person == qcv(ROMNEY, OBAMA)), trans.venn(dialogue, list(person, time), title.name = "Presidential Debates Word Overlap 2012") )
Dissimilarity Matrix
dat5 <- subset(dat4, person == qcv(ROMNEY, OBAMA)) dat5$person <- factor(dat5$person, levels = qcv(OBAMA, ROMNEY)) #a word frequency matrix inspired by the tm package's DocumentTermMatrix with(dat5, wfm(dialogue, list(person, time))) #with row and column sums with(dat5, wfdf(dialogue, list(person, time), margins = TRUE)) #dissimilarity (similar to a correlation #The default emasure is 1 - binary or proportion overlap between grouping variable (sim <- with(dat5, dissimilarity(dialogue, list(person, time)))) OBAMA.time.1 OBAMA.time.2 OBAMA.time.3 ROMNEY.time.1 ROMNEY.time.2 OBAMA.time.2 0.293 OBAMA.time.3 0.257 0.303 ROMNEY.time.1 0.317 0.261 0.245 ROMNEY.time.2 0.273 0.316 0.285 0.317 ROMNEY.time.3 0.240 0.276 0.311 0.265 0.312
Network Graph
The use of igraph may not always be the best way to view the data but this exercise shows one way this package can be utilized. In this plot the wlabels are sized based on number of words used. The distance measures that label the edges are taken from the dissimilarity function (1 – binary). Colors are based on political party.
library(igraph) Z <- with(dat5, adjacency_matrix(wfm(dialogue, list(person, time)))) g <- graph.adjacency(Z$adjacency, weighted=TRUE, mode ='undirected') g <- simplify(g) # set labels and degrees of vertices V(g)$label <- V(g)$name V(g)$degree <- degree(g) set.seed(3952) layout1 <- layout.auto(g) opar <- par()$mar; par(mar=rep(.5, 4)) #Give the graph lots of room plot(g, layout=layout1) edge.weight <- 9 #a maximizing thickness constant z1 <- edge.weight * sim/max(sim)*sim E(g)$width <- c(z1)[c(z1) != 0] #remove 0s: these won't have an edge numformat <- function(val, digits = 2) { sub("^(-?)0.", "\\1.", sprintf(paste0("%.", digits, "f"), val)) } z2 <- numformat(round(sim, 3), 3) E(g)$label <- c(z2)[c(z2) != 0] plot(g, layout=layout1) #check it out! label.size <- 15 #a maximizing label size constant WC <- aggregate(dialogue~person +time, data=dat5, function(x) sum(word.count(x), na.rm = TRUE)) WC <- WC[order(WC$person, WC$time), 3] resize <- (log(WC)/max(log(WC))) V(g)$label.cex <- 5 *(resize - .8) plot(g, layout=layout1) #check it out! V(g)$color <- ifelse(substring(V(g)$label, 1, 2)=="OB", "pink", "lightblue") plot(g, layout=layout1) tkplot(g)
This blog post is a rough initial analysis of the three presidential debates. It was meant as a means of demonstrating the capabilities of qdap rather than providing in depth analysis of the candidates. Please share your experiences with using qdap in a comment below and suggestions for improvement via the issues page of qdap’s github site(LINK).
For a pdf version of all the graphics created in the blog post -click here-
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.