Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The way we communicate is changing. The social media revolution can literally change governments. Twitter is one of the leading mediums through which we, the people, pour forth our informed opinion or raging vitriol, our messages of peace or diatribes of hate. For better or worse, our voices have never been so loud. And when better to listen than on election day? As such, Twitter provides us here at Mango with a fantastic opportunity to be able to quantify the mood of a nation. And today I’m going to show you the impact of last-minute campaigning on the way twitter users may vote.
So how did we go about this? Well, first, we collected tweets that contained the hashtags like #GE2015, as well as the party-specific hashtags. We then extracted this into a plain text file that looked a little like the data below, taken from an initial run last week:
head(dat) date hashtags id lang 1 Wed Apr 29 13:25:39 +0000 2015 593405787155881987 en 2 Wed Apr 29 13:25:39 +0000 2015 GE2015 593405789487964161 en 3 Wed Apr 29 13:25:40 +0000 2015 Greens 593405790523936768 en 4 Wed Apr 29 13:25:40 +0000 2015 GE15 voteSNP 593405791589269504 en 5 Wed Apr 29 13:25:42 +0000 2015 votelabour 593405798736338944 und 6 Wed Apr 29 13:25:42 +0000 2015 UKIP conspiracy GE2015 593405802699980800 en screen_name 1 IngreyLouise 2 leocullen4 3 STynesideGreens 4 srahmanburgh 5 bryanellis01 6 LDNCalling text
1 Vote Green in Leicester Castle Ward!
2 RT @WillBlackWriter: David Cameron says “No income tax, no VAT….and this time in 2020 you’ll be millionaires.”nn#GE2015 http://t.co/f61SK…
3 RT @martinbrampton: Top economist attacks Tory austerity – and Labour’s limp response http://t.co/hEJydiIOuo Only #Greens offer real change…
4 RT @NicolaSturgeon: Forget polls – only votes win elections. The more seats @theSNP win, the stronger Scotland will be. Let’s keep working …
5 https://t.co/RJIeGGUv2n #votelabour
6 RT @PeterMannionMP: “If the…polls are off by 15 (fifteen) % #UKIP win around 100 seats.” Yeah, good luck with that. #conspiracy #GE2015 h…
timestamp urls user_mentions user_name 1 1430313939261 https://t.co/gAxbWeFnQo Louise Young 2 1430313939817 WillBlackWriter leon Cullen 3 1430313940064 http://t.co/hEJydiIOuo martinbrampton South Tyneside Green 4 1430313940318 NicolaSturgeon theSNP selma rahman 5 1430313942022 https://t.co/RJIeGGUv2n bryan 6 1430313942967 PeterMannionMP Simon Mason
This is a lot of detailed information! The sheer volume of tweets – some 300,000 records from the last 36 hours – and the amount of detail in each meant that our analysis must be automated, and what better tool to use than R?! As you can see there is a lot of detailed information that is presented in a tweet. I used the polarity function from the package qdap to generate a numeric opinion from each tweet. The function generates an approximate positive or negative sentiment (or polarity).
> polarity(c("happy", "smile", "pleased")) all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity 1 all 3 3 1 0 Inf > polarity(c("sad", "cross", "angry")) all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity 1 all 3 3 -0.667 0.577 -1.155 polarity("oh so plain") all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity 1 all 1 3 0 NA NA The numeric value is generated from a data table of word-score pairs. The default dictionary is key.pol from qdapDictionaries. > qdapDictionaries::key.pol x y 1: a plus 1 2: abnormal -1 3: abolish -1 4: abominable -1 5: abominably -1 --- 6775: zealously -1 6776: zenith 1 6777: zest 1 6778: zippy 1 6779: zombie -1
The first step is to decide to which party the tweet is referring. Looking at representative records within the dataset, some instances mentioned just one party in each tweet:
“Vote Green in Leicester Castle Ward! :)”
If we search for the string “Green”, and assume that given the provenance of the data the string “Green” refers to the Green Party, we can classify that tweet as being an opinion of the green party.
However, consider the following tweet:
“Top economist attacks Tory austerity – and Labour’s limp response http://t.co/hEJydiIOuo Only #Greens offer real change”
In this case, we could match strings “Tory”, “Labour”, and “Green”, but automating the sentiments that are attached to each party is a lot more challenging. As this was out of the scope of this particular piece of work, I took the decision that if more than one party was mentioned in any one tweet, that record was ignored. Thus my function searched for party names using regular expressions, but does not classify the tweet if more than one party is matched.
#' Classify to which party a tweet is referring #' #' Provide a pattern for each party and return a vector of labels. #' Currently a simple search to find tweets that contain patterns matching #' each party. Tweets mentioning multiple parties are not currently analysed. #' #' @param txt character vector #' @param parties character vector of party names #' @param patterns character vector of length parties (optional) #' @param asis single logical if FALSE drop to lower case #' @return character vector of length txt with value parties, or "" #' @examples #' findParty(c("Conservatives", "Greens", "cons", "tories")) #' findParty(c("Conserv", "Greens", "Conserv cons", #' "Conserv tories", "Conserv snp", "snp", NA)) findParty <- function(txt, parties = c("Conservative", "Labour"), patterns = NULL, asis = FALSE) { if (is.null(patterns)) { patterns <- parties } if (!asis) { txt <- casefold(x = txt, upper = FALSE) patterns <- casefold(x = patterns, upper = FALSE) } out <- character(length = length(txt)) findMat <- matrix(FALSE, nrow = length(txt), ncol = length(parties)) for (party in seq_along(parties)) { findMat[, party] <- grepl(pattern = patterns[party], x = txt) } justOne <- apply(X = findMat, MARGIN = 1L, FUN = sum, na.rm = TRUE) == 1L for (party in seq_along(parties)) { out[justOne & findMat[, party, drop = TRUE]] <- parties[party] } return(out) }
Even when not writing an R package I always use roxygen2 headers now, to remind me to make sure that there’s sufficient information for others to understand my work. It really isn’t much extra effort, and you’ll thank yourself later. As described above, this function creates a matrix of results to allow each pattern to be matched in turn, then classifies where exactly one match is made.
I then made another function that uses the date to create date group bins, performs the polarity calculation, and returns the result. The dictionary lookup methods can be a little slow for mid-sized datasets like this, so I added parallelization for this loop.
#' Get Polarity of Groups #' #' Classify text of tweets in a data file and then use qdap #' sentiment polarity analysis to guess opinion of tweet. #' #' @param data data.frame with columns enumerate{ #' item date character date with specified format #' item text character message posted by user_name at date #' } #' The following columns are expected but not currently used enumerate{ #' item hashtags character #' item id numeric #' item lang label, typically "en", also "und", "fr", "cy", etc. #' item screen_name #' item timestamp numeric #' item urls #' item user_mentions #' item user_name #' } #' @param file name of file to write #' @param fmt single character specifying format of date column (see ?strptime) #' @param summaryfmt single character specifying #' @param parties character vector of groups to assign #' output format of time grouping column (see ?strptime) #' @param patterns character vector of length parties #' @param ncores single integer max number of cores across which to split group search #' (default 2) #' @param onlyclassified single logical should only classified records be returned? #' (default TRUE) #' @return data frame invisibly #' @import qdap foreach doSNOW #' @examples #' littledat <- structure(list(date = c( #' "Wed Apr 29 13:25:39 +0000 2015", "Wed Apr 29 13:25:39 +0000 2015", #' "Wed Apr 29 13:25:40 +0000 2015", "Wed Apr 29 13:25:40 +0000 2015", #' "Wed Apr 29 13:25:42 +0000 2015"), text = c("Vote Green in Leicester Castle Ward! :) https://t.co/gAxbWeFnQo", #' "RT @WillBlackWriter: David Cameron says "No income tax, no VAT....and this time in 2020 you'll be millionaires."nn#GE2015 http://t.co/f61SK…", #' "RT @martinbrampton: Top economist attacks Tory austerity – and Labour's limp response http://t.co/hEJydiIOuo Only #Greens offer real change…", #' "RT @NicolaSturgeon: Forget polls - only votes win elections. The more seats @theSNP win, the stronger Scotland will be. Let's keep working …", #' "https://t.co/RJIeGGUv2n #votelabour")), #' .Names = c("date", "text"), class = "data.frame", row.names = c(NA, 5L)) #' littleres <- getGroups(data = littledat) #' dontrun{ #' system.time(res <- getGroups(data = dat)) #' } getGroups <- function(data, fmt = "%a %b %d %H:%M:%S +0000 %Y", summaryfmt = "%Y-%m-%d %H:%M", parties = c("Conservative", "Labour"), patterns = NULL, ncores = 2L, onlyclassified = TRUE) { if (missing(data)) { stop("data is missing") } if (!all(c("date", "text") %in% colnames(data))) { stop("columns 'date' and 'text' must be present") } if (is.null(patterns)) { patterns <- parties asis <- FALSE } else { if (length(patterns) != length(parties)) { stop("there must be one pattern for each party") } asis <- TRUE } # get date then group by time period data$date <- as.POSIXct(x = data$date, format = fmt) tGroups <- format.POSIXct(x = data$date, format = summaryfmt) uGroups <- unique(tGroups) nGroups <- length(uGroups) # remove websites txt <- gsub(pattern = "http(s){0,1}://t.co/[A-Za-z0-9]{2,10}", replacement = "", x = data$text) txt <- casefold(x = txt, upper = FALSE) # find party party <- findParty(txt = txt, parties = parties, patterns = patterns, asis = asis) # remove party names for (rem in seq_along(parties)) { txt <- gsub(pattern = paste0(parties[rem], "[a-z]{0,9} "), replacement = "", x = txt) } # set up cluster on local machine cl <- makeCluster(ncores) registerDoSNOW(cl) # a foreach loop using local cluster res <- foreach(i = seq_len(nGroups), .packages = "qdap") %dopar% { # get polarity # extract values and clean # skip unclassified groups if onlyclassified if (onlyclassified) { useRecords <- party != "" & tGroups == uGroups[i] } else { useRecords <- tGroups == uGroups[i] } pol <- rep(NA, times = sum(useRecords)) if (sum(useRecords) > 0L) { pol <- polarity(txt[useRecords], constrain = TRUE)$all[, "polarity"] } return(pol) } # tear down cluster stopCluster(cl) dataGrouped <- data.frame("Time" = tGroups, "Party" = party) if (onlyclassified) { dataGrouped <- dataGrouped[party != "", ] } dataGrouped$"Score" <- do.call("c", res) return(dataGrouped) }
The example for this function shows how the function can be used to get some quantitative measure from data of this structure.
We can then visualize these results:
require(ggplot2) theme_set(theme_bw(base_size = 14)) theme_update(axis.text.x = element_text(angle = 90, vjust = 1)) # Create basic plot with smoother partyPlot <- ggplot(aes(x = Time, y = Score), data = res) + geom_point(aes(colour = Party)) + geom_smooth(colour = "black", size = 2) + facet_wrap( ~ Party) # Party colours partyPlot <- partyPlot + scale_colour_manual(values = c("#14427311", "#6BAE2011", "#FA122C11", "#FF8C3C11", "#41852D11", "#FCBA4011", "#8C227E11")) partyPlot
The above ggplot2 plot shows interesting patterns of tweets during the last hours of the election 2015. The Conservatives have the largest volume of tweets and also the largest variation in sentiment. Plaid Cymru received the smallest number of tweets but they are largely positive. The Liberal Democrats have an upward trend with the SNP having a downward trend. More analysis would be required to determine the statistical significance of the results.
As mentioned above, we could certainly improve the sophistication with which we examine tweets, but this serves as a demonstration that the voice of public opinion can be quantitatively captured and analysed. So for our next post, we’re going to show you in more depth how we used the Twitter API to capture a portion of the tweets posted in the 2015 General Election.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.