Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
For my first blog post on R, I want to show how to use R to mine Twitter-data. I wanted to write this guide for several reasons. First, while there are many other guides on R-Bloggers that show similar things, they tend to focus more on setting up your Twitter account and using wordclouds. As researchers however, we are often also interested in comparing and contrasting data between groups. Meta-data can be a rich source of information for that. This approach also means that the scripts we want to write have to be more reusable in nature in order to retrieve, load, and analyze several different groups repeatedly. To this end, I will provide some simple functions built on top of other packages that enable you to download tweets from pre-specified groups of people instead of just whoever tweets on a certain topic.
Lastly, I wanted to show how analysing the Twitter meta-data in addition to the tweets themselves can lead to better behavioural insights.
This guide requires that you have a Twitter account, R(studio), and have a twitter-dev account with OAuth. I’ll leave the details of how to get OAuth and the dev-app running here, since it explains it better than I would have been able to.
By the end of this guide you will be able to download tweets from specific users and from lists, plotting commonly used words in their tweets and examine their tweeting behaviour by using meta-data. To make things a little interesting and current I will use tweets from the ongoing US election. I am going to use a combination of several (excellent) packages in order to achieve this. There is probably a less messy way of getting around all this, but this gets the job done and allows you to get quicker to the interesting part: the analysis.
#Packages required: library(twitteR) library(ggplot2) library(httr) library(rjson) library(tm) library(gridExtra) library(lubridate)
If you don’t have these packages, then use install.packages("NameOfPackage")
before running the code above. To get a feel for the commands we will pass through the code, I urge you to have a look at Twitter’s API documentation.
Before we start, establish a connection to the twitter API (use the previously linked guide if this is your first time):
#Type in your app details from Twitter here: setup_twitter_oauth(API_key, API_secret, access_token, access_secret)
A starting point: The US election
To get warmed up, let’s see what the two US presidential nominees have been talking about.
Downloading tweets from a single user is very easy, just use the userTimeline()
function from the twitteR package. n = specifies how many tweets you want from the user. We will use 200 in this example, but the maximum is 3200. Unfortunately you can’t use the API to request tweets older than a week or two at most. This is a restriction of the Twitter Search API, and it often means you won’t actually get the number of tweets you specified.
#Extract tweets from a single user at a time clinton_tweets <- userTimeline(user = "@HillaryClinton", n = 200, includeRts = FALSE, retryOnRateLimit = 2000) trump_tweets <- userTimeline(user = "@realDonaldTrump", n = 200, includeRts = FALSE, retryOnRateLimit = 2000) #Put the tweets downloaded into a data.frame clinton_tweets <- twListToDF(clinton_tweets) trump_tweets <- twListToDF(trump_tweets)
If you do not include retweets you will often not get the amount of tweets you specified (as they are counted but not downloaded). Consider using a higher number of n = to get the appropriate amount of tweets. If you want to save these tweets for a later time, then use write.csv(downloadedData, "nameCSV.csv")
to write the tweets to your hard drive. Now that we have the tweets down, it is useful to remove any unnecessary fluff:
#Remove punctuation, numbers, html-links and unecessary spaces: textScrubber <- function(dataframe) { dataframe$text <- gsub("—", " ", dataframe$text) dataframe$text <- gsub("&", " ", dataframe$text) dataframe$text <- gsub("[[:punct:]]", " ", dataframe$text) dataframe$text <- gsub("[[:digit:]]", "", dataframe$text) dataframe$text <- gsub("http\\w+", "", dataframe$text) dataframe$text <- gsub("\n", " ", dataframe$text) dataframe$text <- gsub("[ \t]{2,}", "", dataframe$text) dataframe$text <- gsub("^\\s+|\\s+$", "", dataframe$text) dataframe$text <- tolower(dataframe$text) return(dataframe) }
After executing the previous code, type in:
clinton_tweets <- textScrubber(clinton_tweets) trump_tweets <- textScrubber(trump_tweets)
The tweets are now easier to work with. The next step is removing all the so-called “stopwords”(words that do not add meaning to the topic), and convert the text into a Term Document Matrix. The TDM is then summed up so we get a data.frame of words arranged by how often they are used:
tdmCreator <- function(dataframe, stemDoc = T, rmStopwords = T){ tdm <- Corpus(VectorSource(dataframe$text)) if (isTRUE(rmStopwords)) { tdm <- tm_map(tdm, removeWords, stopwords()) } if (isTRUE(stemDoc)) { tdm <- tm_map(tdm, stemDocument) } tdm <- TermDocumentMatrix(tdm, control = list(wordLengths = c(1, Inf))) tdm <- rowSums(as.matrix(tdm)) tdm <- sort(tdm, decreasing = T) df <- data.frame(term = names(tdm), freq = tdm) return(df) }
Making a function like this is not strictly necessary, and if you want to do serious text mining (or sentiment analysis), then it is advisable to save a separate TDM for later use. For now though, having a function makes repeating the same code over and over again a little easier. The function takes two extra arguments. StemDoc asks if you want to stem the text, making the computer try to make words that are similar appear as the same. So “great” and “greater” gets treated as the same identity. In most cases this is a useful feature, unless you are searching for specific words or phrases appearing in the tweets.
rmStopwords simply removes words that are often used but don’t add any additional meaning. I have defaulted them to true in this example for both, but you can experiment with switching these on and off to see how it affects your results.
To clean the tweets, simply pass:
clinton_tweets <- tdmCreator(clinton_tweets) trump_tweets <- tdmCreator(trump_tweets)
Finally we can plot each of the candidates most used words:
#Selects the 15 most used words. trump_tweets <- trump_tweets[1:15,] clinton_tweets <- clinton_tweets[1:15,] #Create bar graph with appropriate colours #and use coord_flip() to help the labels look nicer. trump_plot <- ggplot(trump_tweets, aes(x = reorder(term, freq), y = freq)) + geom_bar(stat = "identity", fill = "red") + xlab("Most Used") + ylab("How Often") + coord_flip() + theme(text=element_text(size=25,face="bold")) clinton_plot <- ggplot(clinton_tweets, aes(x = reorder(term, freq), y = freq)) + geom_bar(stat = "identity", fill = "blue") + xlab("Most Used") + ylab("How Often") + coord_flip() + theme(text=element_text(size=25,face="bold")) #There are other ways to get these plots #side-by-side, but this is easy. grid.arrange(trump_plot, clinton_plot, ncol=2)
At the time of writing, these tweets are a little old, but they still convey a good deal of information. It is immediately obvious that both Trump and Clinton spend a great deal of time talking about Trump. Especially Hillary seems to really try to mention him as much as possible. Trump talking about himself is not really that much of a surprise…
But that’s only for two Twitter-accounts, and a few hundred tweets. What if we instead wanted to look at a selected sample, or a section of a population? To do so we need to curate some Twitter lists. Twitter-lists are lists of Twitter members that users create on Twitter in order to filter their home feed and only receive tweets from those members. These lists are useful for data-mining since they allow us to put people in different quasi-experimental groups that we can then compare afterwards. So all you need to do is to create separate lists of the groups of people you want to extract tweets from. After you have created your list you can simply download all the latest tweets from each member in the list using a simple for
loop:
Comparing groups of tweeters
One way to write the previously mentioned for
loop is to build it on top of the twitteR::userTimeline()
function:
#Download tweets from a group of people specified in a Twitter list. tweetsFromList <- function (listOwner, listName, sleepTime = 7, n_tweets = 200, includeRts = F) { api_url <- paste0("https://api.twitter.com/1.1/lists/members.json?slug=", listName, "&owner_screen_name=", listOwner, "&count=5000") #Pull out the users from the list response <- GET(api_url, config(token=twitteR:::get_oauth_sig())) response_data <- fromJSON(content(response, as = "text", encoding = "UTF-8")) #If you want a list of their profile names, #then remove the hashtag underneath and #add it to the return statement at the bottom (use a list). #Useful if you need to verify identity of people. #user_title <- sapply(response_text$users, function(i) i$name) user_names <- sapply(response_data$users, function(i) i$screen_name) tweets <- list() #Loops over list of users, use rbind() to add them to list. #Sys.sleep ticks in between to avoid the rate limit. for (user in user_names) { ## Download user's timeline from Twitter raw_data <- userTimeline(user, n = n_tweets, includeRts = includeRts, retryOnRateLimit = 2000) if (length(raw_data) != 0L) { tweets <- rbind(tweets, raw_data) print('Sleeping to avoid rate limit') Sys.sleep(sleepTime); #If a Twitter-user has no tweets, #userTimeline and rbind fails. #The if-else statement solves this. } else { next } } rm(raw_data) tweets <- twListToDF(tweets) return(tweets) }
If you don’t have your own lists (they are easy to make), then you can also use lists someone else made. Simply supply the function with the list owner, as well as the name of the list.
To continue our investigation into the political world, let’s compare tweets from Democrat and Republican House Representatives.
republicans <- tweetsFromList("HouseGOP", "house-republicans") democrats <- tweetsFromList("TheDemocrats", "house-democrats")
Note that both of these commands will take some time as there will be a lot of tweets downloaded. To get an idea for how much time it will take, use this formula: (number of list members * 7 sec)/60sec. There’s 261 Republicans in our list, so that means it will take a minumum of 31 minutes to download all the tweets. If you want it to go faster you can adjust the sleepTime parameter. Note however that if you set it below 5, then the Twitter API will start throttling you. Go grab a drink instead.
After that is done it is strongly advised that you save the contents to your hard drive (so you won’t have to download them all at once again). Use write.csv(nameOfFile, "nameOfCSV.csv")
to save it in your working directory.
All done? Excellent, now we have a tens of thousands of tweets to work with. Let’s re-run the previous analysis and plot the graphs again. I have omitted the plotting code as it is identical to the one used previously.
democrats <- read.csv("democrats.csv") republicans <- read.csv("republicans.csv") democrats <- textScrubber(democrats) republicans <- textScrubber(republicans) #Warning: this step requires a lot of working memory available. #If you do not have this, add the removeSparseTerms(tdm, .97) #to the tdmCreator function code (before the rowSums call) dem <- tdmCreator(democrats) rep <- tdmCreator(republicans)
That is certainly interesting. Turns out the House actually seems to care about the same things! Being up for re-election so much of the time seems to have a rather humbling effect.
Using meta-data to dig even deeper.
That’s all fine and good, but to really gain an understanding of their differences it is useful to look at some of the meta-data. For some reason this is rarely used in other guides on analyzing tweets, but I think they are useful to check for a couple of things between groups. For example:
- Do groups/politicians have different tweeting strategies?
- Do they for example tweet at different times of day to target their preferred audience?
- Do they all carefully schedule their tweets, or can they be more impulsive (dependent on party)?
This is not an exhaustive list of topics that can be investigated, but there is a lot you can do with the meta-data that also gets downloaded. use the str()
or View()
function to investigate the data.frame of tweets yourself and you will see that there are numerous other things apart from just the tweets that gets downloaded. For now, lets try to answer some of the example questions using the meta-data on when the tweets are tweeted, as well as from which devices/platforms the politicians tweet from:
Do the parties tweet at different times of day?
#Clean the timing meta-data and plot. democrats$group <- as.factor("dem") republicans$group <- as.factor("rep") #Create one large data.frame for comparison both <- rbind(democrats, republicans) #Use the lubridate package to make working with time easier. both$date <- ymd_hms(both$created) both$hour <- hour(both$date) + minute(both$date)/60 both$day <- weekdays(as.Date(both$date)) #rearrange the weekdays in chronological order both$day <- factor(both$day, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")) #This last re-factoring below just makes #the colors in the plot correct. #For your own groups you want to #remove the line of code underneath. both$group <- factor(both$group, levels = c("rep", "dem")) ggplot(both, aes(hour, colour = group)) + geom_density() ggplot(both, aes(hour, colour = group)) + geom_density() + facet_wrap(~day, nrow = 3)
Voila; a kernel density plot showing when tweets appear. There actually seems to be a large gap in the peaks between the political representatives! Also notice that democrats are slightly more likely to stay up late at night to tweet. Maybe they are more "liberal"" with the alcohol and therefore have less impulse control? One can only imagine…
If we divide up the plot by each days we also get a lot of info:
Republicans have their peak almost always at 15:00 every single day. I am not American myself so I don't know the political system well enough, but that does seem like something a little too odd to just be a coincidence. Feel free to come with suggestions in the comments. On a more humorous note, the late-night bump for the democrats are even more pronounced during the weekend.
Who schedules the most and who tweets more on the fly?
To answer this question, we tidy up and count the different names found in the "StatusSource"-part of our data.frame. This will act as a proxy for scheduling (as this is a function limited to certain platforms):
device <- function(dataframe) { tot_all <- length(dataframe$statusSource) Web <- 100*(length(grep("Twitter Web Client", dataframe$statusSource))/tot_all) TweetDeck <- 100*(length(grep("TweetDeck", dataframe$statusSource))/tot_all) iPhone <- 100*(length(grep("Twitter for iPhone", dataframe$statusSource))/tot_all) iPad <- 100*(length(grep("Twitter for iPad", dataframe$statusSource))/tot_all) Blackberry <- 100*(length(grep("Twitter for BlackBerry", dataframe$statusSource))/tot_all) Tweetbot <- 100*(length(grep("Tweetbot for i", dataframe$statusSource))/tot_all) Hootsuite <- 100*(length(grep("Hootsuite", dataframe$statusSource))/tot_all) Android <- 100*(length(grep("Twitter for Android", dataframe$statusSource))/tot_all) Ads <- 100*(length(grep("Twitter Ads", dataframe$statusSource))/tot_all) M5 <- 100*(length(grep("Mobile Web (M5)", dataframe$statusSource))/tot_all) Mac <- 100*(length(grep("Twitter for Mac", dataframe$statusSource))/tot_all) Facebook <- 100*(length(grep("Facebook", dataframe$statusSource))/tot_all) Instagram <- 100*(length(grep("Instagram", dataframe$statusSource))/tot_all) IFTT <- 100*(length(grep("IFTT", dataframe$statusSource))/tot_all) Buffer <- 100*(length(grep("Buffer", dataframe$statusSource))/tot_all) CoSchedule <- 100*(length(grep("CoSchedule", dataframe$statusSource))/tot_all) GainApp <- 100*(length(grep("Gain App", dataframe$statusSource))/tot_all) MobileWeb <- 100*(length(grep("Mobile Web", dataframe$statusSource))/tot_all) iOS <- 100*(length(grep("iOS", dataframe$statusSource))/tot_all) OSX <- 100*(length(grep("OS X", dataframe$statusSource))/tot_all) Echofon <- 100*(length(grep("Echofon", dataframe$statusSource))/tot_all) Fireside <- 100*(length(grep("Fireside Publishing", dataframe$statusSource))/tot_all) Google <- 100*(length(grep("Google", dataframe$statusSource))/tot_all) MailChimp <- 100*(length(grep("MailChimp", dataframe$statusSource))/tot_all) TwitterForWebsites <- 100*(length(grep("Twitter for Websites", dataframe$statusSource))/tot_all) percentages <- data.frame(Web, TweetDeck, iPhone, iPad, Blackberry, Tweetbot, Hootsuite, Android, Ads, M5, Mac, Facebook, Instagram, IFTT, Buffer, CoSchedule, GainApp, MobileWeb, iOS, OSX, Echofon, Fireside, Google, MailChimp, TwitterForWebsites) return(percentages) }
Now we can sort by most popular devices:
repDevice <- sort(device(republicans), decreasing = T) demDevice <- sort(device(democrats), decreasing = T) clintonDevice <- sort(device(clinton_tweets), decreasing = T) trumpDevice <- sort(device(trump_tweets), decreasing = T) #Pick out the top 5 platforms: repDevice[ ,1:5] #Web TweetDeck iPhone Hootsuite Facebook #50.40192 21.82417 17.07703 3.487563 1.424141 demDevice[ ,1:5] #Web iPhone TweetDeck Hootsuite iPad #51.41274 20.81256 17.50693 2.705448 1.745152 trumpDevice[ ,1:5] #Android iPhone Web TweetDeck iPad #62.62626 28.28283 9.090909 0 0 clintonDevice[ ,1:5] #TweetDeck Web iPhone iPad Blackberry #73.02632 21.05263 5.921053 0 0
That's some pretty big differences there between Trump and Hillary especially. Whereas Hillary seems to use Tweetdeck the most (probably to schedule tweets), Trump seems to live and die by his two phones. This is consistent with his impulsive image.
Republicans and Democrats are more similar, especially in how no-one uses Android phones. Looks like if you're an American representative, then you better use American products. But note how Republicans seem to use scheduling services (TweetDeck and Hootsuite)a little more often. This could potentially explain the 15:00 bump that we found previously.
A more thorough investigation could look at what type of platforms were most used around that time of day, but this article is already getting quite long. So I am going to stop here, and I hope you will play around with the code yourself to see what you can find. In the future I am hoping to make a post on combining different sorts of meta-data using the dplyr package to answer even more fine-grained questions.
- If you have any questions, or something isn’t working as it should, please feel free to contact me @andreasose.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.