Charting Twitter time series data with tweet and unique user counts
[This article was first published on Bommarito Consulting » r, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Let’s say you’ve used my Python script to automate the download of a hashtag or search phrase from Twitter (in a Unicode safe way, unlike within R). Now let’s say you want to visualize the number of tweets over time. Easy enough – I’ve also shared this R/ggplot2 code that accomplishes the task. However, let’s say you now want a plot that has both frequency in one dimension (height/y) and the number of unique users in another (color, fill transparency, etc.). What do you do?
To make your life easier, I’ve published this simple R/ggplot2 script on Github to help. Embedded below:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# @author: Bommarito Consulting, LLC; http://michaelbommarito.com/ | |
# @date: May 21, 2012 | |
# @email: michael@bommaritollc.com | |
# @packages: ggplot2, plyr | |
# Clear and import. | |
rm(list=ls()) | |
library(ggplot2) | |
library(plyr) | |
# Controlling parameters. | |
hashtag <- "#nonato" # Hashtag for label purposes | |
cutoff <- as.POSIXct("2012-01-11 00:00:00", tz="EDT") # First timestamp we will consider | |
dt <- 30 # \Delta t, minutes | |
# Load and pre-process tweets | |
tweets <- unique(read.table('data/tweets.csv', sep="\t", quote="", comment.char="", | |
stringsAsFactors=FALSE, header=FALSE, nrows=300000)) | |
names(tweets) <- c("id", "date", "user", "text") | |
tweets$date <- as.POSIXct(strptime(tweets$date, "%a, %d %b %Y %H:%M:%S %z", tz = "GMT")) | |
tweets <- tweets[which(tweets$date > cutoff), ] | |
# Build date breaks | |
minDate <- min(tweets$date) | |
maxDate <- max(tweets$date) + 60 * dt | |
dateBreaks <- seq(minDate, maxDate, by=60 * dt) | |
# Use hist to count the number of tweets per bin; don't plot. | |
tweetCount <- hist(tweets$date, breaks=dateBreaks, plot=FALSE) | |
# Strip out the left endpoint of each bin. | |
binBreaks <- tweetCount$breaks[1:length(tweetCount$breaks)-1] | |
# Count number of unique tweeters per bin. | |
userCount <- sapply(binBreaks, function(d) length(unique(tweets$user[which((tweets$date >= d) & (tweets$date <= d + 60*dt))]))) | |
# Plot data | |
plotData <- data.frame(dates=dateBreaks[1:length(dateBreaks)-1], tweets=as.numeric(tweetCount$count), users=as.numeric(userCount)) | |
ggplot(plotData) + | |
geom_bar(aes(x=dates, y=tweets, color=users), stat="identity") + | |
scale_x_datetime("Date") + | |
scale_y_continuous("Number of tweets") + | |
opts(title="Number of tweets and unique users : #nonato") | |
ggsave("fig/ts_tweet_user.jpg", width=12, height=8) |
To leave a comment for the author, please follow the link and comment on their blog: Bommarito Consulting » r.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.