Charting Twitter time series data with tweet and unique user counts

[This article was first published on Bommarito Consulting » r, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Let’s say you’ve used my Python script to automate the download of a hashtag or search phrase from Twitter (in a Unicode safe way, unlike within R).  Now let’s say you want to visualize the number of tweets over time.  Easy enough – I’ve also shared this R/ggplot2 code that accomplishes the task.  However, let’s say you now want a plot that has both frequency in one dimension (height/y) and the number of unique users in another (color, fill transparency, etc.).  What do you do?

To make  your life easier, I’ve published this simple R/ggplot2 script on Github to help.  Embedded below:

# @author: Bommarito Consulting, LLC; http://michaelbommarito.com/
# @date: May 21, 2012
# @email: michael@bommaritollc.com
# @packages: ggplot2, plyr
# Clear and import.
rm(list=ls())
library(ggplot2)
library(plyr)
# Controlling parameters.
hashtag <- "#nonato" # Hashtag for label purposes
cutoff <- as.POSIXct("2012-01-11 00:00:00", tz="EDT") # First timestamp we will consider
dt <- 30 # \Delta t, minutes
# Load and pre-process tweets
tweets <- unique(read.table('data/tweets.csv', sep="\t", quote="", comment.char="",
stringsAsFactors=FALSE, header=FALSE, nrows=300000))
names(tweets) <- c("id", "date", "user", "text")
tweets$date <- as.POSIXct(strptime(tweets$date, "%a, %d %b %Y %H:%M:%S %z", tz = "GMT"))
tweets <- tweets[which(tweets$date > cutoff), ]
# Build date breaks
minDate <- min(tweets$date)
maxDate <- max(tweets$date) + 60 * dt
dateBreaks <- seq(minDate, maxDate, by=60 * dt)
# Use hist to count the number of tweets per bin; don't plot.
tweetCount <- hist(tweets$date, breaks=dateBreaks, plot=FALSE)
# Strip out the left endpoint of each bin.
binBreaks <- tweetCount$breaks[1:length(tweetCount$breaks)-1]
# Count number of unique tweeters per bin.
userCount <- sapply(binBreaks, function(d) length(unique(tweets$user[which((tweets$date >= d) & (tweets$date <= d + 60*dt))])))
# Plot data
plotData <- data.frame(dates=dateBreaks[1:length(dateBreaks)-1], tweets=as.numeric(tweetCount$count), users=as.numeric(userCount))
ggplot(plotData) +
geom_bar(aes(x=dates, y=tweets, color=users), stat="identity") +
scale_x_datetime("Date") +
scale_y_continuous("Number of tweets") +
opts(title="Number of tweets and unique users : #nonato")
ggsave("fig/ts_tweet_user.jpg", width=12, height=8)
view raw plotHashtag2.R hosted with ❤ by GitHub

To leave a comment for the author, please follow the link and comment on their blog: Bommarito Consulting » r.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)