Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’ve largely avoided “time” in R to date, but following a chat with @mhawksey at #dev8d yesterday, I went down a rathole last night exploring a few ways of visualising a Twitter user timeline and as a result also had a quick initial play with some time handling features of R, such as timeseries objects, and generating daily, weekly and monthly summary counts of data value.
To start, let’s grab a user timeline. As Martin started it (?!), we’ll use his…;-)
require(twitteR)
#the most tweets we can bring back from a user timeline is the most recent 3600...
mht=userTimeline('mhawksey',n=3600)
tw.df=twListToDF(mht)
#As I've done in previous scripts, pull out the names of folk who have been "old-fashioned RTd"...
require(stringr)
trim <- function (x) sub('@','',x)
tw.df$rt=sapply(tw.df$text,function(tweet) trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2]))
tw.df$rtt=sapply(tw.df$rt,function(rt) if (is.na(rt)) 'T' else 'RT')
The returned data includes a created attribute (of the form “2012-02-17 11:40:25″) and a replyToSN attribute that includes the username of a user Martin was replying to via a particular tweet.
The simplest way I can think of displaying the data is to just display the screenName atrribute of the sender (which in this case is always mhawskey) against time:
require(ggplot2) ggplot(tw.df)+geom_point(aes(x=created,y=screenName))
As ever, things are never that simple… some tweets with old dates appear to have crept in somehow… A couple of things I tried realting to time based filtering caused R to have all sorts of malloc errors, so here’s a fudge I found to just display tweets that were created within the last 8,000 hours…
tw.dfs=subset(tw.df,subset=((Sys.time()-created)<8000)) ggplot(tw.dfs)+geom_point(aes(x=created,y=screenName))
Okay, so not very interesting… It shows that Martin tweets…
Picking up on views of the style doodled in Visualising Activity Around a Twitter Hashtag or Search Term Using R, where we look at when new users appear in a hashtag stream, we can plot when Martin replies to another twitter user, arranging the user names in the order in which they were first publicly replied to:
require(plyr)
#Order the replyToSN factor levels in the order in which they were first created
tw.dfx=ddply(tw.dfs, .var = "replyToSN", .fun = function(x) {return(subset(x, created %in% min(created),select=c(replyToSN,created)))})
tw.dfxa=arrange(tw.dfx,-desc(created))
tw.dfs$replyToSN=factor(tw.dfs$replyToSN, levels = tw.dfxa$replyToSN)
#and plot the result
ggplot(tw.dfs)+geom_point(aes(x=created,y=replyToSN))
The line at the top are tweets where the replyToSN value was NA (not available).
We can then go a little further and plot when folk are replied to or retweeted, as well as tweets that are neither a reply nor an old-style retweet:
ggplot()+geom_point(data=subset(tw.dfs,subset=(!is.na(replyToSN))),aes(x=created,y=replyToSN),col='red') + geom_point(data=subset(tw.dfs,subset=(!is.na(rt))),aes(x=created,y=rt),col='blue') + geom_point(data=subset(tw.dfs,subset=(is.na(replyToSN) & is.na(rt))),aes(x=created,y=screenName),col='green')
Here, the blue dots are old-style retweets, the red dots are replies, and the green dots are tweets that are neither replies nor old-style retweets. If a blue dot appears on a row before a red dot, it shows Martin RT’d them before ever replying to them. If blue dots are on a row that contains no red dot, then it shows Martin has RT’d but not replied to that person. A heavily populated row shows Martin has repeated interactions with that user.
We can generate an ordered bar chart showing who is most heavily replied to:
#First we need to count how many replies a user gets... #http://stackoverflow.com/a/3255448/454773 r_table <- table(tw.dfs$replyToSN) #..rank them... r_levels <- names(r_table)[order(-r_table)] #..and use this ordering to order the factor levels... tw.dfs$replyToSN <- factor(tw.dfs$replyToSN, levels = r_levels) #Then we can plot the chart... ggplot(subset(tw.dfs,subset=(!is.na(replyToSN))),aes(x=replyToSN)) + geom_bar(aes(y = (..count..)))+opts(axis.text.x=theme_text(angle=-90,size=6))
(Hmmm… how would I filter this to only show folk replied to more than 50 times, for example?)
Sometimes, a text view is easier…
head(table(tw.dfs$replyToSN))
#eg returns:
#psychemedia wilm ambrouk sheilmcn dajbelshaw manmalik
394 66 59 53 48 43
#Hmm..can we generalise this?
topTastic=function(dfc,num=5){
r_table <- table(dfc)
r_levels <- names(r_table)[order(-r_table)]
head(table(factor(dfc, levels = r_levels)),num)
}
#so now, for example, I should be able to display the most old-style retweeted folk?
topTastic(tw.dfs$rt)
#or the 10 most replied to...
topTastic(tw.dfs$replyToSN,10)
Let’s try some time stuff now… From the R Cookbook, I find I can do this:
#label a tweet with the month number
tw.dfs$month=sapply(tw.dfs$created, function(x) {p=as.POSIXlt(x);p$mon})
#label a tweet with the hour
tw.dfs$hour=sapply(tw.dfs$created, function(x) {p=as.POSIXlt(x);p$hour})
#label a tweet with a number corresponding to the day of the week
tw.dfs$wday=sapply(tw.dfs$created, function(x) {p=as.POSIXlt(x);p$wday})
What this means is we can now chart a count of the number of tweets by day, week, or hour… For example, here’s hour vs. day of the week:
ggplot(tw.dfs)+geom_jitter(aes(x=wday,y=hour))
Note that this jittered scattergraph, where each dot is a tweet, only approximates the time each tweet occurred – the jitter applied is a random quantity designed to separate out tweets posted within the same hour-and-day-of-the-week bin.
What about Martin’s tweeting behaviour over time?
#We can also generate barplots showing the distribution of tweet count over time: ggplot(tw.dfs,aes(x=created))+geom_bar(aes(y = (..count..))) #Hmm... I'm not sure how to manually set binwidth= sensibly, though?!
Here’s a plot of the number of counts per… I’m not sure: the bin width was calculated automatically…
How about using the number of tweets in a particular day or hour bin to see what times of day or days of week Martin is tweeting?
#We can also plot the number of tweets within particular hour or time bins... ggplot(tw.dfs,aes(x=wday))+geom_bar(aes(y = (..count..)),binwidth=1) ggplot(tw.dfs,aes(x=hour))+geom_bar(aes(y = (..count..)),binwidth=1)
This chart shows activity (in terms of count…) per hour of day.
As well as doing the count of tweets per hour, for example, via a ggplot statistical graphical function, we can also get day, week, month, quarter and year counts from a set of functions associated with a particular sort of timeseries object…
Each element in a time series typically has two elements – a timestamp, and a numeric value. We can generate a time series of a sort around a twitter usertimeline by creating a dummy quantity – such as the unit value, 1 – and associate it with each timestamp:
require(xts)
#The xts function creates a timeline from a vector of values and a vector of timestamps.
#If we know how many tweets we have, we can just create a simple list or vector containing that number of 1s
ts=xts(rep(1,times=nrow(tw.dfs)),tw.dfs$created)
#We can now do some handy number crunching on the timeseries, such as applying a formula to values contained with day, week, month, quarter or year time bins.
#So for example, if we sum the unit values in daily bin, we can get a count of the number of tweets per day
ts.sum=apply.daily(ts,sum)
#also apply. weekly, monthly, quarterly, yearly
#If for any resason we need to turn the timeseries into a dataframe, we can:
#http://stackoverflow.com/a/3387259/454773
ts.sum.df=data.frame(date=index(ts.sum), coredata(ts.sum))
colnames(ts.sum.df)=c('date','sum')
#We can then use ggplot to plot the timeseries...
ggplot(ts.sum.df)+geom_line(aes(x=date,y=sum))
#Having got the data in a timeseries form, we can do timeseries based things to it... such as checking the autocorrelation: acf(ts.sum)
Hmmm.. so, one day is much the same as another, but there also appears to be a weekly (7 day periodicity) pattern…
Finally, here’s a handy script I found on the Revolution Analytics site for Charting time series as calendar heat maps in R:
##############################################################################
# Calendar Heatmap #
# by #
# Paul Bleicher #
# an R version of a graphic from: #
# http://stat-computing.org/dataexpo/2009/posters/wicklin-allison.pdf #
# requires lattice, chron, grid packages #
##############################################################################
## calendarHeat: An R function to display time-series data as a calendar heatmap
## Copyright 2009 Humedica. All rights reserved.
## This program is free software; you can redistribute it and/or modify
## it under the terms of the GNU General Public License as published by
## the Free Software Foundation; either version 2 of the License, or
## (at your option) any later version.
## This program is distributed in the hope that it will be useful,
## but WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
## GNU General Public License for more details.
## You can find a copy of the GNU General Public License, Version 2 at:
## http://www.gnu.org/licenses/gpl-2.0.html
calendarHeat <- function(dates,
values,
ncolors=99,
color="r2g",
varname="Values",
date.form = "%Y-%m-%d", ...) {
require(lattice)
require(grid)
require(chron)
if (class(dates) == "character" | class(dates) == "factor" ) {
dates <- strptime(dates, date.form)
}
caldat <- data.frame(value = values, dates = dates)
min.date <- as.Date(paste(format(min(dates), "%Y"),
"-1-1",sep = ""))
max.date <- as.Date(paste(format(max(dates), "%Y"),
"-12-31", sep = ""))
dates.f <- data.frame(date.seq = seq(min.date, max.date, by="days"))
# Merge moves data by one day, avoid
caldat <- data.frame(date.seq = seq(min.date, max.date, by="days"), value = NA)
dates <- as.Date(dates)
caldat$value[match(dates, caldat$date.seq)] <- values
caldat$dotw <- as.numeric(format(caldat$date.seq, "%w"))
caldat$woty <- as.numeric(format(caldat$date.seq, "%U")) + 1
caldat$yr <- as.factor(format(caldat$date.seq, "%Y"))
caldat$month <- as.numeric(format(caldat$date.seq, "%m"))
yrs <- as.character(unique(caldat$yr))
d.loc <- as.numeric()
for (m in min(yrs):max(yrs)) {
d.subset <- which(caldat$yr == m)
sub.seq <- seq(1,length(d.subset))
d.loc <- c(d.loc, sub.seq)
}
caldat <- cbind(caldat, seq=d.loc)
#color styles
r2b <- c("#0571B0", "#92C5DE", "#F7F7F7", "#F4A582", "#CA0020") #red to blue
r2g <- c("#D61818", "#FFAE63", "#FFFFBD", "#B5E384") #red to green
w2b <- c("#045A8D", "#2B8CBE", "#74A9CF", "#BDC9E1", "#F1EEF6") #white to blue
assign("col.sty", get(color))
calendar.pal <- colorRampPalette((col.sty), space = "Lab")
def.theme <- lattice.getOption("default.theme")
cal.theme <-
function() {
theme <-
list(
strip.background = list(col = "transparent"),
strip.border = list(col = "transparent"),
axis.line = list(col="transparent"),
par.strip.text=list(cex=0.8))
}
lattice.options(default.theme = cal.theme)
yrs <- (unique(caldat$yr))
nyr <- length(yrs)
print(cal.plot <- levelplot(value~woty*dotw | yr, data=caldat,
as.table=TRUE,
aspect=.12,
layout = c(1, nyr%%7),
between = list(x=0, y=c(1,1)),
strip=TRUE,
main = paste("Calendar Heat Map of ", varname, sep = ""),
scales = list(
x = list(
at= c(seq(2.9, 52, by=4.42)),
labels = month.abb,
alternating = c(1, rep(0, (nyr-1))),
tck=0,
cex = 0.7),
y=list(
at = c(0, 1, 2, 3, 4, 5, 6),
labels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday"),
alternating = 1,
cex = 0.6,
tck=0)),
xlim =c(0.4, 54.6),
ylim=c(6.6,-0.6),
cuts= ncolors - 1,
col.regions = (calendar.pal(ncolors)),
xlab="" ,
ylab="",
colorkey= list(col = calendar.pal(ncolors), width = 0.6, height = 0.5),
subscripts=TRUE
) )
panel.locs <- trellis.currentLayout()
for (row in 1:nrow(panel.locs)) {
for (column in 1:ncol(panel.locs)) {
if (panel.locs[row, column] > 0)
{
trellis.focus("panel", row = row, column = column,
highlight = FALSE)
xyetc <- trellis.panelArgs()
subs <- caldat[xyetc$subscripts,]
dates.fsubs <- caldat[caldat$yr == unique(subs$yr),]
y.start <- dates.fsubs$dotw[1]
y.end <- dates.fsubs$dotw[nrow(dates.fsubs)]
dates.len <- nrow(dates.fsubs)
adj.start <- dates.fsubs$woty[1]
for (k in 0:6) {
if (k < y.start) {
x.start <- adj.start + 0.5
} else {
x.start <- adj.start - 0.5
}
if (k > y.end) {
x.finis <- dates.fsubs$woty[nrow(dates.fsubs)] - 0.5
} else {
x.finis <- dates.fsubs$woty[nrow(dates.fsubs)] + 0.5
}
grid.lines(x = c(x.start, x.finis), y = c(k -0.5, k - 0.5),
default.units = "native", gp=gpar(col = "grey", lwd = 1))
}
if (adj.start < 2) {
grid.lines(x = c( 0.5, 0.5), y = c(6.5, y.start-0.5),
default.units = "native", gp=gpar(col = "grey", lwd = 1))
grid.lines(x = c(1.5, 1.5), y = c(6.5, -0.5), default.units = "native",
gp=gpar(col = "grey", lwd = 1))
grid.lines(x = c(x.finis, x.finis),
y = c(dates.fsubs$dotw[dates.len] -0.5, -0.5), default.units = "native",
gp=gpar(col = "grey", lwd = 1))
if (dates.fsubs$dotw[dates.len] != 6) {
grid.lines(x = c(x.finis + 1, x.finis + 1),
y = c(dates.fsubs$dotw[dates.len] -0.5, -0.5), default.units = "native",
gp=gpar(col = "grey", lwd = 1))
}
grid.lines(x = c(x.finis, x.finis),
y = c(dates.fsubs$dotw[dates.len] -0.5, -0.5), default.units = "native",
gp=gpar(col = "grey", lwd = 1))
}
for (n in 1:51) {
grid.lines(x = c(n + 1.5, n + 1.5),
y = c(-0.5, 6.5), default.units = "native", gp=gpar(col = "grey", lwd = 1))
}
x.start <- adj.start - 0.5
if (y.start > 0) {
grid.lines(x = c(x.start, x.start + 1),
y = c(y.start - 0.5, y.start - 0.5), default.units = "native",
gp=gpar(col = "black", lwd = 1.75))
grid.lines(x = c(x.start + 1, x.start + 1),
y = c(y.start - 0.5 , -0.5), default.units = "native",
gp=gpar(col = "black", lwd = 1.75))
grid.lines(x = c(x.start, x.start),
y = c(y.start - 0.5, 6.5), default.units = "native",
gp=gpar(col = "black", lwd = 1.75))
if (y.end < 6 ) {
grid.lines(x = c(x.start + 1, x.finis + 1),
y = c(-0.5, -0.5), default.units = "native",
gp=gpar(col = "black", lwd = 1.75))
grid.lines(x = c(x.start, x.finis),
y = c(6.5, 6.5), default.units = "native",
gp=gpar(col = "black", lwd = 1.75))
} else {
grid.lines(x = c(x.start + 1, x.finis),
y = c(-0.5, -0.5), default.units = "native",
gp=gpar(col = "black", lwd = 1.75))
grid.lines(x = c(x.start, x.finis),
y = c(6.5, 6.5), default.units = "native",
gp=gpar(col = "black", lwd = 1.75))
}
} else {
grid.lines(x = c(x.start, x.start),
y = c( - 0.5, 6.5), default.units = "native",
gp=gpar(col = "black", lwd = 1.75))
}
if (y.start == 0 ) {
if (y.end < 6 ) {
grid.lines(x = c(x.start, x.finis + 1),
y = c(-0.5, -0.5), default.units = "native",
gp=gpar(col = "black", lwd = 1.75))
grid.lines(x = c(x.start, x.finis),
y = c(6.5, 6.5), default.units = "native",
gp=gpar(col = "black", lwd = 1.75))
} else {
grid.lines(x = c(x.start + 1, x.finis),
y = c(-0.5, -0.5), default.units = "native",
gp=gpar(col = "black", lwd = 1.75))
grid.lines(x = c(x.start, x.finis),
y = c(6.5, 6.5), default.units = "native",
gp=gpar(col = "black", lwd = 1.75))
}
}
for (j in 1:12) {
last.month <- max(dates.fsubs$seq[dates.fsubs$month == j])
x.last.m <- dates.fsubs$woty[last.month] + 0.5
y.last.m <- dates.fsubs$dotw[last.month] + 0.5
grid.lines(x = c(x.last.m, x.last.m), y = c(-0.5, y.last.m),
default.units = "native", gp=gpar(col = "black", lwd = 1.75))
if ((y.last.m) < 6) {
grid.lines(x = c(x.last.m, x.last.m - 1), y = c(y.last.m, y.last.m),
default.units = "native", gp=gpar(col = "black", lwd = 1.75))
grid.lines(x = c(x.last.m - 1, x.last.m - 1), y = c(y.last.m, 6.5),
default.units = "native", gp=gpar(col = "black", lwd = 1.75))
} else {
grid.lines(x = c(x.last.m, x.last.m), y = c(- 0.5, 6.5),
default.units = "native", gp=gpar(col = "black", lwd = 1.75))
}
}
}
}
trellis.unfocus()
}
lattice.options(default.theme = def.theme)
}
If we pass the dataframed time series data counting the sum (count) of tweets per day, we can get a calendar heatmap view of Martin’s twitter activity:
calendarHeat(ts.sum.df$date, ts.sum.df$sum, varname="@mhawksey Twitter activity")
I’m not sure if this is even interesting, let alone useful, but I do think now I’ve found out a little bit about working with time in R, that could be handy…
Still to do: extract hashtags and visualise them; extend the twitteR library so it exposes things like retweet counts. But that’s for another day…
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
