Do you still have time to sleep ?

arthur charpentier

10 years ago

[This article was first published on Freakonometrics - Tag - R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last week, @3wen (Ewen) helped me to write nice R functions to extract tweets in R and build datasets containing a lot of information. I’ve tried a couple of time on my own. Once on tweet contents, but it was not convincing and once on the activity on Twitter following an event (e.g. the death of someone famous). I have to admit that I am not a big fan of databases that can be generated using standard function to study tweets. For instance, we can only extract tweets, not re-tweets (which is also an important indicator of tweet-activity). @3wen suggested to use

require("RJSONIO")

The first step is to extract some information from a tweet, and store it in a dataset (details can be found on https://dev.twitter.com/)

obtenir_ligne <- function(unTweet){
date_courante=unTweet$created_at
id_courant=unTweet$id_str
text=unTweet$text
nb_followers=unTweet$user$followers_count
nb_amis=unTweet$user$friends_count
utc_offset=unTweet$user$utc_offset
listeMentions=unTweet$entities$user_mentions
return(c(list(c(id_courant,date_courante,text,
nb_followers,nb_amis,utc_offset)),
list(do.call("rbind",lapply(listeMentions,
function(x,id_courant) c(id_courant,
x$screen_name),unTweet$id_str)))))
}

Now that we have the code to extract information from one tweet, let us find several tweets, from one user, say my account,

nom="Freakonometrics"

The (small) problem here, is that we have a limitation: we can only get 100 tweets per call of the function

n=100
tweets_courants=scan(paste(
"http://api.twitter.com/1/statuses/user_timeline.json?
include_entities=true&include_rts=true&screen_name=
",nom,"&count=",n,sep=""),what = "character",
encoding="latin1")
tweets_courants=paste(tweets_courants[
1:length(tweets_courants)],collapse=" ")
tweets_courants=fromJSON(tweets_courants,
method = "C")

Then, we use our function to build a database with 100 lines,

extracTweets <- lapply(tweets_courants,
obtenir_ligne)
mentions=do.call("rbind",lapply(extracTweets,
function(x) x[[2]]))
colnames(mentions)=list("id","screen_name")
res=t(sapply(extracTweets,function(x) x[[1]]))
colnames(res) <- list("id","date","text",
"nb_followers","nb_amis","utc_offset")

The idea then is simply to use a loop, based on the latest id observed

dernier_id=tweets_courants[[length(
tweets_courants)]]$id_str

So, here we go,

compteurLimite=100
 
while(compteurLimite<4100){
tweets_courants=scan(paste(
"http://api.twitter.com/1/statuses/user_timeline.json?
include_entities=true&include_rts=true&screen_name=
",nom,"&count=",n,"&max_id=",dernier_id,sep=""),
what = "character", encoding="latin1")
tweets_courants=paste(tweets_courants[
1:length(tweets_courants)],collapse=" ")
tweets_courants=fromJSON(tweets_courants,
method = "C")
 
extracTweets <- lapply(tweets_courants[
2:length(tweets_courants)],obtenir_ligne)
mentions=rbind(mentions,do.call("rbind",
lapply(extracTweets,function(x) x[[2]])))
res=rbind(res,t(sapply(extracTweets,function(x) x[[1]])))
t(sapply(extracTweets,function(x) x[[1]]))
dernier_id=tweets_courants[[length(
tweets_courants)]]$id_str
compteurLimite=compteurLimite+100
}
 
resFreakonometrics=res=
data.frame(res,stringsAsFactors=FALSE)

All the information about my own tweets (and re-tweets) are stored in a nice dataset. Actually, we have even more, since we have extracted also names of people mentioned in tweets,

mentionsFreakonometrics=
data.frame(mentions)

We can look at people I mention in my tweets

gazouillis=sapply(split(mentionsFreakonometrics,
mentions$screen_name),nrow)
gazouillis=gazouillis[order(gazouillis,
decreasing=TRUE)]
 
plot(gazouillis)
plot(gazouillis,log="xy")
> gazouillis[1:20]
tomroud freakonometrics       adelaigue       dmonniaux
155              84              77              56
J_P_Boucher         embruns      SkyZeLimit        coulmont
42              39              35              31
Fabrice_BM            3wen          obouba          msotod
31              30              29              27
StatFr     nholzschuch        renaudjf        squintar
26              25              23              23
Vicnent        pareto35        romainqc        valatini
23              22              22              22

If we plot those frequencies, we can clearly observe a standard Pareto distribution,

Now, let us spend some time with dates and time of tweets (it was the initial goal of this post)… One more time, there is a (small) technical problem that we have to deal with: language. We need a function to convert date in English (on Twitter) to dates in French (since I have a French version of R),

changer_date_anglais <- function(date_courante){
mois <- c("Jan","Fév", "Mar", "Avr", "Mai",
"Jui", "Jul", "Aoû", "Sep", "Oct", "Nov", "Déc")
months <- c("Jan", "Feb", "Mar", "Apr", "May",
"Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
jours <- c("Lun","Mar","Mer","Jeu",
"Ven","Sam","Dim")
days <- c("Mon","Tue","Wed","Thu",
"Fri","Sat","Sun")
leJour <- substr(date_courante,1,3)
leMois <- substr(date_courante,5,7)
return(paste(jours[match(leJour,days)]," ",
mois[match(leMois,months)],substr(
date_courante,8,nchar(date_courante)),sep=""))
}

So now, it is possible to plot the times where I am online, tweeting,

DATE=Vectorize(changer_date_anglais)(res$date)
DATE=sapply(resSkyZeLimit$date,
changer_date_anglais,simplify=TRUE)
 
DATE2=strptime(as.character(DATE),
"%a %b %d %H:%M:%S %z %Y")
lt= as.POSIXlt(DATE2, origin="1970-01-01")
heure=lt$hour+lt$min/60
plot(DATE2,heure)

On this graph, we can see that I am clearly not online almost 6 hours a day (or at least not on Twitter). It is possible to visualize more precisely the period of the day where I might be on Twitter,

hist(heure,breaks=0:24,col="light green",proba=TRUE)
X=c(heure-24,heure,heure+24)
d=density(X,n = 512, from=0, to=24,bw=1)
lines(d$x,d$y*3,lwd=3,col="red")

or, if we want to illustrate with some kind of heat plot,

Note that we did it for my Twitter account, but we can also run the code on (almost) anyone on Twitter. Consider e.g. @adelaigue. Since Alexandre is tweeting in France, we have to play with time-zones,

res=extractR("adelaigue")
DATE=Vectorize(changer_date_anglais)(res$date)
DATE2=strptime(as.character(DATE),
"%a %b %d %H:%M:%S %z %Y",tz = "GMT")+2*60*60

or I can also look at @skythelimit who’s usually twitting from Singapore (I am in Montréal). I can seen clearly when we might have overlaps,

res=extractR("skythelimit")

Nice isn’t it. But it is possible to do much better… for instance, for those who do not ask specifically not to be Geo-located, we can see where they do tweet during the day, and during the night… I am quite sure a dozen posts with those functions can be written…

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics - Tag - R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.