Do you still have time to sleep ?
[This article was first published on Freakonometrics - Tag - R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Last week, @3wen (Ewen) helped me to write nice R functions to extract tweets in R and build datasets containing a lot of information. I’ve tried a couple of time on my own. Once on tweet contents, but it was not convincing and once on the activity on Twitter following an event (e.g. the death of someone famous). I have to admit that I am not a big fan of databases that can be generated using standard function to study tweets. For instance, we can only extract tweets, not re-tweets (which is also an important indicator of tweet-activity). @3wen suggested to useWant to share your content on R-bloggers? click here if you have a blog, or here if you don't.
require("RJSONIO")The first step is to extract some information from a tweet, and store it in a dataset (details can be found on https://dev.twitter.com/)
obtenir_ligne <- function(unTweet){ date_courante=unTweet$created_at id_courant=unTweet$id_str text=unTweet$text nb_followers=unTweet$user$followers_count nb_amis=unTweet$user$friends_count utc_offset=unTweet$user$utc_offset listeMentions=unTweet$entities$user_mentions return(c(list(c(id_courant,date_courante,text, nb_followers,nb_amis,utc_offset)), list(do.call("rbind",lapply(listeMentions, function(x,id_courant) c(id_courant, x$screen_name),unTweet$id_str))))) }Now that we have the code to extract information from one tweet, let us find several tweets, from one user, say my account,
nom="Freakonometrics"The (small) problem here, is that we have a limitation: we can only get 100 tweets per call of the function
n=100 tweets_courants=scan(paste( "http://api.twitter.com/1/statuses/user_timeline.json? include_entities=true&include_rts=true&screen_name= ",nom,"&count=",n,sep=""),what = "character", encoding="latin1") tweets_courants=paste(tweets_courants[ 1:length(tweets_courants)],collapse=" ") tweets_courants=fromJSON(tweets_courants, method = "C")Then, we use our function to build a database with 100 lines,
extracTweets <- lapply(tweets_courants, obtenir_ligne) mentions=do.call("rbind",lapply(extracTweets, function(x) x[[2]])) colnames(mentions)=list("id","screen_name") res=t(sapply(extracTweets,function(x) x[[1]])) colnames(res) <- list("id","date","text", "nb_followers","nb_amis","utc_offset")The idea then is simply to use a loop, based on the latest id observed
dernier_id=tweets_courants[[length( tweets_courants)]]$id_strSo, here we go,
compteurLimite=100 while(compteurLimite<4100){ tweets_courants=scan(paste( "http://api.twitter.com/1/statuses/user_timeline.json? include_entities=true&include_rts=true&screen_name= ",nom,"&count=",n,"&max_id=",dernier_id,sep=""), what = "character", encoding="latin1") tweets_courants=paste(tweets_courants[ 1:length(tweets_courants)],collapse=" ") tweets_courants=fromJSON(tweets_courants, method = "C") extracTweets <- lapply(tweets_courants[ 2:length(tweets_courants)],obtenir_ligne) mentions=rbind(mentions,do.call("rbind", lapply(extracTweets,function(x) x[[2]]))) res=rbind(res,t(sapply(extracTweets,function(x) x[[1]]))) t(sapply(extracTweets,function(x) x[[1]])) dernier_id=tweets_courants[[length( tweets_courants)]]$id_str compteurLimite=compteurLimite+100 } resFreakonometrics=res= data.frame(res,stringsAsFactors=FALSE)All the information about my own tweets (and re-tweets) are stored in a nice dataset. Actually, we have even more, since we have extracted also names of people mentioned in tweets,
mentionsFreakonometrics= data.frame(mentions)We can look at people I mention in my tweets
gazouillis=sapply(split(mentionsFreakonometrics, mentions$screen_name),nrow) gazouillis=gazouillis[order(gazouillis, decreasing=TRUE)] plot(gazouillis) plot(gazouillis,log="xy") > gazouillis[1:20] tomroud freakonometrics adelaigue dmonniaux 155 84 77 56 J_P_Boucher embruns SkyZeLimit coulmont 42 39 35 31 Fabrice_BM 3wen obouba msotod 31 30 29 27 StatFr nholzschuch renaudjf squintar 26 25 23 23 Vicnent pareto35 romainqc valatini 23 22 22 22If we plot those frequencies, we can clearly observe a standard Pareto distribution,
changer_date_anglais <- function(date_courante){ mois <- c("Jan","Fév", "Mar", "Avr", "Mai", "Jui", "Jul", "Aoû", "Sep", "Oct", "Nov", "Déc") months <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec") jours <- c("Lun","Mar","Mer","Jeu", "Ven","Sam","Dim") days <- c("Mon","Tue","Wed","Thu", "Fri","Sat","Sun") leJour <- substr(date_courante,1,3) leMois <- substr(date_courante,5,7) return(paste(jours[match(leJour,days)]," ", mois[match(leMois,months)],substr( date_courante,8,nchar(date_courante)),sep="")) }So now, it is possible to plot the times where I am online, tweeting,
DATE=Vectorize(changer_date_anglais)(res$date) DATE=sapply(resSkyZeLimit$date, changer_date_anglais,simplify=TRUE) DATE2=strptime(as.character(DATE), "%a %b %d %H:%M:%S %z %Y") lt= as.POSIXlt(DATE2, origin="1970-01-01") heure=lt$hour+lt$min/60 plot(DATE2,heure)
On this graph, we can see that I am clearly not online almost 6 hours a day (or at least not on Twitter). It is possible to visualize more precisely the period of the day where I might be on Twitter,
hist(heure,breaks=0:24,col="light green",proba=TRUE) X=c(heure-24,heure,heure+24) d=density(X,n = 512, from=0, to=24,bw=1) lines(d$x,d$y*3,lwd=3,col="red")or, if we want to illustrate with some kind of heat plot,
Note that we did it for my Twitter account, but we can also run the code on (almost) anyone on Twitter. Consider e.g. @adelaigue. Since Alexandre is tweeting in France, we have to play with time-zones,
res=extractR("adelaigue") DATE=Vectorize(changer_date_anglais)(res$date) DATE2=strptime(as.character(DATE), "%a %b %d %H:%M:%S %z %Y",tz = "GMT")+2*60*60
or I can also look at @skythelimit who's usually twitting from Singapore (I am in Montréal). I can seen clearly when we might have overlaps,
res=extractR("skythelimit")
Nice isn't it. But it is possible to do much better... for instance, for those who do not ask specifically not to be Geo-located, we can see where they do tweet during the day, and during the night... I am quite sure a dozen posts with those functions can be written...
To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics - Tag - R-english.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.