[This article was first published on Freakonometrics - Tag - R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
require("RJSONIO")The first step is to extract some information from a tweet, and store it in a dataset (details can be found on https://dev.twitter.com/)
obtenir_ligne <- function(unTweet){ date_courante=unTweet$created_at id_courant=unTweet$id_str text=unTweet$text nb_followers=unTweet$user$followers_count nb_amis=unTweet$user$friends_count utc_offset=unTweet$user$utc_offset listeMentions=unTweet$entities$user_mentions return(c(list(c(id_courant,date_courante,text, nb_followers,nb_amis,utc_offset)), list(do.call("rbind",lapply(listeMentions, function(x,id_courant) c(id_courant, x$screen_name),unTweet$id_str))))) }Now that we have the code to extract information from one tweet, let us find several tweets, from one user, say my account,
nom="Freakonometrics"The (small) problem here, is that we have a limitation: we can only get 100 tweets per call of the function
n=100 tweets_courants=scan(paste( "http://api.twitter.com/1/statuses/user_timeline.json? include_entities=true&include_rts=true&screen_name= ",nom,"&count=",n,sep=""),what = "character", encoding="latin1") tweets_courants=paste(tweets_courants[ 1:length(tweets_courants)],collapse=" ") tweets_courants=fromJSON(tweets_courants, method = "C")Then, we use our function to build a database with 100 lines,
extracTweets <- lapply(tweets_courants, obtenir_ligne) mentions=do.call("rbind",lapply(extracTweets, function(x) x[[2]])) colnames(mentions)=list("id","screen_name") res=t(sapply(extracTweets,function(x) x[[1]])) colnames(res) <- list("id","date","text", "nb_followers","nb_amis","utc_offset")The idea then is simply to use a loop, based on the latest id observed
dernier_id=tweets_courants[[length( tweets_courants)]]$id_strSo, here we go,
compteurLimite=100 while(compteurLimite<4100){ tweets_courants=scan(paste( "http://api.twitter.com/1/statuses/user_timeline.json? include_entities=true&include_rts=true&screen_name= ",nom,"&count=",n,"&max_id=",dernier_id,sep=""), what = "character", encoding="latin1") tweets_courants=paste(tweets_courants[ 1:length(tweets_courants)],collapse=" ") tweets_courants=fromJSON(tweets_courants, method = "C") extracTweets <- lapply(tweets_courants[ 2:length(tweets_courants)],obtenir_ligne) mentions=rbind(mentions,do.call("rbind", lapply(extracTweets,function(x) x[[2]]))) res=rbind(res,t(sapply(extracTweets,function(x) x[[1]]))) t(sapply(extracTweets,function(x) x[[1]])) dernier_id=tweets_courants[[length( tweets_courants)]]$id_str compteurLimite=compteurLimite+100 } resFreakonometrics=res= data.frame(res,stringsAsFactors=FALSE)All the information about my own tweets (and re-tweets) are stored in a nice dataset. Actually, we have even more, since we have extracted also names of people mentioned in tweets,
mentionsFreakonometrics= data.frame(mentions)We can look at people I mention in my tweets
gazouillis=sapply(split(mentionsFreakonometrics, mentions$screen_name),nrow) gazouillis=gazouillis[order(gazouillis, decreasing=TRUE)] plot(gazouillis) plot(gazouillis,log="xy") > gazouillis[1:20] tomroud freakonometrics adelaigue dmonniaux 155 84 77 56 J_P_Boucher embruns SkyZeLimit coulmont 42 39 35 31 Fabrice_BM 3wen obouba msotod 31 30 29 27 StatFr nholzschuch renaudjf squintar 26 25 23 23 Vicnent pareto35 romainqc valatini 23 22 22 22If we plot those frequencies, we can clearly observe a standard Pareto distribution,
changer_date_anglais <- function(date_courante){ mois <- c("Jan","Fév", "Mar", "Avr", "Mai", "Jui", "Jul", "Aoû", "Sep", "Oct", "Nov", "Déc") months <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec") jours <- c("Lun","Mar","Mer","Jeu", "Ven","Sam","Dim") days <- c("Mon","Tue","Wed","Thu", "Fri","Sat","Sun") leJour <- substr(date_courante,1,3) leMois <- substr(date_courante,5,7) return(paste(jours[match(leJour,days)]," ", mois[match(leMois,months)],substr( date_courante,8,nchar(date_courante)),sep="")) }So now, it is possible to plot the times where I am online, tweeting,
DATE=Vectorize(changer_date_anglais)(res$date) DATE=sapply(resSkyZeLimit$date, changer_date_anglais,simplify=TRUE) DATE2=strptime(as.character(DATE), "%a %b %d %H:%M:%S %z %Y") lt= as.POSIXlt(DATE2, origin="1970-01-01") heure=lt$hour+lt$min/60 plot(DATE2,heure)
On this graph, we can see that I am clearly not online almost 6 hours a day (or at least not on Twitter). It is possible to visualize more precisely the period of the day where I might be on Twitter,
hist(heure,breaks=0:24,col="light green",proba=TRUE) X=c(heure-24,heure,heure+24) d=density(X,n = 512, from=0, to=24,bw=1) lines(d$x,d$y*3,lwd=3,col="red")
res=extractR("adelaigue") DATE=Vectorize(changer_date_anglais)(res$date) DATE2=strptime(as.character(DATE), "%a %b %d %H:%M:%S %z %Y",tz = "GMT")+2*60*60
or I can also look at @skythelimit who’s usually twitting from Singapore (I am in Montréal). I can seen clearly when we might have overlaps,
res=extractR("skythelimit")
Nice isn’t it. But it is possible to do much better… for instance, for those who do not ask specifically not to be Geo-located, we can see where they do tweet during the day, and during the night… I am quite sure a dozen posts with those functions can be written…
To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics - Tag - R-english.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.