An automatic code to extract tweets (and to produce the “Somewhere else” review)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A few weeks ago, I ask in a post the (simple) question “dear reader, who are you?” just to know more about the readers of my blog. I found that extremely interesting (even if – to be honest – I was expecting more answers to start a more serious sociological study of the readers of my blog). And an interesting point was that a lot of readers of my blog come to read the “somewhere else” posts, which is a review of interesting posts and articles found on the internet. Those links I share actually come from my tweets. I have on my blog a backup of my tweets, and usually, that’s where I go if I want to find some article, or some graph, or some map I have in mind, that I’ve seen somewhere (but usually I can’t remember where). But most of the time, I feel bored, because there is nothing new: it is simply a copy and paste from my tweets.
And this afternoon @tomroud asked how those posts were written: was there an automatic procedure, or was I doing it manually? Until tonight, I was doing it manually. But because it was some kind of stupid challenge, I did try to produce a code that will generate a simple list of my tweets that I can use to produce a post.
Nevertheless, there are still two problems I cannot fix with a code:
- in my “somewhere else” posts, there was a language distinction, with posts and articles in English first, and then those in French. Unfortunately, I could not find a function that detects the language of a tweet. I remember that we’ve been trying with
@3wen to write such a code, but I could not find it… I guess@3wen had a first draft so if we can find it, I will upload it on my blog (or he will upload it on his) - in my posts, I include the picture, if any. This part will still be done manually because it is much more difficult (but I guess it is possible…)
Now, before starting, we will need functions from an old post, to convert twitter’s shorten url to real ones,
extraire <- function(entree,motif){ res <- regexec(motif,entree) if(length(res[[1]])==2){ debut <- (res[[1]])[2] fin <- debut+(attr(res[[1]],"match.length"))[2]-1 return(substr(entree,debut,fin)) }else return(NA)} unshorten <- function(url){ uri <- getURL(url, header=TRUE, nobody=TRUE, followlocation=FALSE, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")) res <- try(extraire(uri,"rnlocation: (.*?)rnserver")) return(res)}
Now, let us consider the following code. The first step, of course, is to run some lines that will allow me to use Twitter's API,
require(twitteR) reqURL <- "https://api.twitter.com/oauth/request_token" accessURL <- "https://api.twitter.com/oauth/access_token" authURL <- "https://api.twitter.com/oauth/authorize" apiKey <- "yourAPIkey" apiSecret <- "yourAPIsecret" twitCred <- OAuthFactory$new(consumerKey=apiKey,consumerSecret=apiSecret,requestURL=reqURL,accessURL=accessURL,authURL=authURL) twitCred$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")) registerTwitterOAuth(twitCred)
Then, I need to be cautious become some of my tweets are in French, and some weird symbols might appear,
Sys.setlocale("LC_CTYPE","fr_FR.UTF-8")
Now I can write my function
somewhere_else <- function(){ tweets_freak <- searchTwitter("from:@freakonometrics", n = 500) save(tweets_freak, file="somewhere_else.RData") tweets_freak_df <- do.call("rbind", lapply(tweets_freak, as.data.frame)) text_tweets_freak <- tweets_freak_df$text tweets_freak_message <- text_tweets_freak[which(substr(text_tweets_freak,1,1)!="@")] SE <- which(substr(tweets_freak_message,1,15)==""Somewhere else") first_SE <- SE[1] tweets_freak <- tweets_freak_message[1:(first_SE-1)] substitute_id <- function(x){ split_x <- strsplit(x,"@")[[1]] x_id <- paste(split_x,collapse="http://twitter.com/",sep="") split_x_id <- strsplit(x_id,"http") n <- length(split_x_id[[1]]) tweet_x <- strsplit(split_x_id[[1]]," ") if(n==1) rt <- x_id if(n>1){ for(i in 2:n){ url <- tweet_x[[i]][1] split=FALSE if(substr(url,nchar(url),nchar(url))%in%c(":",",",";",")","(")) split <- TRUE if(split==FALSE) unshort_url <- unshorten(paste("http",url,sep="")) if(split==TRUE) unshort_url <- unshorten(paste("http",substr(url,1,nchar(url)-1),sep="")) tweet=FALSE if(substr(url,4,10)=="twitter") tweet=TRUE if((split==FALSE)&(tweet==FALSE)) tweet_x_2 <- c("<a href="",unshort_url,"">",unshort_url,"</a>") if((split==TRUE)&(tweet==FALSE)) tweet_x_2 <- c("<a href="",unshort_url,"">",unshort_url,"</a>",substr(url,nchar(url),nchar(url))) if((split==FALSE)&(tweet==TRUE)) tweet_x_2 <- c("<a href="",unshort_url,"">@",substr(unshort_url,21,nchar(unshort_url)),"</a>") if((split==TRUE)&(tweet==TRUE)) tweet_x_2 <- c("<a href="",unshort_url,"">@",substr(unshort_url,21,nchar(unshort_url)) ,"</a>",substr(url,nchar(url),nchar(url))) tweet_x[[i]] <- c(tweet_x_2,tweet_x[[i]][-1]) } rt <- paste("<li>",paste(unlist(tweet_x),collapse=" "),"</li>",sep="") } return(rt) } tweets_freak_sub <- lapply(tweets_freak, substitute_id) write.table(unlist(tweets_freak_sub),file="tweets_somewhere_else.txt",quote=FALSE,row.names=FALSE) cat("Number of tweets.....",length(tweets_freak_sub),"n") cat("File.................",paste(getwd(),"tweets_somewhere_else.txt",sep="/"),"n") cat("Donen") }
The first tricky part was to recognize names mentionned in my tweets (since some of them are retweets). The second one was to create an html link each time there is a link (I did not take into account hastags, here). If I run it, get
> somewhere_else() Number of tweets..... 72 File.... /home/arthur/tweets_somewhere_else.txt Done Warning message: In doRppAPICall("search/tweets", n, params = params, retryOnRateLimit = retryOnRateLimit, : 500 tweets were requested but the API can only return 191
If I make a copy and paste from the text file, I have
- Stupid People Have No Idea How Stupid They Are (a.k.a. the Dunning-Kruger Effect) http://www.openculture.com/2014/12/john-cleese-on-stupidity-and-a-cornell-study.html http http://youtu.be/wvVPdyYeaQU
- "On the Measurement of Economic Tail Risk" https://www.aeaweb.org/aea/2015conference/program/retrieve.php?pdfid=288
- "Environmental Protection, Rare Disasters, and Discount Rates" https://www.aeaweb.org/aea/2015conference/program/retrieve.php?pdfid=200 (#assa2015, the most interesting hashtag those days)
- "20 Years in the Professor Game: things I did right and things I did wrong" http://lymanmuseum.wordpress.com/2015/01/01/20-years-in-the-professor-game-things-i-did-right-and-things-i-did-wrong/ by @ta_wheeler
- "You May Believe You Are a Bayesian But You Are Probably Wrong" http://www.rmm-journal.de/downloads/Article_Senn.pdf by @stephensenn
- "An investigation of the false discovery rate and the misinterpretation of p-values" http://rsos.royalsocietypublishing.org/content/1/3/140216 by @david_colquhoun
- "Statisticians: When We Teach, We Dont Practice What We Preach " http://www.stat.columbia.edu/~gelman/research/published/ChanceEthics2.pdf
- "Why continue to teach and use hypothesis testing?" http://andrewgelman.com/2015/01/03/continue-teach-use-hypothesis-testing/
which makes sense, because those are indeed my most recent posts,
“Stupid People Have No Idea How Stupid They Are” (a.k.a. the Dunning-Kruger Effect) http://t.co/ugdGbuoD2D http://t.co/pHnyWkU20I
— Arthur Charpentier (@freakonometrics) 3 Janvier 2015
etc. I will have to spend some time to include pictures, graphs, maps, videos, etc, but that function should save me some time!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.