Scraping twitter data to visualize trending tweets in Kuala Lumpur
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
(Disclaimer: I’ve no grudge against python programming language per se. I think its equally great. In the following post, I’m merely recounting my experience.)
It’s been quite a while since I last posted. The reasons are numerous, notable being, unable to decide which programming language to choose for web data scraping. The contenders were data analytic maestro, R
and data scraping guru, python
. So, I decided to give myself some time to figure out which language will be best for my use case. My use case was, Given some search keywords, scrape twitter for related posts and visualize the result. First, I needed the live data. Again, I was at the cross-roads, “R or Python”. Apparently python has some great packages for twitter data streaming like twython
,python-twitter
, tweepy
. Equivalent R libraries are twitteR
,rwteet
. I chose the rtweet
package for data collection over python for following reasons;
- I do not have to create a
credential file
(unlike in python) to log in to my twitter account. However, you do need to authenticate the twitter account when using thertweet
package. This authentication is done just once if using thertweet
package. Your twitter credentials will be stored locally. - Coding and code readability is far more easier as compared to python.
- The
rtweet
package allows for multiple hash tags to be searched for. - To localize the data, the package also allows for specifying geographic coordinates.
So, using the following code snippet, I was able to scrape data. The code has following parts;
-
A custom search for tweets function which will accept the search string. If search string is
#required librariesNULL
, it will throw a message and stop, else it will search for hash tags specified in search string and return a data frame as output.> library(rtweet) > library(tidytext) > library(tidyverse) > library(stringr) > library(stopwords) > library(rtweet) # for search_tweets() # Create a function that will accept multiple hashtags and will search the twitter api for related tweets > search_tweets_queries <- function(x, n = 100, ...) { ## check inputs stopifnot(is.atomic(x), is.numeric(n)) if (length(x) == 0L) { stop("No query found", call. = FALSE) } ## search for each string in column of queries rt <- lapply(x, search_tweets, n = n, ...) ## add query variable to data frames rt <- Map(cbind, rt, query = x, stringsAsFactors = FALSE) ## merge users data into one data frame rt_users <- do.call("rbind", lapply(rt, users_data)) ## merge tweets data into one data frame rt <- do.call("rbind", rt) ## set users attribute attr(rt, "users") <- rt_users ## return tibble (validate = FALSE makes it a bit faster) tibble::as_tibble(rt, validate = FALSE) }
A data frame containing the search terms. Note, here my search hash-tags are KTM
, MRT
and monorail
.
# create data frame with query column > df_query <- data.frame( query = c("KTM", "monorail","MRT"), n = rnorm(3), # change this number according to the number of searchwords in parameter query. As of now, the parameter got 3 keywords, therefore this nuber is set to 3. stringsAsFactors = FALSE )