Site icon R-bloggers

Using twitteR to see, what german press secretary tweets about

[This article was first published on fibosworld » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Find the HTML-slides here, and the .Rmd-file that was used to generate here.

How to deal with .Rmd-files, see here

What this is about

These are my first steps to play around with the interface from R to twitter, using the twitteR-package.

We will load the latest 1500 (maximum the API allows) tweets from the user @RegSprecher, who is the spokesman of the German government and run some analysis, like:

Load data

## text ## 1 @eigensinn83 War jedenfalls eine anregende Sonntagmorgenslektüre; ob es aber für den unmittelbaren politischen Durchbruch reicht ?

The latest 6 tweets of @RegSprecher

Which device does he use?

Analysis of frequency

At what time of day is he tweeting?

What is he tweeting about?

## [1] "bpa" "bundesregierung" "deu" ## [4] "fragreg" "für" "kanzlerin" ## [7] "mehr" "merkel" "neue" ## [10] "über" "uhr" 
## merkel kanzlerin ## 1.00 0.73 

——————————————————–
The Code below:

## @knitr setup
library(ggplot2)
library(scales)
library(lubridate)
library(twitteR)
library(gridExtra)
# set global chunk options
opts_chunk$set(fig.path='figure/slides-', cache.path='cache/slides-', cache=TRUE)
# upload images automatically
opts_knit$set(upload.fun = imgur_upload)

## @knitr load_data
# load tweets and convert to dataframe
regsprecher.tweets <- userTimeline("RegSprecher", n=1500)
regsprecher.tweets.df <- twListToDF(regsprecher.tweets)
regsprecher.tweets.df <- subset(regsprecher.tweets.df, created > ymd("2011-01-01")) # need to subset, because sometimes there are tweets from 2004...
#str(regsprecher.tweets.df)
print(head(regsprecher.tweets.df[,c(1,4,10)]))

## @knitr device
# Code from vignette of twitteR-package
  sources <- sapply(regsprecher.tweets, function(x) x$getStatusSource())
  sources <- gsub("", "", sources)
  sources <- strsplit(sources, ">")
  sources <- sapply(sources, function(x) ifelse(length(x) > 1, x[2], x[1]))
  pie(table(sources))

## @knitr freq
  ggplot() +
 geom_bar(aes(x = created),data=regsprecher.tweets.df,binwidth = 86400.0) +
 scale_y_continuous(name = 'Frequency, # tweets/day') +
 scale_x_datetime(name = 'Date',breaks = date_breaks(),labels = date_format(format = '%Y-%b'))

## @knitr time
plot1 <- ggplot() + geom_point(aes(x=created, y=hour(created)), data=regsprecher.tweets.df, alpha=0.5) +scale_y_continuous(name = 'Hour of day')
plot2 <- ggplot() +
 geom_bar(aes(x = hour(created)),data=regsprecher.tweets.df,binwidth = 1.0) +
 scale_x_continuous(name = 'Hour of day',breaks = c(c(0,6,10,12,8,14,16,18,20,
 22,2,4,24)),limits = c(0,24)) +
 scale_y_continuous(name = '# tweets')
grid.arrange(plot1, plot2, ncol=2)

## @knitr words
# this passage is entirely from http://heuristically.wordpress.com/2011/04/08/text-data-mining-twitter-r/
require(tm)
# build a corpus
mydata.corpus <- Corpus(VectorSource(regsprecher.tweets.df$text))
# make each letter lowercase
mydata.corpus <- tm_map(mydata.corpus, tolower) 
# remove punctuation 
mydata.corpus <- tm_map(mydata.corpus, removePunctuation, preserve_intra_word_dashes=TRUE)
# remove generic and custom stopwords
my_stopwords <- c(stopwords('german'))
mydata.corpus <- tm_map(mydata.corpus, removeWords, my_stopwords)
# build a term-document matrix
mydata.dtm <- TermDocumentMatrix(mydata.corpus)
# inspect the document-term matrix
#mydata.dtm
# inspect most popular words
findFreqTerms(mydata.dtm, lowfreq=30)
findAssocs(mydata.dtm, 'merkel', 0.20)

## @knitr words2

# remove sparse terms to simplify the cluster plot
# Note: tweak the sparse parameter to determine the number of words.
# About 10-30 words is good.
mydata.dtm2 <- removeSparseTerms(mydata.dtm, sparse=0.97)
# convert the sparse term-document matrix to a standard data frame
mydata.df <- as.data.frame(inspect(mydata.dtm2))
mydata.df.scale <- scale(mydata.df)
d <- dist(mydata.df.scale, method = "euclidean") # distance matrix

## @knitr words3
fit <- hclust(d, method="ward")
plot(fit) # display dendogram?


To leave a comment for the author, please follow the link and comment on their blog: fibosworld » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.