Site icon R-bloggers

Analyze Twitter Data Using R

[This article was first published on R-Chart, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Twitter data available through its API provides a wealth of real time information.  This article demonstrates a graph of user relationships and an analysis of tweets returned in a search using R.  Keep in mind, Twitter has announced that basic authentication removal is going to occur on August 16, 2010.  I am not sure how this code will work after that point… it depends upon the state of the twitteR library at that time and API specifics that Twitter implements.

Cornelius Puschmann’s blog provides the background for the graphing code below that is consolidated here into a function.  It relies upon the the twitteR and igraph libraries.  

library(twitteR)
library(igraph)

twitterGraph = function (username, password, userToPlot)
{
  sess <- initSession(username, password)
  friends.object <- userFriends(userToPlot,n=20, sess)
  followers.object <- userFollowers(userToPlot,n=20, sess)
  friends <- sapply(friends.object,name)
  followers <- sapply(followers.object,name)

  relations <- merge(data.frame(User=userToPlot, Follower=friends), 
                     data.frame(User=followers,  Follower=userToPlot), 
                     all=T)

  g <- graph.data.frame(relations, directed = T)
  V(g)$label <- V(g)$name
  g
}

The function returns an object of class igraph that represents twitter users as nodes as well as the directionality of their relationships.

Vertices: 16 
Edges: 16 
Directed: TRUE 
Edges:
                                                     
[0]  ‘ezgraphs’             -> ‘Cranatic’      



The graph object can be written to a file in a variety of common graph formats.  However, some of these currently do not retain label information – only numbers of nodes. The .dot and .graphml formats do appear to retain this data.  You can also plot the graph a variety of ways within R itself (which is what is demonstrated below).  

You can call the twitterGraph function by passing it a twitter username and password to log in with along with a user to plot.  

g= twitterGraph (‘YOUR_TWITTER_USERNAME’,’PASSWORD’,’USER_TO_PLOT’)


The simplest way to visualize this graph in R is by calling plot.  It results in the graph pictured above.

plot(g)

The graph displayed is static and cannot be manipulated or rearranged in any way.  However, you can create an interactive version of the plot by calling the tkplot function for the graph.  The individual nodes can be arranged by clicking and dragging them.  However, you might instead opt for one of the automatic layouts available that implement various algorithms for drawing graphs.


tkplot(g)




The Reingold / Tilford algorithm results in an arrangement like the following:


The Kamada-Kawai algorithm by contrast renders as follows:
The tkplot is the best option available at present for drawing graphs, and Tk interface provides some decent functionality for interacting with the graph.  There is one final possibility which renders the graph in 3D and allows you to rotate the graph by clicking and dragging.  However, this option does not permit arrangement of individual nodes and does not include any algorithms to rearrange the nodes.

rglplot(g)


There are many other ways you can analyze Twitter data using R.  There are an extensive collection of R packages dedicated to natural language processing tasks.  The following example relies upon the  OpenNLP and openNLPmodels.en packages.


library(openNLP)
library(twitteR)


# Replace the user and password below
sess <- initSession(‘YOUR_TWITTER_USER’,’PASSWORD’)
 sea <- searchTwitter(“#rstats”)


# Cycle through the list and get the text from the tweets for analysis
names(sea)=c(‘tweet’)
textdata=vector()
for (i in 1:length(sea)) {textdata=append(textdata,tokenize(text(sea[[i]])))}


# limit to entries that include alpha characters
textdata=factor(textdata)
textdata=textdata[grep(“[a-zA-Z]”,textdata)]


# Only include tokens that appear more than three times
s=summary(textdata)
subset=s[s>3]


# Set the chart options so that we can see the y axis
par(las=2,cex=.9,mar=c(11, 2, 4, 2) + 0.1)
barplot(subset,names=names(subset))

The result is a bar chart of occurrences of words in the tweets retreived.
R’s interactive nature makes it a great platform for investigating new data sources.  As usual, I am pleasantly surprised at the mature nature of the language and the large number of libraries available to do the heavy lifting required for analyzing data in a variety of formats.

To leave a comment for the author, please follow the link and comment on their blog: R-Chart.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.