Site icon R-bloggers

Facebook-class social network analysis with R and Hadoop

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In computing, social networks are traditionally represented as graphs: a connection of nodes (people), pairs of which may be connected by edges (friend relationships). Visually, the social networks can then be represented like this:

Social network analysis often amounts to calculating the statistics on a graph like this: the number of edges (friends) connected to a particular node (person), and the distribution of the number of edges connected to nodes across the entire graph. When the graph consists of up to 10 billion elements (nodes and edges), such computations can be done on a single server with dedicated graph software like Neo4j. But bigger networks — like Facebook's social network, which is a graph with more than 60 billion elements — require a distributed solution.

Marko A. Rodriguez, a graph consultant with Aurelius, shows in a blog post how to use R and Hadoop (integrated with Revolution Analytics' RHadoop packages) to analyze Facebook-scale social networks. He first simulates a social network (shown at the top of this post) using R's igraph package, and then distributed the network in the Hadoop cluster with to.dfs function (from the rhdfs package). He then used the mapreduce function (from the rmr package) to write a simple map-reduce algorithm in R to count the number of edges associated with each node:

degree.V <- mapreduce(edge.list, 
    map=function(k,v) keyval(v[2],1), 
    reduce=function(k,v) keyval(k,length(v)))
from.dfs(degree.V)[[1]]

From there, it's another simple map-reduce job to calculate the connectivity statistics for the entire network. For more details on how Marko used RHadoop to perform this analysis, see the entire blog post linked below.

Aurelius blog: Graph Degree Distributions using R over Hadoop

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.