Data preparation for Social Network Analysis using R and Gephi
[This article was first published on Enterprise Software Doesn't Have to Suck, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I want to share my experience in generating the data for social network analysis using R and analyzing it using Gephi… Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
WHICH DATA STRUCTURE TO USE FOR LARGE GRAPHS?
I quickly realized that using edge lists and adjacency matrix gets difficult as the graph size increases. So I needed an alternative graph format that was efficient (for storage) and flexible to capture details like edge weight. I chose Gephi’s gexf file format as it can handle large graphs, and it supports dynamic and hierarchical structure. Checkout gexf comparison with other formats for details.
HOW TO HANDLE LARGE DATA SETS IN R?
As I tried to process millions of rows of email log to derive the edgelist, I realized a couple of things…
1) R cannot handle data larger than my computer’s RAM. So I had to look for a way to use R for large data sets. R packages like RMySQL and SQLDF came in handy for this. SQLDF uses SQLlite, an in-memory database. If your data cannot fit into RAM then you can instruct SQLLITE to use persistent store for handling large data sets.
Note: There are many other ways to handle large data in R effectively, e.g. R multicore package for parallel processing, R on MapReduce/Hadoop, etc. Check out the presentation on high performance computing in R for other techniques like ff and bigmemory. Please shout if there are other ways that you used…
2) Some operations are better suited for database/RDBMS: I offloaded RDBMS-suited tasks to SQLlite, the default database used by SQLDF.
3) Learn memory management in R:
– By default R allocates ~1.5GB memory for its use. I allocated more memory for R to handle larger objects using the command “memory.limit(size=3000)”
– Remove unwanted objects from the R session e.g.
rm(raw_emails, emails, to_nodes,from_nodes,all_nodes, unique_nodes)
gc() # call garbage collection explicitly
LOADING THE GRAPH IN GEPHI
Gephi wasn’t able to handle very large graph files (e.g. for files > 500MB size, Gephi was either too slow or stopped responding). So I had to do a couple of things…
1) Increase the amount of memory Gephi allocates for the JVM at startup: By default Gephi allocates 512MB memory for JVM. This wasn’t enough to load the large graph file, so I increased the max. memory Gephi allocated for JVM to 1.4GB.
Edit C:\Program Files\Gephi-0.7\etc\gephidesktop.conf file and changing the line
default_options=”–branding gephidesktop -J-Xms64m -J-Xmx512m” to
default_options=”–branding gephidesktop -J-Xms64m -J-Xmx1400m”
2) Decrease the file size by reducing the text in the graph file e.g. use shorter node_ids, edge_ids etc.
Also, Gephi complained about incorrect file format (it expects UTF-8 encoded XML files). I fixed this simply by opening the graph file generated by R in Textpad and saving it in UTF-8 format before feeding it to Gephi.
LESSONS LEARNED
1) R is more than a statistical tool. I was able to manipulate and clean large data sets (500+ million rows) easily. I will continue learning it. Its fun and rewarding.
2) There are other sophisticated tools for visual social network analysis like Network Workbench
I will explore it for heavy analysis, but Gephi is very easy to use and continues to be my favorite.
3) Use a machine with a lot of RAM, as both Gephi and R are memory hungry
MY CODE FOR GENERATING THE GRAPH
By the way, here’s the R code I used for preparing the graph from email logs for social network analysis using R and Gephi. I’m sure there are better ways to accomplish this. Please shout if you notice any.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Target is to generate a graph file in gexf format (http://gexf.net/format) for Gephi | |
#----------------------------------------------------- | |
# STEP 1 | |
# Generate nodes and edgelist from each email log file | |
#----------------------------------------------------- | |
setwd("C:/R") | |
# use sqldf for operations suited for db http://code.google.com/p/sqldf/ | |
library(sqldf) | |
# define utility functions | |
object.sizes <- function(obs=ls(envir=.GlobalEnv)){return(rev(sort(sapply(obs, function (object.name) object.size(get(object.name))))))} | |
# Create an empty data frame from a header list | |
empty.df<- function(header){ | |
df<-data.frame(matrix(matrix(rep(1,length(header)),1),1)) | |
colnames(df)<-header | |
return(df[NULL,]) | |
} | |
# break down large data problem into smaller ones | |
max_rows<-200000 | |
filelist<-c("Emails Dec 2009.txt", "Emails Jan 2010.txt", "Emails Feb 2010.txt") | |
# data format in the email logs is as below | |
# date from_address from_name to_address to_name | |
# 12-12-2009 john_doe@gmail.com John Doe jane_smith@hotmail.com Jane Smith | |
for(k in 1:length(filelist)){ | |
system.time(raw_emails<-read.csv(filelist[k],sep="\t",header=T,strip.white=TRUE)) | |
# remove system-generated emails | |
emails<-raw_emails[-grep('donotreply|error|webmaster|www', paste(raw_emails[,3],raw_emails[,5]), ignore.case=TRUE, value=FALSE),] | |
# filenames for collecting nodes and edges | |
node_file<-paste("nodes",k,"-",filelist[k],sep="") | |
edge_file<-paste("edges",k,"-",filelist[k],sep="") | |
# get to_nodes in "node_id, node_label" format | |
to_nodes<-emails[c(-1,-2,-3)] | |
# get from_nodes in "node_id, node_label" format | |
from_nodes<-emails[c(-1,-4,-5)] | |
# get edgelist in "from_node_id, to_node_id" format | |
edgelist<-emails[c(-1,-3,-5)] | |
# change column names for rbind | |
colnames(to_nodes)<-c("id","name") | |
colnames(from_nodes)<-c("id","name") | |
all_nodes<-rbind(to_nodes,from_nodes) | |
# convert all nodes and edgelist to lowercase... using SQL | |
system.time(all_nodes_lowercase<-sqldf('SELECT LOWER(id) uid, LOWER(name) label FROM all_nodes')) | |
system.time(edgelist_lowercase<-sqldf('SELECT LOWER(Originator) originator, LOWER(Recipient) recipient FROM edgelist')) | |
unique_nodes<-unique(all_nodes_lowercase) | |
sorted_unique_nodes<- unique_nodes[order(unique_nodes[,1]),] | |
write.csv(sorted_unique_nodes, file = node_file, row.names=FALSE, quote = FALSE) | |
num_blocks<-ceiling(nrow(edgelist_lowercase)/max_rows) | |
start_row<-0 | |
edgecount <- empty.df(c("originator","recipient","count(1)")) | |
for(i in 1:num_blocks){ | |
sql_statement<-paste('select originator, recipient, count(1) FROM (select originator, recipient FROM edgelist_lowercase LIMIT ', start_row, ',', max_rows, ') group by originator, recipient order by originator, recipient') | |
print(system.time(counts<-sqldf(sql_statement))) | |
edgecount<- rbind(edgecount, counts) | |
start_row<-start_row + max_rows | |
} | |
system.time(sqldf("create index edgecount1 on edgecount (originator, recipient)")) | |
system.time(final_edgecount <- sqldf("select originator, recipient, sum(count_1_) FROM edgecount group by originator, recipient order by originator, recipient")) | |
write.csv(final_edgecount, file = edge_file, row.names=FALSE, quote = FALSE) | |
} | |
#---------------------------------------------------------------------- | |
# STEP 2 | |
# Combine node and edgelist files into one large node and edgelist file | |
#---------------------------------------------------------------------- | |
all_file_nodes <- empty.df(c("id","label")) | |
all_file_edges <- empty.df(c("originator","recipient", "sum.count_1_.")) | |
for(k in 1:length(filelist)){ | |
node_file<-paste("nodes",k,"-",filelist[k],sep="") | |
edge_file<-paste("edges",k,"-",filelist[k],sep="") | |
# read each node file and rbind with all_file_nodes | |
system.time(nodes<-read.csv(node_file,sep=",",header=T,strip.white=TRUE)) | |
all_file_nodes <- rbind(all_file_nodes, nodes) | |
# read each edge file and rbind with all_file_edges | |
system.time(edges<-read.csv(edge_file,sep=",",header=T,strip.white=TRUE)) | |
all_file_edges <- rbind(all_file_edges, edges) | |
} | |
unique_all_file_nodes<-unique(all_file_nodes) | |
sorted_unique_all_file_nodes<- unique_all_file_nodes[order(unique_all_file_nodes[,1]),] | |
# write nodes in this form --- <node id="0" label="Hello" /> | |
nodexml<-paste("<node id=\"", sorted_unique_all_file_nodes[,1], "\"", " label=\"", sorted_unique_all_file_nodes[,2],"\""," />", sep="") | |
write.csv(as.data.frame(nodexml, optional=TRUE), file = "All Nodes.txt", quote = FALSE, row.names = FALSE) | |
# edge operations | |
# use pragma table_info to see the table attributes to use in sum sql below | |
# sqldf("pragma table_info(all_file_edges)") | |
unique_all_file_edges<-sqldf('select originator, recipient, sum(sum_count_1__) FROM all_file_edges group by originator, recipient order by sum(sum_count_1__)') | |
nrow(unique_all_file_edges) | |
# filter out edges with wt 0 to 2 | |
thicker_edges<-unique_all_file_edges[unique_all_file_edges[,3]>3,] | |
# write edges in this form --- <edge id="0" source="0" target="1" type="directed" weight="2.4" /> | |
edgelistxml<-paste("<edge id=\"", rownames(thicker_edges), "\" ", "source=\"", thicker_edges[,1], "\" target=\"", thicker_edges[,2], "\" weight=\"", thicker_edges[,3], "\"/>", sep="") | |
# write edges for gexf file. Convert to data.frame to prevent printing the column name | |
write.csv(as.data.frame(edgelistxml, optional=TRUE), file = "All Edges.txt", quote = FALSE, row.names = FALSE) |
To leave a comment for the author, please follow the link and comment on their blog: Enterprise Software Doesn't Have to Suck.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.