[This article was first published on Enterprise Software Doesn't Have to Suck, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I want to share my experience in generating the data for social network analysis using R and analyzing it using Gephi… Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
WHICH DATA STRUCTURE TO USE FOR LARGE GRAPHS?
I quickly realized that using edge lists and adjacency matrix gets difficult as the graph size increases. So I needed an alternative graph format that was efficient (for storage) and flexible to capture details like edge weight. I chose Gephi’s gexf file format as it can handle large graphs, and it supports dynamic and hierarchical structure. Checkout gexf comparison with other formats for details.
HOW TO HANDLE LARGE DATA SETS IN R?
As I tried to process millions of rows of email log to derive the edgelist, I realized a couple of things…
1) R cannot handle data larger than my computer’s RAM. So I had to look for a way to use R for large data sets. R packages like RMySQL and SQLDF came in handy for this. SQLDF uses SQLlite, an in-memory database. If your data cannot fit into RAM then you can instruct SQLLITE to use persistent store for handling large data sets.
Note: There are many other ways to handle large data in R effectively, e.g. R multicore package for parallel processing, R on MapReduce/Hadoop, etc. Check out the presentation on high performance computing in R for other techniques like ff and bigmemory. Please shout if there are other ways that you used…
2) Some operations are better suited for database/RDBMS: I offloaded RDBMS-suited tasks to SQLlite, the default database used by SQLDF.
3) Learn memory management in R:
– By default R allocates ~1.5GB memory for its use. I allocated more memory for R to handle larger objects using the command “memory.limit(size=3000)”
– Remove unwanted objects from the R session e.g.
rm(raw_emails, emails, to_nodes,from_nodes,all_nodes, unique_nodes)
gc() # call garbage collection explicitly
LOADING THE GRAPH IN GEPHI
Gephi wasn’t able to handle very large graph files (e.g. for files > 500MB size, Gephi was either too slow or stopped responding). So I had to do a couple of things…
1) Increase the amount of memory Gephi allocates for the JVM at startup: By default Gephi allocates 512MB memory for JVM. This wasn’t enough to load the large graph file, so I increased the max. memory Gephi allocated for JVM to 1.4GB.
Edit C:\Program Files\Gephi-0.7\etc\gephidesktop.conf file and changing the line
default_options=”–branding gephidesktop -J-Xms64m -J-Xmx512m” to
default_options=”–branding gephidesktop -J-Xms64m -J-Xmx1400m”
2) Decrease the file size by reducing the text in the graph file e.g. use shorter node_ids, edge_ids etc.
Also, Gephi complained about incorrect file format (it expects UTF-8 encoded XML files). I fixed this simply by opening the graph file generated by R in Textpad and saving it in UTF-8 format before feeding it to Gephi.
LESSONS LEARNED
1) R is more than a statistical tool. I was able to manipulate and clean large data sets (500+ million rows) easily. I will continue learning it. Its fun and rewarding.
2) There are other sophisticated tools for visual social network analysis like Network Workbench
I will explore it for heavy analysis, but Gephi is very easy to use and continues to be my favorite.
3) Use a machine with a lot of RAM, as both Gephi and R are memory hungry
MY CODE FOR GENERATING THE GRAPH
By the way, here’s the R code I used for preparing the graph from email logs for social network analysis using R and Gephi. I’m sure there are better ways to accomplish this. Please shout if you notice any.
To leave a comment for the author, please follow the link and comment on their blog: Enterprise Software Doesn't Have to Suck.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.