Analysing the ISMB 2010 meeting using R

Posted on July 20, 2010 by nsaunders in R bloggers | 0 Comments

[This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The colossus of bioinformatics meetings, ISMB, convened in Boston this year from July 9 – 13. As in recent years, the meeting was covered online at its website, FriendFeed and Twitter.

I thought it would be fun to run a quick analysis of activity at the FriendFeed room using R.

1. Fetch the data
We can use the FriendFeed API to fetch data in JSON format. R provides two useful packages: RCurl, for making the HTTP request and rjson (or RJSONIO), to parse the results into a list. Since we don’t know in advance how many entries to expect, we set some arbitrarily large maximum number of entries, loop towards it and break when no more entries are returned.

library(RCurl)
library(rjson)

ismb.url  <- "http://friendfeed-api.com/v2/feed/ismb2010"
ismb.data <- list()

for(i in seq(0, 900, by = 100)) {
  ismb.json <- fromJSON(getURL(paste(ismb.url, "?start=", i , "&num=100", sep = "")))
  if(length(ismb.json$entries) == 0)
    break
  else
    ismb.data <- append(ismb.data, ismb.json$entries)
}

The list ismb.data currently contains 178 entries. Each entry is itself a list of items that describe the entry. You can get an idea of its structure using summary():

length(ismb.data)
[1] 178

summary(ismb.data[[1]])
         Length Class  Mode
body     1      -none- character
from     3      -none- list
url      1      -none- character
comments 2      -none- list
to       1      -none- list
likes    1      -none- list
date     1      -none- character
id       1      -none- character

2. Entries, comments and likes
We’d like to see the title, date, number of comments and number of likes for each entry. One way to do that is to convert ismb.data to a data frame. There is surely an elegant way to achieve this using, for example, the plyr package, but here’s an ugly way using sapply():

ismb.df <- data.frame(body = sapply(ismb.data, function(x)x$body), 
            date = sapply(ismb.data, function(x)x$date), 
            comments = sapply(ismb.data, function(x)length(x$comments)), 
            likes = sapply(ismb.data, function(x)length(x$likes)))

Now that we have a data frame, it’s easy to sort. Let’s look at the 10 entries that generated the most discussion. I’ve edited the output here, to highlight just the relevant parts with counts in the first column:

head(ismb.df[sort.list(ismb.df$comments, decreasing = T), ], n = 10L)
121	PLoS Session on How to Write a Good Paper
100	Keynote: David Altshuler - Genomic Variation and the Inherited Basis of Common Disease
68	Keynote: Chris Sander - Systems Biology of Cancer Cells
66	Keynote: Svante Pääbo - Analyses of Pleistocene Genomes
65	Keynote: George Church - BI/O: Reading and Writing Genomes
61	Keynote: Steven Brenner - Ultraconserved nonsense: gene regulation by splicing & RNA surveillance
37	Keynote: Susan Lindquist - Protein Folding and Environmental Stress REDRAW the Relationship between Genotype and Phenotype
29	Special Public Lecture: Dr. Robert Weinberg - Cancer Stem Cells and the Evolution of Malignancy
18	HL40: Martin Vingron - Histone modification levels are predictive for gene expression
18	HL35: Liran Carmel - A universal relationship between gene compactness and expression level in multicellular eukaryotes

So the keynotes and the PLoS session on writing a paper were popular. We can look at the “likes” too:

head(ismb.df[sort.list(ismb.df$likes, decreasing = T), ], n = 10L)
12	PLoS Session on How to Write a Good Paper
7	HL21: Rachel Kolodny - FragBag: representing protein structures as 'bags-of-fragments' allows efficient exploration of protein structure space.
7	Keynote: Steven Brenner - Ultraconserved nonsense: gene regulation by splicing & RNA surveillance
6	LBR11: Mark  Wass - Towards the prediction of protein interaction partners using physical docking
6	Keynote: Chris Sander - Systems Biology of Cancer Cells
6	HL25: Benjamin Jefferys - Protein Folding Requires Crowd Control in a Simulated Cell
5	Keynote: George Church - BI/O: Reading and Writing Genomes
5	Keynote: David Altshuler - Genomic Variation and the Inherited Basis of Common Disease
5	Keynote: Susan Lindquist - Protein Folding and Environmental Stress REDRAW the Relationship between Genotype and Phenotype
4	Special Public Lecture: Dr. Robert Weinberg - Cancer Stem Cells and the Evolution of Malignancy

A slightly-different picture, with some non-keynotes creeping into the list. This confirms my FriendFeed experience; whilst you might assume that “liking” takes less effort, people will in fact comment at length if they really care about the topic.

How many posts generated no discussion?

# no comments
> nrow(subset(ismb.df, comments == 0))
[1] 122
# no likes
> nrow(subset(ismb.df, likes == 0))
[1] 124

Quite a high proportion – around 69%. It did seem as though there was less online activity this year – perhaps attendees could explain why? I have heard rumours that the wireless connectivity was not optimal.

3. Who contributed?
Let’s name some names!
There is, surely, an apply-type function to grab names and comments from the list, ismb.data and count up the comments. In its absence, here once again is my ugly solution which does at least employ plyr:

library(plyr)
# loop over ismb.data and append names of commenters to a list
commenters <- list()
for(i in seq(1:length(ismb.data))) {
  commenters <- append(commenters, llply(ismb.data[[i]]$comments, function(x) x$from$name))
}
# convert to data frame, count comments using table() and convert again to data frame
ismb.commenters <- as.data.frame(table(ldply(commenters)))

# How many people left comments?
nrow(ismb.commenters)
[1] 30
# The top 10 contributors
head(ismb.commenters[sort.list(ismb.commenters$Freq, decreasing = T), ], n = 10L)
                   Var1 Freq
3                    bb  249
19        Roland Krause  147
2                  arne   89
15       Mickey Kosloff   79
22     Shannon McWeeney   78
7             Dawei lin   70
27 Venkata P. Satagopam   61
25          Ted Laderas   19
11          John Greene    7
5         Burkhard Rost    4

4. Activity over time
Finally, let’s bring in ggplot2, to display daily comment activity over the course of the conference. This is pretty rough and ready. I’m sure that you can do better.

library(chron)
# Loop through comments as before but this time, get the comment dates in a list
ismb.datetime <- list()
for(i in seq(1:length(ismb.data))) {
  ismb.datetime <- append(ismb.datetime, llply(ismb.data[[i]]$comments, function(x) x$date))
}
# convert to data frame
ismb.datetime <- ldply(ismb.datetime)
colnames(ismb.datetime) <- "datetime"
# add the month, day and hour of the comment
ismb.datetime$month <- months(strptime(ismb.datetime$datetime, "%Y-%m-%dT%H:%M:%SZ"))
ismb.datetime$day <- days(strptime(ismb.datetime$datetime, "%Y-%m-%dT%H:%M:%SZ"))
ismb.datetime$hour <- hours(strptime(ismb.datetime$datetime, "%Y-%m-%dT%H:%M:%SZ"))
# Month is always July, so ignore and count comments by hour each day
ismb.datetime <- as.data.frame(table(ismb.datetime$day, ismb.datetime$hour))
colnames(ismb.datetime) <- c("day", "hour", "count")
# Why not keep only the conference dates too (July 9-13)
ismb.datetime <- ismb.datetime[as.numeric(ismb.datetime$day) > 8 & as.numeric(ismb.datetime$day) < 14, ]
# Finally, plot comments per hour, faceted by day
print(ggplot(ismb.datetime, aes(hour, count)) + geom_bar() + facet_grid(day~.))

ISMB2010 comments July 9-13

I did warn you to expect ugliness. I reckon if you knew your POSIX date/time functions and your ggplot2, you could leap from the list of comment dates to a final plot in a single bound.
Still, we have a result. It looks as though things were pretty quiet on the SIG/Tutorial days (July 9-10), followed by several bursts of comments between July 11-13. The highest bars, presumably, correspond with the morning/afternoon keynotes and other popular talks. Note that FriendFeed stores time as UTC; for extra credit, you could convert to Boston time before plotting.

In 2011, ISMB returns to Vienna. Let’s hope for more conference microblogging in the years ahead.

Filed under: bioinformatics, meetings, R, statistics Tagged: api, friendfeed, ismb2010