Analysing the ISMB 2010 meeting using R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The colossus of bioinformatics meetings, ISMB, convened in Boston this year from July 9 – 13. As in recent years, the meeting was covered online at its website, FriendFeed and Twitter.
I thought it would be fun to run a quick analysis of activity at the FriendFeed room using R.
1. Fetch the data
We can use the FriendFeed API to fetch data in JSON format. R provides two useful packages: RCurl, for making the HTTP request and rjson (or RJSONIO), to parse the results into a list. Since we don’t know in advance how many entries to expect, we set some arbitrarily large maximum number of entries, loop towards it and break when no more entries are returned.
library(RCurl) library(rjson) ismb.url <- "http://friendfeed-api.com/v2/feed/ismb2010" ismb.data <- list() for(i in seq(0, 900, by = 100)) { ismb.json <- fromJSON(getURL(paste(ismb.url, "?start=", i , "&num=100", sep = ""))) if(length(ismb.json$entries) == 0) break else ismb.data <- append(ismb.data, ismb.json$entries) }
The list ismb.data currently contains 178 entries. Each entry is itself a list of items that describe the entry. You can get an idea of its structure using summary():
length(ismb.data) [1] 178 summary(ismb.data[[1]]) Length Class Mode body 1 -none- character from 3 -none- list url 1 -none- character comments 2 -none- list to 1 -none- list likes 1 -none- list date 1 -none- character id 1 -none- character
2. Entries, comments and likes
We’d like to see the title, date, number of comments and number of likes for each entry. One way to do that is to convert ismb.data to a data frame. There is surely an elegant way to achieve this using, for example, the plyr package, but here’s an ugly way using sapply():
ismb.df <- data.frame(body = sapply(ismb.data, function(x)x$body), date = sapply(ismb.data, function(x)x$date), comments = sapply(ismb.data, function(x)length(x$comments)), likes = sapply(ismb.data, function(x)length(x$likes)))
Now that we have a data frame, it’s easy to sort. Let’s look at the 10 entries that generated the most discussion. I’ve edited the output here, to highlight just the relevant parts with counts in the first column:
head(ismb.df[sort.list(ismb.df$comments, decreasing = T), ], n = 10L) 121 PLoS Session on How to Write a Good Paper 100 Keynote: David Altshuler - Genomic Variation and the Inherited Basis of Common Disease 68 Keynote: Chris Sander - Systems Biology of Cancer Cells 66 Keynote: Svante Pääbo - Analyses of Pleistocene Genomes 65 Keynote: George Church - BI/O: Reading and Writing Genomes 61 Keynote: Steven Brenner - Ultraconserved nonsense: gene regulation by splicing & RNA surveillance 37 Keynote: Susan Lindquist - Protein Folding and Environmental Stress REDRAW the Relationship between Genotype and Phenotype 29 Special Public Lecture: Dr. Robert Weinberg - Cancer Stem Cells and the Evolution of Malignancy 18 HL40: Martin Vingron - Histone modification levels are predictive for gene expression 18 HL35: Liran Carmel - A universal relationship between gene compactness and expression level in multicellular eukaryotes
So the keynotes and the PLoS session on writing a paper were popular. We can look at the “likes” too:
head(ismb.df[sort.list(ismb.df$likes, decreasing = T), ], n = 10L) 12 PLoS Session on How to Write a Good Paper 7 HL21: Rachel Kolodny - FragBag: representing protein structures as 'bags-of-fragments' allows efficient exploration of protein structure space. 7 Keynote: Steven Brenner - Ultraconserved nonsense: gene regulation by splicing & RNA surveillance 6 LBR11: Mark Wass - Towards the prediction of protein interaction partners using physical docking 6 Keynote: Chris Sander - Systems Biology of Cancer Cells 6 HL25: Benjamin Jefferys - Protein Folding Requires Crowd Control in a Simulated Cell 5 Keynote: George Church - BI/O: Reading and Writing Genomes 5 Keynote: David Altshuler - Genomic Variation and the Inherited Basis of Common Disease 5 Keynote: Susan Lindquist - Protein Folding and Environmental Stress REDRAW the Relationship between Genotype and Phenotype 4 Special Public Lecture: Dr. Robert Weinberg - Cancer Stem Cells and the Evolution of Malignancy
A slightly-different picture, with some non-keynotes creeping into the list. This confirms my FriendFeed experience; whilst you might assume that “liking” takes less effort, people will in fact comment at length if they really care about the topic.
How many posts generated no discussion?
# no comments > nrow(subset(ismb.df, comments == 0)) [1] 122 # no likes > nrow(subset(ismb.df, likes == 0)) [1] 124
Quite a high proportion – around 69%. It did seem as though there was less online activity this year – perhaps attendees could explain why? I have heard rumours that the wireless connectivity was not optimal.
3. Who contributed?
Let’s name some names!
There is, surely, an apply-type function to grab names and comments from the list, ismb.data and count up the comments. In its absence, here once again is my ugly solution which does at least employ plyr:
library(plyr) # loop over ismb.data and append names of commenters to a list commenters <- list() for(i in seq(1:length(ismb.data))) { commenters <- append(commenters, llply(ismb.data[[i]]$comments, function(x) x$from$name)) } # convert to data frame, count comments using table() and convert again to data frame ismb.commenters <- as.data.frame(table(ldply(commenters))) # How many people left comments? nrow(ismb.commenters) [1] 30 # The top 10 contributors head(ismb.commenters[sort.list(ismb.commenters$Freq, decreasing = T), ], n = 10L) Var1 Freq 3 bb 249 19 Roland Krause 147 2 arne 89 15 Mickey Kosloff 79 22 Shannon McWeeney 78 7 Dawei lin 70 27 Venkata P. Satagopam 61 25 Ted Laderas 19 11 John Greene 7 5 Burkhard Rost 4
4. Activity over time
Finally, let’s bring in ggplot2, to display daily comment activity over the course of the conference. This is pretty rough and ready. I’m sure that you can do better.
library(chron) # Loop through comments as before but this time, get the comment dates in a list ismb.datetime <- list() for(i in seq(1:length(ismb.data))) { ismb.datetime <- append(ismb.datetime, llply(ismb.data[[i]]$comments, function(x) x$date)) } # convert to data frame ismb.datetime <- ldply(ismb.datetime) colnames(ismb.datetime) <- "datetime" # add the month, day and hour of the comment ismb.datetime$month <- months(strptime(ismb.datetime$datetime, "%Y-%m-%dT%H:%M:%SZ")) ismb.datetime$day <- days(strptime(ismb.datetime$datetime, "%Y-%m-%dT%H:%M:%SZ")) ismb.datetime$hour <- hours(strptime(ismb.datetime$datetime, "%Y-%m-%dT%H:%M:%SZ")) # Month is always July, so ignore and count comments by hour each day ismb.datetime <- as.data.frame(table(ismb.datetime$day, ismb.datetime$hour)) colnames(ismb.datetime) <- c("day", "hour", "count") # Why not keep only the conference dates too (July 9-13) ismb.datetime <- ismb.datetime[as.numeric(ismb.datetime$day) > 8 & as.numeric(ismb.datetime$day) < 14, ] # Finally, plot comments per hour, faceted by day print(ggplot(ismb.datetime, aes(hour, count)) + geom_bar() + facet_grid(day~.))
I did warn you to expect ugliness. I reckon if you knew your POSIX date/time functions and your ggplot2, you could leap from the list of comment dates to a final plot in a single bound.
Still, we have a result. It looks as though things were pretty quiet on the SIG/Tutorial days (July 9-10), followed by several bursts of comments between July 11-13. The highest bars, presumably, correspond with the morning/afternoon keynotes and other popular talks. Note that FriendFeed stores time as UTC; for extra credit, you could convert to Boston time before plotting.
In 2011, ISMB returns to Vienna. Let’s hope for more conference microblogging in the years ahead.
Filed under: bioinformatics, meetings, R, statistics Tagged: api, friendfeed, ismb2010
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.