Site icon R-bloggers

Bioinformatics journals: time from submission to acceptance, revisited

[This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Before we start: yes, we’ve been here before. There was the Biostars question “Calculating Time From Submission To Publication / Degree Of Burden In Submitting A Paper.” That gave rise to Pierre’s excellent blog post and code + data on Figshare.

So why are we here again? 1. It’s been a couple of years. 2. This is the R (+ Ruby) version. 3. It’s always worth highlighting how the poor state of publicly-available data prevents us from doing what we’d like to do. In this case the interesting question “which bioinformatics journal should I submit to for rapid publication?” becomes “here’s an incomplete analysis using questionable data regarding publication dates.”

Let’s get it out of the way then.

1. Find a list of bioinformatics journals

Here’s one, at the Bioinformatics.org wiki. It includes a metric named Article Influence that we can use to sort the journals. Let’s be completely arbitrary and take the top 20.

getJournalTitles <- function() {
  require(XML)
  journals <- readHTMLTable("http://www.bioinformatics.org/wiki/Journals", stringsAsFactors = FALSE)
  journals <- journals[[2]]
  journals[, 2] <- as.numeric(journals[, 2])
  journals <- journals[order(journals[, 2], decreasing = TRUE), ]
  titles <- head(journals[, 1], 20)
  return(titles)
}

titles <- getJournalTitles()

2. Download PubMed records

Next, we search PubMed for those journal titles and download records in PubMed XML format. During this process I learned that (1) the ampersand in Molecular & Cellular Proteomics should be replaced by “and”, (2) Proteins: Structure, Function, and Bioinformatics should be renamed “Proteins” and (3) IEEE Transactions on Evolutionary Computation is apparently not indexed by PubMed.

getJournalXML <- function(title) {
  require(rentrez)
  term <- paste(title, "[JOUR]", sep = "")
  e <- entrez_search("pubmed", term, usehistory = "y")
  f <- entrez_fetch("pubmed", WebEnv = e$WebEnv, query_key = e$QueryKey, 
                    rettype = "xml", retmax = e$count)
  d <- xmlTreeParse(f, useInternalNodes = TRUE)
  outfile <- paste(gsub(" ", "_", title), "xml", sep = ".")
  saveXML(xmlRoot(d), outfile)
}

titles[6] <- gsub("&", "and", titles[6])
titles[11] <- "Proteins"

# saves XML files in current working directory
sapply(titles, function(x) getJournalXML(x))

3. Parse for publication dates

Yes, submission to publication time includes time for revision(s). However, submission to initial decision times are not readily-available (certainly not from PubMed), and acceptance to publication times mean nothing in the age of “ahead of print”, so the first of these is what we use.

I haven’t figured out how to make the R/XML xpathSApply() function return empty values where nodes don’t exist, so I went for Ruby/Nokogiri which does that by default. Cue extraordinarily-ugly code:

#!/usr/bin/ruby

require 'nokogiri'

f = File.open(ARGV.first)
doc = Nokogiri::XML(f)
f.close

doc.xpath("//PubmedArticle").each do |a|
  r = ["", "", "", "", "", "", "", ""]
  r[0] = a.xpath("MedlineCitation/Article/Journal/ISOAbbreviation").text
  r[1] = a.xpath("MedlineCitation/PMID").text
  r[2] = a.xpath("PubmedData/History/PubMedPubDate[@PubStatus='received']/Year").text
  r[3] = a.xpath("PubmedData/History/PubMedPubDate[@PubStatus='received']/Month").text
  r[4] = a.xpath("PubmedData/History/PubMedPubDate[@PubStatus='received']/Day").text
  r[5] = a.xpath("PubmedData/History/PubMedPubDate[@PubStatus='accepted']/Year").text
  r[6] = a.xpath("PubmedData/History/PubMedPubDate[@PubStatus='accepted']/Month").text
  r[7] = a.xpath("PubmedData/History/PubMedPubDate[@PubStatus='accepted']/Day").text
  puts r.join(",")
end

If you save that as pubmedXML2CSV.rb, you can run it on all the XML files in the current directory using:

find . -name "*.xml" -exec ruby pubmedXML2CSV.rb {} > bioinfjournals.csv ;

4. Analyse data

Time from received to accepted for selected journals according to PubMed

It’s reasonably plain sailing from here onwards. We read the CSV file into R. Not all records have received or accepted dates and for those that do, it’s messy. Months, for example, are variously represented as January, Jan or 1. It would be nice to have a function that could make a Date object from anything resembling “year-month-day” and happily, the R/lubridate package provides ymd() to do just that.

You’ll also find articles submitted from the future (the year 2919, for example), so best to remove records where submission apparently happened after acceptance. Perhaps the most exciting thing about the following code is that I’m not alone in wanting to sort boxplots by median and found a solution to do so. Click the image, right, for the larger version.

plotJournalTimes <- function(csvfile) {
  require(lubridate)
  require(ggplot2)
  journals <- read.csv(csvfile, header=FALSE, stringsAsFactors=FALSE)
  colnames(journals) <- c("title", "pmid", "rec.year", "rec.month", "rec.day", "acc.year", "acc.month", "acc.day")
  journals$received  <- ymd(paste(journals$rec.year, journals$rec.month, journals$rec.day, sep = "-"))
  journals$accepted  <- ymd(paste(journals$acc.year, journals$acc.month, journals$acc.day, sep = "-"))
  journals$diff      <- as.numeric(journals$accepted - journals$received)
  ggplot(subset(journals, diff > 0), aes(reorder(title, diff, median), diff / (24 * 3600))) + 
    geom_boxplot(fill = "wheat2") + theme_bw() + coord_flip() + 
    ylab("accepted - received (days)") + xlab("journal")
}

plotJournalTimes("bioinfjournals.csv")

So there you have it. No data for one of the top 20 journals. No accepted and/or received date for 9 of the others. Of the 10 remaining, only about 48% of the 64 759 records include dates that can be parsed. Of those, at least one and probably more are rather dubious. Very short times are as likely to be outliers (erroneous) as very long times.

If you still care by this point: Mammalian Genome is the winner with a median time to acceptance of 80 days, going up to 175.5 days for Journal of Computational Neuroscience. 11 weeks still seems like a long time to me, even if you believe the numbers. Which you probably should not.


Filed under: bioinformatics, programming, R, ruby, statistics Tagged: journals, publishing, pubmed

To leave a comment for the author, please follow the link and comment on their blog: What You're Doing Is Rather Desperate » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.