Site icon R-bloggers

Screenager: screening times at bioRxiv

[This article was first published on Rstats – quantixed, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When a preprint is uploaded to bioRxiv, it undergoes screening before it appears online. How long does it take for Affiliates to screen preprints at bioRxiv?

tl;dr I used R to look at bioRxiv screening times. Even though bioRxiv has expanded massively, screening happens quickly (in about 24 h).

I am a bioRxiv Affiliate – one of the people who does the screening. Preprints wait in a queue to be screened. Over the years, I’ve seen the typical queue get longer and longer. In the early days the queue was maybe 10 preprints. These days it’s often over 100.

It’s a team effort and more Affiliates have been recruited over time. Yet I often wonder how we’re doing. My impression is that there are always lots of Neuroscience and Bioinformatics papers in queue. Do any subject areas get neglected? If so, should Affiliates in these areas be recruited specifically?

To look at these questions I used this wonderful R client for the bioRxiv API written by Nicholas Fraser.

To set up:

devtools::install_github("nicholasmfraser/rbiorxiv")
# load packages
library(rbiorxiv)
library(tidyverse)
library(gridExtra)

# make directory for output if it doesn't exist
if (dir.exists("./output")==FALSE) dir.create("./output")
if (dir.exists("./output/plots")==FALSE) dir.create("./output/plots")

Use the R client to get a data frame of preprints uploaded in 2020.

data <- biorxiv_content(from = "2020-01-01", to = "2020-03-29", limit = "*", format = "df")

We only want to look at new preprints (Version 1) and not revisions, so let’s filter for that. Then, we’ll take advantage of bioRxiv’s new style DOIs to find the “submission date”.

data <- filter(data, version == 1)
data$doi_date <- substr(data$doi, 9, 18)
data$doi_date <- gsub("\\.", "-", data$doi_date)
data$days <- as.Date(data$date) - as.Date(data$doi_date)
data$category <- as.factor(data$category)

We now have a column called ‘days’ that shows the time in days from “submission” to “publication”. We will use this as a measure of screening time. Note: this is imperfect because the submission date is when an author begins uploading their preprint (they could take several days to do this) and not when it actually gets submitted to bioRxiv.

Let’s look at the screening time per subject area.

p1 <- ggplot(data, aes(x = as.numeric(days))) +
  geom_histogram(binwidth = 1) +
  xlim(NA, 6) +
  facet_grid(category ~ ., scales = "free") +
  labs(x = "Days",
       y= "Preprints") +
  theme(strip.text.y = element_text(angle = 0))
ggsave("./output/plots/screenLag.png", p1, height = 15, width = 6, dpi = 300,)
Histogram of screening times per subject

I was surprised to see that, with the exception of “Scientific Communication and Education”, the screening times were pretty constant across categories.

The subject areas on bioRxiv are not equal in size. Look at the numbers on the axes for Zoology and for Neuroscience to get a feel for the difference. The histogram view conceals these differences.

Next, we can calculate the average screening time and see if the busiest categories suffer delayed screening.

df1 <- aggregate(as.numeric(data$days), list(data$category), mean)
colnames(df1) <- c("category","mean_days")
df2 <- count(data$category)
colnames(df2) <- c("category","count")
summary_df <- merge(df1,df2)

And then make some bar charts to look at the data.

p3 <- ggplot(summary_df, aes(x = category, y = mean_days)) +
  geom_bar(stat = "identity") +
  scale_x_discrete(limits = rev(levels(summary_df$category))) +
  labs(x = "", y = "Mean (days)") +
  coord_flip()
p4 <- ggplot(summary_df, aes(x = category, y = count)) +
  geom_bar(stat = "identity") +
  scale_x_discrete(limits = rev(levels(summary_df$category))) +
  labs(x = "", y = "Preprints") +
  coord_flip()
p5 <- grid.arrange(p3,p4, nrow = 1, ncol = 2)
ggsave("./output/plots/summary.png", p5, height = 8, width = 8, dpi = 300,)

The average screening time is 1 day or less. Neuroscience, microbiology and bioinformatics (the biggest categories) have similar screening delays to less busy categories. So, assuming that Affiliates screen on the basis of expertise, the pool is either enriched for these popular areas, or those affiliates are more busy!

The longest lag is for “Scientific Communication and Education”, which is a very small category. Assuming the authors take a similar time to upload these manuscripts, I guess the Affiliates tend to screen these preprints as a lower priority. These papers do tend to be a bit different from other research papers and have separate screening criteria. Anyway, they still get screened in just over 2 days, which is still impressive.

I was pleased to see “Cell Biology” had the shortest screening time (around half a day)!

Conclusion

Even though my impression was that Bioinformatics and Neuroscience papers linger in the queue, this is not actually the case. There’s likely more of them in the queue because there are more of them, period.

The bioRxiv team have done a great job in maintaining a pool of Affiliates that can screen the huge number of preprints that are uploaded.

The post title comes from “Screenager” by Muse from their Origin of Symmetry album.

To leave a comment for the author, please follow the link and comment on their blog: Rstats – quantixed.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.