Pre Self: what fraction of a journal’s papers are preprinted?
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Answering the question of what fraction of a journal’s papers were previously available as a preprint is quite difficult to do. The tricky part is matching preprints (from a number of different servers) with the published output from a journal. The easy matches are those that are directly linked together, the remainder though can be hard to identify since the manuscript may change (authors, title, abstract) between the preprint and the published version.
A strategy by Crossref called Marple, that aims match preprints to published outputs seems like the best effort so far. Their code and data up to Aug 2023 is available. Let’s use this to answer the question!
My code is below, let’s look at the results first.
The papers that have a preprint version are red, and those without are in grey. The bars are stacked in these plots and the scale is free so that the journals with different volumes of papers can be compared. The plots show only research papers. Reviews and all other outputs have been excluded as far as possible.
We can replot this to show the fraction of papers that have an associated preprint:
We can see that Elife is on a march to become 100% of papers with preprint version. This is due to a policy decision taken a few years ago.
Then there is a tranche of journals who seem to be stabilising at between 25-50% of outputs having a preprinted version. These journals include: Cell Rep, Dev Cell, Development, EMBO J, J Cell Biol, J Cell Sci, MBoC, Nat Cell Biol, Nat Commun, and Plos Biol.
Finally, journals with a very small fraction of preprinted papers include Cells, FASEB J, Front Cell Dev Biol, JBC.
My focus here was on journals in the cell and developmental biology area. I suspect that the differences in rates between journals reflects the content they carry. Cell and developmental biology, like genetics and biophysics, have an established pattern of preprinting. A journal like JCB, carrying 100% cell biology papers tops out at 50% in 2022. Whereas EMBO J, which has a lower fraction of cell biology papers plateaus at ~30%. However, the discipline doesn’t really explain why Cells and Front Cell Dev Biol have such low preprint rates. I know that there are geographical differences in preprinting and so differences in the regional base of authors at a journal may impact their preprint rate overall. There are likely other contributing factors.
Caveats and things to note:
- the data only goes up to Aug 2023, so the final bar is unreliable.
- the assignment is not perfect – there will be some papers here that have a preprint version but are not linked up and some erroneous linkages. I had a sense check of the data for one journal and could see a couple of duplicates in the Crossref data out of ~600 for that journal. So the error rate seems very low.
- the PubMed data is good but again, it is hard to exclude some outputs that are not research papers if they are not tagged appropriately.
The code
devtools::install_github("ropensci/rentrez") library(rentrez) library(XML) # pre-existing script that parses PubMed XML files source("Script/pubmedXML.R") # Fetch papers ---- # search term below exceed 9999 results, so need to use history srchTrm <- paste('("j cell sci"[ta] OR', '"mol biol cell"[ta] OR', '"j cell biol"[ta] OR', '"nat cell biol"[ta] OR', '"embo j"[ta] OR', '"biochem j"[ta] OR', '"dev cell"[ta] OR', '"faseb j"[ta] OR', '"j biol chem"[ta] OR', '"cells"[ta] OR', '"front cell dev biol"[ta] OR', '"nature communications"[ta] OR', '"cell reports"[ta]) AND', '"development"[ta]) AND', '"elife"[ta]) AND', '"plos biol"[ta]) AND', '(2016 : 2023[pdat]) AND', '(journal article[pt] NOT review[pt])') # so we will use this journalSrchTrms <- c('"j cell sci"[ta]','"mol biol cell"[ta]','"j cell biol"[ta]','"nat cell biol"[ta]','"embo j"[ta]', '"biochem j"[ta]','"dev cell"[ta]','"faseb j"[ta]','"j biol chem"[ta]','"cells"[ta]', '"front cell dev biol"[ta]','"nature communications"[ta]','"cell reports"[ta]', '"development"[ta]','"elife"[ta]','"plos biol"[ta]') # loop through journals and loop through the years # 2016:2023 pprs <- data.frame() for (i in 2016:2023) { for(j in journalSrchTrms) { srchTrm <- paste(j, ' AND ', i, '[pdat]', sep = "") pp <- entrez_search(db = "pubmed", term = srchTrm, use_history = TRUE) if(pp$count == 0) { next } pp_rec <- entrez_fetch(db = "pubmed", web_history = pp$web_history, rettype = "xml", parsed = TRUE) xml_name <- paste("Data/all_", i,"_",extract_jname(j), ".xml", sep = "") saveXML(pp_rec, file = xml_name) tempdf <- extract_xml_brief(xml_name) if(!is.null(tempdf)) { pprs <- rbind(pprs, tempdf) } } }
Now let’s load in the Crossref data and match it up
library(dplyr) library(ggplot2) df_all <- read.csv("Data/crossref-preprint-article-relationships-Aug-2023.csv") # remove duplicates from pubmed data pprs <- pprs[!duplicated(pprs$pmid), ] # remove unwanted publication types by using a vector of strings unwanted <- c("Review", "Comment", "Retracted Publication", "Retraction of Publication", "Editorial", "Autobiography", "Biography", "Historical", "Published Erratum", "Expression of Concern", "Editorial") # subset pprs to remove unwanted publication types using grepl pure <- pprs[!grepl(paste(unwanted, collapse = "|"), pprs$ptype), ] # ensure that ptype contains "Journal Article" pure <- pure[grepl("Journal Article", pure$ptype), ] # remove papers with "NA NA" as the sole author pure <- pure[!grepl("NA NA", pure$authors), ] # add factor column to pure that indicates if a row in pprs has a doi that is also found in article_doi pure$in_crossref <- ifelse(tolower(pure$doi) %in% tolower(df_all$article_doi), "yes", "no") # find the number of rows in pprs that have a doi that is also found in pure nrow(pure[pure$in_crossref == "yes",]) # summarize by year the number of papers in pure and how many are in the yes and no category of in_crossref summary_df <- pure %>% # convert from chr to numeric mutate(year = as.numeric(year)) %>% group_by(year, journal, in_crossref) %>% summarise(n = n()) # make a plot to show stacked bars of yes and no for each year ggplot(summary_df, aes(x = year, y = n, fill = in_crossref)) + geom_bar(stat = "identity") + theme_minimal() + scale_fill_manual(values = c("yes" = "#ae363b", "no" = "#d3d3d3")) + lims(x = c(2015.5, 2023.5)) + labs(x = "Year", y = "Papers") + facet_wrap(~journal, scales = "free_y") + theme(legend.position = "none") ggsave("Output/Plots/preprints_all.png", width = 2400, height = 1800, dpi = 300, units = "px", bg = "white") # now do plot where the bars stack to 100% ggplot(summary_df, aes(x = year, y = n, fill = in_crossref)) + geom_bar(stat = "identity", position = "fill") + theme_minimal() + scale_fill_manual(values = c("yes" = "#ae363b", "no" = "#d3d3d3")) + lims(x = c(2015.5, 2023.5)) + labs(x = "Year", y = "Proportion of papers") + facet_wrap(~journal) + theme(legend.position = "none") ggsave("Output/Plots/preprints_scaled.png", width = 2400, height = 1800, dpi = 300, units = "px", bg = "white")
Edit: minor update to first plot and code.
—
The post title comes from “Pre Self” by Godflesh from the “Post Self” album.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.