Analysis of retractions in PubMed
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
As so often happens these days, a brief post at FriendFeed got me thinking about data analysis. Entitled “So how many retractions are there every year, anyway?”, the post links to this article at Retraction Watch. It discusses ways to estimate the number of retractions and in particular, a recent article in the Journal of Medical Ethics (subscription only, sorry) which addresses the issue.
As Christina pointed out in a comment at Retraction Watch, there are thousands of scientific journals of which PubMed indexes only a fraction. However, PubMed is relatively easy to analyse using a little Ruby and R. So, here we go…
Code and raw data used for this post are available at Github.
1. Searching for retractions
In the Journal of Medical Ethics article, the authors state: “Every research paper noted as retracted in the PubMed database from 2000 to 2010 was evaluated. PubMed was searched on 22 January 2010 with the limits of ‘items with abstracts, retracted publication, English.’ A total of 788 retracted papers were identified…”
Not a bad approach. There’s another way: at the PubMed website, find a retraction and examine the record in XML format. You’ll see this:
<PublicationTypeList> <PublicationType>Retraction of Publication</PublicationType> </PublicationTypeList>
The equivalent in Medline format is:
PT - Retraction of Publication
This means that retractions have a particular type: Publication Type, or PTYP for short. If you search at the PubMed website using the term “Retraction of Publication[Publication Type]“, you will retrieve (at the time of writing) ~ 1621 records.
2. Retrieving retraction counts by year
Armed with this information, we can modify the Ruby code that I’ve posted previously to retrieve total and retracted publications between 1900 and 2010. This generates a tab-delimited file with 3 columns: year, total publications and retracted publications.
3. Retraction count analysis
Here’s the R code to analyse the retraction counts. There are no recorded retractions until 1977, so we’ll start from that year.
First, a simple plot of retractions for each year.
So, retractions are increasing rapidly. No surprise there, since the total number of publications per year is also increasing rapidly. We need some kind of normalization. |
|
Chris got there first with this graphic, showing retractions each year per 100 000 publications. Here’s my version.
Indeed, it seems that with each year, retractions constitute a greater proportion of publications for that year. |
|
Another way to examine the trend is to use the cumulative sum of both total publications and retractions over time. In other words for each year, instead of looking at the numbers for just that year, we look at the total records accumulated in PubMed to date. Here’s that plot.
This shows a smoother upwards trend, with a rapid increase from 2005 onwards. |
|
Finally, we can compare the growth rate of total and retracted publications. One way to do this is to choose 1977 as the baseline and for each year, calculate the percentage increase in both publication types relative to 1977. Here’s the result.
This is somewhat alarming. Whilst there are about 4x as many total publications in Pubmed now as there were in 1977, the total number of retractions has risen almost 550x. |
4. Analysis of Medline data
Using the search term described earlier in the post to retrieve retractions, we can download a file in Medline format. Medline records contain various fields of interest, including the ROF (retraction of) line, describing the publication that was retracted.
Or – as it turns out in some cases – publications. One retraction record may include the retraction of several publications, as we can see with a simple grep:
grep -c "^PMID" retractions.medline && grep -c "^ROF" retractions.medline 1621 1705
We won’t worry about that too much, since the majority of retraction records reference one publication.
Here is some R code that performs two simple, similar analyses of the Medline file. First, the top 10 journals for retractions:
so Freq 667 Proc Natl Acad Sci U S A 54 707 Science 52 590 Nature 42 388 J Biol Chem 32 450 J Immunol 28 157 Cell 20 92 Biochem Biophys Res Commun 16 116 Blood 16 413 J Clin Invest 15 566 Mol Cell Biol 15
A brief glance at that list suggests that higher impact factor = more retractions. We would want to know the total number of publications for those journals to make more sense of that.
Second, the top 10 countries:
pl Freq 45 united states 856 12 england 373 28 netherlands 83 15 germany 47 23 japan 42 6 china 25 2 australia 19 24 korea (south) 19 10 denmark 17 42 switzerland 14
Not especially surprising; the ones with the most researchers/scientific output. Again, we’d want more data before drawing any conclusions.
Final thoughts
- Analysis of all kinds of data from PubMed is relatively straightforward. As to the factors underlying the recent rise in retractions: the JME focuses on fraud. Your thoughts are welcome.
- It strikes me that it would be relatively easy to build a web application (Rails, Heroku), which constantly monitors retraction data at PubMed and generates a variety of statistics and charts.
- The post at Retraction Watch lists a variety of estimates for numbers of retractions: 328 from 1995-2004, 529 from 1988-2008 and, most amusingly, 95 in 2008 – for the entire Thomson Reuters Science Citation Index. Given that there are 237 records in PubMed alone for 2008, you have to wonder what the Times Higher Education Supplement paid for the latter study. And people wonder why we don’t trust impact factors.
Filed under: bibliography, publications, R, ruby, statistics Tagged: ggplot2, pubmed, retraction, stringr
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.