Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
File this one under “has troubled me (and others) for some years now, let’s try to resolve it.”
Let’s use the excellent R/rentrez package to search PubMed for articles that were retracted in 2013.
library(rentrez) es <- entrez_search("pubmed", ""Retracted Publication"[PTYP] 2013[PDAT]", usehistory = "y") es$count # [1] 117
117 articles. Now let’s fetch the records in XML format.
xml <- entrez_fetch("pubmed", WebEnv = es$WebEnv, query_key = es$QueryKey, rettype = "xml", retmax = es$count)
Next question: which XML element specifies the “Date of publication” (PDAT)?
To make a long story short: there are several nodes in PubMed XML that contain the word “Date”, but the one which looks most promising is named PubDate. Given that our search used the year (2013), you might think that years can be extracted using the XPath expression //PubDate/Year. You would be mostly, but not entirely right.
doc <- xmlTreeParse(xml, useInternalNodes = TRUE) table(xpathSApply(doc, "//PubDate/Year", xmlValue)) # 2013 2014 # 111 2
Well, that’s confusing. Not only do we not get the expected total number of years (117), but two of them have the value 2014. Time to delve deeper into the nodes under PubDate.
children <- xpathSApply(doc, "//PubDate", xmlChildren) table(names(unlist(children))) # Day MedlineDate Month Year # 25 4 87 113 table(xpathSApply(doc, "//PubDate/MedlineDate", xmlValue)) # 2013 Jan-Mar 2013 May-Jun 2013 Nov-Dec 2013 Oct-Dec # 1 1 1 1
Interesting. So in addition to //PubDate/Year, 4 records have a node named //PubDate/MedlineDate.
It’s also possible to retrieve records in docsum format, which is also XML but with a different structure. Here, PubDate is an attribute of an Item node.
ds <- entrez_fetch("pubmed", WebEnv = es$WebEnv, query_key = es$QueryKey, rettype = "docsum", retmax = es$count) ds.doc <- xmlTreeParse(ds, useInternalNodes = TRUE) table(xpathSApply(ds.doc, "//Item[@Name='PubDate']", xmlValue)) # 2013 2013 Apr 2013 Apr 1 2013 Apr 2 2013 Aug 2013 Aug 15 2013 Aug 29 # 23 7 1 1 2 2 1 # 2013 Dec 2013 Dec 1 2013 Feb 2013 Feb 26 2013 Feb 7 2013 Jan 2013 Jan 24 # 3 1 6 1 1 10 2 # 2013 Jan 3 2013 Jan 30 2013 Jan 7 2013 Jan-Mar 2013 Jul 2013 Jul 25 2013 Jun # 1 1 1 1 4 1 3 # 2013 Jun 18 2013 Jun 5 2013 Jun 7 2013 Mar 2013 Mar 1 2013 Mar 12 2013 Mar 28 # 1 1 1 5 1 1 1 # 2013 Mar 9 2013 May 2013 May 1 2013 May 29 2013 May 6 2013 May 8 2013 May 9 # 1 4 3 1 1 2 1 # 2013 Nov 2013 Nov-Dec 2013 Oct 2013 Oct-Dec 2013 Sep 2013 Sep 30 2014 Feb # 8 1 2 1 5 1 1 # 2014 Jan # 1
A fair old mix of formats in there then, and still the issue of the 2014 years when we searched for PDAT = 2013. We can split on space to get years:
yr <- xpathSApply(ds.doc, "//Item[@Name='PubDate']", function(x) strsplit(xmlValue(x), " ")[[1]][1]) which(yr == "2014") # [1] 16 26
And examine records 16 and 26:
xmlRoot(ds.doc)[[16]] # complete output not shown # <DocSum> # <Id>24156249</Id> # <Item Name="PubDate" Type="Date">2014 Jan</Item> # <Item Name="EPubDate" Type="Date">2013 Oct 25</Item> xmlRoot(ds.doc)[[26]] # complete output not shown # <DocSum> # <Id>24001238</Id> # <Item Name="PubDate" Type="Date">2014 Feb</Item> # <Item Name="EPubDate" Type="Date">2013 Sep 4</Item>
Not every record has EPubDate. Is it simply the case that where it exists and is earlier than PubDate, then EPubDate == PDAT?
So we haven’t really resolved very much, have we?
- we started with the Entrez search term PDAT (Date of publication)
- both PubMed XML and DocSum contain something called PubDate
- in the former case, most child node names = Year, but some = MedlineDate
- we retrieve some records where PubDate year = 2014, even when searching for 2013[PDAT]
It appears that PDAT does not map consistently to any XML node in either XML or DocSum formats. It might be derived from (1) EPubDate, where that exists and is earlier than PubDate, or (2) PubDate, where EPubDate does not exist.
Filed under: bioinformatics, R, statistics Tagged: entrez, eutils, ncbi, pubmed, xml
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.