The National Centre for Biotechnology Information (NCBI) is part…

[This article was first published on biochemistries, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.



The National Centre for Biotechnology Information (NCBI) is part of the National Institutes of Health’s National Library of Medicine, and most well-known for hosting Pubmed, the go-to search engine for biomedical literature – every (Medline-indexed) publication goes up there.

On a separate but related note, one thing I’m constantly looking to do is get DOIs for papers on demand. Most recently I found a package for R, knitcitations that generates bibliographies automatically from DOIs in the text, which worked quite well for a 10 page literature review chock full of references (I’m a little allergic to Mendeley and other clunky reference managers).

The “Digital Object Identifier”, as the name suggests, uniquely identifies a research paper (and recently it’s being co-opted to reference associated datasets). There’re lots of interesting and troublesome exceptions which I’ve mentioned previously, but in the vast majority of cases any paper published in at least the last 10 years or so will have one.

Although NCBI Pubmed does a great job of cataloguing biomedical literature, another site, doi.org provides a consistent gateway to the original source of the paper. You only need to append the DOI to “dx.doi.org/” to generate a working redirection link.

Last week the NCBI posted a webinar detailing the inner workings of Entrez Direct, the command line interface for Unix computers (GNU/Linux, and Macs; Windows users can fake it with Cygwin). It revolves around a custom XML parser written in Perl (typical for bioinformaticians) encoding subtle ‘switches’ to tailor the output just as you would from the web service (albeit with a fair portion more of the inner workings on show).

I’ve been in a bit of a programming phase recently, starting to read some linear/integer and dynamic programming texts (alongside this course from UCB) so was in the mood to get something working with this.

I’ve pieced together a basic pipeline, which has a function to generate citations for knitcitations from files listing basic bibliographic information, and in the final piece of the puzzle now have a custom function (or several) that does its best to find a single unique article matching the author, publication year, and title of a paper systematically, to find DOIs for entries in such a table.

It’s the sort of thing that’s best shown than described really. The animation above is the function I finished today in action, sequentially adding a fourth column to a table of file info with DOI, all automated through Entrez Direct. It’ll probably be something I can improve upon iteratively, but for now I thought I’d share the code for anyone else starting to play around with the service.

For those not familiar with a development system setup, this code is in the .bashrc file found in the home directory, which contains useful shortcuts, called “aliases”, as well as more elaborate functions which allow all sorts of file, text and program manipulation.

The “Unix philosophy” dictates that thou shalt pipe – connect simple functions together in a mix-and-match manner, so that’s my excuse for what seems like an overcomplicated .bashrc file…

Entrez Direct has nice, concise documentation, that will explain this a whole lot better than me. It’s well worth pointing out that Pubmed is just one of the NCBI’s libraries. You can also access genetic, oncology (OMIM), protein, and other types of data through this very same interface.

One technical note not on the NCBI site: when installing the setup script added a “source .bashrc” command to my .bashrc, ‘sourcing’ my .bash_profile, which was already in turn ‘sourcing’ my .bashrc, effectively putting every new terminal command prompt in an infinite loop – watch out for this if your terminals freeze then quit after installation!

The scripts below are available here, I’ll update them on the GitHub Gist if I make amendments.

function cutf (){ cut -d $'\t' -f "$@"; }
function striptoalpha (){ for thisword in $(echo "$@" | tr -dc "[A-Z][a-z]\n" | tr [A-Z] [a-z]); do echo $thisword; done; }
function pubmed (){ esearch -db pubmed -query "$@" | efetch -format docsum | xtract -pattern DocumentSummary -present Author -and Title -element Id -first "Author/Name" -element Title; }
function pubmeddocsum (){ esearch -db pubmed -query "$@" | efetch -format docsum; }
function pubmedextractdoi (){ pubmeddocsum "$@" | xtract -pattern DocumentSummary -element Id -first "Author/Name" -element Title SortPubDate -block ArticleId -match "IdType:doi" -element Value | awk '{split($0,a,"\t"); split(a[4],b,"/"); print a[1]"\t"a[2]"\t"a[3]"\t"a[5]"\t"b[1]}'; }
function pubmeddoi (){ pubmedextractdoi "$@" | cutf 4; }
function pubmeddoimulti (){
xtracted=$(pubmedextractdoi "$@")
if [[ $(echo "$xtracted" | cutf 4) == '' ]]
then
xtractedpmid=$(echo "$xtracted" | cutf 1)
pmid2doirestful "$xtractedpmid"
else
echo "$xtracted" | cutf 4
fi
}
function pmid2doi (){ curl -s www.pmid2doi.org/rest/json/doi/"$@" | awk '{split($0,a,",\"doi\":\"|\"}"); print a[2]}'; }
function pmid2doimulti (){
curleddoi=$(pmid2doi "$@")
if [[ $curleddoi == '' ]]
then
pmid2doincbi "$@"
else
echo "$curleddoi"
fi
}
function pmid2doincbi (){
xtracteddoi=$(pubmedextractdoi "$@")
if [[ $xtracteddoi == '' ]]
then
echo "DOI NA"
else
echo "$xtracteddoi"
fi
}
function AddPubTableDOIsSimple () {
old_IFS=$IFS
IFS=$'\n'
for line in $(cat "$@"); do
AddPubDOI "$line"
done
IFS=$old_IFS
}
# Came across NCBI rate throttling while trying to call AddPubDOI in parallel, so added a second attempt for "DOI NA"
# and also writing STDOUT output to STDERR as this function will be used on a file (meaning STDOUT will get silenced)
# so you can see progress through the lines, as in:
# AddPubTableDOIs table.tsv > outputfile.tsv
# I'd recommend it's not wise to overwrite unless you're using version control.
function AddPubTableDOIs () {
old_IFS=$IFS
IFS=$'\n'
for line in $(cat "$@"); do
DOIresp=$(AddPubDOI "$line" 2>/dev/null)
if [[ $DOIresp =~ 'DOI NA' ]]; then
# try again in case it's just NCBI rate throttling, but just the once
DOIresp2=$(AddPubDOI "$line" 2>/dev/null)
if [[ $(echo "$DOIresp2" | awk 'BEGIN{FS="\t"};{print NF}' | uniq | wc -l) == '1' ]]; then
echo "$DOIresp2"
>&2 echo "$DOIresp"
else
DOIinput=$(echo "$line" | cutf 1-3)
echo -e "$DOIinput\tDOI NA: Parse error"
>&2 echo "$DOIinput\tDOI NA: Parse error"
fi
else
if [[ $(echo "$DOIresp" | awk 'BEGIN{FS="\t"};{print NF}' | uniq | wc -l) == '1' ]]; then
echo "$DOIresp"
>&2 echo "$DOIresp"
else
DOIinput=$(echo "$line" | cutf 1-3)
echo -e "$DOIinput\tDOI NA: Parse error"
>&2 echo "$DOIinput\tDOI NA: Parse error"
fi
fi
done
IFS=$old_IFS
}
function AddPubDOI (){
if [[ $(echo "$@" | cutf 4) != '' ]]; then
echo "$@"
continue
fi
printf "$(echo "$@" | cutf 1-3)\t"
thistitle=$(echo "$@" | cutf 3)
if [[ $thistitle != 'Title' ]]; then
thisauthor=$(echo "$@" | cutf 1)
thisyear=$(echo "$@" | cutf 2)
round1=$(pubmeddoimulti "$thistitle AND $thisauthor [AUTHOR]")
round1hits=$(echo "$round1" | wc -l)
if [[ "$round1hits" -gt '1' ]]; then
round2=$(pubmeddoimulti "$thistitle AND $thisauthor [AUTHOR] AND ("$thisyear"[Date - Publication] : "$thisyear"[Date - Publication])")
round2hits=$(echo "$round2" | wc -l)
if [[ "$round2hits" -gt '1' ]]; then
round3=$(
xtracted=$(pubmedextractdoi "$@")
xtractedtitles=$(echo "$xtracted" | cutf 3 | tr -dc "[A-Z][a-z]\n")
alphatitles=$(striptoalpha "$xtractedtitles")
thistitlealpha=$(striptoalpha "$thistitle")
presearchIFS=$IFS
IFS=$'\n'
titlecounter="1"
for searchtitle in $(echo "$alphatitles"); do
(( titlecounter++ ))
if [[ "$searchtitle" == *"$thistitlealpha"* ]]; then
echo "$xtracted" | sed $titlecounter'q;d' | cutf 4
fi
done
IFS=$presearchIFS
)
round3hits=$(echo "$round3" | wc -l)
if [[ "$round3hits" -gt '1' ]]; then
echo "ERROR multiple DOIs after 3 attempts to reduce - "$round3
else
echo $round3
fi
else
echo $round2
fi
else
echo $round1
fi
fi
}
function pmid2doirestful (){
curleddoi=$(pmid2doi "$@")
if [[ $curleddoi == '' ]]
then
echo "DOI NA"
else
echo "$curleddoi"
fi
}
function mmrlit { cat ~/Dropbox/Y3/MMR/Essay/literature_table.tsv; }
function mmrlitedit { vim ~/Dropbox/Y3/MMR/Essay/literature_table.tsv; }
function mmrlitgrep (){ grep -i "$@" ~/Dropbox/Y3/MMR/Essay/literature_table_with_DOIs.tsv; }
function mmrlitdoi (){ mmrlitgrep "$@" | cut -d $'\t' -f 4 | tr -d '\n' | xclip -sel p; clipconfirm; }
function mmrlitdoicite (){ mmrlitgrep "$@" | cut -d $'\t' -f 4 | awk '{print "`r citet(\""$0"\")`"}' | tr -d '\n' | xclip -sel p; clipconfirm; }
# As well as processing a tab-separated file if passed to AddPubTableDOIs, the AddPubDOI can be used to make a DOI
# search box if tied to a keyboard shortcut (creating a terminal, instructing it to start a bash shell with commands)
# NB: `xclip -sel p` saved to clipboard, but the clipboard was cleared upon closing this window.
# - Installing parcellite clipboard manager and ensuring it monitors the new
# - clipboard entry ensures retention after the bash child process exits
# Using a terminal profile with larger text called "Bigcommands", I tie the keyboard shortcut to:
# gnome-terminal -e "bash -c \"source /home/louis/.bashrc; read pmstr; getdoispaced "$pmstr" | tr -d '\n' | xclip -sel p; clipconfirm; parcellite -p > /dev/null; read dummyvar;\"" --geometry 80x2 --title="Search for a DOI:" --window-with-profile="Bigcommands"
# Optional extra: check if the spacebar was pressed at the end. If so, open the article webpage
# gnome-terminal -e "bash -c \"source /home/louis/.bashrc; read pmstr; getdoispaced "$pmstr" | tr -d '\n' | xclip -sel p; clipconfirm; parcellite -p > /dev/null; read -d'' -s -n1; if [[ $REPLY = ' ' ]]; then google-chrome "http://dx.doi.org/"$(xclip -o) > /dev/null 2>&1; fi;\"" --geometry 80x2 --title="Search for a DOI:" --window-with-profile="Bigcommands"
function cuts (){ cut -d ' ' -f "$@"; }
function getdoispaced (){
if [[ $(echo "$@" | cuts 2) =~ [0-2][0-9]{3} ]]; then
tabseppub=$(echo -e "$(echo $@ | cuts 1-2 | tr ' ' '\t')\t$(echo $@ | cuts 3-)")
AddPubDOI "$tabseppub" | cutf 4
fi
}

The main functions in the script are AddPubDOI and AddPubTableDOIs (I renamed it from the less descriptive title in the screenshot animation above) the former being executed for every line in the input table within the latter. Weird bug/programming language feature who knows where – you can’t use the traditional while read variable; do function(variable); done < inputfile construction to handle a file line by line, so I resorted to cat trickery. I blame Perl.

  • cutf is my shorthand to tell the cut command I want a specific column in a tab-separated file or variable.
  • striptoalpha is a function I made here to turn paper titles into all-lowercase squished together strings of letters (no dashes, commas etc that might get in the way of text comparison) in a really crude way of checking one name against another. This part of the script could easily be improved, but I was just sorting out one funny case - usually matching author and year and using a loose title match will be sufficient to find the matching Pubmed entry, for which a DOI can be found.
  • pubmed chains together: esearch to search pubmed for the query; efetch to get the document (i.e. article) summaries as XML; and xtract to get the basic info. I don’t use this in my little pipeline setup, rather I kept my options open and chose to get more information, and match within blocks of the XML for the DOI. It’s not so complicated to follow, as well as my code there’s this example on Biostars.
  • pubmeddocsum just does the first 2 of the steps above: providing full unparsed XML ‘docsums’
  • pubmedextractdoi gets date and DOI information as columns, then uses GNU awk to rearrange the columns in the output
  • pubmeddoi gives just the DOI column from said rearranged output
  • pubmeddoimulti has ‘multiple’ ways to try and get the DOI for an article matched from searching Pubmed: firstly from the DOI output, then attempting to use the pmid2doi service output.
  • pmid2doimulti does as for pubmeddoimulti but from a provided PMID
  • pmid2doi handles the pmid2doi.org response, pmid2doincbi the Entrez Direct side, both feed into pmid2doimulti.

Rookie’s disclaimer: I’m aware pipelines are suposed to contain more um, pipes, but I can’t quite figure out an easy way to make these functions ‘pipe’ to one another, so I’m sticking with passing the output to the next as input ("$@" in bash script).

Update: the second file added to the GitHub gist has the code needed to tie this to a keyboard shortcut (I’m using Alt+Windows Key+P) :

E.g. with Rolland et al. (2014) A Proteome-Scale Map of the Human Interactome Network:

Update 2: the keybinding now lets you hit the spacebar to open the article in a web browser, and I rejigged the main function to write to a file, see comments in code. The end of the GitHub gist also has some functions I use to copy a DOI from one of the references in the table file created to the clipboard, including one to cite it for knitcitations Rmarkdown as mentioned above.

To leave a comment for the author, please follow the link and comment on their blog: biochemistries.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)