The National Centre for Biotechnology Information (NCBI) is part…
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The National Centre for Biotechnology Information (NCBI) is part of the National Institutes of Health’s National Library of Medicine, and most well-known for hosting Pubmed, the go-to search engine for biomedical literature – every (Medline-indexed) publication goes up there.
On a separate but related note, one thing I’m constantly looking to do is get DOIs for papers on demand. Most recently I found a package for R, knitcitations
that generates bibliographies automatically from DOIs in the text, which worked quite well for a 10 page literature review chock full of references (I’m a little allergic to Mendeley and other clunky reference managers).
The “Digital Object Identifier”, as the name suggests, uniquely identifies a research paper (and recently it’s being co-opted to reference associated datasets). There’re lots of interesting and troublesome exceptions which I’ve mentioned previously, but in the vast majority of cases any paper published in at least the last 10 years or so will have one.
Although NCBI Pubmed does a great job of cataloguing biomedical literature, another site, doi.org provides a consistent gateway to the original source of the paper. You only need to append the DOI to “dx.doi.org/” to generate a working redirection link.
Last week the NCBI posted a webinar detailing the inner workings of Entrez Direct, the command line interface for Unix computers (GNU/Linux, and Macs; Windows users can fake it with Cygwin). It revolves around a custom XML parser written in Perl (typical for bioinformaticians) encoding subtle ‘switches’ to tailor the output just as you would from the web service (albeit with a fair portion more of the inner workings on show).
I’ve been in a bit of a programming phase recently, starting to read some linear/integer and dynamic programming texts (alongside this course from UCB) so was in the mood to get something working with this.
I’ve pieced together a basic pipeline, which has a function to generate citations for knitcitations from files listing basic bibliographic information, and in the final piece of the puzzle now have a custom function (or several) that does its best to find a single unique article matching the author, publication year, and title of a paper systematically, to find DOIs for entries in such a table.
It’s the sort of thing that’s best shown than described really. The animation above is the function I finished today in action, sequentially adding a fourth column to a table of file info with DOI, all automated through Entrez Direct. It’ll probably be something I can improve upon iteratively, but for now I thought I’d share the code for anyone else starting to play around with the service.
For those not familiar with a development system setup, this code is in the .bashrc file found in the home directory, which contains useful shortcuts, called “aliases”, as well as more elaborate functions which allow all sorts of file, text and program manipulation.
The “Unix philosophy” dictates that thou shalt pipe – connect simple functions together in a mix-and-match manner, so that’s my excuse for what seems like an overcomplicated .bashrc file…
Entrez Direct has nice, concise documentation, that will explain this a whole lot better than me. It’s well worth pointing out that Pubmed is just one of the NCBI’s libraries. You can also access genetic, oncology (OMIM), protein, and other types of data through this very same interface.
One technical note not on the NCBI site: when installing the setup script added a “source .bashrc
” command to my .bashrc, ‘sourcing’ my .bash_profile, which was already in turn ‘sourcing’ my .bashrc, effectively putting every new terminal command prompt in an infinite loop – watch out for this if your terminals freeze then quit after installation!
The scripts below are available here, I’ll update them on the GitHub Gist if I make amendments.
function cutf (){ cut -d $'\t' -f "$@"; } | |
function striptoalpha (){ for thisword in $(echo "$@" | tr -dc "[A-Z][a-z]\n" | tr [A-Z] [a-z]); do echo $thisword; done; } | |
function pubmed (){ esearch -db pubmed -query "$@" | efetch -format docsum | xtract -pattern DocumentSummary -present Author -and Title -element Id -first "Author/Name" -element Title; } | |
function pubmeddocsum (){ esearch -db pubmed -query "$@" | efetch -format docsum; } | |
function pubmedextractdoi (){ pubmeddocsum "$@" | xtract -pattern DocumentSummary -element Id -first "Author/Name" -element Title SortPubDate -block ArticleId -match "IdType:doi" -element Value | awk '{split($0,a,"\t"); split(a[4],b,"/"); print a[1]"\t"a[2]"\t"a[3]"\t"a[5]"\t"b[1]}'; } | |
function pubmeddoi (){ pubmedextractdoi "$@" | cutf 4; } | |
function pubmeddoimulti (){ | |
xtracted=$(pubmedextractdoi "$@") | |
if [[ $(echo "$xtracted" | cutf 4) == '' ]] | |
then | |
xtractedpmid=$(echo "$xtracted" | cutf 1) | |
pmid2doirestful "$xtractedpmid" | |
else | |
echo "$xtracted" | cutf 4 | |
fi | |
} | |
function pmid2doi (){ curl -s www.pmid2doi.org/rest/json/doi/"$@" | awk '{split($0,a,",\"doi\":\"|\"}"); print a[2]}'; } | |
function pmid2doimulti (){ | |
curleddoi=$(pmid2doi "$@") | |
if [[ $curleddoi == '' ]] | |
then | |
pmid2doincbi "$@" | |
else | |
echo "$curleddoi" | |
fi | |
} | |
function pmid2doincbi (){ | |
xtracteddoi=$(pubmedextractdoi "$@") | |
if [[ $xtracteddoi == '' ]] | |
then | |
echo "DOI NA" | |
else | |
echo "$xtracteddoi" | |
fi | |
} | |
function AddPubTableDOIsSimple () { | |
old_IFS=$IFS | |
IFS=$'\n' | |
for line in $(cat "$@"); do | |
AddPubDOI "$line" | |
done | |
IFS=$old_IFS | |
} | |
# Came across NCBI rate throttling while trying to call AddPubDOI in parallel, so added a second attempt for "DOI NA" | |
# and also writing STDOUT output to STDERR as this function will be used on a file (meaning STDOUT will get silenced) | |
# so you can see progress through the lines, as in: | |
# AddPubTableDOIs table.tsv > outputfile.tsv | |
# I'd recommend it's not wise to overwrite unless you're using version control. | |
function AddPubTableDOIs () { | |
old_IFS=$IFS | |
IFS=$'\n' | |
for line in $(cat "$@"); do | |
DOIresp=$(AddPubDOI "$line" 2>/dev/null) | |
if [[ $DOIresp =~ 'DOI NA' ]]; then | |
# try again in case it's just NCBI rate throttling, but just the once | |
DOIresp2=$(AddPubDOI "$line" 2>/dev/null) | |
if [[ $(echo "$DOIresp2" | awk 'BEGIN{FS="\t"};{print NF}' | uniq | wc -l) == '1' ]]; then | |
echo "$DOIresp2" | |
>&2 echo "$DOIresp" | |
else | |
DOIinput=$(echo "$line" | cutf 1-3) | |
echo -e "$DOIinput\tDOI NA: Parse error" | |
>&2 echo "$DOIinput\tDOI NA: Parse error" | |
fi | |
else | |
if [[ $(echo "$DOIresp" | awk 'BEGIN{FS="\t"};{print NF}' | uniq | wc -l) == '1' ]]; then | |
echo "$DOIresp" | |
>&2 echo "$DOIresp" | |
else | |
DOIinput=$(echo "$line" | cutf 1-3) | |
echo -e "$DOIinput\tDOI NA: Parse error" | |
>&2 echo "$DOIinput\tDOI NA: Parse error" | |
fi | |
fi | |
done | |
IFS=$old_IFS | |
} | |
function AddPubDOI (){ | |
if [[ $(echo "$@" | cutf 4) != '' ]]; then | |
echo "$@" | |
continue | |
fi | |
printf "$(echo "$@" | cutf 1-3)\t" | |
thistitle=$(echo "$@" | cutf 3) | |
if [[ $thistitle != 'Title' ]]; then | |
thisauthor=$(echo "$@" | cutf 1) | |
thisyear=$(echo "$@" | cutf 2) | |
round1=$(pubmeddoimulti "$thistitle AND $thisauthor [AUTHOR]") | |
round1hits=$(echo "$round1" | wc -l) | |
if [[ "$round1hits" -gt '1' ]]; then | |
round2=$(pubmeddoimulti "$thistitle AND $thisauthor [AUTHOR] AND ("$thisyear"[Date - Publication] : "$thisyear"[Date - Publication])") | |
round2hits=$(echo "$round2" | wc -l) | |
if [[ "$round2hits" -gt '1' ]]; then | |
round3=$( | |
xtracted=$(pubmedextractdoi "$@") | |
xtractedtitles=$(echo "$xtracted" | cutf 3 | tr -dc "[A-Z][a-z]\n") | |
alphatitles=$(striptoalpha "$xtractedtitles") | |
thistitlealpha=$(striptoalpha "$thistitle") | |
presearchIFS=$IFS | |
IFS=$'\n' | |
titlecounter="1" | |
for searchtitle in $(echo "$alphatitles"); do | |
(( titlecounter++ )) | |
if [[ "$searchtitle" == *"$thistitlealpha"* ]]; then | |
echo "$xtracted" | sed $titlecounter'q;d' | cutf 4 | |
fi | |
done | |
IFS=$presearchIFS | |
) | |
round3hits=$(echo "$round3" | wc -l) | |
if [[ "$round3hits" -gt '1' ]]; then | |
echo "ERROR multiple DOIs after 3 attempts to reduce - "$round3 | |
else | |
echo $round3 | |
fi | |
else | |
echo $round2 | |
fi | |
else | |
echo $round1 | |
fi | |
fi | |
} | |
function pmid2doirestful (){ | |
curleddoi=$(pmid2doi "$@") | |
if [[ $curleddoi == '' ]] | |
then | |
echo "DOI NA" | |
else | |
echo "$curleddoi" | |
fi | |
} | |
function mmrlit { cat ~/Dropbox/Y3/MMR/Essay/literature_table.tsv; } | |
function mmrlitedit { vim ~/Dropbox/Y3/MMR/Essay/literature_table.tsv; } | |
function mmrlitgrep (){ grep -i "$@" ~/Dropbox/Y3/MMR/Essay/literature_table_with_DOIs.tsv; } | |
function mmrlitdoi (){ mmrlitgrep "$@" | cut -d $'\t' -f 4 | tr -d '\n' | xclip -sel p; clipconfirm; } | |
function mmrlitdoicite (){ mmrlitgrep "$@" | cut -d $'\t' -f 4 | awk '{print "`r citet(\""$0"\")`"}' | tr -d '\n' | xclip -sel p; clipconfirm; } |
# As well as processing a tab-separated file if passed to AddPubTableDOIs, the AddPubDOI can be used to make a DOI | |
# search box if tied to a keyboard shortcut (creating a terminal, instructing it to start a bash shell with commands) | |
# NB: `xclip -sel p` saved to clipboard, but the clipboard was cleared upon closing this window. | |
# - Installing parcellite clipboard manager and ensuring it monitors the new | |
# - clipboard entry ensures retention after the bash child process exits | |
# Using a terminal profile with larger text called "Bigcommands", I tie the keyboard shortcut to: | |
# gnome-terminal -e "bash -c \"source /home/louis/.bashrc; read pmstr; getdoispaced "$pmstr" | tr -d '\n' | xclip -sel p; clipconfirm; parcellite -p > /dev/null; read dummyvar;\"" --geometry 80x2 --title="Search for a DOI:" --window-with-profile="Bigcommands" | |
# Optional extra: check if the spacebar was pressed at the end. If so, open the article webpage | |
# gnome-terminal -e "bash -c \"source /home/louis/.bashrc; read pmstr; getdoispaced "$pmstr" | tr -d '\n' | xclip -sel p; clipconfirm; parcellite -p > /dev/null; read -d'' -s -n1; if [[ $REPLY = ' ' ]]; then google-chrome "http://dx.doi.org/"$(xclip -o) > /dev/null 2>&1; fi;\"" --geometry 80x2 --title="Search for a DOI:" --window-with-profile="Bigcommands" | |
function cuts (){ cut -d ' ' -f "$@"; } | |
function getdoispaced (){ | |
if [[ $(echo "$@" | cuts 2) =~ [0-2][0-9]{3} ]]; then | |
tabseppub=$(echo -e "$(echo $@ | cuts 1-2 | tr ' ' '\t')\t$(echo $@ | cuts 3-)") | |
AddPubDOI "$tabseppub" | cutf 4 | |
fi | |
} |
The main functions in the script are AddPubDOI
and AddPubTableDOIs
(I renamed it from the less descriptive title in the screenshot animation above) the former being executed for every line in the input table within the latter. Weird bug/programming language feature who knows where – you can’t use the traditional while read variable;
do function(variable);
done < inputfile
construction to handle a file line by line, so I resorted to cat
trickery. I blame Perl.
cutf
is my shorthand to tell thecut
command I want a specific column in a tab-separated file or variable.striptoalpha
is a function I made here to turn paper titles into all-lowercase squished together strings of letters (no dashes, commas etc that might get in the way of text comparison) in a really crude way of checking one name against another. This part of the script could easily be improved, but I was just sorting out one funny case - usually matching author and year and using a loose title match will be sufficient to find the matching Pubmed entry, for which a DOI can be found.pubmed
chains together:esearch
to search pubmed for the query;efetch
to get the document (i.e. article) summaries as XML; andxtract
to get the basic info. I don’t use this in my little pipeline setup, rather I kept my options open and chose to get more information, and match within blocks of the XML for the DOI. It’s not so complicated to follow, as well as my code there’s this example on Biostars.pubmeddocsum
just does the first 2 of the steps above: providing full unparsed XML ‘docsums’pubmedextractdoi
gets date and DOI information as columns, then uses GNU awk to rearrange the columns in the outputpubmeddoi
gives just the DOI column from said rearranged outputpubmeddoimulti
has ‘multiple’ ways to try and get the DOI for an article matched from searching Pubmed: firstly from the DOI output, then attempting to use the pmid2doi service output.pmid2doimulti
does as forpubmeddoimulti
but from a provided PMIDpmid2doi
handles the pmid2doi.org response,pmid2doincbi
the Entrez Direct side, both feed intopmid2doimulti
.
Rookie’s disclaimer: I’m aware pipelines are suposed to contain more um, pipes, but I can’t quite figure out an easy way to make these functions ‘pipe’ to one another, so I’m sticking with passing the output to the next as input ("$@"
in bash script).
Update: the second file added to the GitHub gist has the code needed to tie this to a keyboard shortcut (I’m using Alt+Windows Key+P) :
E.g. with Rolland et al. (2014) A Proteome-Scale Map of the Human Interactome Network:
Update 2: the keybinding now lets you hit the spacebar to open the article in a web browser, and I rejigged the main function to write to a file, see comments in code. The end of the GitHub gist also has some functions I use to copy a DOI from one of the references in the table file created to the clipboard, including one to cite it for knitcitations Rmarkdown as mentioned above.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.