Quick wordclouds from PubMed abstracts – using PMID lists in R

Posted on September 26, 2016 by mareviv in R bloggers | 0 Comments

[This article was first published on R – Tales of R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Wordclouds are one of the most visually straightforward, compelling ways of displaying text info in a graph.

Of course, we have a lot of web pages (and even apps) that, given an input text, will plot you some nice tagclouds. However, when you need reproducible results, or getting done complex tasks -like combined wordclouds from several files-, a programming environment may be the best option.

In R, there are (as always), several alternatives to get this done, such as tagcloud and wordcloud.

For this script I used the following packages:

“RCurl” to retrieve a PMID list, stored in my GitHub account as a .csv file.
“RefManageR” and “plyr“ to retrieve and arrange PM records. To fetch the info from the inets, we’ll be using the PubMed API (free version, with some limitations).
Finally, “tm“, “SnowballC” to prepare the data and “wordcloud” to plot the wordcloud. This part of the script is based on this from Georeferenced.

One of the advantages of using RefManageR is that you can easily change the field from which you are importing from, and it usually works flawlessly with the PubMed API.

My biggest problem sources when running this script… download caps, busy hours, and firewalls!.

At the beginning of the gist, there is also a handy function that automagically downloads all needed packages for you.

To source the script, simply type in the R console:

library(devtools)
source_url("https://gist.githubusercontent.com/aurora-mareviv/697cbb505189591648224ed640e70fb1/raw/b42ac2e361ede770e118f217494d70c332a64ef8/pmid.tagcloud.R")

And there is the code:

Enjoy!