Constructing a Word Cloud for ICML 2015
[This article was first published on Exegetic Analytics » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Word clouds have become a bit cliché, but I still think that they have a place in giving a high level overview of the content of a corpus. Here are the steps I took in putting together the word cloud for the International Conference on Machine Learning (2015).
- Extract the hyperlinks to the PDFs of all of the papers from the Conference Programme web site using a pipeline of grep and uniq.
- In R, extract the text from each of these PDFs using the Rpoppler package.
- Split the text for each page into lines and remove the first line (corresponding to the running title) from every page except the first.
- Intelligently handle word breaks and then concatenate lines within each document.
- Transform text to lower case then remove punctuation, digits and stop words using the tm package.
- Compile the words for all of the documents into a single data.frame.
- Using the dplyr package count the number of times that each word occurs across the entire corpus as well as the number of documents which contain that word. This is what the top end of the resulting data.frame looks like:
> head(word.counts) word count doc 1 can 6496 270 2 learning 5818 270 3 algorithm 5762 252 4 data 5687 254 5 model 4956 242 6 set 4040 269
- Finally, construct a word cloud with the tagcloud package using the word count to weight the word size and the document count to determine grey scale.
The first cloud above contains the top 300 words. The larger cloud below is the top 1000 words.
The post Constructing a Word Cloud for ICML 2015 appeared first on Exegetic Analytics.
To leave a comment for the author, please follow the link and comment on their blog: Exegetic Analytics » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.