Site icon R-bloggers

Text analysis made too easy with the tm package

[This article was first published on is.R(), and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Today’s Gist takes the CNN transcript of the Denver Presidential Debate, converts paragraphs into a document-term matrix, and does the absolute most basic form of text analysis: a raw word count.

There are actually quite a few steps in this process, though it is made easier with reference to the tm vignette, but you would do well to update R, re-install the relevant packages, and make sure you have a recent version of Java installed on your computer: this code has lots of dependencies.

Please keep in mind that this Gist is intended only to illustrate the basic functionality of the tm package. Text analysis is difficult to do well, and a term frequency scatter plot does not qualify as “done well.” At least it’s not a Wordle (the mullet of the internet?)

https://gist.github.com/4260167

To leave a comment for the author, please follow the link and comment on their blog: is.R().

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.