Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The R code for this tutorial on Methods of Distributional Semantics in R is found in the respective GitHub repository. You will find .R, .Rmd, and .html files corresponding to each part of this tutorial (e.g. DistSemanticsBelgradeR-Part2.R, DistSemanticsBelgradeR-Part2.Rmd, and DistSemanticsBelgradeR-Part2.html, for Part 2) there. All auxiliary files are also uploaded to the repository.
Following my Methods of Distributional Semantics in R BelgradeR Meetup with Data Science Serbia, organized in Startit Center, Belgrade,
11/30/2016, several people asked me for the R code used for the analysis
of William Shakespeare’s plays that was presented. I have decided to
continue the development of the code that I’ve used during the Meetup in order to advance the examples
that I have shown then into a more or less complete and comprehensible
text-mining tutorial with {tm}, {openNLP}, and {topicmodels} in R. All
files in this GitHub repository are a product of that work.
Part 2 will introduce named entity recognition with {openNLP}, and Apache project in Java interfaced by this nice R package that, in turn, relies on {NLP} classes. We will try to make machine learning (MaxEnt models offered in {openNLP} figure out the characters from Shakespeare’s plays, a quite difficult task given that the learning algorithms at our disposal were trained on contemporary English corpora.
The accuracy of character recognition from Shakespeare’s comedies, tragedies, and histories; the black dashed line is the overall density. The results is not realistic (explanation given in the respective .Rmd and .hmtl files).
What I really want to show you here is how tricky and difficult it can be to do serious text-mining, and help you by exemplifying some steps that are necessary to ensure the consistency of results that you are expecting. The text-mining pipelines being developed here are in no sense perfect or complete; they are meant to demonstrate important problems and propose solutions rather than to provide a copy and paste ready chunks for future re-use. In essence, except in those cases where a standardized information extraction + text-mining pipeline is being developed (a situation where, by assumption, one periodically processes large text corpora, e.g. web-scraped news and other media reports, from various sources, in various formats, and where one simply needs to learn to live with approximations) every text-mining study will need a specific pipeline on its own. Chaining those tm_map() calls to various content_transformers from {tm} restlessly, while being ignorant of the necessary changes in parameters and different content-specific transformations – of which {tm} supports only a few – will simply not do.
Don’t get hooked on the results presented in the {ggplot2} figure above; {openNLP} is not that successful in recognizing personal names from Shakespeare’s plays (in spite of the fact that it works great for contemporary English documents). I have helped it a bit, by doing something that is not applicable to real-world situations; go take a look at the code from this GitHub repository.
See you soon.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.