Site icon R-bloggers

Distributional Semantics in R: Part 2 Entity Recognition w. {openNLP}

[This article was first published on The Exactness of Mind, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The R code for this tutorial on Methods of Distributional Semantics in R is found in the respective GitHub repository. You will find .R, .Rmd, and .html files corresponding to each part of this tutorial (e.g. DistSemanticsBelgradeR-Part2.RDistSemanticsBelgradeR-Part2.Rmd, and DistSemanticsBelgradeR-Part2.html, for Part 2) there. All auxiliary files are also uploaded to the repository.

Following my Methods of Distributional Semantics in R BelgradeR Meetup with Data Science Serbia, organized in Startit Center, Belgrade, 11/30/2016, several people asked me for the R code used for the analysis of William Shakespeare’s plays that was presented. I have decided to continue the development of the code that I’ve used during the Meetup in order to advance the examples that I have shown then into a more or less complete and comprehensible text-mining tutorial with {tm}, {openNLP}, and {topicmodels} in R. All files in this GitHub repository are a product of that work. 

Part 2 will introduce named entity recognition with {openNLP}, and Apache project in Java interfaced by this nice R package that, in turn, relies on {NLP} classes. We will try to make machine learning (MaxEnt models offered in {openNLP} figure out the characters from Shakespeare’s plays, a quite difficult task given that the learning algorithms at our disposal were trained on contemporary English corpora.

The accuracy of character recognition from Shakespeare’s comedies, tragedies, and histories; the black dashed line is the overall density. The results is not realistic (explanation given in the respective .Rmd and .hmtl files).

What I really want to show you here is how tricky and difficult it can be to do serious text-mining, and help you by exemplifying some steps that are necessary to ensure the consistency of results that you are expecting. The text-mining pipelines being developed here are in no sense perfect or complete; they are meant to demonstrate important problems and propose solutions rather than to provide a copy and paste ready chunks for future re-use. In essence, except in those cases where a standardized information extraction + text-mining pipeline is being developed (a situation where, by assumption, one periodically processes large text corpora, e.g. web-scraped news and other media reports, from various sources, in various formats, and where one simply needs to learn to live with approximations) every text-mining study will need a specific pipeline on its own. Chaining those tm_map() calls to various content_transformers from {tm} restlessly, while being ignorant of the necessary changes in parameters and different content-specific transformations – of which {tm} supports only a few – will simply not do.

Don’t get hooked on the results presented in the {ggplot2} figure above; {openNLP} is not that successful in recognizing personal names from Shakespeare’s plays (in spite of the fact that it works great for contemporary English documents). I have helped it a bit, by doing something that is not applicable to real-world situations; go take a look at the code from this GitHub repository.

See you soon.

To leave a comment for the author, please follow the link and comment on their blog: The Exactness of Mind.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.