Distributional Semantics in R: Part 1 {tm} classes + read/write

The Exactness of Mind

5 years ago

[This article was first published on The Exactness of Mind, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The R code for this tutorial on Methods of Distributional Semantics in R is found in the respective GitHub repository.

Following my Methods of Distributional Semantics in R BelgradeR Meetup with Data Science Serbia, organized in Startit Center, Belgrade, 11/30/2016, several people asked me for the R code used for the analysis of William Shakespeare’s plays that was presented. I have decided to continue the development of the code that I’ve used during the Meetup in order to advance the examples that I have shown then into a more or less complete and comprehensible text-mining tutorial with {tm}, {openNLP}, and {topicmodels} in R. All files in this GitHub repository are a product of that work.

The idea here is to provide an overview of selected R packages and functions for text-mining and modeling in distributional semantics. Instead of presenting functions and packages in a piece-wise fashion, I have decided to develop a full text-mining pipeline by combining the essential steps orderly and exactly as one would need to follow them to arrive at some useful Data Science production following data wrangling, checking for integrity, text pre-processing, and modeling.

The first notebook – Part 1: The {tm} structures for text-mining in R – introduces the classes provided by the {tm} package, and show you how to index a text corpus with metadata prior to modeling and analytics. I have also introduced the essential read and write (i.e. Vcorpus formation) operations from {tm} there.

The forthcoming Part 2. of this tutorial will cover Entitity Recognition with {OpenNLP}. We will check how well can machine learning tell what characters appear in which Shakespeare’s play. In Part 3. we will deal with text pre-processing with {tm}, while Part 4. introduces topic modeling with Latent Dirichlet Allocation. Part 5, finally, will present an analytical exploration of the topic model.

A semantic network of Shakespeare’s characters produced by {igraph} and from a previously developed LDA model from {topicmodels}.

The video of the Meetup that motivated me to develop this tutorial is on YouTube – however, no English titles yet.

The exercise uses the complete plays of William Shakespeare kindly provided by the Massachusetts Institute of Technology at their The Complete Works of William Shakespeare pages.

Stay tuned for more text-mining in R.

To leave a comment for the author, please follow the link and comment on their blog: The Exactness of Mind.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.