LDAvis Show Case on R-Bloggers
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Text mining is a new challenge for machine wandering practitioners. The increased interest in the text mining is caused by an augmentation of internet users and by rapid growth of the internet data which is said that in 80% is a text data. Extracting information from articles, news, posts and comments have became a desirable skill but what is even more needful are tools for text mining models diagnostics and visualizations. Such visualizations enable to better understand the insight from a model and provides an easy interface for presenting your research results to greater audience. In this post I present the Latent Dirichlet Allocation text mining model for text classification into topics and a great LDAvis package for interactive visualizations of topic models. All this on R-Bloggers posts!
LDAvis is not my package. It is created by cpsievert and this post’s code for LDAvis-preparations is mostly based on his LDAvis tutorial: A topic model for movie reviews
LDA overview and text preparation
From Wikipedia
In natural language processing, latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word’s creation is attributable to one of the document’s topics.
For this post I have used articles from R-Bloggers. They can be downloaded from this repository. The data harvesting process is explained at the end of this post.
Normally I would use LDA()
function from topicmodels package to fit LDA model because the input can be of class DocumentTermMatrix
which is an object from tm package. DocumentTermMatrix
object is very convinient for working with text data (check this Norbert Ryciak’s post) because there exists tm_map
function which can be applied to all documents for stop words removal, lowering capital letters and removal of words that did not occur in x % of documents. I haven’t seen LDAvis
examples for models generated with topicsmodel package so we will use traditional approach to text processing. The stemming and stopwords removal was perfored during the data collection, which is described at the end of the post.
The lda.collapsed.gibbs.sampler()
function from tm
package has uncomfortable input format (regarding to LDA()
from topicmodels package) so I basically used cpsievert snippets
Fitting the model: from tm
package documentation
… [this function] takes sparsely represented input documents, perform inference, and return point estimates of the latent parameters using the state at the last iteration of Gibbs sampling.
The computations took very long, so in case you would like to get model faster, I have archived my model on GitHub with the help of archivist package. You can easily load this model to R with
LDAvis use case
If you google out properly you’ll wind out that LDAvis description is
Tools to create an interactive web-based visualization of a topic model that has been fit to a corpus of text data using Latent Dirichlet Allocation (LDA). Given the estimated parameters of the topic model, it computes various summary statistics as input to an interactive visualization built with D3.js that is accessed via a browser. The goal is to help users interpret the topics in their LDA topic model.
Detailed vignette about LDAvis input and output can be found here.
To visualize the result using LDAvis, we’ll need estimates of the document-topic distributions, which we denote by the DxK matrix theta, and the set of topic-term distributions, which we denote by the K×W matrix phi.
The result is published under this link http://r-addict.com/r-bloggers-harvesting/ where you can check Intertopic Distance Map (via multidimensional scaling) and top N relevant terms for a topic.
Data Harvesting
Below is the code I have used for R-Bloggers web-scraping. At start I have extracted all links to posts
from first 100 main pages of R-Bloggers. Then I have created an SQLite database with empty table called posts
.
This table has been used to store information like: post title, post link, post author, date of publication and the whole post text. For the text I have extracted only words (with stringi package) that have length greater than 1 and applied tolower
function to get rid of capital letters. Stop words removal was done thanks to tm::removeWords()
. For stemming I have used RWeka::LovinsStemmer. I did not perform full lemmatization as I have found it troubling in R (couldn’t install this and this took too long).
If you have any questions or comments, please feel free to share your ideas on the Disqus panel below.
Also, if you know how to web-scrap the number of shares per R-Bloggers article then I would love to hear your feedback as I am wondering what’s the correlation between Hadley Wickham
appearance in the post and its number of shares.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.