visualizing topic models with crosstalk

Jason Timm

8 months ago

[This article was first published on Jason Timm, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

< name="viewport" content="width=device-width, initial-scale=1" />

Introduction

A simple post detailing the use of the crosstalk package to visualize and investigate topic model results interactively. As an example, we investigate the topic structure of correspondences from the Founders Online corpus – focusing on letters generated during the Washington Presidency, ca. 1789-1787.

Founders Online corpus

library(tidyverse)

I have scraped the entirety of the Founders Online corpus, and make it available as a collection of RDS files here. The Washington Presidency portion of the corpus is comprised of ~28K letters/correspondences, ~10.5 million words.

wash <- readRDS(filepath) %>%
  mutate(wc = tokenizers::count_words(text)) %>%
  filter(wc < 1000)

Extract named entities

In my experience, topic models work best with some type of supervision, as topic composition can often be overwhelmed by more frequent word forms. Simple frequency filters can be helpful, but they can also kill informative forms as well. Here, we focus on named entities using the spacyr package. .

#spacyr::spacy_install()
ent1 <- spacyr::entity_extract(spacyr::spacy_parse(wash0$text))

The resulting data structure, then, is a data frame in which each letter is represented by its constituent named entities.

ent2 <- ent1 %>%
  mutate(entity = tolower(entity)) %>%
  group_by(entity) %>%
  mutate(bign = length(unique(doc_id))) %>%
  ungroup() %>%
  count(doc_id, entity, bign) %>%
  filter(bign > 3, nchar(entity) > 4)

Build topic model

Next, we cast the entity-based text representations into a sparse matrix, and build a LDA topic model using the text2vec package. A 50 topic solution is specified. Model results are summarized and extracted using the PubmedMTK::pmtk_summarize_lda function, which is designed with text2vec output in mind.

dtm <- tidytext::cast_sparse(data = ent2,
                             row = doc_id,
                             column = entity,
                             value = n)

lda <- text2vec::LDA$new(n_topics = 50) 
fit <- lda$fit_transform(dtm, progressbar = F)
## INFO  [10:56:37.577] early stopping at 230 iteration 
## INFO  [10:56:39.022] early stopping at 30 iteration
tm_summary <- PubmedMTK::pmtk_summarize_lda(
  lda = lda, topic_feats_n = 15)

tSNE

Based on the topic-word-ditribution output from the topic model, we cast a proper topic-word sparse matrix for input to the Rtsne function.

tmat <- tidytext::cast_sparse(data = tm_summary$topic_word_dist,
                              row = topic_id,
                              column = feature,
                              value = beta)

set.seed(99)
tsne <- Rtsne::Rtsne(X = as.matrix(tmat), 
                     check_duplicates = T,
                     perplexity = 15)

tsne0 <- data.frame(topic_id = as.integer(rownames(tmat)), tsne$Y)

Crosstalk widget

Before getting into crosstalk, we filter the topic-word-ditribution to the top 10 loading terms per topic. Then we create SharedData objects. The group and key parameters specify where the action will be in the crosstalk widget.

x1 <- tm_summary$topic_word_dist %>%
  group_by(topic_id) %>%
  slice_max(order_by = beta, n = 10) %>%
  mutate(beta = round(beta, 3)) 

sd_points <- crosstalk::SharedData$new(tsne0, 
                                       group = "tm", 
                                       key = ~topic_id)
sd_features <- crosstalk::SharedData$new(x1, 
                                         group = "tm", 
                                         key = ~topic_id)

And then the widget. The user can hover on the topic tSNE plot to investigate terms underlying each topic.

library(plotly)
library(magrittr)
library(ggplot2)

p <- sd_points %>%
  
  ggplot(aes(x = X1, 
             y = X2,
             label = topic_id)) + 
  
  geom_hline(yintercept = 0, color = 'gray') +
  geom_vline(xintercept = 0, color = 'gray') +
  
  ggplot2::geom_point(size = 10, 
                      color = '#1a476f',
                      alpha = 0.5) +
  geom_text(size = 3) +
  theme_minimal() +
  theme(legend.position = 'none') 

p1 <- plotly::ggplotly(p) %>% 
  plotly::layout(showlegend = F,
                 autosize = T) %>%
  plotly::style(hoverinfo = 'none') %>%
  plotly::highlight(on = 'plotly_hover',
                    opacityDim = .75)

t1 <- sd_features %>%
  DT::datatable(rownames = FALSE,
                width = "100%",
                options = list(dom = 't',
                               pageLength = 10)) %>%
  
  DT::formatStyle(names(x1[,3]),
                  background = DT::styleColorBar(range(x1[,3]),
                                                 '#e76a53'),
                  backgroundSize = '80% 70%',
                  backgroundRepeat = 'no-repeat',
                  backgroundPosition = 'right')

Topic model summary of the Washington Presidency in letters

crosstalk::bscols (list(p1, t1))

Summary

For a stand-alone flexdashboard/html version of things, see this RPubs post.

To leave a comment for the author, please follow the link and comment on their blog: Jason Timm.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.