Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
A simple post detailing the use of the crosstalk
package to visualize and investigate topic model results interactively. As an example, we investigate the topic structure of correspondences from the Founders Online corpus – focusing on letters generated during the Washington Presidency, ca. 1789-1787.
Founders Online corpus
library(tidyverse)
I have scraped the entirety of the Founders Online corpus, and make it available as a collection of RDS files here. The Washington Presidency portion of the corpus is comprised of ~28K letters/correspondences, ~10.5 million words.
wash <- readRDS(filepath) %>% mutate(wc = tokenizers::count_words(text)) %>% filter(wc < 1000)
Extract named entities
In my experience, topic models work best with some type of supervision, as topic composition can often be overwhelmed by more frequent word forms. Simple frequency filters can be helpful, but they can also kill informative forms as well. Here, we focus on named entities using the spacyr
package.
.
#spacyr::spacy_install() ent1 <- spacyr::entity_extract(spacyr::spacy_parse(wash0$text))
The resulting data structure, then, is a data frame in which each letter is represented by its constituent named entities.
ent2 <- ent1 %>% mutate(entity = tolower(entity)) %>% group_by(entity) %>% mutate(bign = length(unique(doc_id))) %>% ungroup() %>% count(doc_id, entity, bign) %>% filter(bign > 3, nchar(entity) > 4)
Build topic model
Next, we cast the entity-based text representations into a sparse matrix, and build a LDA topic model using the text2vec
package. A 50 topic solution is specified. Model results are summarized and extracted using the PubmedMTK::pmtk_summarize_lda
function, which is designed with text2vec
output in mind.
dtm <- tidytext::cast_sparse(data = ent2, row = doc_id, column = entity, value = n) lda <- text2vec::LDA$new(n_topics = 50) fit <- lda$fit_transform(dtm, progressbar = F) ## INFO [10:56:37.577] early stopping at 230 iteration ## INFO [10:56:39.022] early stopping at 30 iteration tm_summary <- PubmedMTK::pmtk_summarize_lda( lda = lda, topic_feats_n = 15)
tSNE
Based on the topic-word-ditribution
output from the topic model, we cast a proper topic-word sparse matrix for input to the Rtsne
function.
tmat <- tidytext::cast_sparse(data = tm_summary$topic_word_dist, row = topic_id, column = feature, value = beta) set.seed(99) tsne <- Rtsne::Rtsne(X = as.matrix(tmat), check_duplicates = T, perplexity = 15) tsne0 <- data.frame(topic_id = as.integer(rownames(tmat)), tsne$Y)
Crosstalk widget
Before getting into crosstalk
, we filter the topic-word-ditribution
to the top 10 loading terms per topic. Then we create SharedData
objects. The group
and key
parameters specify where the action will be in the crosstalk
widget.
x1 <- tm_summary$topic_word_dist %>% group_by(topic_id) %>% slice_max(order_by = beta, n = 10) %>% mutate(beta = round(beta, 3)) sd_points <- crosstalk::SharedData$new(tsne0, group = "tm", key = ~topic_id) sd_features <- crosstalk::SharedData$new(x1, group = "tm", key = ~topic_id)
And then the widget. The user can hover on the topic tSNE plot to investigate terms underlying each topic.
library(plotly) library(magrittr) library(ggplot2) p <- sd_points %>% ggplot(aes(x = X1, y = X2, label = topic_id)) + geom_hline(yintercept = 0, color = 'gray') + geom_vline(xintercept = 0, color = 'gray') + ggplot2::geom_point(size = 10, color = '#1a476f', alpha = 0.5) + geom_text(size = 3) + theme_minimal() + theme(legend.position = 'none') p1 <- plotly::ggplotly(p) %>% plotly::layout(showlegend = F, autosize = T) %>% plotly::style(hoverinfo = 'none') %>% plotly::highlight(on = 'plotly_hover', opacityDim = .75) t1 <- sd_features %>% DT::datatable(rownames = FALSE, width = "100%", options = list(dom = 't', pageLength = 10)) %>% DT::formatStyle(names(x1[,3]), background = DT::styleColorBar(range(x1[,3]), '#e76a53'), backgroundSize = '80% 70%', backgroundRepeat = 'no-repeat', backgroundPosition = 'right')
Topic model summary of the Washington Presidency in letters
crosstalk::bscols (list(p1, t1))
Summary
For a stand-alone flexdashboard
/html version of things, see this RPubs post.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.