Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A SHORT COURSE IN COMPUTATIONAL-CORPUS LINGUISTICS – using the R library text2vec
– with a focus on working with (a controlled vocabulary of) multi-word expressions. Here we consider a Twitter corpus comprised of all tweets generated by the 535 voting members of the US Congress during the second session of the 116th Congress.
More specifically, we consider the usage of COVID19-related terms as a function of both time and party affiliation; we also investigate the conceptual relatedness of COVID19-relevant terms using a GloVe model and multi-dimensional scaling.
A cache of scalable & efficient methodologies for some common corpus-based tasks.
A COVID19 lexicon
library(tidyverse) tweets_dir <- '/home/jtimm/jt_work/GitHub/git_projects/us_lawmaker_tweets_2020/' covid_dir <- '/home/jtimm/jt_work/GitHub/git_projects/A-covid19-lexicon/' setwd(covid_dir) dictionary <- readxl::read_xlsx ('covid_glossary_w_variants.xlsx') %>% filter(category != 'race-ethnicity') %>% ungroup()
I have collated some COVID19-related terms from a few resources, most notably, this Yale Medicine glossary. Per this resource, each term has been categorized as one of the following:
unique(dictionary$category) ## [1] "cv" "interventions" "medical_response" ## [4] "prevention" "socio-political" "spread_of_disease" ## [7] "transmission"
As I have added terms, I have tried to fit them within this classification framework. I have also added a socio-political
category to capture some of the civil-liberties-based rhetoric/protesting happening in the US in response to stay-at-home orders, as well as stimulus legislation, etc. COVID19 vocabulary as xlsx file. A good start, but could certainly be developed.
The table below illustrates the structure of the vocabulary for two COVID19-related concepts: ANTIVIRAL
and HAND-HYGIENE
. So, the descriptor_name
column represents the higher-level concept; term_name
column reflects the different (inflectional or orthographical) ways the concept can manifest in text. The actual form of the descriptor/concept is arbitrary.
So, as we move towards identifying/extracting COIV19-related terms from Twitter text, this vocabulary gives us the option to aggregate over terms to the higher-level concept (or descriptor). Some academic fields refer to this as the process of normalization.
dictionary %>% filter(descriptor_name %in% c('antiviral', 'hand-hygiene')) %>% group_by(category, descriptor_name) %>% summarize(term_names = paste0(term_name, collapse = ' | ')) %>% DT::datatable(rownames = FALSE, options = list(dom = 't'))
Congressional Twitter Corpus (2020)
Again, our data set is a Twitter corpus comprised of all tweets generated by the 535 voting members of the US Congress during the second session of the 116th Congress. Code for extracting/building/updating the corpus using the R package rweet
is available here, and the actual corpus as TIF/xlsx is available here. Updated 5/26/20
setwd(paste0(tweets_dir, 'tweets')) tweets <- readxl::read_xlsx('us_lawmaker_tweets_full_2020-05-26.xlsx') %>% mutate(created_at = as.Date(created_at, format = "%Y-%B-%d"))
Corpus composition:
data.frame(tweets = format(nrow(tweets), big.mark = ','), tokens = format(sum(tokenizers::count_words(tweets$text)), big.mark = ',')) %>% knitr::kable()
tweets | tokens |
---|---|
157,420 | 5,650,645 |
Next, we quickly grab some details about US lawmakers from the united states project. The Twitter corpus and lawmaker detail data sets can then be joined via Twitter handle.
leg_dets <- 'https://theunitedstates.io/congress-legislators/legislators-current.csv' twitters <- read.csv((url(leg_dets)), stringsAsFactors = FALSE) %>% rename (state_abbrev = state, district_code = district) tweets1 <- tweets %>% mutate(screen_name = toupper(screen_name)) %>% left_join(twitters %>% mutate(twitter = toupper(twitter)), by = c('screen_name' = 'twitter'))
Some sample tweets:
text2vec
framework for NLP ::
text2vec
is a beast of a text analysis R library. Here, we walk-through the building of some common text structures relevant to many downstream applications – using our congressional Twitter corpus. As text2vec
implements R6 objects (a mystery to me), the framework is a bit funky. So, we present some hacks, etc. here, specifically for working with multi-word expressions – in the larger context of building document-term matrices, term-co-occurrence matrices, GloVe models & co-occurrence-based graph structures.
With the ultimate goal of investigating (1) some historical- and party-affiliation-based variation in the use of COVID19-related terms on Twitter, and (2) the conceptual relatedness of COVID19-related terms
Tokens & tokenizers
text2vec
, like other text analysis frameworks, operates on a token object
, which for a single document/tweet looks like the following:
tokenizers::tokenize_ptb(tweets$text[2], lowercase = TRUE) ## [[1]] ## [1] "ohioans" ":" "request" ## [4] "a" "mail-in" "ballot" ## [7] "today" "from" "the" ## [10] "secretary" "of" "state" ## [13] "to" "ensure" "your" ## [16] "vote" "is" "counted" ## [19] "in" "ohio" "'s" ## [22] "primary" "election." "the" ## [25] "deadline" "to" "postmark" ## [28] "your" "ballot" "is" ## [31] "monday." "https" ":" ## [34] "//t.co/gkdascowqc"
Tokenization, even for English, is a non-trivial task. The tokenize_ptb
function from the tokenizers
package is pretty good (which is based on the Penn Treebank model). But there are still two instances above, eg, in which sentence-final punctuation is not tokenized: election.
& monday.
. So, when we go to build word-level models, election
& election.
, eg, will be treated distinctly.
This bothers me. The code below sorts this and other issues out. Resulting/re-built text can then be fed to any simple space-based tokenizer, and things will be clean. Tokenize > clean tokens > rebuild text > re-tokenize.
## tokenizer -- t1 <- tokenizers::tokenize_ptb(tweets$text, lowercase = TRUE) ## Remove punct t2 <- lapply(t1, gsub, pattern = '([a-z0-9])([[:punct:]])', replacement = '\\1 \\2') t3 <- lapply(t2, gsub, pattern = '([[:punct:]])([a-z0-9])', replacement = '\\1 \\2') t4 <- lapply(t3, paste0, collapse = ' ') ## Re-build tweets$word_text <- unlist(t4)
Multi-word expressions & controlled vocabularies
Units of meaning often (ie, almost always) span multiple words and multiple grammatical categories. Here we briefly consider some supervised approaches to tricking tokenizers (and specifically text2vec
) into treating a controlled vocabulary of multi-word expressions as single-units-of-meaning.
§ Some multi-word hacks
The spelling & inflectional variants of the COVID19-related concept FLATTEN THE CURVE
are presented below:
term_names |
---|
flatten the curve | flatten_the_curve | flatten-the-curve | flattening the curve | flattening_the_curve | flattening-the-curve | flatteningthecurve | flattenthecurve |
So, if a lawmaker on Twitter refers to the concept FLATTEN THE CURVE
as flattenthecurve
, without any spaces (& presumably prefixed with a hash tag), a space-based (or word-based) tokenizer will do right by the analyst investigating multi-word expressions. The same goes for flatten-the-curve
and flatten_the_curve
.
The form flatten the curve
, however, will be tokenized as flatten
and the
and curve
. Which is not helpful. Basically, we want to phrasify these three individual tokens as a single token. Such that in downstream applications, flatten the curve
and flattenthecurve
, eg, are (or can be) treated as instantiations of the same conceptual category.
The Collocations
function/model from the text2vec
package enables an unsupervised approach to identifying multi-word expressions, and results can be used to update token objects such that flatten
the
curve
becomes flatten-the-curve
. If flatten
+ the
+ curve
is identified as an expression per the model.
Here, however, we are interested in a supervised (or controlled) approach, ie, we have our own multi-word lexicon of COVID19-related terms that we want phrasified. text2vec
does not provide a straightforward way to do this. So, here we present a simple (albeit extended) hack.
multi_word_expressions <- subset(dictionary, grepl(' ', term_name)) sep = ' ' mas_que_dos <- subset(multi_word_expressions, grepl(' [a-z0-9]* ', term_name))
First: text2vec::Collocations
builds out phrases in a piecemeal fashion. Long story short: in order to identify (or phrasify) flatten the curve
as a multi-word expression, it must first identify (or phrasify), eg, flatten the
. Then flatten-the
and curve
can be phrasified as flatten-the-curve
. So, for multi-word expressions > 2, we have to build out some component parts. Some multi-word expressions in the COVID19 vocabulary >2 words:
## [1] "flattening the curve" "drive thru tests" ## [3] "personal protective equipment" "flatten the curve" ## [5] "front line worker" "great american comeback" ## [7] "global economic cirsis" "return to work" ## [9] "high risk population" "god bless america"
A simple function for extracting component 2-word phrases from multi-word expressions >2:
new_two_grams <- lapply(mas_que_dos$term_name, function(x) { regmatches(x, gregexpr("[^ ]+ [^ ]+", # sep = ' ' x, perl=TRUE) )[[1]] }) %>% unlist() %>% unique()
A look at the “pieces” of our mutli-word expressions composed of more than two words:
## [1] "sars cov" "drive through" "drive thru" ## [4] "personal protective" "flatten the" "flattening the" ## [7] "front line" "high risk" "shelter in" ## [10] "long term" "stay at" "home order" ## [13] "home orders" "wash your" "wear a" ## [16] "work from" "working from" "dont bankrupt" ## [19] "global economic" "global health" "god bless" ## [22] "great american" "made in" "open up" ## [25] "paycheck protection" "re open" "return to" ## [28] "person to" "person transmission"
Then we add these “pieces” to the full multi-word portion of the COVID19 lexicon.
multi_word_expressions_replace <- gsub(' ', sep, multi_word_expressions$term_name) multi_word_expressions_replace <- c(multi_word_expressions_replace, new_two_grams )
§ Some text2vec
primitives
Before we can dupe text2vec
into phrasifying our multi-word COVID19 terms, we first need to build two basic text2vec
(data) structures: an itoken
object (or iterator) & a vocabulary object. The former containing (among other things) a generic tokens object. Again, see this vignette for more technical details. Regardless of your text2vec
objectives, these will (almost) always be your first two opening moves.
mo <- text2vec::itoken(tweets$word_text, preprocessor = tolower, tokenizer = text2vec::space_tokenizer, n_chunks = 1, ids = tweets$status_id) vocab <- text2vec::create_vocabulary(mo, stopwords = character(0)) #tm::stopwords()
Then we build a skeleton Collocations
model per code below. But we never actually run the model.
model <- text2vec::Collocations$new(vocabulary = vocab, sep = sep)
Instead, all we want to do is assign the parameter model$.__enclos_env__$private$phrases
our list of multi-word expressions.
model$.__enclos_env__$private$phrases <- multi_word_expressions_replace
Using this dummy Collocations
model, we then transform
the itoken
object built above. Here, transform means updating the token object to account for multi-word expressions.
it_phrases <- model$transform(mo) term_vocab <- text2vec::create_vocabulary(it_phrases) term_vocab1 <- text2vec::prune_vocabulary(term_vocab, term_count_min = 2) ## HACK ats <- attributes(term_vocab1) term_vocab2 <- subset(term_vocab1, grepl('^[A-Za-z]', term) & nchar(term) > 2) t2v_vocab <- term_vocab2 attributes(t2v_vocab) <- ats #egs <- it_phrases$nextElem()$tokens #egs1 <- lapply(egs, paste0, collapse = ' ')
And now we can investigate frequencies for all forms included in the congressional Twitter corpus, including (but not limited to) our multi-word expressions.
term_freqs <- term_vocab2 %>% left_join(dictionary , by = c('term' = 'term_name')) descriptor_freq <- term_freqs %>% group_by(category, descriptor_name) %>% summarize(term_freq = sum(term_count)) %>% filter(!is.na(descriptor_name))
Some relative frequencies for spelling & lexical variants for a sample of multi-word expressions from the COVID19 lexicon.
term_freqs %>% filter(descriptor_name %in% c('social-distancing', 'front-line-workers', 'flatten-the-curve')) %>% arrange(desc(term_count)) %>% mutate(tf = paste0(term, ' (', term_count, ')')) %>% group_by(descriptor_name) %>% summarize(relative = paste0(tf, collapse = ' | ')) %>% DT::datatable(rownames = FALSE, options = list(dom = 't'))
So, the trickier/hackier part is complete. The text2vec
vocabulary object now recognizes the multi-word expressions in our COVID19 lexicon as single units of meaning. And we can carry on.
DTMs & prevalence of COVID19-related concepts
§ Document-term matrix
DTMs can be leveraged in any number of ways, for any number of downstream linguistic applications. Here, we want to use the DTM to understand variation in the use of COVID19-related concepts as a function of time and political affiliation – the former a feature of the tweet; the latter a feature of the tweeter, ie, the US lawmaker.
The text2vec::create_dtm
function requires an additional object, a vocab_vectorizer
, which is derived from the vocabulary
object discussed/created above. The vocab_vectorizer
operates on the phrasified token object contained in it_phrases
. Using my lexvarsdatr
package, we then convert the DTM to a long data frame.
vectorizer <- text2vec::vocab_vectorizer(t2v_vocab) dtm0 <- text2vec::create_dtm(it_phrases, vectorizer) dtm1 <- lexvarsdatr::lvdr_get_closest(dtm0) ## other ways -- colnames(dtm1) <- c('doc_id', 'term_name', 'count') dtm2 <- dtm1 %>% inner_join(dictionary)
§ Historical prevalences
Per this new data structure, we can now add tweet metadata (here, date of creation) as well as tweeter details, eg, name, party affiliation, gender, etc.
dtm3 <- dtm2 %>% ## Aggregate -- to descriptor_name group_by(doc_id, descriptor_name, category) %>% summarize(count = sum(count)) %>% ungroup() %>% left_join(tweets %>% select(status_id, screen_name, created_at) %>% mutate(screen_name = toupper(screen_name)), by = c('doc_id' = 'status_id')) %>% left_join(twitters %>% select(full_name, type:district_code, twitter, party) %>% mutate(twitter = toupper(twitter)), by = c('screen_name' = 'twitter'))
We can now calculate the daily rate of reference for each concept included in the COVID19 lexicon. Here, the rate of reference for concept X is calculated as the # of tweets referring to concept X per 1K total tweets (for a given day).
for_plot_historical <- dtm3 %>% group_by(created_at) %>% mutate(daily_total = length(unique(doc_id))) %>% group_by(descriptor_name, created_at, daily_total) %>% summarize (n = n()) %>% mutate(per_k_tweets = round(n/daily_total * 1000, 3)) %>% ungroup() # mutate(cumulative_n = cumsum(n)) %>% # complete(created_at = seq.Date(from = min(created_at), # to = max(created_at), # by="day"), # descriptor_name) %>% fill(cumulative_n) for_plot_historical$per_k_tweets[is.na(for_plot_historical$per_k_tweets)] <- 0
§ Historical plot
The plot below illustrates daily rates of reference to several COVID19-related concepts since the beginning of March. Black lines represent 7-day moving averages.
keeps <- c('cares-act', 'flatten-the-curve', 'stay-at-home', 'social-distancing', 'essential-workers', 'face-mask', 'paycheck-protection-program', 'first-responders', 'ventilator') ### for_plot_historical %>% filter(descriptor_name %in% keeps & created_at > '2020-03-01') %>% mutate(moving_n = zoo::rollmean(per_k_tweets, k = 7, fill = NA)) %>% ggplot() + geom_bar(aes(x = created_at, y= per_k_tweets, #cumulative_n, fill = descriptor_name), stat="identity") + #geom_bar(stat="identity") + geom_line(aes(x = created_at, y = moving_n), color = 'black', #linetype = 2, size = .75) + theme_minimal() + ggthemes::scale_fill_stata() + theme(axis.text.x = element_text(angle = 90, hjust = 1))+ scale_x_date(date_breaks = '1 week', date_labels = "%b %d") + theme(legend.position = 'none', legend.title = element_blank()) + ylab('Daily reference rate (per 1K tweets)') + facet_wrap(~descriptor_name, scales = 'free_y') + labs(title = 'Daily reference rates to some COVID-related concepts since March 1st', subtitle = 'With 7-day rolling averages in black', caption = 'Source: Congressional Twitter Corpus, 2020')
See this post for an onomatalogical perspective on how reference to COVID19/coronavirus has changed among members of the US House since February.
§ Party prevalences
Next, we consider variation in reference to COVID19 concepts among US lawmakers as a function of party affiliation. For COVID19 concept X, then, prevalence is calculated as the quotient of (1) the rate of reference for concept X among Republicans and (2) the rate of reference for concept X among Democrats. We tweak this quotient some such that the number reflects a percentage increase.
For a similar analysis on how party affiliation influences how US House Reps refer to the President of the USA on Twitter, see this post.
descriptor_by_party <- dtm3 %>% filter(party != 'Independent') %>% group_by(party) %>% mutate(party_total = length(unique(doc_id))) %>% group_by(party, descriptor_name, party_total) %>% summarize(n = n()) %>% group_by(descriptor_name) %>% mutate(descriptor_total = sum(n)) %>% ungroup() %>% filter(descriptor_total > 40) %>% mutate(per_1k = round(n/party_total * 1000, 2)) %>% select(-n, -party_total) %>% spread(party, per_1k) %>% mutate(ratio = ifelse (Democrat < Republican, (Republican/Democrat) - 1, (-Democrat/Republican) +1 )) %>% mutate(ratio = round(ratio * 100, 2)) %>% filter(!is.na(ratio) & abs(ratio) < 200)
The table below illustrates several examples of the relative prevalences of COVID19-related concepts as a function of political affiliation. Interpreting table, eg: Republicans are ~30% more likely than Democrats to reference american-dream
on Twitter.
descriptor_name | descriptor_total | Democrat | Republican | ratio |
---|---|---|---|---|
american-dream | 112 | 2.07 | 2.71 | 30.92 |
cares-act | 3476 | 62.82 | 86.22 | 37.25 |
civil-liberty | 75 | 1.30 | 1.95 | 50.00 |
close-contact | 155 | 4.04 | 1.84 | -119.57 |
community-spread | 54 | 1.37 | 0.70 | -95.71 |
coronavirus | 16397 | 313.64 | 378.67 | 20.73 |
COVID19-concepts politicized. A Democrat-Republican continuum. Concepts in green are ~neutral. Certainly some interesting differences, that shed light on both ideology and constituency demographics.
descriptor_by_party %>% mutate(col1 = ifelse(ratio > 0, 'red', 'blue')) %>% mutate(col1 = ifelse(ratio > -25 & ratio < 25, 'x', col1)) %>% ggplot(aes(x=reorder(descriptor_name, ratio), y=ratio, label=descriptor_name, color = col1)) + geom_hline(yintercept = 25, linetype = 2, color = 'gray') + geom_hline(yintercept = -25, linetype = 2, color = 'gray') + geom_point(size= 1.5, color = 'darkgray') + geom_text(size=4, hjust = 0, nudge_y = 5)+ annotate('text' , y = -75, x = 20, label = 'DEMOCRAT') + annotate('text' , y = 100, x = 10, label = 'REPUBLICAN') + ggthemes::scale_color_stata() + theme_minimal() + labs(title="COVID19-related concepts", subtitle = 'Prevalence by party affiliation') + ##? theme(legend.position = "none", axis.text.y=element_blank(), axis.ticks.y=element_blank())+ xlab('') + ylab('% more likely')+ ylim(-150, 200) + coord_flip()
GloVe model & COVID19 semantic space
The next piece is to build a GloVe model to investigate semantic relatedness among concepts included in our COVID19 lexicon. The general workflow here is:
- Build a term-co-occurrence matrix (TCM),
- Build an n-dimensional GloVe model based on the TCM,
- Further reduce GloVe dimensions via tSNE, PCA, or MDS,
- Plot terms in a reduced 2D space.
Here, we have the additional task of aggregating the TCM from terms to descriptors (or concepts), before building the GloVe model. The code below creates a simple table that crosswalks terms to concepts.
term_vocab3 <- term_vocab2 %>% rename(term_name = term) %>% left_join(dictionary) %>% mutate(descriptor_name = ifelse(is.na(descriptor_name), term_name, descriptor_name), category = ifelse(is.na(category), 'other', category)) %>% arrange(term_name)
§ Term-co-occurrence matrix
Utilizing previously constructed text2vec
primitives, we use the text2vec::create_tcm
function to construct a term-co-occurrence matrix, specifying a context window-size of 5 x 5.
tcm <- text2vec::create_tcm(it = it_phrases, vectorizer = vectorizer, skip_grams_window = 5L)
Then we implement the lvdr_aggregate_matrix
function from the lexvarsdatr
package to aggregate term vectors to a single descriptor vector (for forms included in the COVID19 lexicon).
tcm <- tcm[, order(colnames(tcm))] tcm <- tcm[order(rownames(tcm)), ] tcm1 <- lexvarsdatr::lvdr_aggregate_matrix(tfm = tcm, group = term_vocab3$descriptor_name, fun = 'sum')
Dimensions of TCM:
## [1] 42468 42468
§ GloVe model
We specify GloVe model parameters via the text2vec::GlobalVectors
function, and build term vectors using fit_transform
. Vectors are comprised of n = 128 dimensions.
set.seed(99) glove <- text2vec::GlobalVectors$new(rank = 128, #vocabulary = row.names(tcm1), x_max = 10) wv_main <- glove$fit_transform(tcm1, n_iter = 10, convergence_tol = 0.01) wv_context <- glove$components glove_vectors <- wv_main + t(wv_context)
§ Semantic & conceptual relatedness
With GloVe vectors in tow, options abound. Here, we demonstrate two fairly straightforward applications. The first – a quick look at nearest neighbors for a set of COVID19-related concepts. Via cosine similarity and the LSAfun::neighbors
function.
eg_terms <- c('stay-at-home', 'outbreak', 'front-line-workers', 'vaccine', 'relief' ) x <- lapply(eg_terms, LSAfun::neighbors, glove_vectors, n = 10) names(x) <- eg_terms
Top 10 nearest neighbors for stay-at-home
, outbreak
, front-line-workers
, vaccine
& relief
. So, despite a relatively small corpus, some fairly nice results.
## $`stay-at-home` ## stay-at-home practice social-distancing flatten-the-curve ## 1.0000000 0.5634366 0.5500531 0.4852880 ## sick stay avoid staying ## 0.4788068 0.4424411 0.4312511 0.4096375 ## unless you ## 0.3959533 0.3679505 ## ## $outbreak ## outbreak coronavirus pandemic covid covidー19 crisis ## 1.0000000 0.6616114 0.5697251 0.5494554 0.5296838 0.4987309 ## response spread virus impacts ## 0.4826594 0.4798837 0.4642697 0.4621730 ## ## $`front-line-workers` ## front-line-workers essential-workers first-responders ppe ## 1.0000000 0.5883052 0.5341892 0.4832370 ## heroes professionals nurses frontline ## 0.4822017 0.4641108 0.4569190 0.4568646 ## doctors equipment ## 0.4491088 0.4448107 ## ## $vaccine ## vaccine treatments develop therapies treatment development ## 1.0000000 0.5817043 0.4566783 0.4113966 0.4112393 0.3854895 ## easy tests genesis research ## 0.3832985 0.3499214 0.3297194 0.3265058 ## ## $relief ## relief funding aid additional provide bill support ## 1.0000000 0.6063464 0.5879983 0.5798526 0.5768965 0.5732163 0.5670530 ## package assistance cares-act ## 0.5640401 0.5563580 0.5379026
For a smarter approach to visualizing nearest neighbors, as well as visualizing lexical semantic change historically, see this post. Or, a similar analysis utilizing Google n-gram data.
In the second application, we consider a two-dimensional perspective on the semantic relatedness of concepts included in our COVID19 lexicon. While we have built vectors for all forms attested in the Congressional Twitter Corpus, here we subset this matrix to just COVID concepts. Via classical multidimensional scaling, we project 128 features (per GloVe) into two-dimensional Euclidean space.
set.seed(99) keeps <- descriptor_freq %>% filter(term_freq > 30) glove1 <- glove_vectors[rownames(glove_vectors) %in% unique(keeps$descriptor_name),] sim_mat <- text2vec::sim2(glove1, method = "cosine", norm = "l2") # data set too small for tSNE --- y1 <- cmdscale(1-sim_mat, eig = TRUE, k = 2)$points %>% data.frame() %>% mutate (descriptor_name = rownames(sim_mat)) %>% left_join(dictionary %>% distinct(descriptor_name, category))
A semantic map of COVID19 concepts is presented below. Some intuitive structure for sure; some less so.
y1 %>% ggplot(aes(X1,X2, label = descriptor_name)) + geom_point(aes(color = category), size = 3.5) + ggrepel::geom_text_repel( data = y1, nudge_y = 0.025, segment.color = "grey80", direction = "y", hjust = 0, size = 3 ) + ggthemes::scale_colour_stata() + theme_minimal() + #theme_classic() + theme(legend.position = "bottom", plot.title = element_text(size=14))+ labs(title="COVID19-related concepts in 2D semantic space")
See a similar application using co-occurrence and place-names to create a text-based map of the world.
Networks & lexical co-occurrence
Lastly, we build & visualize a co-occurrence network based on the previously constructed term-co-occurrence matrix. The lexvarsdatr
package streamlines these processes, and enables straightforward extraction of sub-networks from large matrices. See package description for a more detailed discussion.
Below, we convert our count-based TCM to a positive point-wise mutual information-based matrix (via lvdr_calc_ppmi
), and extract the 20 strongest collocates (via lvdr_extract_network
) for five (cherry-picked) concepts included in the COVID19 lexicon.
network <- tcm1 %>% lexvarsdatr::lvdr_calc_ppmi(make_symmetric = TRUE) %>% lexvarsdatr::lvdr_extract_network ( target = c('contact-tracing', 'flatten-the-curve', 'return-to-work', 'social-distancing', #'remote-learning', 'drive-through-testing'), n = 20)
And then visualize:
set.seed(66) network %>% tidygraph::as_tbl_graph() %>% ggraph::ggraph() + ggraph::geom_edge_link(color = 'darkgray') + ggraph::geom_node_point(aes(size = value, color = term, shape = group)) + ggraph::geom_node_text(aes(label = toupper(label), filter = group == 'term'), repel = TRUE, size = 4) + ggraph::geom_node_text(aes(label = tolower(label), filter = group == 'feature'), repel = TRUE, size = 3) + ggthemes::scale_color_stata()+ theme_minimal() + ggtitle('A COVID19 co-occurrence network') + theme(legend.position = "none", plot.title = element_text(size=14))
Summary
So, more of a resource/guide than a post-proper. Mostly an attempt on my part to collate some scattered methods. And a bit of an ode totext2vec
.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.