corpus query and grammatical constructions
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This post demonstrates the use of a simple collection of functions from my R-package corpuslingr
. Functions streamline two sets of corpus linguistics tasks:
- annotated corpus search of grammatical constructions and complex lexical patterns in context, and
- detailed summary and aggregation of corpus search results.
While still in development, the package should be useful to linguists and digital humanists interested in having BYU corpora-like search & summary functionality when working with (moderately-sized) personal corpora, as well as researchers interested in performing finer-grained, more qualitative analyses of language use and variation in context. The package is available for download at my github site.
library(tidyverse) library(corpuslingr) #devtools::install_github("jaytimm/corpuslingr") library(corpusdatr) #devtools::install_github("jaytimm/corpusdatr")
Search syntax
Under the hood, corpuslingr
search is regex-based & tuple-based — akin to the RegexpParser
function in Python’s Natural Language Toolkit (NLTK) — which facilitates search of grammatical and lexical patterns comprised of:
- different types of elements (eg, form, lemma, or part-of-speech),
- contiguous and/or non-contiguous elements,
- positionally fixed and/or free (ie, optional) elements.
Regex character matching is streamlined with a simple “corpus querying language” modeled after the more intuitive and transparent syntax used in the online BYU suite of English corpora. This allows for convenient specification of search patterns comprised of form, lemma, & pos, with all of the functionality of regex metacharacters and repetition quantifiers.
Example searches & syntax are presented below, which load with the package as clr_ref_search_egs
. A full list of part-of-speech codes can be viewed here, or via clr_ref_pos_codes
.
example search syntax
Corpus search
For demo purposes, we use the cdr_slate_ann
corpus from my corpusdatr
package. A simple description of the corpus is available here. Using the corpuslingr::clr_set_corpus
function (which builds tuples and sets character onsets/offsets), we ready the corpus for search.
slate <- corpusdatr::cdr_slate_ann %>% corpuslingr::clr_set_corpus()
SIMPLE SEARCH
The clr_search_gramx()
function returns instantiations of a search pattern without context. It is for quick search. The function returns search results as a single dataframe.
- ADJECTIVE and ADJECTIVE, eg “happy and healthy”
search1 <- "ADJ and ADJ" slate %>% corpuslingr::clr_search_gramx(search=search1)%>% select(doc_id,token,tag)%>% head()
SEARCH IN CONTEXT
The clr_search_context()
function builds on clr_search_gramx()
by adding surrounding context to the search phrase. Search windows can be specified using the LW/RW parameters. Function output includes a list of two data frames.
The first, BOW
, presents results in a long format, which can be used to build word embeddings, for example. The second, KWIC
, presents results with the surrounding context rebuilt in more or less a KWIC fashion. Both data frames serve as intermediary data structures for subsequent analyses.
- VERB PRP$ way PREP NPHR, eg “make its way through the Senate”
Per CQL above, NPHR
can be used as a generic noun phrase search.
search2 <- "VERB PRP$ way (through| into) NPHR" searchResults <- slate %>% corpuslingr::clr_search_context(search=search2, LW=5, RW = 5)
KWIC
object:
searchResults$KWIC %>% head() %>% select(-eg)
BOW
object:
searchResults$BOW %>% head()
Search summary
The clr_get_freq()
function enables quick aggregation of search results. It calculates token and text frequency for search terms, and allows the user to specify how to aggregate counts with the agg_var
parameter.
- VERB up, eg “pass up”
search3 <- "VERB up"
The figure below illustrates the top 20 instantiations of the grammatical construction VERB up.
slate %>% corpuslingr::clr_search_gramx(search=search3)%>% corpuslingr::clr_get_freq(agg_var=c("lemma"), toupper =TRUE) %>% slice(1:20)%>% ggplot(aes(x=reorder(lemma,txtf), y=txtf)) + geom_col(width=.6, fill="steelblue") + coord_flip()+ labs(title="Top 20 instantiations of 'VERB up' by frequency")
Although search is quicker when searching for multiple search terms simultaneaously, in some cases it may be useful to treat multiple search terms distinctly using lapply()
:
search3a <- c("VERB across", "VERB through", "VERB out", "VERB down") vb_prep <- lapply(1:length(search3a), function(y) { corpuslingr::clr_search_gramx(corp=slate, search=search3a[y])%>% corpuslingr::clr_get_freq(agg_var=c("lemma"), toupper =TRUE) %>% mutate(search = search3a[y]) }) %>% bind_rows()
Summary by search:
vb_prep %>% group_by(search) %>% summarize(gramx_freq = sum(txtf), gramx_type = n())
Top 10 instantiations of each search pattern by search term:
KWIC & BOW
KEYWORD IN CONTEXT
clr_context_kwic()
is a super simple function that rebuilds search contexts from the output of clr_search_context()
. The include
parameter allows the user to add details about the search pattern, eg. part-of-speech, to the table. It works nicely with DT
tables.
- VERB NOUNPHRASE into VERBING, eg “trick someone into believing”
search4 <- "VERB NPHR into VBG" slate %>% corpuslingr::clr_search_context(search=search4, LW=10, RW = 10) %>% corpuslingr::clr_context_kwic(include=c('doc_id')) %>% DT::datatable(class = 'cell-border stripe', rownames = FALSE,width="450", escape=FALSE) %>% DT::formatStyle(c(1:2),fontSize = '85%')
BAG OF WORDS
The clr_context_bow()
function returns a co-occurrence vector for each search term based on the context window-size specified in clr_search_context()
. Again, how features are counted can be specified using the agg_var
parameter. Additionally, features included in the vector can be filtered to content words using the content_only
parameter.
- Multiple search terms
search5 <- c("Clinton", "Lewinsky", "Bradley", "McCain", "Milosevic", "Starr", "Microsoft", "Congress", "China", "Russia")
Here we search for some prominent players of the late 90s (when articles in the cdr_slate_ann
corpus were published), and plot the most frequent co-occurring features of each search term.
co_occur <- slate %>% corpuslingr::clr_search_context(search=search5, LW=15, RW = 15)%>% corpuslingr::clr_context_bow(content_only=TRUE, agg_var=c('searchLemma','lemma','pos'))
Plotting facets in ggplot
is problematic when within-facet categories contain some overlap. We add a couple of hacks to address this.
co_occur %>% filter(pos=="NOUN")%>% arrange(searchLemma,cofreq)%>% group_by(searchLemma)%>% top_n(n=10,wt=jitter(cofreq))%>% ungroup()%>% #Hack1 to sort order within facet mutate(order = row_number(), lemma=factor(paste(order,lemma,sep="_"), levels = paste(order, lemma, sep = "_")))%>% ggplot(aes(x=lemma, y=cofreq, fill=searchLemma)) + geom_col(show.legend = FALSE) + facet_wrap(~searchLemma, scales = "free_y", ncol = 2) + #Hack2 to modify labels scale_x_discrete(labels = function(x) gsub("^.*_", "", x))+ theme_fivethirtyeight()+ scale_fill_stata() + theme(plot.title = element_text(size=13))+ coord_flip()+ labs(title="Co-occurrence frequencies for some late 20th century players")
Summary and shiny
So, a quick demo of some corpuslingr
functions for annotated corpus search & summary of complex lexical-grammatical patterns in context.
I have built a Shiny app to search/explore the Slate Magazine corpus available here. Code for building the app is available here. Swapping out the Slate corpus for a personal one should be fairly straightforward, with the caveat that the annotated corpus needs to be set/“tuple-ized” using the clr_set_corpus
function from corpuslingr
.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.