Site icon R-bloggers

place from text: geography & distributional semantics

[This article was first published on Jason Timm, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
  • In this post, we demonstrate some different methodologies for exploring the geographical information found in text. First, we address some of the practical issues of extracting places/place-names from an annotated corpus, and demonstrate how to (1) map their geospatial distribution via geocoding and (2) append additional geographic detail to these locations via spatial joins.

    We then consider how these locations “map” in semantic space by comparing context-based word embeddings for each location. Ultimately, the endgame is to investigate the extent to which geospatial proximity is reflected (or not) in distributional similarity in a corpus. In the process, we demonstrate some methods for getting from lexical co-occurrence to a 2D semantic map via latent semantic analysis (LSA) and classical multi-dimensional scaling (MDS).

    library(tidyverse)
    library(ggthemes)
    library(corpuslingr) #devtools::install_github("jaytimm/corpuslingr")
    library(corpusdatr) #devtools::install_github("jaytimm/corpusdatr")
    library(knitr)

    From text to map

    Slate corpus & geopolitical entities

    For demo purposes, we use the annotated Slate magazine corpus made available as cdr_slate_ann via the corpusdatr. Content of articles comprising the corpus is largely political in nature, so lots of reference to place and location, namely foreign and domestic political entities. The first task, then, is to get a rollcall of the geopolitical entities included in the corpus.

    The Slate Magazine corpus has been annotated using the spacyr package, and contains named entity tags, including geopolitical entities (GPEs). Here we collapse multi-word entities (eg, “New” “York”) to single tokens (eg, “New_York”), and ready the corpus for search using clr_set_corpus.

    slate <- corpusdatr::cdr_slate_ann %>%
      spacyr::entity_consolidate() %>%
      corpuslingr::clr_set_corpus(ent_as_tag=TRUE)

    Next, we obtain text and document frequencies for GPEs included in the corpus, and filter to only those occurring in 1% or greater of articles comprising the corpus.

    slate_gpe <- slate %>%
      bind_rows()%>%
      filter(tag == 'NNGPE')%>%
      corpuslingr::clr_get_freq(agg_var='lemma',toupper=TRUE) %>%
      filter(txtf>9 & !grepl('US|USA|AMERICA|UNITED_STATES|THE_UNITED_STATES|U.S.|U.S.A',lemma))

    The most frequently referenced GPEs in the Slate corpus (not including the US):

    lemma txtf docf
    WASHINGTON 398 230
    KOSOVO 298 78
    CHINA 262 94
    NEW_YORK 222 143
    ISRAEL 204 78
    BRITAIN 161 85

    Geocoding

    To visualize the geographical distribution of GPEs in the Slate Magazine corpus, we use the geocode function from the ggmap package to transform our corpus locations to lat/lon coordinates that can be mapped. While ggmap works best with proper addresses (eg, street, city, zip, etc), country and city names can be geolocated as well.

    Note that while GPEs are geographical areas, this method approximates GPE location as a single point in lat/long space at the center (or centroid) of these areas. For our purposes here, this approximation is fine.

    The following pipe geocodes the GPEs, removes GPEs that Google Maps cannot geocode, and transforms the new dataframe with lat/lon coordinates into an sf spatial object. The last step enables convenient mapping/geospatial processing within the sf framework.

    library(ggmap)
    library(sf)
    
    slate_gpe_geo <- ggmap::geocode(slate_gpe$lemma, 
                                    output = c("latlon"), 
                                    messaging = FALSE) %>%
      bind_cols(slate_gpe)%>%
      filter(complete.cases(.))%>%
      sf::st_as_sf(coords = c("lon", "lat"), 
                   crs = 4326)

    We then map the geolocated GPEs using the leaflet package; circle radius reflects frequency of occurrence in the slate corpus.

    library(leaflet)
    library(widgetframe)
    
    x <- slate_gpe_geo %>%
      leaflet(width="100%") %>%
      setView(lng = -5, lat = 31, zoom = 2) %>%
      addProviderTiles ("CartoDB.Positron",
                        options = providerTileOptions (minZoom = 2, maxZoom = 4))%>%
      addCircleMarkers(
        radius = ~txtf/25,
        stroke = FALSE, fillOpacity = .75,
        label=~lemma)
    
    frameWidget(x)

    Spatial joins

    The spData package conveniently makes available a variety of shapefiles/geopolitical polygons as sf objects, including a world country map. Having geocoded the GPEs, we can add features from this country map (eg, country, subregion, continent) to our GPE points via a spatial join. We use the st_join function from the sf package to accomplish this task.

    library(spData)
    slate_gpe_details <- sf::st_join(slate_gpe_geo, spData::world)

    Per the spatial join, we now have information regarding country, continent, and subregion for each GPE from the Slate Magazine corpus.

    lemma name_long continent subregion
    4 ALBANIA Albania Europe Southern Europe
    5 ARGENTINA Argentina South America South America
    6 ARIZONA United States North America Northern America
    7 ARKANSAS United States North America Northern America
    8 ARLINGTON United States North America Northern America
    9 ATHENS Greece Europe Southern Europe

    We can use this information, for example, to aggregate GPE text and document frequencies to the subregion level:

    slate_gpe_details %>%
      data.frame()%>%
      group_by(subregion) %>%
      summarize (txtf=sum(txtf),docf=sum(docf))%>%
      filter(subregion!='Northern America')%>%
      ggplot(aes(x=docf, y=txtf)) + 
      geom_text(aes(label=toupper(subregion)), 
                size=3, 
                check_overlap = TRUE,
                hjust = "inward")+
      labs(title = "Document vs. text frequency for GPEs outside of Northern America", 
           subtitle="By Subregion")

    Corpus search and context

    So, our next task is to map the GPEs in 2D (semantic) space by comparing context-based word embeddings for each location. What does a map derived from patterns of lexical co-occurrence in text look like?

    The first step in accomplishing this task is to search the Slate Magazine corpus for GPEs in context. For each occurrence of each GPE in the corpus, then, token and surrounding context are extracted using the corpuslingr::clr_search_context function. Here, context is defined as the 15×15 window of words surrounding a given token of a GPE. We limit our search to the 100 most frequent GPEs.

    gpe_search <- data.frame(slate_gpe_geo) %>%
      arrange(desc(txtf))%>%
      slice(1:100)%>%
      mutate(lemma=paste0(lemma,'~GPE'))

    Perform search:

    gpe_contexts <- corpuslingr::clr_search_context(
      search = gpe_search$lemma, 
      corp=slate, 
      LW=15, RW=15)

    A small random sample of the search results are presented below in context. The clr_context_kwic function quickly rebuilds the original user-specified search context, with the search term highlighted.

    gpe_contexts %>%
      corpuslingr::clr_context_kwic(include=c('doc_id')) %>%
      sample_n(5)%>%
      DT::datatable(class = 'cell-border stripe', 
                    rownames = FALSE,
                    width="100%", 
                    escape=FALSE)

    LSA, MDS, and semantic space

    So, having extracted all contexts from the corpus, we can now build a GPE-feature matrix (ie, word embeddings by GPE) by applying the clr_context_bow function to the output of clr_search_context. We limit our definition of features to only content words, and aggregate feature frequencies by lemma.

    term_feat_mat <- gpe_contexts %>%
      corpuslingr::clr_context_bow(
        agg_var = c('searchLemma','lemma'),
        content_only=TRUE)%>%
      spread (searchLemma,cofreq)%>%
      replace(is.na(.), 0)

    Some of the matrix:

    lemma AFGHANISTAN ALABAMA ALASKA
    GOT_MAIL 0 0 0
    GOURMET 0 0 0
    GOV. 0 0 0
    GOVERN 0 0 0
    GOVERNANCE 0 0 0
    GOVERNMENT 2 1 2


    Next, we create a cosine-based similarity matrix using the LSA package:

    library(lsa)
    sim_mat <- term_feat_mat %>%
      select(2:ncol(term_feat_mat)) %>%
      data.matrix()%>%
      lsa::cosine(.)

    The lsa::cosine function computes cosine measures between all GPE vectors of the term-feature matrix. The higher the cosine measure between two vectors, the greater their similarity in composition. The top-left portion of this matrix is presented below:

    ##             AFGHANISTAN   ALABAMA    ALASKA
    ## AFGHANISTAN   1.0000000 0.1663644 0.1089837
    ## ALABAMA       0.1663644 1.0000000 0.1570805
    ## ALASKA        0.1089837 0.1570805 1.0000000

    The last step is to transform the similarities between co-occurrence vectors into two-dimensional space, such that context-based (ie, semantic) similarity is reflected in spatial proximity.

    To accomplish this task, we apply classical scaling to the similarity matrix using the base r function cmdscale. Two-dimensional coordinates are then extracted from the points element of the cmdscale output. We join the slate_gpe_details object to the ouput in order to color GPEs in the plot by continent.

    As the plot demonstrates, we get a fairly good sense of geo-political space (from the perspective of Slate Magazine contributors) by comparing word embeddings derived from a corpus of only 1 million words.

    cmdscale(1-sim_mat, eig = TRUE, k = 2)$points %>% 
      data.frame() %>%
      mutate (lemma = rownames(sim_mat))%>%
      left_join(slate_gpe_details)%>%
      ggplot(aes(X1,X2)) +
      geom_text(aes(label=colnames(sim_mat),col=continent), 
                size=2.5, 
                check_overlap = TRUE)+
      scale_colour_stata() + theme_fivethirtyeight() +
      theme(legend.position = "none",
            plot.title = element_text(size=14))+ 
      labs(title="Slate GPEs in semantic space",
           subtitle="A two-dimensional solution")

    The first dimension (x-axis) seems to do a very nice job capturing a “Domestic – Foreign” distinction, with some obvious exceptions. The second dimension (y-axis) seems to capture a “City – State” distinction, or an “Urban – Non-urban” distinction. Also, there seems to be a “Europe – Non-Europe” element to the second dimension on the “Foreign” side of the plot.

    Someone better versed in the geo-political happenings of the waning 20th century could likely provide a more detailed analysis here. Suffice it to say, there is some very intuitive structure to the plot above derived from co-occcurence vectors. While not exclusively geospatial, as a “map” of the geo-political “lay of the land” it certainly has utility.

    FIN

    So, we have weaved together here a set of methodologies that are often discussed in different classrooms, and demonstrated some different approaches to extracting and analyzing the geospatial information contained in text. Maps and “maps.”

    To leave a comment for the author, please follow the link and comment on their blog: Jason Timm.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.