Site icon R-bloggers

Curate language data (1/2): organizing meta-data

[This article was first published on R on francojc ⟲, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
  • When working with raw data, whether is comes from a corpus repository, web download, or a web scrape, it is important to recognize that the attributes that we want to organize can be stored or represented in various formats. The three I will cover here have to do with meta-data that is: (1) contained in the file name of a set of corpus files, (2) embedded in the corpus documents inline with the corpus text, and (3) stored separate from the the text data. Our goal will be to wrangle this information into a tidy dataset format where each row is an observation and each column a corresponding attribute of the data.

    The following code is available on GitHub recipes-curate_data and is built on the recipes-project_template I have discussed in detail here and made accessible here. I encourage you to follow along by downloading the recipes-project_template with git from the Terminal or create a new RStudio R Project and select the “Version Control” option.

    < !-- Another important step when tidying raw data is to be aware of the potential erroneous data in the text that may be present that we would like to remove. For example, a corpus may contain resource-particular information that is not needed for the purposes of the analysis (ex. sound file alignment timestamps) or may be part of a scheme that depends on particular software to interpret (ex. formatting for CHAT software). We will also work to clean the data of the extraneous elements as well. -->

    Running text with meta-data in file names

    A common format for storing meta-data for corpora is in the file names of the corpus documents. When this is the approach of the corpus designer, the names will contain the relevant attributes in some regular format, usually using some common character as the delimiter between the distinct attribute elements.

    Download corpus data

    The ACTIV-ES Corpus is structured this way. ACTIV-ES is a corpus of TV/film transcripts from Argentina, Mexico, and Spain. Let’s use this corpus as an example. First we need to download the data. The ACTIV-ES corpus is stored in a GitHub repository. We can download the entire corpus using git to clone the repository, or we can access the specific corpus format (plain-text or part-of-speech annotated) as a compressed .zip file. Let’s download the compressed file for the plain text data. Navigate to the https://github.com/francojc/activ-es/blob/master/activ-es-v.02/corpus/plain.zip file and then copy the link for the ‘Download’ button. We can use the get_zip_data() function we developed in the Acquiring data for language research (1/3): direct downloads post.

    get_zip_data(url = "https://github.com/francojc/activ-es/raw/master/activ-es-v.02/corpus/plain.zip", 
                 target_dir = "data/original/actives/plain")

    Taking a look at the data/original/actives/plain/ directory we can see the files. Below is a subset of files from each of the three countries.

    es_Argentina_2008_Lluvia_movie_Drama_1194615.run
    es_Argentina_2008_Los-paranoicos_movie_Comedy_1178654.run
    es_Mexico_2008_Rudo-y-Cursi_movie_Comedy_405393.run
    es_Mexico_2009_Sin-nombre_movie_Adventure_1127715.run
    es_Spain_2010_También-la-lluvia_movie_Drama_1422032.run
    es_Spain_2010_Tres-metros-sobre-el-cielo_movie_Drama_1648216.run

    Tidy the corpus

    Each of the meta-data attributes is separated by an underscore _. The extension on these files is .run. There is nothing special about this extension, the data is plain text, but it is used to contrast the ‘running text’ version of these files with similar names that have linguistic annotations associated in other versions of the corpus. The delimited elements correspond to language, country, year, title, type, genre, and imdb_id.

    Ideally we want a data set with columns for each of these attributes in the file names and an extra two columns for the text itself and an id to distinguish each document doc_id. The readtext package comes in handy here. So let’s load (or install) this package to read the corpus files and the tidyverse package for other miscellaneous helper functions.

    pacman::p_load(readtext, tidyverse) # use the pacman package to load-install

    The readtext() function is quite versatile. It allows us to read multiple files simultaneously and organize the data in a tidy dataset. The files argument will allow us to add the path to the directory where the files are located and use a pattern matching syntax known as a Regular Expressions to match only the files we want to extract the data from. Regular expressions are a powerful tool for manipulating character strings. Getting familiar with how they work is highly recommended.1 We will see them in action at various points throughout the rest of this series. In this case we want all the files from the data/original/actives/plain/ directory that have the extension .run. So we using the Kleene start * as a wildcard match in combination with .run to match all files that end in .run.

    Furthermore, the readtext() function allows for us to specify where the meta-data is to be found with the docvarsfrom argument, in our case "filenames". The default separator value is the underscore, so we do not have to add this argument. In the case, however, the the separator is not an underscore, you will add this argument with the separator value necessary. The actual names we want to give to the attributes can be added with the docvarnames argument. Note that the docvarnames argument takes a character vector as a value. Remember to create a character vector we use the c() function with each element quoted.

    aes <- 
      readtext(file = "data/original/actives/plain/*.run", # read each file .run
               docvarsfrom = "filenames", # get attributes from filename
               docvarnames = c("language", "country", "year", "title", "type", "genre", "imdb_id")) # add the column names we want for each attribute
    
    glimpse(aes) # preview structure of the object
    Observations: 430
    Variables: 9
    $ doc_id   <chr> "es_Argentina_1950_Esposa-último-modelo_movie_n_199500.run", "es_Arge...
    $ text     <chr> "No está , señora . Aquí tampoco . No aparece , señora . ¿ Dónde se ha...
    $ language <chr> "es", "es", "es", "es", "es", "es", "es", "es", "es", "es", "es", "es"...
    $ country  <chr> "Argentina", "Argentina", "Argentina", "Argentina", "Argentina", "Arge...
    $ year     <int> 1950, 1952, 1955, 1965, 1969, 1973, 1975, 1977, 1979, 1980, 1981, 1983...
    $ title    <chr> "Esposa-último-modelo", "No-abras-nunca-esa-puerta", "El-amor-nunca-m...
    $ type     <chr> "movie", "movie", "movie", "movie", "movie", "movie", "movie", "video-...
    $ genre    <chr> "n", "Mystery", "Drama", "Documentary", "Horror", "Adventure", "Drama"...
    $ imdb_id  <int> 199500, 184782, 47823, 282622, 62433, 70250, 71897, 333883, 333954, 17...

    The output from glimpse(aes) shows us that there are 430 observations and 9 attributes corresponding to the 430 files in the corpus and the 7 meta-data information attributes in the file names plus the added columns doc_id and text which contain the name of the file and the text in the file for each file. The information in the doc_id is captured in our meta-data, yet the values are not ideal –seeing as they are quite long and informationally redundant. Although not strictly necessary, let’s change the doc_id values to unique numeric values. To transform the data overwriting doc_id with numerical values we can use the mutate() function from the tidyverse package in combination with the row_number() function.

    aes <- 
      aes %>% 
      mutate(doc_id = row_number()) # change doc_id to numbers
    
    glimpse(aes) # preview structure of the object
    Observations: 430
    Variables: 9
    $ doc_id   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,...
    $ text     <chr> "No está , señora . Aquí tampoco . No aparece , señora . ¿ Dónde se ha...
    $ language <chr> "es", "es", "es", "es", "es", "es", "es", "es", "es", "es", "es", "es"...
    $ country  <chr> "Argentina", "Argentina", "Argentina", "Argentina", "Argentina", "Arge...
    $ year     <int> 1950, 1952, 1955, 1965, 1969, 1973, 1975, 1977, 1979, 1980, 1981, 1983...
    $ title    <chr> "Esposa-último-modelo", "No-abras-nunca-esa-puerta", "El-amor-nunca-m...
    $ type     <chr> "movie", "movie", "movie", "movie", "movie", "movie", "movie", "video-...
    $ genre    <chr> "n", "Mystery", "Drama", "Documentary", "Horror", "Adventure", "Drama"...
    $ imdb_id  <int> 199500, 184782, 47823, 282622, 62433, 70250, 71897, 333883, 333954, 17...

    Explore the tidy dataset

    Now that we have the data in a tidy format where each row is one of our corpus files and each column is a meta-data attribute that describes each corpus file, let’s do some quick exploration of the distribution of the data to get a better feel for what our corpus is like. One thing we can do is to calculate the size of the corpus. A rudimentary approach to corpus size is the number of word tokens. The tidytext package provides a very useful function unnest_tokens() that provides a simple and efficient way to tokenize text while maintaining the tidy structure we have created. In combination with a set of functions from the tidyverse package, we can tokenize the text into words and count the number of words (count()).

    Let’s take this in two steps so you can appreciate what unnest_tokens() does. First load (or install) tidytext.

    pacman::p_load(tidytext) # use the pacman package to load-install

    Now let’s tokenize the text column into word terms and preview the first 25 rows in the output.

    aes_tokens <- 
      aes %>% 
      unnest_tokens(output = terms, input = text) # tokenize `text` into words `terms`
    aes_tokens %>% 
      head(25) # view first 25 tokenized terms

    We see in the previous table that a column terms has replaced text in our tidy dataset. The meta-data, however, is still in tact.

    The unnest_tokens() function from tidytext is very flexible. Here we have used the default arguments which produce word tokens. There are many other tokenization parameters that can be used, and we will use, to create sentence tokens, ngram tokens, and custom tokenization schemes. View ?unnest_tokens to find out more in the R documentation.

    After applying the unnest_tokens() function in the previous code, the rows correspond to tokenized words. Therefore the number of rows corresponds to the total number of words in the corpus. To find the total number of words we can use the count() function.

    aes_tokens %>% 
      count() # count terms

    The count() function can be used with a data frame, like our aes_tokens object, to group our rows by the values of a particular column. A practical application for this functionality is to group the rows (word terms) by the values of country (‘Argentina’, ‘Mexico’, and ‘Spain’). This will give us the number of words in each country sub-corpus.

    aes_tokens %>% 
      count(country) # count terms by `country`

    So now we know the total word count of the corpus and the number of words in each country sub-corpus. If we would like to have a description of the proportion of words from each sub-corpus in the total corpus, we can use the mutate() function to create a new column prop which calculates the total size of the corpus (sum(n)) and then divides each sub-corpus size (n) by this number.

    aes_country_props <- 
      aes_tokens %>% 
      count(country) %>% # count terms by `country`
      mutate(prop = n / sum(n) ) # add the word term proportion for each country
    aes_country_props

    As we have seen in the previous examples tidy datasets are easy to work with. Another advantage to data frames is that we can use them to create graphics using the ggplot2 package. ggplot2 is a powerful package for creating graphics in R that applies what is known as the ‘Grammar of Graphics’. The Grammar of Graphics recognizes that there are three principle components to any graphic: (1) data, (2) mappings or ‘aesthetics’ as they are called, and (3) geometries, or ‘geoms’. Data is the data frame which contains our observations (rows) and our variables (columns). We connect certain variables of interest from our data set to certain parameters in the visual space. Typical parameters include the ‘x-axis’ and the ‘y-axis’. The x-axis corresponds to the horizontal plane and the y-axis the vertical plane. This sets up a base coordinate system for visualizing the data. Once our data has been mapped to a visual space, we then designate an appropriate geometry to represent this space (bar plot, line graphs, scatter plots, etc). There are many geometries available in ggplot2 for relevant mapping types.

    Let’s visualize the aes_country_props object as a bar graph, as an example. ggplot2 is included as part of the tidyverse package so we already have access to it. So first we pass the aes_country_props data frame to the ggplot() function. Then we map the x-axis to the country column and the y-axis to the prop column. This mapping is then passed with the plus + operator to the geom_col() function to visualize the mapping in columns, or bars.

    aes_country_props %>% # pass the data frame as our data source
      ggplot(aes(x = country, y = prop)) + # create x- and y-axis mappings
      geom_col() # visualize a column-wise geometry

    The above code will do the heavy lifting to create our plot. Below I’ve added the labs() function to create a more informative graphic with prettier x- and y-axis labels and a title and subtitle.

    aes_country_props %>% # pass the data frame as our data source
      ggplot(aes(x = country, y = prop)) + # create x- and y-axis mappings
      geom_col() + # visualize a column-wise geometry
      labs(x = "Country", y = "Proportion (%)", title = "ACTIV-ES Corpus Distribution", subtitle = "Proportion of words in each country sub-corpus")

    In this example we covered reading a corpus and the meta-data information contained withing the file names of this corpus with the readtext package. We then did some quick exploratory work to find the corpus size and the proportions of the corpus by country sub-corpus with the tidytext package and assorted functions from the tidyverse package. We rounded things out with a brief introduction to the ggplot2 package which we used to visualize the country sub-corpus proportions.

    Running text with inline meta-data

    In the previous example, our corpus contained meta-data stored in the individual file names of the corpus. In some other cases the meta-data is stored inline with the corpus text itself. The goal in cases such as these is to separate the meta-data from the text and coerce all the information into a tidy dataset.

    Download corpus data

    As an example we will work with the The Switchboard Dialog Act Corpus (SDAC) which extends the Switchboard Corpus with speech act annotation. The SDAC dialogues (swb1_dialogact_annot.tar.gz) are available as a free download from the LDC. The dialogues are contained within a compressed .tar.gz file. This file can be downloaded manually and its contents extracted to disk, but since we are working to create a reproducible workflow we will approach this task programmatically.

    We have an available custom function that deals with .zip compressed files get_zip_data() but we need a function that works on .tar.gz files. R has a function to extract .tar.gz files untar() that we can use to mimic the same functionality as the unzip() function used in the get_zip_data() function. Instead of writing a new custom function to deal specifically with .tar.gz files, I’ve created a function that deals with both compressed file formats, named it get_compressed_data(), and added it to my functions/acquire_functions.R file.

    get_compressed_data <- function(url, target_dir, force = FALSE) {
      # Get the extension of the target file
      ext <- tools::file_ext(url)
      # Check to see if the target file is a compressed file
      if(!ext %in% c("zip", "gz", "tar")) stop("Target file given is not supported")
      # Check to see if the data already exists
      if(!dir.exists(target_dir) | force == TRUE) { # if data does not exist, download/ decompress
        cat("Creating target data directory \n") # print status message
        dir.create(path = target_dir, recursive = TRUE, showWarnings = FALSE) # create target data directory
        cat("Downloading data... \n") # print status message
        temp <- tempfile() # create a temporary space for the file to be written to
        download.file(url = url, destfile = temp) # download the data to the temp file
        # Decompress the temp file in the target directory
        if(ext == "zip") {
          unzip(zipfile = temp, exdir = target_dir, junkpaths = TRUE) # zip files
        } else {
          untar(tarfile = temp, exdir = target_dir) # tar, gz files
        }
        cat("Data downloaded! \n") # print status message
      } else { # if data exists, don't download it again
        cat("Data already exists \n") # print status message
      }
    }

    Once this function is loaded into R, either by sourcing the functions/aquire_functions.R file (source("functions/acquire_functions.R")) or running the code directly, we apply the function to the resource URL targeting the data/original/sdac/ directory as the extraction location.

    get_compressed_data(url = "https://catalog.ldc.upenn.edu/docs/LDC97S62/swb1_dialogact_annot.tar.gz", 
                        target_dir = "data/original/sdac/")

    The main directory structure of the sdac/ data looks like this:

    .
    ├── README
    ├── doc
    ├── sw00utt
    ├── sw01utt
    ├── sw02utt
    ├── sw03utt
    ├── sw04utt
    ├── sw05utt
    ├── sw06utt
    ├── sw07utt
    ├── sw08utt
    ├── sw09utt
    ├── sw10utt
    ├── sw11utt
    ├── sw12utt
    └── sw13utt
    
    15 directories, 1 file

    The README file contains basic information about the resource, the doc/ directory contains more detailed information about the dialog annotations, and each of the following directories prefixed with sw... contain individual conversation files. Here’s a peek at internal structure of the first couple directories.

    .
    ├── README
    ├── doc
    │   └── manual.august1.html
    ├── sw00utt
    │   ├── sw_0001_4325.utt
    │   ├── sw_0002_4330.utt
    │   ├── sw_0003_4103.utt
    │   ├── sw_0004_4327.utt
    │   ├── sw_0005_4646.utt

    Let’s take a look at the first conversation file (sw_0001_4325.utt) to see how it is structured.

    *x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*
    *x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*
    *x*                                                                     *x*
    *x*            Copyright (C) 1995 University of Pennsylvania            *x*
    *x*                                                                     *x*
    *x*    The data in this file are part of a preliminary version of the   *x*
    *x*    Penn Treebank Corpus and should not be redistributed.  Any       *x*
    *x*    research using this corpus or based on it should acknowledge     *x*
    *x*    that fact, as well as the preliminary nature of the corpus.      *x*
    *x*                                                                     *x*
    *x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*
    *x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*
    
    
    FILENAME:   4325_1632_1519
    TOPIC#:     323
    DATE:       920323
    TRANSCRIBER:    glp
    UTT_CODER:  tc
    DIFFICULTY: 1
    TOPICALITY: 3
    NATURALNESS:    2
    ECHO_FROM_B:    1
    ECHO_FROM_A:    4
    STATIC_ON_A:    1
    STATIC_ON_B:    1
    BACKGROUND_A:   1
    BACKGROUND_B:   2
    REMARKS:        None.
    
    =========================================================================
      
    
    o          A.1 utt1: Okay.  /
    qw          A.1 utt2: {D So, }   
    
    qy^d          B.2 utt1: [ [ I guess, +   
    
    +          A.3 utt1: What kind of experience [ do you, + do you ] have, then with child care? /

    There are few things to take note of here. First we see that the conversation files have a meta-data header offset from the conversation text by a line of = characters. Second the header contains meta-information of various types. Third, the text is interleaved with an annotation scheme.

    Some of the information may be readily understandable, such as the various pieces of meta-data in the header, but to get a better understanding of what information is encoded here let’s take a look at the README file. In this file we get a birds eye view of what is going on. In short, the data includes 1155 telephone conversations between two people annotated with 42 ‘DAMSL’ dialog act labels. The README file refers us to the doc/manual.august1.html file for more information on this scheme.

    At this point we open the the doc/manual.august1.html file in a browser and do some investigation. We find out that ‘DAMSL’ stands for ‘Discourse Annotation and Markup System of Labeling’ and that the first characters of each line of the conversation text correspond to one or a combination of labels for each utterance. So for our first utterances we have:

    o = "Other"
    qw = "Wh-Question"
    qy^d = "Declarative Yes-No-Question"
    + = "Segment (multi-utterance)"

    Each utterance is also labeled for speaker (‘A’ or ‘B’), speaker turn (‘1’, ‘2’, ‘3’, etc.), and each utterance within that turn (‘utt1’, ‘utt2’, etc.). There is other annotation provided withing each utterance, but this should be enough to get us started on the conversations.

    Now let’s turn to the meta-data in the header. We see here that there is information about the creation of the file: ‘FILENAME’, ‘TOPIC’, ‘DATE’, etc. The doc/manual.august1.html file doesn’t have much to say about this information so I returned to the LDC Documentation and found more information in the Online Documentation section. After some poking around in this documentation I discovered that that meta-data for each speaker in the corpus is found in the caller_tab.csv file. This tabular file does not contain column names, but the caller_doc.txt does. After inspecting these files manually and comparing them with the information in the conversation file I noticed that the ‘FILENAME’ information contained three pieces of useful information delimited by underscores _.

    *x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*
    
    
    FILENAME:   4325_1632_1519
    TOPIC#:     323
    DATE:       920323
    TRANSCRIBER:    glp

    The first information is the document id (4325), the second and third correspond to the speaker number: the first being speaker A (1632) and the second speaker B (1519).

    Tidy the corpus

    In sum, we have 1155 conversation files. Each file has two parts, a header and text section, separated by a line of = characters. The header section contains a ‘FILENAME’ line which has the document id, and ids for speaker A and speaker B. The text section is annotated with DAMSL tags beginning each line, followed by speaker, turn number, utterance number, and the utterance text. With this knowledge in hand, let’s set out to create a tidy dataset with the following column structure:

    Let’s begin by reading one of the conversation files into R as a character vector using the read_lines() function.

    doc <- read_lines(file = "data/original/sdac/sw00utt/sw_0001_4325.utt") # read file by lines

    To isolate the vector element that contains the document and speaker ids, we use str_detect() from the stringr package. This function takes two arguments, a string and a pattern, and returns a logical value, TRUE if the pattern is matched or FALSE if not. We can use the output of this function, then, to subset the doc character vector and only return the vector element (line) that contains digits_digits_digits with a regular expression. The expression combines the digit matching operator \\d with the + operator to match 1 or more contiguous digits. We then separate three groups of \\d+ with underscores _. The result is \\d+_\\d+_\\d+.

    pacman::p_load(stringr) # load-install `stringr` package
    doc[str_detect(doc, pattern = "\\d+_\\d+_\\d+")] # isolate pattern
    ## [1] "FILENAME:\t4325_1632_1519"

    The next step is to extract the three digit sequences that correspond to the doc_id, speaker_a_id, and speaker_b_id. First we extract the pattern that we have identified with str_extract() and then we can break up the single character vector into multiple parts based on the underscore _. The str_split() function takes a string and then a pattern to use to split a character vector. It will return a list of character vectors.

    doc[str_detect(doc, "\\d+_\\d+_\\d+")] %>% # isolate pattern
      str_extract(pattern = "\\d+_\\d+_\\d+") %>% # extract the pattern
      str_split(pattern = "_") # split the character vector
    ## [[1]]
    ## [1] "4325" "1632" "1519"

    A list is a special object type in R. It is an unordered collection of objects whose lengths can differ (contrast this with a data frame which is a collection of objects whose lengths are the same –hence the tabular format). In this case we have a list of length 1, whose sole element is a character vector of length 3 –one element per segment returned from our split. This is a desired result in most cases as if we were to pass multiple character vectors to our str_split() function we don’t want the results to be conflated as a single character vector blurring the distinction between the individual character vectors. If we would like to conflate, or flatten a list, we can use the unlist() function.

    doc[str_detect(doc, "\\d+_\\d+_\\d+")] %>% # isolate pattern
      str_extract(pattern = "\\d+_\\d+_\\d+") %>% # extract the pattern
      str_split(pattern = "_") %>% # split the character vector
      unlist() # flatten the list to a character vector
    ## [1] "4325" "1632" "1519"
    < !-- In this case we won't flatten the list object just yet. Instead we want to extract the digits from our three character vectors. The `str_extract_all()` will take the list output of character vectors and allow us to provide a pattern to match and extract. Where in the `str_split()` case our pattern was a fixed character (`_`), the pattern to match for digits is more abstract --we want the contiguous sequences of digits. For this we will build a regular expression to match the digits. The expression we want is `\\d+` which means match one or more contiguous digits. Regular expressions provide hooks for many different character types. We will soon see a number of them in action as we proceed to curate the `sdac/` data. --> < !-- ```{r sdac-doc-info-4, echo=TRUE, eval=TRUE} --> < !-- doc[str_detect(doc, "FILENAME:")] %>% –> < !-- str_split(pattern = "_") %>% –> < !-- str_extract_all(pattern = "\\d+") --> < !-- ``` --> < !-- The `str_extract_all()` took our list and returned a list. At this point we can flatten and assign the results to the variable `doc_speaker_info`. --> < !-- ```{r sdac-doc-info-5, echo=TRUE, eval=TRUE} --> < !-- doc_speaker_info < - --> < !-- doc[str_detect(doc, "FILENAME:")] %>% –> < !-- str_split(pattern = "_") %>% –> < !-- str_extract_all(pattern = "\\d+") %>% –> < !-- unlist() --> < !-- ``` -->

    Let’s flatten the list in this case, as we have a single character vector, and assign this result to doc_speaker_info.

    doc_speaker_info <- 
      doc[str_detect(doc, "\\d+_\\d+_\\d+")] %>% # isolate pattern
      str_extract(pattern = "\\d+_\\d+_\\d+") %>% # extract the pattern
      str_split(pattern = "_") %>%  # split the character vector
      unlist() # flatten the list to a character vector

    doc_speaker_info is now a character vector of length three. Let’s subset each of the elements and assign them to meaningful variable names so we can conveniently use them later on in the tidying process.

    doc_id <- doc_speaker_info[1]
    speaker_a_id <- doc_speaker_info[2]
    speaker_b_id <- doc_speaker_info[3]

    The next step is to isolate the text section extracting it from rest of the document. As noted previously, a sequence of = separates the header section from the text section. What we need to do is to index the point in our character vector doc where that line occurs and then subset the doc from that point until the end of the character vector. Let’s first find the point where the = sequence occurs. We will again use the str_detect() function to find the pattern we are looking for (a contiguous sequence of =), but then we will pass the logical result to the which() function which will return the element index number of this match.

    doc %>% 
      str_detect(pattern = "=+") %>% # match 1 or more `=`
      which() # find vector index
    ## [1] 31

    So for this file 31 is the index in doc where the = sequence occurs. Now it is important to keep in mind that we are working with a single file from the sdac/ data. We need to be cautious to not create a pattern that may be matched multiple times in another document in the corpus. As the =+ pattern will match =, or ==, or ===, etc. it is not implausible to believe that there might be a = character on some other line in one of the other files. Let’s update our regular expression to avoid this potential scenario by only matching sequences of three or more =. In this case we will make use of the curly bracket operators {}.

    doc %>% 
      str_detect(pattern = "={3,}") %>% # match 3 or more `=`
      which() # find vector index
    ## [1] 31

    We will get the same result for this file, but will safeguard ourselves a bit as it is unlikely we will find multiple matches for ===, ====, etc.

    31 is the index for the = sequence, but we want the next line to be where we start reading the text section. To do this we increment the index by 1.

    text_start_index <- 
      doc %>% 
      str_detect(pattern = "={3,}") %>% # match 3 or more `=` 
      which() # find vector index
    text_start_index <- text_start_index + 1 # increment index by 1

    The index for the end of the text is simply the length of the doc vector. We can use the length() function to get this index.

    text_end_index <- length(doc)

    We now have the bookends, so to speak, for our text section. To extract the text we subset the doc vector by these indices.

    text <- doc[text_start_index:text_end_index] # extract text
    head(text) # preview first lines of `text`
    ## [1] "  "                                       
    ## [2] ""                                         
    ## [3] "o          A.1 utt1: Okay.  /"            
    ## [4] "qw          A.1 utt2: {D So, }   "        
    ## [5] ""                                         
    ## [6] "qy^d          B.2 utt1: [ [ I guess, +   "

    The text has some extra whitespace on some lines and there are blank lines as well. We should do some cleaning up before moving forward to organize the data. To get rid of the whitespace we use the str_trim() function which by default will remove leading and trailing whitespace from each line.

    text <- str_trim(text) # remove leading and trailing whitespace
    head(text) # preview first lines of `text`
    ## [1] ""                                      
    ## [2] ""                                      
    ## [3] "o          A.1 utt1: Okay.  /"         
    ## [4] "qw          A.1 utt2: {D So, }"        
    ## [5] ""                                      
    ## [6] "qy^d          B.2 utt1: [ [ I guess, +"

    To remove blank lines we will create a logical expression to subset the text vector. text != "" means return TRUE where lines are not blank, and FALSE where they are.

    text <- text[text != ""] # remove blank lines
    head(text) # preview first lines of `text`
    ## [1] "o          A.1 utt1: Okay.  /"                                                                  
    ## [2] "qw          A.1 utt2: {D So, }"                                                                 
    ## [3] "qy^d          B.2 utt1: [ [ I guess, +"                                                         
    ## [4] "+          A.3 utt1: What kind of experience [ do you, + do you ] have, then with child care? /"
    ## [5] "+          B.4 utt1: I think, ] + {F uh, } I wonder ] if that worked. /"                        
    ## [6] "qy          A.5 utt1: Does it say something? /"

    Our first step towards a tidy dataset is to now combine the doc_id and each element of text in a data frame.

    data <- data.frame(doc_id, text) # tidy format `doc_id` and `text`
    head(data) # preview first lines of `text`

    With our data now in a data frame, its time to parse the text column and extract the damsl tags, speaker, speaker turn, utterance number, and the utterance text itself into separate columns. To do this we will make extensive use of regular expressions. Our aim is to find a consistent pattern that distinguishes each piece of information from other other text in a given row of data$text and extract it.

    The best way to learn regular expressions is to use them. To this end I’ve included a window to the interactive regular expression practice website regex101 in Figure 1.

    Copy the text below into the ‘TEST STRING’ field.

    o          A.1 utt1: Okay.  /
    qw          A.1 utt2: {D So, }
    qy^d          B.2 utt1: [ [ I guess, +
    +          A.3 utt1: What kind of experience [ do you, + do you ] have, then with child care? /
    +          B.4 utt1: I think, ] + {F uh, } I wonder ] if that worked. /
    qy          A.5 utt1: Does it say something? /
    sd          B.6 utt1: I think it usually does.  /
    ad          B.6 utt2: You might try, {F uh, }  /
    h          B.6 utt3: I don't know,  /
    ad          B.6 utt4: hold it down a little longer,  /

    Figure 1: Interactive interface to the regex101 practice website.

    Now manually type the following regular expressions into the ‘REGULAR EXPRESSION’ field one-by-one (each is on a separate line). Notice what is matched as you type and when you’ve finished typing. You can find out exactly what the component parts of each expression are doing by toggling the top right icon in the window or hovering your mouse over the relevant parts of the expression.

    ^.+?\s
    [AB]\.\d+
    utt\d+
    :.+$

    As you can now see, we have regular expressions that will match the damsl tags, speaker and speaker turn, utterance number, and the utterance text. To apply these expressions to our data and extract this information into separate columns we will make use of the mutate() and str_extract() functions. mutate() will take our data frame and create new columns with values we match and extract from each row in the data frame with str_extract(). Notice that str_extract() is different than str_extract_all(). When we work with mutate() each row will be evaluated in turn, therefore we only need to make one match per row in data$text.

    I’ve chained each of these steps in the code below, dropping the original text column with select(-text), and overwriting data with the results.

    data <- # extract column information from `text`
      data %>% 
      mutate(damsl_tag = str_extract(string = text, pattern = "^.+?\\s")) %>%  # extract damsl tags
      mutate(speaker_turn = str_extract(string = text, pattern = "[AB]\\.\\d+")) %>% # extract speaker_turn pairs
      mutate(utterance_num = str_extract(string = text, pattern = "utt\\d+")) %>% # extract utterance number
      mutate(utterance_text = str_extract(string = text, pattern = ":.+$")) %>%  # extract utterance text
      select(-text) # drop the `text` column
    
    glimpse(data) # preview the data set
    ## Observations: 159
    ## Variables: 5
    ## $ doc_id         <chr> "4325", "4325", "4325", "4325", "4325", "4325",...
    ## $ damsl_tag      <chr> "o ", "qw ", "qy^d ", "+ ", "+ ", "qy ", "sd ",...
    ## $ speaker_turn   <chr> "A.1", "A.1", "B.2", "A.3", "B.4", "A.5", "B.6"...
    ## $ utterance_num  <chr> "utt1", "utt2", "utt1", "utt1", "utt1", "utt1",...
    ## $ utterance_text <chr> ": Okay.  /", ": {D So, }", ": [ [ I guess, +",...

    One twist you will notice is that regular expressions in R require double backslashes (\\) where other programming environments use a single backslash (\).

    There are a couple things left to do to the columns we extracted from the text before we move on to finishing up our tidy dataset. First, we need to separate the speaker_turn column into speaker and turn_num columns and second we need to remove unwanted characters from the damsl_tag, utterance_num, and utterance_text columns.

    To separate the values of a column into two columns we use the separate() function. It takes a column to separate and character vector of the names of the new columns to create. By default the values of the input column will be separated by non-alphanumeric characters. In our case this means the . will be our separator.

    data <-
      data %>% 
      separate(col = speaker_turn, into = c("speaker", "turn_num")) # separate speaker_turn into distinct columns
    
    glimpse(data) # preview the data set
    ## Observations: 159
    ## Variables: 6
    ## $ doc_id         <chr> "4325", "4325", "4325", "4325", "4325", "4325",...
    ## $ damsl_tag      <chr> "o ", "qw ", "qy^d ", "+ ", "+ ", "qy ", "sd ",...
    ## $ speaker        <chr> "A", "A", "B", "A", "B", "A", "B", "B", "B", "B...
    ## $ turn_num       <chr> "1", "1", "2", "3", "4", "5", "6", "6", "6", "6...
    ## $ utterance_num  <chr> "utt1", "utt2", "utt1", "utt1", "utt1", "utt1",...
    ## $ utterance_text <chr> ": Okay.  /", ": {D So, }", ": [ [ I guess, +",...

    To remove unwanted leading or trailing whitespace we apply the str_trim() function. For removing other characters we matching the character(s) and replace them with an empty string ("") with the str_replace() function. Again, I’ve chained these functions together and overwritten data with the results.

    data <- # clean up column information
      data %>% 
      mutate(damsl_tag = str_trim(damsl_tag)) %>% # remove leading/ trailing whitespace
      mutate(utterance_num = str_replace(string = utterance_num, pattern = "utt", replacement = "")) %>% # remove 'utt'
      mutate(utterance_text = str_replace(string = utterance_text, pattern = ":\\s", replacement = "")) %>% # remove ': '
      mutate(utterance_text = str_trim(utterance_text)) # trim leading/ trailing whitespace
    
    glimpse(data) # preview the data set
    ## Observations: 159
    ## Variables: 6
    ## $ doc_id         <chr> "4325", "4325", "4325", "4325", "4325", "4325",...
    ## $ damsl_tag      <chr> "o", "qw", "qy^d", "+", "+", "qy", "sd", "ad", ...
    ## $ speaker        <chr> "A", "A", "B", "A", "B", "A", "B", "B", "B", "B...
    ## $ turn_num       <chr> "1", "1", "2", "3", "4", "5", "6", "6", "6", "6...
    ## $ utterance_num  <chr> "1", "2", "1", "1", "1", "1", "1", "2", "3", "4...
    ## $ utterance_text <chr> "Okay.  /", "{D So, }", "[ [ I guess, +", "What...

    To round out our tidy dataset for this single conversation file we will connect the speaker_a_id and speaker_b_id with speaker A and B in our current dataset adding a new column speaker_id. The case_when() function does exactly this: allows us to map rows of speaker with the value “A” to speaker_a_id and rows with value “B” to speaker_b_id.

    data <- # link speaker with speaker_id
      data %>% 
      mutate(speaker_id = case_when(
        speaker == "A" ~ speaker_a_id,
        speaker == "B" ~ speaker_b_id
      ))
    
    glimpse(data) # preview the data set
    ## Observations: 159
    ## Variables: 7
    ## $ doc_id         <chr> "4325", "4325", "4325", "4325", "4325", "4325",...
    ## $ damsl_tag      <chr> "o", "qw", "qy^d", "+", "+", "qy", "sd", "ad", ...
    ## $ speaker        <chr> "A", "A", "B", "A", "B", "A", "B", "B", "B", "B...
    ## $ turn_num       <chr> "1", "1", "2", "3", "4", "5", "6", "6", "6", "6...
    ## $ utterance_num  <chr> "1", "2", "1", "1", "1", "1", "1", "2", "3", "4...
    ## $ utterance_text <chr> "Okay.  /", "{D So, }", "[ [ I guess, +", "What...
    ## $ speaker_id     <chr> "1632", "1632", "1519", "1632", "1519", "1632",...

    We now have the tidy dataset we set out to create. But this dataset only includes on conversation file. We want to apply this code to all 1155 conversation files in the sdac/ corpus. The approach will be to create a custom function which groups the code we’ve done for this single file and then iterative send each file from the corpus through this function and combine the results into one data frame.

    Here’s the function with some extra code to print a progress message for each file when it runs.

    extract_sdac_metadata <- function(file) {
      # Function: to read a Switchboard Corpus Dialogue file and extract meta-data
      cat("Reading", basename(file), "...")
      
      # Read `file` by lines
      doc <- read_lines(file) 
      
      # Extract `doc_id`, `speaker_a_id`, and `speaker_b_id`
      doc_speaker_info <- 
        doc[str_detect(doc, "\\d+_\\d+_\\d+")] %>% # isolate pattern
        str_extract("\\d+_\\d+_\\d+") %>% # extract the pattern
        str_split(pattern = "_") %>% # split the character vector
        unlist() # flatten the list to a character vector
      doc_id <- doc_speaker_info[1] # extract `doc_id`
      speaker_a_id <- doc_speaker_info[2] # extract `speaker_a_id`
      speaker_b_id <- doc_speaker_info[3] # extract `speaker_b_id`
      
      # Extract `text`
      text_start_index <- # find where header info stops
        doc %>% 
        str_detect(pattern = "={3,}") %>% # match 3 or more `=`
        which() # find vector index
      
      text_start_index <- text_start_index + 1 # increment index by 1
      text_end_index <- length(doc) # get the end of the text section
      
      text <- doc[text_start_index:text_end_index] # extract text
      text <- str_trim(text) # remove leading and trailing whitespace
      text <- text[text != ""] # remove blank lines
      
      data <- data.frame(doc_id, text) # tidy format `doc_id` and `text`
      
      data <- # extract column information from `text`
        data %>% 
        mutate(damsl_tag = str_extract(string = text, pattern = "^.+?\\s")) %>%  # extract damsl tags
        mutate(speaker_turn = str_extract(string = text, pattern = "[AB]\\.\\d+")) %>% # extract speaker_turn pairs
        mutate(utterance_num = str_extract(string = text, pattern = "utt\\d+")) %>% # extract utterance number
        mutate(utterance_text = str_extract(string = text, pattern = ":.+$")) %>%  # extract utterance text
        select(-text)
      
      data <- # separate speaker_turn into distinct columns
        data %>% 
        separate(col = speaker_turn, into = c("speaker", "turn_num")) 
      
      data <- # clean up column information
        data %>% 
        mutate(damsl_tag = str_trim(damsl_tag)) %>% # remove leading/ trailing whitespace
        mutate(utterance_num = str_replace(string = utterance_num, pattern = "utt", replacement = "")) %>% # remove 'utt'
        mutate(utterance_text = str_replace(string = utterance_text, pattern = ":\\s", replacement = "")) %>% # remove ': '
        mutate(utterance_text = str_trim(utterance_text)) # trim leading/ trailing whitespace
      
      data <- # link speaker with speaker_id
        data %>% 
        mutate(speaker_id = case_when(
          speaker == "A" ~ speaker_a_id,
          speaker == "B" ~ speaker_b_id
        )) 
      cat(" done.\n")
      return(data) # return the data frame object
    }

    As a sanity check we will run the extract_sdac_metadata() function on a the conversation file we were just working on to make sure it works as expected.

    extract_sdac_metadata(file = "data/original/sdac/sw00utt/sw_0001_4325.utt") %>% 
      glimpse()
    ## Reading sw_0001_4325.utt ... done.
    ## Observations: 159
    ## Variables: 7
    ## $ doc_id         <chr> "4325", "4325", "4325", "4325", "4325", "4325",...
    ## $ damsl_tag      <chr> "o", "qw", "qy^d", "+", "+", "qy", "sd", "ad", ...
    ## $ speaker        <chr> "A", "A", "B", "A", "B", "A", "B", "B", "B", "B...
    ## $ turn_num       <chr> "1", "1", "2", "3", "4", "5", "6", "6", "6", "6...
    ## $ utterance_num  <chr> "1", "2", "1", "1", "1", "1", "1", "2", "3", "4...
    ## $ utterance_text <chr> "Okay.  /", "{D So, }", "[ [ I guess, +", "What...
    ## $ speaker_id     <chr> "1632", "1632", "1519", "1632", "1519", "1632",...

    Looks good so now it’s time to create a vector with the paths to all of the conversation files. list_files() interfaces with our OS file system and will return the paths to the files in the specified directory. We also add a pattern to match conversation files (\\.utt) so we don’t accidently include other files in the corpus. full.names and recursive set to TRUE means we will get the full path to each file and files in all sub-directories will be returned.

    files <- 
      list.files(path = "data/original/sdac", # path to main directory 
                 pattern = "\\.utt", # files to match
                 full.names = TRUE, # extract full path to each file
                 recursive = TRUE) # drill down in each sub-directory of `sdac/`
    head(files) # preview character vector
    ## [1] "data/original/sdac/sw00utt/sw_0001_4325.utt"
    ## [2] "data/original/sdac/sw00utt/sw_0002_4330.utt"
    ## [3] "data/original/sdac/sw00utt/sw_0003_4103.utt"
    ## [4] "data/original/sdac/sw00utt/sw_0004_4327.utt"
    ## [5] "data/original/sdac/sw00utt/sw_0005_4646.utt"
    ## [6] "data/original/sdac/sw00utt/sw_0006_4108.utt"

    To pass each conversation file in the vector of paths to our conversation files iteratively to the extract_sdac_metadata() function we use map(). This will apply the function to each conversation file and return a data frame for each. bind_rows() will then join the resulting data frames by rows to give us a single tidy dataset for all 1155 conversations. Note there is a lot of processing going on here so be patient.

    # Read files and return a tidy dataset
    sdac <- 
      files %>% # pass file names
      map(extract_sdac_metadata) %>% # read and tidy iteratively 
      bind_rows() # bind the results into a single data frame
    glimpse(sdac) # preview the dataset
    ## Observations: 223,606
    ## Variables: 7
    ## $ doc_id         <chr> "4325", "4325", "4325", "4325", "4325", "4325",...
    ## $ damsl_tag      <chr> "o", "qw", "qy^d", "+", "+", "qy", "sd", "ad", ...
    ## $ speaker        <chr> "A", "A", "B", "A", "B", "A", "B", "B", "B", "B...
    ## $ turn_num       <chr> "1", "1", "2", "3", "4", "5", "6", "6", "6", "6...
    ## $ utterance_num  <chr> "1", "2", "1", "1", "1", "1", "1", "2", "3", "4...
    ## $ utterance_text <chr> "Okay.  /", "{D So, }", "[ [ I guess, +", "What...
    ## $ speaker_id     <chr> "1632", "1632", "1519", "1632", "1519", "1632",...

    Explore the tidy dataset

    It is always a good idea to perform some diagnostics on the data to confirm the integrity of the data. One thing that can go wrong in tidying a dataset, as we’ve done here, is that our pattern matching failed and did not return what we expected. This can be because our patterns were not specific enough or can arise from transcriber/annotator error. In any case this can lead to missing values, or NA values. R provides the function complete.cases() to test for NA values, returning TRUE for rows in a data frame which do not include NA values. We can use this to subset the same data set to identify any rows in sdac dataset that are missing. Note that because we are subsetting a data frame by rows, we will add our expression to the row position in the subsetting operation, i.e. ‘data_frame_name[row, column]’.

    sdac[!complete.cases(sdac), ] # check for missing values

    Great! No missing data. Now let’s make sure we have captured all 1155 conversation files. We will pipe the sdac$doc_id column to the unique() function which returns the unique values of the column. Then we can get the length of that result to find out how many unique conversation files we have in our dataset.

    sdac$doc_id %>% unique() %>% length() # check for unique files
    ## [1] 1155

    Also good news, we have all 1155 conversations in the dataset.

    Let’s find out how many individual speakers are in the dataset.

    sdac$speaker_id %>% unique() %>% length() # check for unique speakers
    ## [1] 441

    Good to know before we proceed to adding speaker meta-data from the stand-off file caller_tab.csv.

    Running text with stand-off meta-data files

    The sdac dataset now contains various pieces of linguistic and non-linguistic meta-data that we extracted from the conversation files in the sdac/ corpus. As part of that extraction we isolated the ids of the speakers involved in each conversation. As noted during the preliminary investigation portion of the curation of the data, these ids appear in a stand-off meta-data file named caller_tab.csv in the online documentation for the Switchboard Dialog Act Corpus. These ids provide us a link between the corpus data and the speaker meta-data that we can exploit to incorporate that meta-data into our existing sdac tidy dataset.

    It is common that stand-off meta-data files are in structured format. That is, they will typically be stored in a .csv file or an .xml document. The goal then is to read the data into R as a data frame and then join that data with the existing tidy corpus dataset. To read a .csv file, like the caller_tab.csv we use the read_csv() function. Before we read it we should manually download and inspect the data for a couple things: (1) how is the file delimited? and (2) is there a header row that names the columns in the data?

    We can generally assume that a .csv file will be comma-separated, but this is not always the case sometimes the file will be delimited by semi-colons (;), tabs (\\t), or single or multiple spaces (\\s+). Whether there will be a header row or not can vary. If a header row does not exist in the file itself there is a good chance there is some file that documents what each column in the data represents.

    Let’s take a look at a few rows from the caller_tab.csv to see what we have.

    1000, 32, "N", "FEMALE", 1954, "SOUTH MIDLAND", 1, 0, "CASH", 15, "N", "", 2, "DN2"
    1001, 102, "N", "MALE", 1940, "WESTERN", 3, 0, "GIFT", 10, "N", "", 0, "XP"
    1002, 104, "N", "FEMALE", 1963, "SOUTHERN", 2, 0, "GIFT", 11, "N", "", 0, "XP"

    First we see that columns are indeed separated by commas. Second we see there is no header row. Some of the columns seem interpretable, like column 4, but we should try to find documentation to guide us. Poking around in the online documentation I noticed the caller_doc.txt file has names for the columns. This is a file used to generate a database table, but it contains the information we need so we’ll use it to assign names to our columns in caller_tab.csv.

    CREATE TABLE caller (
        caller_no numeric(4) NOT NULL,
        pin numeric(4) NOT NULL,
        target character(1),
        sex character(6),
        birth_year numeric(4),
        dialect_area character(13),
        education numeric(1),
        ti numeric(1),
        payment_type character(5),
        amt_pd numeric(6),
        con character(1),
        remarks character(120),
        calls_deleted numeric(3),
        speaker_partition character(3)
    );

    We can combine this information with the read_csv() function to read the caller_tab.csv and add the column names. Note that I’ve changed the caller_no name to speaker_id to align the nomenclature with the current sdac dataset. This renaming will facilitate the upcoming step to join the tidy dataset and this meta-data.

    sdac_speaker_<- 
      read_csv(file = "https://catalog.ldc.upenn.edu/docs/LDC97S62/caller_tab.csv", 
               col_names = c("speaker_id", # changed from `caller_no`
                             "pin",
                             "target",
                             "sex",
                             "birth_year",
                             "dialect_area",
                             "education",
                             "ti",
                             "payment_type",
                             "amt_pd",
                             "con",
                             "remarks",
                             "calls_deleted",
                             "speaker_partition"))
    
    glimpse(sdac_speaker_meta) # preview the dataset
    ## Observations: 543
    ## Variables: 14
    ## $ speaker_id        <int> 1000, 1001, 1002, 1003, 1004, 1005, 1007, 10...
    ## $ pin               <int> 32, 102, 104, 5656, 123, 166, 274, 322, 445,...
    ## $ target            <chr> "\"N\"", "\"N\"", "\"N\"", "\"N\"", "\"N\"",...
    ## $ sex               <chr> "\"FEMALE\"", "\"MALE\"", "\"FEMALE\"", "\"M...
    ## $ birth_year        <int> 1954, 1940, 1963, 1947, 1958, 1956, 1965, 19...
    ## $ dialect_area      <chr> "\"SOUTH MIDLAND\"", "\"WESTERN\"", "\"SOUTH...
    ## $ education         <int> 1, 3, 2, 2, 2, 2, 2, 1, 1, 2, 2, 1, 2, 2, 3,...
    ## $ ti                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
    ## $ payment_type      <chr> "\"CASH\"", "\"GIFT\"", "\"GIFT\"", "\"NONE\...
    ## $ amt_pd            <int> 15, 10, 11, 7, 11, 22, 20, 3, 11, 9, 25, 9, ...
    ## $ con               <chr> "\"N\"", "\"N\"", "\"N\"", "\"Y\"", "\"N\"",...
    ## $ remarks           <chr> "\"\"", "\"\"", "\"\"", "\"\"", "\"\"", "\"\...
    ## $ calls_deleted     <int> 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,...
    ## $ speaker_partition <chr> "\"DN2\"", "\"XP\"", "\"XP\"", "\"DN2\"", "\...

    The columns mapped to the data as expected. The character columns contain double quotes ("\"), however. We could proceed without issue (R will treat them as character values just the same) but I would like to clean up the character values for aesthetic purposes. To do this I applied the following code.

    sdac_speaker_<- # remove double quotes
      sdac_speaker_%>% 
      map(str_replace_all, pattern = '"', replacement = '') %>% # iteratively remove doubled quotes
      bind_rows() %>%  # combine the results by rows
      type_convert() # return columns to orignal data types
    
    glimpse(sdac_speaker_meta) # preview the dataset
    ## Observations: 543
    ## Variables: 14
    ## $ speaker_id        <int> 1000, 1001, 1002, 1003, 1004, 1005, 1007, 10...
    ## $ pin               <int> 32, 102, 104, 5656, 123, 166, 274, 322, 445,...
    ## $ target            <chr> "N", "N", "N", "N", "N", "Y", "N", "N", "N",...
    ## $ sex               <chr> "FEMALE", "MALE", "FEMALE", "MALE", "FEMALE"...
    ## $ birth_year        <int> 1954, 1940, 1963, 1947, 1958, 1956, 1965, 19...
    ## $ dialect_area      <chr> "SOUTH MIDLAND", "WESTERN", "SOUTHERN", "NOR...
    ## $ education         <int> 1, 3, 2, 2, 2, 2, 2, 1, 1, 2, 2, 1, 2, 2, 3,...
    ## $ ti                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
    ## $ payment_type      <chr> "CASH", "GIFT", "GIFT", "NONE", "GIFT", "GIF...
    ## $ amt_pd            <int> 15, 10, 11, 7, 11, 22, 20, 3, 11, 9, 25, 9, ...
    ## $ con               <chr> "N", "N", "N", "Y", "N", "Y", "N", "Y", "N",...
    ## $ remarks           <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
    ## $ calls_deleted     <int> 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,...
    ## $ speaker_partition <chr> "DN2", "XP", "XP", "DN2", "XP", "ET", "DN1",...

    From the preview of the sdac_speaker_meta dataset we can see that there are 14 columns, including the speaker_id. We also see that there are 543 observations. We can assume that each row corresponds to an individual speaker, but to make sure let’s find the length of the unique values of sdac_speaker_meta$speaker_id.

    sdac_speaker_meta$speaker_id %>% unique() %>% length() # check for unique speakers
    ## [1] 543

    So this confirms each row in sdac_speaker_meta corresponds to an individual speaker. It is also clear now that the sdac dataset, which contains 441 individual speakers, is a subset of all the data collected in the Switchboard Corpus project.

    Let’s select the columns that seem most interesting for a future analysis dropping the other columns. The select() function allows us to specify columns to keep (or drop).

    sdac_speaker_<- # select columns of interest
      sdac_speaker_%>% 
      select(speaker_id, sex, birth_year, dialect_area, education)
    
    glimpse(sdac_speaker_meta) # preview the dataset
    ## Observations: 543
    ## Variables: 5
    ## $ speaker_id   <int> 1000, 1001, 1002, 1003, 1004, 1005, 1007, 1008, 1...
    ## $ sex          <chr> "FEMALE", "MALE", "FEMALE", "MALE", "FEMALE", "FE...
    ## $ birth_year   <int> 1954, 1940, 1963, 1947, 1958, 1956, 1965, 1939, 1...
    ## $ dialect_area <chr> "SOUTH MIDLAND", "WESTERN", "SOUTHERN", "NORTH MI...
    ## $ education    <int> 1, 3, 2, 2, 2, 2, 2, 1, 1, 2, 2, 1, 2, 2, 3, 3, 2...

    Tidy the corpus

    The next step is to join the two datasets linking the values of sdac$speaker_id with the values of sdac_speaker_meta$speaker_id. We want to keep all the data in the sdac dataset and only include data from sdac_speaker_meta where there are matching speaker ids. To do this we use the left_join() function. left_join() requires two arguments which correspond to two data frames. We can optionally specify which column(s) to use as the columns to use as the joining condition, but by default it will use any column names that match in the two data frames. In our case the only column that matches is the speaker_id column so we can proceed without explicitly specifying the join column.

    sdac <- left_join(sdac, sdac_speaker_meta) # join by `speaker_id`
    ## Error in left_join_impl(x, y, by$x, by$y, suffix$x, suffix$y, check_na_matches(na_matches)): Can't join on 'speaker_id' x 'speaker_id' because of incompatible types (integer / character)

    We get an error! Reading the error it appears we are trying to join columns of differing data types; the sdac$speaker_id is of type character and sdac_speaker_meta$speaker_id is of type integer.

    Error messages can be difficult to make sense of. If the issue is not clear to you, copy the error and search the web to see if others have had the same issue. Chances are someone has! If not, follow these steps to create a reproducible example and post it to a site such as StackOverflow.

    To remedy the situation we need to coerce the sdac$speaker_id column into a integer. The as.numeric() function will do this.

    sdac$speaker_id <- sdac$speaker_id %>% as.numeric() # convert to integer

    Now let’s apply our join operation again.

    sdac <- left_join(sdac, sdac_speaker_meta) # join by `speaker_id`
    
    glimpse(sdac) # preview the joined dataset
    ## Observations: 223,606
    ## Variables: 11
    ## $ doc_id         <chr> "4325", "4325", "4325", "4325", "4325", "4325",...
    ## $ damsl_tag      <chr> "o", "qw", "qy^d", "+", "+", "qy", "sd", "ad", ...
    ## $ speaker        <chr> "A", "A", "B", "A", "B", "A", "B", "B", "B", "B...
    ## $ turn_num       <chr> "1", "1", "2", "3", "4", "5", "6", "6", "6", "6...
    ## $ utterance_num  <chr> "1", "2", "1", "1", "1", "1", "1", "2", "3", "4...
    ## $ utterance_text <chr> "Okay.  /", "{D So, }", "[ [ I guess, +", "What...
    ## $ speaker_id     <dbl> 1632, 1632, 1519, 1632, 1519, 1632, 1519, 1519,...
    ## $ sex            <chr> "FEMALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE...
    ## $ birth_year     <int> 1962, 1962, 1971, 1962, 1971, 1962, 1971, 1971,...
    ## $ dialect_area   <chr> "WESTERN", "WESTERN", "SOUTH MIDLAND", "WESTERN...
    ## $ education      <int> 2, 2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 2,...

    Result! Now let’s check our data for any missing data points generated in the join.

    sdac[!complete.cases(sdac), ] %>% glimpse # view incomplete cases
    ## Observations: 100
    ## Variables: 11
    ## $ doc_id         <chr> "3554", "3554", "3554", "3554", "3554", "3554",...
    ## $ damsl_tag      <chr> "sd@", "+@", "sv@", "+@", "+", "sd", "+", "+", ...
    ## $ speaker        <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A...
    ## $ turn_num       <chr> "1", "3", "5", "7", "9", "9", "11", "13", "13",...
    ## $ utterance_num  <chr> "1", "1", "1", "1", "1", "2", "1", "1", "2", "1...
    ## $ utterance_text <chr> "Of a exercise program you have.", "Right. /", ...
    ## $ speaker_id     <dbl> 155, 155, 155, 155, 155, 155, 155, 155, 155, 15...
    ## $ sex            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
    ## $ birth_year     <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
    ## $ dialect_area   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
    ## $ education      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...

    We have 100 observations that are missing data. Inspecting the dataset preview it appears that there was at least one speaker_id that appears in the conversation files that does not appear in the speaker meta-data. Let’s check to see how many speakers this might affect.

    sdac[!complete.cases(sdac), ] %>% select(speaker_id) %>% unique()

    Just one speaker. This could very well be annotator error. Since it effects a relatively small proportion of the data, let’s drop this speaker from the dataset. We can use the filter() function to select the values of speaker_id that are not equal to 155.

    sdac <- # remove rows where speaker_id == 155
      sdac %>% 
      filter(speaker_id != 155)
    
    sdac[!complete.cases(sdac), ] %>% glimpse # view incomplete cases
    ## Observations: 0
    ## Variables: 11
    ## $ doc_id         <chr> 
    ## $ damsl_tag      <chr> 
    ## $ speaker        <chr> 
    ## $ turn_num       <chr> 
    ## $ utterance_num  <chr> 
    ## $ utterance_text <chr> 
    ## $ speaker_id     <dbl> 
    ## $ sex            <chr> 
    ## $ birth_year     <int> 
    ## $ dialect_area   <chr> 
    ## $ education      <int>

    Explore the tidy dataset

    At this point we have a well-curated dataset which includes linguistic and non-linguistic meta-data. As we did for the ACTIV-ES corpus, let’s get a sense of the distribution of some of the meta-data.

    First we will visualize the number of utterances from speakers of the different dialect regions.

    sdac %>% 
      group_by(dialect_area) %>% 
      count() %>%
      ggplot(aes(x = dialect_area, y = n)) + 
      geom_col() +
      labs(x = "Dialect region", y = "Utterance count", title = "Switchboard Dialog Act Corpus", subtitle = "Utterances per dialect region") +
      theme(axis.text.x = element_text(angle = 45, hjust = 1))

    Let’s see how men and women figure across the dialect areas.

    sdac %>% 
      group_by(dialect_area, sex) %>% 
      count() %>% 
      ggplot(aes(x = dialect_area, y = n, fill = sex)) + 
      geom_col() +
      labs(x = "Dialect region", y = "Utterance count", title = "Switchboard Dialog Act Corpus", subtitle = "Utterances per dialect region and sex") +
      theme(axis.text.x = element_text(angle = 45, hjust = 1))

    There are many other ways to group and count the dataset but I’ll leave that to you to look at!

    Round up

    In this post I covered tidying a corpus from running text files. We looked at three cases where meta-data is typically stored: in filenames, embedded inline with the text itself, and in stand-off files. As usual we made extensive use of the tidyverse package set (readr, dplyr, ggplot2, etc.) and included discussion of other packages: readtext for reading and organizing meta-data from file names, tidytext for tokenizing text, and stringr for text cleaning and pattern matching. I also briefly introduced the ggplot2 package for creating plots based on the Grammar of Graphics philosophy. Along the way we continued to extend our knowledge of R data and object types working with vectors, data frames, and lists manipulating them in various ways (subsetting, sorting, transforming, and summarizing).

    In the next post I will turn to working with meta-data in structured documents, specifically .xml documents. These type of documents tend to have rich meta-data including linguistic and non-linguistic information. We will focus on working with linguistic annnotations such as part-of-speech and syntactic structure. We will work to parse the linguistic information in these documents into a tidy dataset and also see how to create linguistic annotations for data does not already contain them.

    References

    Benoit, Kenneth, and Adam Obeng. 2017. Readtext: Import and Handling for Plain and Formatted Text Files. https://CRAN.R-project.org/package=readtext.

    Henry, Lionel, and Hadley Wickham. 2017. Purrr: Functional Programming Tools. https://CRAN.R-project.org/package=purrr.

    Robinson, David, and Julia Silge. 2018. Tidytext: Text Mining Using ’Dplyr’, ’Ggplot2’, and Other Tidy Tools. https://CRAN.R-project.org/package=tidytext.

    Wickham, Hadley. 2018. Stringr: Simple, Consistent Wrappers for Common String Operations. https://CRAN.R-project.org/package=stringr.

    Wickham, Hadley, and Winston Chang. 2018. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics.

    Wickham, Hadley, Romain Francois, Lionel Henry, and Kirill Müller. 2017. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

    Wickham, Hadley, Jim Hester, and Romain Francois. 2017. Readr: Read Rectangular Text Data. https://CRAN.R-project.org/package=readr.


    1. https://regex101.com is a great place to learn more about Regular Expressions and to practice using them.

    To leave a comment for the author, please follow the link and comment on their blog: R on francojc ⟲.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.