Curate language data (1/2): organizing meta-data
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
When working with raw data, whether is comes from a corpus repository, web download, or a web scrape, it is important to recognize that the attributes that we want to organize can be stored or represented in various formats. The three I will cover here have to do with meta-data that is: (1) contained in the file name of a set of corpus files, (2) embedded in the corpus documents inline with the corpus text, and (3) stored separate from the the text data. Our goal will be to wrangle this information into a tidy dataset format where each row is an observation and each column a corresponding attribute of the data.
The following code is available on GitHub recipes-curate_data
and is built on the recipes-project_template
I have discussed in detail here and made accessible here. I encourage you to follow along by downloading the recipes-project_template
with git
from the Terminal or create a new RStudio R Project and select the “Version Control” option.
Running text with meta-data in file names
A common format for storing meta-data for corpora is in the file names of the corpus documents. When this is the approach of the corpus designer, the names will contain the relevant attributes in some regular format, usually using some common character as the delimiter between the distinct attribute elements.
Download corpus data
The ACTIV-ES Corpus is structured this way. ACTIV-ES is a corpus of TV/film transcripts from Argentina, Mexico, and Spain. Let’s use this corpus as an example. First we need to download the data. The ACTIV-ES corpus is stored in a GitHub repository. We can download the entire corpus using git
to clone the repository, or we can access the specific corpus format (plain-text or part-of-speech annotated) as a compressed .zip
file. Let’s download the compressed file for the plain text data. Navigate to the https://github.com/francojc/activ-es/blob/master/activ-es-v.02/corpus/plain.zip
file and then copy the link for the ‘Download’ button. We can use the get_zip_data()
function we developed in the Acquiring data for language research (1/3): direct downloads post.
get_zip_data(url = "https://github.com/francojc/activ-es/raw/master/activ-es-v.02/corpus/plain.zip", target_dir = "data/original/actives/plain")
Taking a look at the data/original/actives/plain/
directory we can see the files. Below is a subset of files from each of the three countries.
es_Argentina_2008_Lluvia_movie_Drama_1194615.run es_Argentina_2008_Los-paranoicos_movie_Comedy_1178654.run es_Mexico_2008_Rudo-y-Cursi_movie_Comedy_405393.run es_Mexico_2009_Sin-nombre_movie_Adventure_1127715.run es_Spain_2010_También-la-lluvia_movie_Drama_1422032.run es_Spain_2010_Tres-metros-sobre-el-cielo_movie_Drama_1648216.run
Tidy the corpus
Each of the meta-data attributes is separated by an underscore _
. The extension on these files is .run
. There is nothing special about this extension, the data is plain text, but it is used to contrast the ‘running text’ version of these files with similar names that have linguistic annotations associated in other versions of the corpus. The delimited elements correspond to language
, country
, year
, title
, type
, genre
, and imdb_id
.
Ideally we want a data set with columns for each of these attributes in the file names and an extra two columns for the text
itself and an id to distinguish each document doc_id
. The readtext package comes in handy here. So let’s load (or install) this package to read the corpus files and the tidyverse package for other miscellaneous helper functions.
pacman::p_load(readtext, tidyverse) # use the pacman package to load-install
The readtext()
function is quite versatile. It allows us to read multiple files simultaneously and organize the data in a tidy dataset. The files argument will allow us to add the path to the directory where the files are located and use a pattern matching syntax known as a Regular Expressions to match only the files we want to extract the data from. Regular expressions are a powerful tool for manipulating character strings. Getting familiar with how they work is highly recommended.1 We will see them in action at various points throughout the rest of this series. In this case we want all the files from the data/original/actives/plain/
directory that have the extension .run
. So we using the Kleene start *
as a wildcard match in combination with .run
to match all files that end in .run
.
Furthermore, the readtext()
function allows for us to specify where the meta-data is to be found with the docvarsfrom
argument, in our case "filenames"
. The default separator value is the underscore, so we do not have to add this argument. In the case, however, the the separator is not an underscore, you will add this argument with the separator value necessary. The actual names we want to give to the attributes can be added with the docvarnames
argument. Note that the docvarnames
argument takes a character vector as a value. Remember to create a character vector we use the c()
function with each element quoted.
aes <- readtext(file = "data/original/actives/plain/*.run", # read each file .run docvarsfrom = "filenames", # get attributes from filename docvarnames = c("language", "country", "year", "title", "type", "genre", "imdb_id")) # add the column names we want for each attribute glimpse(aes) # preview structure of the object Observations: 430 Variables: 9 $ doc_id <chr> "es_Argentina_1950_Esposa-último-modelo_movie_n_199500.run", "es_Arge... $ text <chr> "No está , señora . Aquí tampoco . No aparece , señora . ¿ Dónde se ha... $ language <chr> "es", "es", "es", "es", "es", "es", "es", "es", "es", "es", "es", "es"... $ country <chr> "Argentina", "Argentina", "Argentina", "Argentina", "Argentina", "Arge... $ year <int> 1950, 1952, 1955, 1965, 1969, 1973, 1975, 1977, 1979, 1980, 1981, 1983... $ title <chr> "Esposa-último-modelo", "No-abras-nunca-esa-puerta", "El-amor-nunca-m... $ type <chr> "movie", "movie", "movie", "movie", "movie", "movie", "movie", "video-... $ genre <chr> "n", "Mystery", "Drama", "Documentary", "Horror", "Adventure", "Drama"... $ imdb_id <int> 199500, 184782, 47823, 282622, 62433, 70250, 71897, 333883, 333954, 17...
The output from glimpse(aes)
shows us that there are 430 observations and 9 attributes corresponding to the 430 files in the corpus and the 7 meta-data information attributes in the file names plus the added columns doc_id
and text
which contain the name of the file and the text in the file for each file. The information in the doc_id
is captured in our meta-data, yet the values are not ideal –seeing as they are quite long and informationally redundant. Although not strictly necessary, let’s change the doc_id
values to unique numeric values. To transform the data overwriting doc_id
with numerical values we can use the mutate()
function from the tidyverse package in combination with the row_number()
function.
aes <- aes %>% mutate(doc_id = row_number()) # change doc_id to numbers glimpse(aes) # preview structure of the object Observations: 430 Variables: 9 $ doc_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,... $ text <chr> "No está , señora . Aquí tampoco . No aparece , señora . ¿ Dónde se ha... $ language <chr> "es", "es", "es", "es", "es", "es", "es", "es", "es", "es", "es", "es"... $ country <chr> "Argentina", "Argentina", "Argentina", "Argentina", "Argentina", "Arge... $ year <int> 1950, 1952, 1955, 1965, 1969, 1973, 1975, 1977, 1979, 1980, 1981, 1983... $ title <chr> "Esposa-último-modelo", "No-abras-nunca-esa-puerta", "El-amor-nunca-m... $ type <chr> "movie", "movie", "movie", "movie", "movie", "movie", "movie", "video-... $ genre <chr> "n", "Mystery", "Drama", "Documentary", "Horror", "Adventure", "Drama"... $ imdb_id <int> 199500, 184782, 47823, 282622, 62433, 70250, 71897, 333883, 333954, 17...
Explore the tidy dataset
Now that we have the data in a tidy format where each row is one of our corpus files and each column is a meta-data attribute that describes each corpus file, let’s do some quick exploration of the distribution of the data to get a better feel for what our corpus is like. One thing we can do is to calculate the size of the corpus. A rudimentary approach to corpus size is the number of word tokens. The tidytext package provides a very useful function unnest_tokens()
that provides a simple and efficient way to tokenize text while maintaining the tidy structure we have created. In combination with a set of functions from the tidyverse package, we can tokenize the text into words and count the number of words (count()
).
Let’s take this in two steps so you can appreciate what unnest_tokens()
does. First load (or install) tidytext.
pacman::p_load(tidytext) # use the pacman package to load-install
Now let’s tokenize the text
column into word terms and preview the first 25 rows in the output.
aes_tokens <- aes %>% unnest_tokens(output = terms, input = text) # tokenize `text` into words `terms` aes_tokens %>% head(25) # view first 25 tokenized terms
We see in the previous table that a column terms
has replaced text
in our tidy dataset. The meta-data, however, is still in tact.
The unnest_tokens()
function from tidytext
is very flexible. Here we have used the default arguments which produce word tokens. There are many other tokenization parameters that can be used, and we will use, to create sentence tokens, ngram tokens, and custom tokenization schemes. View ?unnest_tokens
to find out more in the R documentation.
After applying the unnest_tokens()
function in the previous code, the rows correspond to tokenized words. Therefore the number of rows corresponds to the total number of words in the corpus. To find the total number of words we can use the count()
function.
aes_tokens %>% count() # count terms
The count()
function can be used with a data frame, like our aes_tokens
object, to group our rows by the values of a particular column. A practical application for this functionality is to group the rows (word terms) by the values of country
(‘Argentina’, ‘Mexico’, and ‘Spain’). This will give us the number of words in each country sub-corpus.
aes_tokens %>% count(country) # count terms by `country`
So now we know the total word count of the corpus and the number of words in each country sub-corpus. If we would like to have a description of the proportion of words from each sub-corpus in the total corpus, we can use the mutate()
function to create a new column prop
which calculates the total size of the corpus (sum(n)
) and then divides each sub-corpus size (n
) by this number.
aes_country_props <- aes_tokens %>% count(country) %>% # count terms by `country` mutate(prop = n / sum(n) ) # add the word term proportion for each country aes_country_props
As we have seen in the previous examples tidy datasets are easy to work with. Another advantage to data frames is that we can use them to create graphics using the ggplot2 package. ggplot2 is a powerful package for creating graphics in R that applies what is known as the ‘Grammar of Graphics’. The Grammar of Graphics recognizes that there are three principle components to any graphic: (1) data, (2) mappings or ‘aesthetics’ as they are called, and (3) geometries, or ‘geoms’. Data is the data frame which contains our observations (rows) and our variables (columns). We connect certain variables of interest from our data set to certain parameters in the visual space. Typical parameters include the ‘x-axis’ and the ‘y-axis’. The x-axis corresponds to the horizontal plane and the y-axis the vertical plane. This sets up a base coordinate system for visualizing the data. Once our data has been mapped to a visual space, we then designate an appropriate geometry to represent this space (bar plot, line graphs, scatter plots, etc). There are many geometries available in ggplot2 for relevant mapping types.
Let’s visualize the aes_country_props
object as a bar graph, as an example. ggplot2 is included as part of the tidyverse package so we already have access to it. So first we pass the aes_country_props
data frame to the ggplot()
function. Then we map the x-axis to the country
column and the y-axis to the prop
column. This mapping is then passed with the plus +
operator to the geom_col()
function to visualize the mapping in columns, or bars.
aes_country_props %>% # pass the data frame as our data source ggplot(aes(x = country, y = prop)) + # create x- and y-axis mappings geom_col() # visualize a column-wise geometry
The above code will do the heavy lifting to create our plot. Below I’ve added the labs()
function to create a more informative graphic with prettier x- and y-axis labels and a title and subtitle.
aes_country_props %>% # pass the data frame as our data source ggplot(aes(x = country, y = prop)) + # create x- and y-axis mappings geom_col() + # visualize a column-wise geometry labs(x = "Country", y = "Proportion (%)", title = "ACTIV-ES Corpus Distribution", subtitle = "Proportion of words in each country sub-corpus")
In this example we covered reading a corpus and the meta-data information contained withing the file names of this corpus with the readtext package. We then did some quick exploratory work to find the corpus size and the proportions of the corpus by country sub-corpus with the tidytext package and assorted functions from the tidyverse package. We rounded things out with a brief introduction to the ggplot2 package which we used to visualize the country sub-corpus proportions.
Running text with inline meta-data
In the previous example, our corpus contained meta-data stored in the individual file names of the corpus. In some other cases the meta-data is stored inline with the corpus text itself. The goal in cases such as these is to separate the meta-data from the text and coerce all the information into a tidy dataset.
Download corpus data
As an example we will work with the The Switchboard Dialog Act Corpus (SDAC) which extends the Switchboard Corpus with speech act annotation. The SDAC dialogues (swb1_dialogact_annot.tar.gz
) are available as a free download from the LDC. The dialogues are contained within a compressed .tar.gz
file. This file can be downloaded manually and its contents extracted to disk, but since we are working to create a reproducible workflow we will approach this task programmatically.
We have an available custom function that deals with .zip
compressed files get_zip_data()
but we need a function that works on .tar.gz
files. R has a function to extract .tar.gz
files untar()
that we can use to mimic the same functionality as the unzip()
function used in the get_zip_data()
function. Instead of writing a new custom function to deal specifically with .tar.gz
files, I’ve created a function that deals with both compressed file formats, named it get_compressed_data()
, and added it to my functions/acquire_functions.R
file.
get_compressed_data <- function(url, target_dir, force = FALSE) { # Get the extension of the target file ext <- tools::file_ext(url) # Check to see if the target file is a compressed file if(!ext %in% c("zip", "gz", "tar")) stop("Target file given is not supported") # Check to see if the data already exists if(!dir.exists(target_dir) | force == TRUE) { # if data does not exist, download/ decompress cat("Creating target data directory \n") # print status message dir.create(path = target_dir, recursive = TRUE, showWarnings = FALSE) # create target data directory cat("Downloading data... \n") # print status message temp <- tempfile() # create a temporary space for the file to be written to download.file(url = url, destfile = temp) # download the data to the temp file # Decompress the temp file in the target directory if(ext == "zip") { unzip(zipfile = temp, exdir = target_dir, junkpaths = TRUE) # zip files } else { untar(tarfile = temp, exdir = target_dir) # tar, gz files } cat("Data downloaded! \n") # print status message } else { # if data exists, don't download it again cat("Data already exists \n") # print status message } }
Once this function is loaded into R, either by sourcing the functions/aquire_functions.R
file (source("functions/acquire_functions.R")
) or running the code directly, we apply the function to the resource URL targeting the data/original/sdac/
directory as the extraction location.
get_compressed_data(url = "https://catalog.ldc.upenn.edu/docs/LDC97S62/swb1_dialogact_annot.tar.gz", target_dir = "data/original/sdac/")
The main directory structure of the sdac/
data looks like this:
. ├── README ├── doc ├── sw00utt ├── sw01utt ├── sw02utt ├── sw03utt ├── sw04utt ├── sw05utt ├── sw06utt ├── sw07utt ├── sw08utt ├── sw09utt ├── sw10utt ├── sw11utt ├── sw12utt └── sw13utt 15 directories, 1 file
The README
file contains basic information about the resource, the doc/
directory contains more detailed information about the dialog annotations, and each of the following directories prefixed with sw...
contain individual conversation files. Here’s a peek at internal structure of the first couple directories.
. ├── README ├── doc │ └── manual.august1.html ├── sw00utt │ ├── sw_0001_4325.utt │ ├── sw_0002_4330.utt │ ├── sw_0003_4103.utt │ ├── sw_0004_4327.utt │ ├── sw_0005_4646.utt
Let’s take a look at the first conversation file (sw_0001_4325.utt
) to see how it is structured.
*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x* *x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x* *x* *x* *x* Copyright (C) 1995 University of Pennsylvania *x* *x* *x* *x* The data in this file are part of a preliminary version of the *x* *x* Penn Treebank Corpus and should not be redistributed. Any *x* *x* research using this corpus or based on it should acknowledge *x* *x* that fact, as well as the preliminary nature of the corpus. *x* *x* *x* *x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x* *x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x* FILENAME: 4325_1632_1519 TOPIC#: 323 DATE: 920323 TRANSCRIBER: glp UTT_CODER: tc DIFFICULTY: 1 TOPICALITY: 3 NATURALNESS: 2 ECHO_FROM_B: 1 ECHO_FROM_A: 4 STATIC_ON_A: 1 STATIC_ON_B: 1 BACKGROUND_A: 1 BACKGROUND_B: 2 REMARKS: None. ========================================================================= o A.1 utt1: Okay. / qw A.1 utt2: {D So, } qy^d B.2 utt1: [ [ I guess, + + A.3 utt1: What kind of experience [ do you, + do you ] have, then with child care? /
There are few things to take note of here. First we see that the conversation files have a meta-data header offset from the conversation text by a line of =
characters. Second the header contains meta-information of various types. Third, the text is interleaved with an annotation scheme.
Some of the information may be readily understandable, such as the various pieces of meta-data in the header, but to get a better understanding of what information is encoded here let’s take a look at the README
file. In this file we get a birds eye view of what is going on. In short, the data includes 1155 telephone conversations between two people annotated with 42 ‘DAMSL’ dialog act labels. The README
file refers us to the doc/manual.august1.html
file for more information on this scheme.
At this point we open the the doc/manual.august1.html
file in a browser and do some investigation. We find out that ‘DAMSL’ stands for ‘Discourse Annotation and Markup System of Labeling’ and that the first characters of each line of the conversation text correspond to one or a combination of labels for each utterance. So for our first utterances we have:
o = "Other" qw = "Wh-Question" qy^d = "Declarative Yes-No-Question" + = "Segment (multi-utterance)"
Each utterance is also labeled for speaker (‘A’ or ‘B’), speaker turn (‘1’, ‘2’, ‘3’, etc.), and each utterance within that turn (‘utt1’, ‘utt2’, etc.). There is other annotation provided withing each utterance, but this should be enough to get us started on the conversations.
Now let’s turn to the meta-data in the header. We see here that there is information about the creation of the file: ‘FILENAME’, ‘TOPIC’, ‘DATE’, etc. The doc/manual.august1.html
file doesn’t have much to say about this information so I returned to the LDC Documentation and found more information in the Online Documentation section. After some poking around in this documentation I discovered that that meta-data for each speaker in the corpus is found in the caller_tab.csv
file. This tabular file does not contain column names, but the caller_doc.txt
does. After inspecting these files manually and comparing them with the information in the conversation file I noticed that the ‘FILENAME’ information contained three pieces of useful information delimited by underscores _
.
*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x* FILENAME: 4325_1632_1519 TOPIC#: 323 DATE: 920323 TRANSCRIBER: glp
The first information is the document id (4325
), the second and third correspond to the speaker number: the first being speaker A (1632
) and the second speaker B (1519
).
Tidy the corpus
In sum, we have 1155 conversation files. Each file has two parts, a header and text section, separated by a line of =
characters. The header section contains a ‘FILENAME’ line which has the document id, and ids for speaker A and speaker B. The text section is annotated with DAMSL tags beginning each line, followed by speaker, turn number, utterance number, and the utterance text. With this knowledge in hand, let’s set out to create a tidy dataset with the following column structure:
Let’s begin by reading one of the conversation files into R as a character vector using the read_lines()
function.
doc <- read_lines(file = "data/original/sdac/sw00utt/sw_0001_4325.utt") # read file by lines
To isolate the vector element that contains the document and speaker ids, we use str_detect()
from the stringr package. This function takes two arguments, a string and a pattern, and returns a logical value, TRUE
if the pattern is matched or FALSE
if not. We can use the output of this function, then, to subset the doc
character vector and only return the vector element (line) that contains digits_digits_digits
with a regular expression. The expression combines the digit matching operator \\d
with the +
operator to match 1 or more contiguous digits. We then separate three groups of \\d+
with underscores _
. The result is \\d+_\\d+_\\d+
.
pacman::p_load(stringr) # load-install `stringr` package doc[str_detect(doc, pattern = "\\d+_\\d+_\\d+")] # isolate pattern ## [1] "FILENAME:\t4325_1632_1519"
The next step is to extract the three digit sequences that correspond to the doc_id
, speaker_a_id
, and speaker_b_id
. First we extract the pattern that we have identified with str_extract()
and then we can break up the single character vector into multiple parts based on the underscore _
. The str_split()
function takes a string and then a pattern to use to split a character vector. It will return a list of character vectors.
doc[str_detect(doc, "\\d+_\\d+_\\d+")] %>% # isolate pattern str_extract(pattern = "\\d+_\\d+_\\d+") %>% # extract the pattern str_split(pattern = "_") # split the character vector ## [[1]] ## [1] "4325" "1632" "1519"
A list is a special object type in R. It is an unordered collection of objects whose lengths can differ (contrast this with a data frame which is a collection of objects whose lengths are the same –hence the tabular format). In this case we have a list of length 1, whose sole element is a character vector of length 3 –one element per segment returned from our split. This is a desired result in most cases as if we were to pass multiple character vectors to our str_split()
function we don’t want the results to be conflated as a single character vector blurring the distinction between the individual character vectors. If we would like to conflate, or flatten a list, we can use the unlist()
function.
doc[str_detect(doc, "\\d+_\\d+_\\d+")] %>% # isolate pattern str_extract(pattern = "\\d+_\\d+_\\d+") %>% # extract the pattern str_split(pattern = "_") %>% # split the character vector unlist() # flatten the list to a character vector ## [1] "4325" "1632" "1519"
Let’s flatten the list in this case, as we have a single character vector, and assign this result to doc_speaker_info
.
doc_speaker_info <- doc[str_detect(doc, "\\d+_\\d+_\\d+")] %>% # isolate pattern str_extract(pattern = "\\d+_\\d+_\\d+") %>% # extract the pattern str_split(pattern = "_") %>% # split the character vector unlist() # flatten the list to a character vector
doc_speaker_info
is now a character vector of length three. Let’s subset each of the elements and assign them to meaningful variable names so we can conveniently use them later on in the tidying process.
doc_id <- doc_speaker_info[1] speaker_a_id <- doc_speaker_info[2] speaker_b_id <- doc_speaker_info[3]
The next step is to isolate the text section extracting it from rest of the document. As noted previously, a sequence of =
separates the header section from the text section. What we need to do is to index the point in our character vector doc
where that line occurs and then subset the doc
from that point until the end of the character vector. Let’s first find the point where the =
sequence occurs. We will again use the str_detect()
function to find the pattern we are looking for (a contiguous sequence of =
), but then we will pass the logical result to the which()
function which will return the element index number of this match.
doc %>% str_detect(pattern = "=+") %>% # match 1 or more `=` which() # find vector index ## [1] 31
So for this file 31
is the index in doc
where the =
sequence occurs. Now it is important to keep in mind that we are working with a single file from the sdac/
data. We need to be cautious to not create a pattern that may be matched multiple times in another document in the corpus. As the =+
pattern will match =
, or ==
, or ===
, etc. it is not implausible to believe that there might be a =
character on some other line in one of the other files. Let’s update our regular expression to avoid this potential scenario by only matching sequences of three or more =
. In this case we will make use of the curly bracket operators {}
.
doc %>% str_detect(pattern = "={3,}") %>% # match 3 or more `=` which() # find vector index ## [1] 31
We will get the same result for this file, but will safeguard ourselves a bit as it is unlikely we will find multiple matches for ===
, ====
, etc.
31
is the index for the =
sequence, but we want the next line to be where we start reading the text section. To do this we increment the index by 1.
text_start_index <- doc %>% str_detect(pattern = "={3,}") %>% # match 3 or more `=` which() # find vector index text_start_index <- text_start_index + 1 # increment index by 1
The index for the end of the text is simply the length of the doc
vector. We can use the length()
function to get this index.
text_end_index <- length(doc)
We now have the bookends, so to speak, for our text section. To extract the text we subset the doc
vector by these indices.
text <- doc[text_start_index:text_end_index] # extract text head(text) # preview first lines of `text` ## [1] " " ## [2] "" ## [3] "o A.1 utt1: Okay. /" ## [4] "qw A.1 utt2: {D So, } " ## [5] "" ## [6] "qy^d B.2 utt1: [ [ I guess, + "
The text has some extra whitespace on some lines and there are blank lines as well. We should do some cleaning up before moving forward to organize the data. To get rid of the whitespace we use the str_trim()
function which by default will remove leading and trailing whitespace from each line.
text <- str_trim(text) # remove leading and trailing whitespace head(text) # preview first lines of `text` ## [1] "" ## [2] "" ## [3] "o A.1 utt1: Okay. /" ## [4] "qw A.1 utt2: {D So, }" ## [5] "" ## [6] "qy^d B.2 utt1: [ [ I guess, +"
To remove blank lines we will create a logical expression to subset the text
vector. text != ""
means return TRUE where lines are not blank, and FALSE where they are.
text <- text[text != ""] # remove blank lines head(text) # preview first lines of `text` ## [1] "o A.1 utt1: Okay. /" ## [2] "qw A.1 utt2: {D So, }" ## [3] "qy^d B.2 utt1: [ [ I guess, +" ## [4] "+ A.3 utt1: What kind of experience [ do you, + do you ] have, then with child care? /" ## [5] "+ B.4 utt1: I think, ] + {F uh, } I wonder ] if that worked. /" ## [6] "qy A.5 utt1: Does it say something? /"
Our first step towards a tidy dataset is to now combine the doc_id
and each element of text
in a data frame.
data <- data.frame(doc_id, text) # tidy format `doc_id` and `text` head(data) # preview first lines of `text`
With our data now in a data frame, its time to parse the text
column and extract the damsl tags, speaker, speaker turn, utterance number, and the utterance text itself into separate columns. To do this we will make extensive use of regular expressions. Our aim is to find a consistent pattern that distinguishes each piece of information from other other text in a given row of data$text
and extract it.
The best way to learn regular expressions is to use them. To this end I’ve included a window to the interactive regular expression practice website regex101 in Figure 1.
Copy the text below into the ‘TEST STRING’ field.
o A.1 utt1: Okay. / qw A.1 utt2: {D So, } qy^d B.2 utt1: [ [ I guess, + + A.3 utt1: What kind of experience [ do you, + do you ] have, then with child care? / + B.4 utt1: I think, ] + {F uh, } I wonder ] if that worked. / qy A.5 utt1: Does it say something? / sd B.6 utt1: I think it usually does. / ad B.6 utt2: You might try, {F uh, } / h B.6 utt3: I don't know, / ad B.6 utt4: hold it down a little longer, /
Now manually type the following regular expressions into the ‘REGULAR EXPRESSION’ field one-by-one (each is on a separate line). Notice what is matched as you type and when you’ve finished typing. You can find out exactly what the component parts of each expression are doing by toggling the top right icon in the window or hovering your mouse over the relevant parts of the expression.
^.+?\s [AB]\.\d+ utt\d+ :.+$
As you can now see, we have regular expressions that will match the damsl tags, speaker and speaker turn, utterance number, and the utterance text. To apply these expressions to our data and extract this information into separate columns we will make use of the mutate()
and str_extract()
functions. mutate()
will take our data frame and create new columns with values we match and extract from each row in the data frame with str_extract()
. Notice that str_extract()
is different than str_extract_all()
. When we work with mutate()
each row will be evaluated in turn, therefore we only need to make one match per row in data$text
.
I’ve chained each of these steps in the code below, dropping the original text
column with select(-text)
, and overwriting data
with the results.
data <- # extract column information from `text` data %>% mutate(damsl_tag = str_extract(string = text, pattern = "^.+?\\s")) %>% # extract damsl tags mutate(speaker_turn = str_extract(string = text, pattern = "[AB]\\.\\d+")) %>% # extract speaker_turn pairs mutate(utterance_num = str_extract(string = text, pattern = "utt\\d+")) %>% # extract utterance number mutate(utterance_text = str_extract(string = text, pattern = ":.+$")) %>% # extract utterance text select(-text) # drop the `text` column glimpse(data) # preview the data set ## Observations: 159 ## Variables: 5 ## $ doc_id <chr> "4325", "4325", "4325", "4325", "4325", "4325",... ## $ damsl_tag <chr> "o ", "qw ", "qy^d ", "+ ", "+ ", "qy ", "sd ",... ## $ speaker_turn <chr> "A.1", "A.1", "B.2", "A.3", "B.4", "A.5", "B.6"... ## $ utterance_num <chr> "utt1", "utt2", "utt1", "utt1", "utt1", "utt1",... ## $ utterance_text <chr> ": Okay. /", ": {D So, }", ": [ [ I guess, +",...
One twist you will notice is that regular expressions in R require double backslashes (\\
) where other programming environments use a single backslash (\
).
There are a couple things left to do to the columns we extracted from the text before we move on to finishing up our tidy dataset. First, we need to separate the speaker_turn
column into speaker
and turn_num
columns and second we need to remove unwanted characters from the damsl_tag
, utterance_num
, and utterance_text
columns.
To separate the values of a column into two columns we use the separate()
function. It takes a column to separate and character vector of the names of the new columns to create. By default the values of the input column will be separated by non-alphanumeric characters. In our case this means the .
will be our separator.
data <- data %>% separate(col = speaker_turn, into = c("speaker", "turn_num")) # separate speaker_turn into distinct columns glimpse(data) # preview the data set ## Observations: 159 ## Variables: 6 ## $ doc_id <chr> "4325", "4325", "4325", "4325", "4325", "4325",... ## $ damsl_tag <chr> "o ", "qw ", "qy^d ", "+ ", "+ ", "qy ", "sd ",... ## $ speaker <chr> "A", "A", "B", "A", "B", "A", "B", "B", "B", "B... ## $ turn_num <chr> "1", "1", "2", "3", "4", "5", "6", "6", "6", "6... ## $ utterance_num <chr> "utt1", "utt2", "utt1", "utt1", "utt1", "utt1",... ## $ utterance_text <chr> ": Okay. /", ": {D So, }", ": [ [ I guess, +",...
To remove unwanted leading or trailing whitespace we apply the str_trim()
function. For removing other characters we matching the character(s) and replace them with an empty string (""
) with the str_replace()
function. Again, I’ve chained these functions together and overwritten data
with the results.
data <- # clean up column information data %>% mutate(damsl_tag = str_trim(damsl_tag)) %>% # remove leading/ trailing whitespace mutate(utterance_num = str_replace(string = utterance_num, pattern = "utt", replacement = "")) %>% # remove 'utt' mutate(utterance_text = str_replace(string = utterance_text, pattern = ":\\s", replacement = "")) %>% # remove ': ' mutate(utterance_text = str_trim(utterance_text)) # trim leading/ trailing whitespace glimpse(data) # preview the data set ## Observations: 159 ## Variables: 6 ## $ doc_id <chr> "4325", "4325", "4325", "4325", "4325", "4325",... ## $ damsl_tag <chr> "o", "qw", "qy^d", "+", "+", "qy", "sd", "ad", ... ## $ speaker <chr> "A", "A", "B", "A", "B", "A", "B", "B", "B", "B... ## $ turn_num <chr> "1", "1", "2", "3", "4", "5", "6", "6", "6", "6... ## $ utterance_num <chr> "1", "2", "1", "1", "1", "1", "1", "2", "3", "4... ## $ utterance_text <chr> "Okay. /", "{D So, }", "[ [ I guess, +", "What...
To round out our tidy dataset for this single conversation file we will connect the speaker_a_id
and speaker_b_id
with speaker A and B in our current dataset adding a new column speaker_id
. The case_when()
function does exactly this: allows us to map rows of speaker
with the value “A” to speaker_a_id
and rows with value “B” to speaker_b_id
.
data <- # link speaker with speaker_id data %>% mutate(speaker_id = case_when( speaker == "A" ~ speaker_a_id, speaker == "B" ~ speaker_b_id )) glimpse(data) # preview the data set ## Observations: 159 ## Variables: 7 ## $ doc_id <chr> "4325", "4325", "4325", "4325", "4325", "4325",... ## $ damsl_tag <chr> "o", "qw", "qy^d", "+", "+", "qy", "sd", "ad", ... ## $ speaker <chr> "A", "A", "B", "A", "B", "A", "B", "B", "B", "B... ## $ turn_num <chr> "1", "1", "2", "3", "4", "5", "6", "6", "6", "6... ## $ utterance_num <chr> "1", "2", "1", "1", "1", "1", "1", "2", "3", "4... ## $ utterance_text <chr> "Okay. /", "{D So, }", "[ [ I guess, +", "What... ## $ speaker_id <chr> "1632", "1632", "1519", "1632", "1519", "1632",...
We now have the tidy dataset we set out to create. But this dataset only includes on conversation file. We want to apply this code to all 1155 conversation files in the sdac/
corpus. The approach will be to create a custom function which groups the code we’ve done for this single file and then iterative send each file from the corpus through this function and combine the results into one data frame.
Here’s the function with some extra code to print a progress message for each file when it runs.
extract_sdac_metadata <- function(file) { # Function: to read a Switchboard Corpus Dialogue file and extract meta-data cat("Reading", basename(file), "...") # Read `file` by lines doc <- read_lines(file) # Extract `doc_id`, `speaker_a_id`, and `speaker_b_id` doc_speaker_info <- doc[str_detect(doc, "\\d+_\\d+_\\d+")] %>% # isolate pattern str_extract("\\d+_\\d+_\\d+") %>% # extract the pattern str_split(pattern = "_") %>% # split the character vector unlist() # flatten the list to a character vector doc_id <- doc_speaker_info[1] # extract `doc_id` speaker_a_id <- doc_speaker_info[2] # extract `speaker_a_id` speaker_b_id <- doc_speaker_info[3] # extract `speaker_b_id` # Extract `text` text_start_index <- # find where header info stops doc %>% str_detect(pattern = "={3,}") %>% # match 3 or more `=` which() # find vector index text_start_index <- text_start_index + 1 # increment index by 1 text_end_index <- length(doc) # get the end of the text section text <- doc[text_start_index:text_end_index] # extract text text <- str_trim(text) # remove leading and trailing whitespace text <- text[text != ""] # remove blank lines data <- data.frame(doc_id, text) # tidy format `doc_id` and `text` data <- # extract column information from `text` data %>% mutate(damsl_tag = str_extract(string = text, pattern = "^.+?\\s")) %>% # extract damsl tags mutate(speaker_turn = str_extract(string = text, pattern = "[AB]\\.\\d+")) %>% # extract speaker_turn pairs mutate(utterance_num = str_extract(string = text, pattern = "utt\\d+")) %>% # extract utterance number mutate(utterance_text = str_extract(string = text, pattern = ":.+$")) %>% # extract utterance text select(-text) data <- # separate speaker_turn into distinct columns data %>% separate(col = speaker_turn, into = c("speaker", "turn_num")) data <- # clean up column information data %>% mutate(damsl_tag = str_trim(damsl_tag)) %>% # remove leading/ trailing whitespace mutate(utterance_num = str_replace(string = utterance_num, pattern = "utt", replacement = "")) %>% # remove 'utt' mutate(utterance_text = str_replace(string = utterance_text, pattern = ":\\s", replacement = "")) %>% # remove ': ' mutate(utterance_text = str_trim(utterance_text)) # trim leading/ trailing whitespace data <- # link speaker with speaker_id data %>% mutate(speaker_id = case_when( speaker == "A" ~ speaker_a_id, speaker == "B" ~ speaker_b_id )) cat(" done.\n") return(data) # return the data frame object }
As a sanity check we will run the extract_sdac_metadata()
function on a the conversation file we were just working on to make sure it works as expected.
extract_sdac_metadata(file = "data/original/sdac/sw00utt/sw_0001_4325.utt") %>% glimpse() ## Reading sw_0001_4325.utt ... done. ## Observations: 159 ## Variables: 7 ## $ doc_id <chr> "4325", "4325", "4325", "4325", "4325", "4325",... ## $ damsl_tag <chr> "o", "qw", "qy^d", "+", "+", "qy", "sd", "ad", ... ## $ speaker <chr> "A", "A", "B", "A", "B", "A", "B", "B", "B", "B... ## $ turn_num <chr> "1", "1", "2", "3", "4", "5", "6", "6", "6", "6... ## $ utterance_num <chr> "1", "2", "1", "1", "1", "1", "1", "2", "3", "4... ## $ utterance_text <chr> "Okay. /", "{D So, }", "[ [ I guess, +", "What... ## $ speaker_id <chr> "1632", "1632", "1519", "1632", "1519", "1632",...
Looks good so now it’s time to create a vector with the paths to all of the conversation files. list_files()
interfaces with our OS file system and will return the paths to the files in the specified directory. We also add a pattern to match conversation files (\\.utt
) so we don’t accidently include other files in the corpus. full.names
and recursive
set to TRUE
means we will get the full path to each file and files in all sub-directories will be returned.
files <- list.files(path = "data/original/sdac", # path to main directory pattern = "\\.utt", # files to match full.names = TRUE, # extract full path to each file recursive = TRUE) # drill down in each sub-directory of `sdac/` head(files) # preview character vector ## [1] "data/original/sdac/sw00utt/sw_0001_4325.utt" ## [2] "data/original/sdac/sw00utt/sw_0002_4330.utt" ## [3] "data/original/sdac/sw00utt/sw_0003_4103.utt" ## [4] "data/original/sdac/sw00utt/sw_0004_4327.utt" ## [5] "data/original/sdac/sw00utt/sw_0005_4646.utt" ## [6] "data/original/sdac/sw00utt/sw_0006_4108.utt"
To pass each conversation file in the vector of paths to our conversation files iteratively to the extract_sdac_metadata()
function we use map()
. This will apply the function to each conversation file and return a data frame for each. bind_rows()
will then join the resulting data frames by rows to give us a single tidy dataset for all 1155 conversations. Note there is a lot of processing going on here so be patient.
# Read files and return a tidy dataset sdac <- files %>% # pass file names map(extract_sdac_metadata) %>% # read and tidy iteratively bind_rows() # bind the results into a single data frame glimpse(sdac) # preview the dataset ## Observations: 223,606 ## Variables: 7 ## $ doc_id <chr> "4325", "4325", "4325", "4325", "4325", "4325",... ## $ damsl_tag <chr> "o", "qw", "qy^d", "+", "+", "qy", "sd", "ad", ... ## $ speaker <chr> "A", "A", "B", "A", "B", "A", "B", "B", "B", "B... ## $ turn_num <chr> "1", "1", "2", "3", "4", "5", "6", "6", "6", "6... ## $ utterance_num <chr> "1", "2", "1", "1", "1", "1", "1", "2", "3", "4... ## $ utterance_text <chr> "Okay. /", "{D So, }", "[ [ I guess, +", "What... ## $ speaker_id <chr> "1632", "1632", "1519", "1632", "1519", "1632",...
Explore the tidy dataset
It is always a good idea to perform some diagnostics on the data to confirm the integrity of the data. One thing that can go wrong in tidying a dataset, as we’ve done here, is that our pattern matching failed and did not return what we expected. This can be because our patterns were not specific enough or can arise from transcriber/annotator error. In any case this can lead to missing values, or NA
values. R provides the function complete.cases()
to test for NA
values, returning TRUE
for rows in a data frame which do not include NA
values. We can use this to subset the same data set to identify any rows in sdac
dataset that are missing. Note that because we are subsetting a data frame by rows, we will add our expression to the row position in the subsetting operation, i.e. ‘data_frame_name[row, column]’.
sdac[!complete.cases(sdac), ] # check for missing values
Great! No missing data. Now let’s make sure we have captured all 1155 conversation files. We will pipe the sdac$doc_id
column to the unique()
function which returns the unique values of the column. Then we can get the length of that result to find out how many unique conversation files we have in our dataset.
sdac$doc_id %>% unique() %>% length() # check for unique files ## [1] 1155
Also good news, we have all 1155 conversations in the dataset.
Let’s find out how many individual speakers are in the dataset.
sdac$speaker_id %>% unique() %>% length() # check for unique speakers ## [1] 441
Good to know before we proceed to adding speaker meta-data from the stand-off file caller_tab.csv
.
Running text with stand-off meta-data files
The sdac
dataset now contains various pieces of linguistic and non-linguistic meta-data that we extracted from the conversation files in the sdac/
corpus. As part of that extraction we isolated the ids of the speakers involved in each conversation. As noted during the preliminary investigation portion of the curation of the data, these ids appear in a stand-off meta-data file named caller_tab.csv
in the online documentation for the Switchboard Dialog Act Corpus. These ids provide us a link between the corpus data and the speaker meta-data that we can exploit to incorporate that meta-data into our existing sdac
tidy dataset.
It is common that stand-off meta-data files are in structured format. That is, they will typically be stored in a .csv
file or an .xml
document. The goal then is to read the data into R as a data frame and then join that data with the existing tidy corpus dataset. To read a .csv
file, like the caller_tab.csv
we use the read_csv()
function. Before we read it we should manually download and inspect the data for a couple things: (1) how is the file delimited? and (2) is there a header row that names the columns in the data?
We can generally assume that a .csv
file will be comma-separated, but this is not always the case sometimes the file will be delimited by semi-colons (;
), tabs (\\t
), or single or multiple spaces (\\s+
). Whether there will be a header row or not can vary. If a header row does not exist in the file itself there is a good chance there is some file that documents what each column in the data represents.
Let’s take a look at a few rows from the caller_tab.csv
to see what we have.
1000, 32, "N", "FEMALE", 1954, "SOUTH MIDLAND", 1, 0, "CASH", 15, "N", "", 2, "DN2" 1001, 102, "N", "MALE", 1940, "WESTERN", 3, 0, "GIFT", 10, "N", "", 0, "XP" 1002, 104, "N", "FEMALE", 1963, "SOUTHERN", 2, 0, "GIFT", 11, "N", "", 0, "XP"
First we see that columns are indeed separated by commas. Second we see there is no header row. Some of the columns seem interpretable, like column 4, but we should try to find documentation to guide us. Poking around in the online documentation I noticed the caller_doc.txt
file has names for the columns. This is a file used to generate a database table, but it contains the information we need so we’ll use it to assign names to our columns in caller_tab.csv
.
CREATE TABLE caller ( caller_no numeric(4) NOT NULL, pin numeric(4) NOT NULL, target character(1), sex character(6), birth_year numeric(4), dialect_area character(13), education numeric(1), ti numeric(1), payment_type character(5), amt_pd numeric(6), con character(1), remarks character(120), calls_deleted numeric(3), speaker_partition character(3) );
We can combine this information with the read_csv()
function to read the caller_tab.csv
and add the column names. Note that I’ve changed the caller_no
name to speaker_id
to align the nomenclature with the current sdac
dataset. This renaming will facilitate the upcoming step to join the tidy dataset and this meta-data.
sdac_speaker_meta <- read_csv(file = "https://catalog.ldc.upenn.edu/docs/LDC97S62/caller_tab.csv", col_names = c("speaker_id", # changed from `caller_no` "pin", "target", "sex", "birth_year", "dialect_area", "education", "ti", "payment_type", "amt_pd", "con", "remarks", "calls_deleted", "speaker_partition")) glimpse(sdac_speaker_meta) # preview the dataset ## Observations: 543 ## Variables: 14 ## $ speaker_id <int> 1000, 1001, 1002, 1003, 1004, 1005, 1007, 10... ## $ pin <int> 32, 102, 104, 5656, 123, 166, 274, 322, 445,... ## $ target <chr> "\"N\"", "\"N\"", "\"N\"", "\"N\"", "\"N\"",... ## $ sex <chr> "\"FEMALE\"", "\"MALE\"", "\"FEMALE\"", "\"M... ## $ birth_year <int> 1954, 1940, 1963, 1947, 1958, 1956, 1965, 19... ## $ dialect_area <chr> "\"SOUTH MIDLAND\"", "\"WESTERN\"", "\"SOUTH... ## $ education <int> 1, 3, 2, 2, 2, 2, 2, 1, 1, 2, 2, 1, 2, 2, 3,... ## $ ti <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... ## $ payment_type <chr> "\"CASH\"", "\"GIFT\"", "\"GIFT\"", "\"NONE\... ## $ amt_pd <int> 15, 10, 11, 7, 11, 22, 20, 3, 11, 9, 25, 9, ... ## $ con <chr> "\"N\"", "\"N\"", "\"N\"", "\"Y\"", "\"N\"",... ## $ remarks <chr> "\"\"", "\"\"", "\"\"", "\"\"", "\"\"", "\"\... ## $ calls_deleted <int> 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,... ## $ speaker_partition <chr> "\"DN2\"", "\"XP\"", "\"XP\"", "\"DN2\"", "\...
The columns mapped to the data as expected. The character columns contain double quotes ("\"
), however. We could proceed without issue (R will treat them as character values just the same) but I would like to clean up the character values for aesthetic purposes. To do this I applied the following code.
sdac_speaker_meta <- # remove double quotes sdac_speaker_meta %>% map(str_replace_all, pattern = '"', replacement = '') %>% # iteratively remove doubled quotes bind_rows() %>% # combine the results by rows type_convert() # return columns to orignal data types glimpse(sdac_speaker_meta) # preview the dataset ## Observations: 543 ## Variables: 14 ## $ speaker_id <int> 1000, 1001, 1002, 1003, 1004, 1005, 1007, 10... ## $ pin <int> 32, 102, 104, 5656, 123, 166, 274, 322, 445,... ## $ target <chr> "N", "N", "N", "N", "N", "Y", "N", "N", "N",... ## $ sex <chr> "FEMALE", "MALE", "FEMALE", "MALE", "FEMALE"... ## $ birth_year <int> 1954, 1940, 1963, 1947, 1958, 1956, 1965, 19... ## $ dialect_area <chr> "SOUTH MIDLAND", "WESTERN", "SOUTHERN", "NOR... ## $ education <int> 1, 3, 2, 2, 2, 2, 2, 1, 1, 2, 2, 1, 2, 2, 3,... ## $ ti <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... ## $ payment_type <chr> "CASH", "GIFT", "GIFT", "NONE", "GIFT", "GIF... ## $ amt_pd <int> 15, 10, 11, 7, 11, 22, 20, 3, 11, 9, 25, 9, ... ## $ con <chr> "N", "N", "N", "Y", "N", "Y", "N", "Y", "N",... ## $ remarks <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ... ## $ calls_deleted <int> 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,... ## $ speaker_partition <chr> "DN2", "XP", "XP", "DN2", "XP", "ET", "DN1",...
From the preview of the sdac_speaker_meta
dataset we can see that there are 14 columns, including the speaker_id
. We also see that there are 543 observations. We can assume that each row corresponds to an individual speaker, but to make sure let’s find the length of the unique values of sdac_speaker_meta$speaker_id
.
sdac_speaker_meta$speaker_id %>% unique() %>% length() # check for unique speakers ## [1] 543
So this confirms each row in sdac_speaker_meta
corresponds to an individual speaker. It is also clear now that the sdac
dataset, which contains 441 individual speakers, is a subset of all the data collected in the Switchboard Corpus project.
Let’s select the columns that seem most interesting for a future analysis dropping the other columns. The select()
function allows us to specify columns to keep (or drop).
sdac_speaker_meta <- # select columns of interest sdac_speaker_meta %>% select(speaker_id, sex, birth_year, dialect_area, education) glimpse(sdac_speaker_meta) # preview the dataset ## Observations: 543 ## Variables: 5 ## $ speaker_id <int> 1000, 1001, 1002, 1003, 1004, 1005, 1007, 1008, 1... ## $ sex <chr> "FEMALE", "MALE", "FEMALE", "MALE", "FEMALE", "FE... ## $ birth_year <int> 1954, 1940, 1963, 1947, 1958, 1956, 1965, 1939, 1... ## $ dialect_area <chr> "SOUTH MIDLAND", "WESTERN", "SOUTHERN", "NORTH MI... ## $ education <int> 1, 3, 2, 2, 2, 2, 2, 1, 1, 2, 2, 1, 2, 2, 3, 3, 2...
Tidy the corpus
The next step is to join the two datasets linking the values of sdac$speaker_id
with the values of sdac_speaker_meta$speaker_id
. We want to keep all the data in the sdac
dataset and only include data from sdac_speaker_meta
where there are matching speaker ids. To do this we use the left_join()
function. left_join()
requires two arguments which correspond to two data frames. We can optionally specify which column(s) to use as the columns to use as the joining condition, but by default it will use any column names that match in the two data frames. In our case the only column that matches is the speaker_id
column so we can proceed without explicitly specifying the join column.
sdac <- left_join(sdac, sdac_speaker_meta) # join by `speaker_id` ## Error in left_join_impl(x, y, by$x, by$y, suffix$x, suffix$y, check_na_matches(na_matches)): Can't join on 'speaker_id' x 'speaker_id' because of incompatible types (integer / character)
We get an error! Reading the error it appears we are trying to join columns of differing data types; the sdac$speaker_id
is of type character and sdac_speaker_meta$speaker_id
is of type integer.
Error messages can be difficult to make sense of. If the issue is not clear to you, copy the error and search the web to see if others have had the same issue. Chances are someone has! If not, follow these steps to create a reproducible example and post it to a site such as StackOverflow.
To remedy the situation we need to coerce the sdac$speaker_id
column into a integer. The as.numeric()
function will do this.
sdac$speaker_id <- sdac$speaker_id %>% as.numeric() # convert to integer
Now let’s apply our join operation again.
sdac <- left_join(sdac, sdac_speaker_meta) # join by `speaker_id` glimpse(sdac) # preview the joined dataset ## Observations: 223,606 ## Variables: 11 ## $ doc_id <chr> "4325", "4325", "4325", "4325", "4325", "4325",... ## $ damsl_tag <chr> "o", "qw", "qy^d", "+", "+", "qy", "sd", "ad", ... ## $ speaker <chr> "A", "A", "B", "A", "B", "A", "B", "B", "B", "B... ## $ turn_num <chr> "1", "1", "2", "3", "4", "5", "6", "6", "6", "6... ## $ utterance_num <chr> "1", "2", "1", "1", "1", "1", "1", "2", "3", "4... ## $ utterance_text <chr> "Okay. /", "{D So, }", "[ [ I guess, +", "What... ## $ speaker_id <dbl> 1632, 1632, 1519, 1632, 1519, 1632, 1519, 1519,... ## $ sex <chr> "FEMALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE... ## $ birth_year <int> 1962, 1962, 1971, 1962, 1971, 1962, 1971, 1971,... ## $ dialect_area <chr> "WESTERN", "WESTERN", "SOUTH MIDLAND", "WESTERN... ## $ education <int> 2, 2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 2,...
Result! Now let’s check our data for any missing data points generated in the join.
sdac[!complete.cases(sdac), ] %>% glimpse # view incomplete cases ## Observations: 100 ## Variables: 11 ## $ doc_id <chr> "3554", "3554", "3554", "3554", "3554", "3554",... ## $ damsl_tag <chr> "sd@", "+@", "sv@", "+@", "+", "sd", "+", "+", ... ## $ speaker <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A... ## $ turn_num <chr> "1", "3", "5", "7", "9", "9", "11", "13", "13",... ## $ utterance_num <chr> "1", "1", "1", "1", "1", "2", "1", "1", "2", "1... ## $ utterance_text <chr> "Of a exercise program you have.", "Right. /", ... ## $ speaker_id <dbl> 155, 155, 155, 155, 155, 155, 155, 155, 155, 15... ## $ sex <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,... ## $ birth_year <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,... ## $ dialect_area <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,... ## $ education <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
We have 100 observations that are missing data. Inspecting the dataset preview it appears that there was at least one speaker_id
that appears in the conversation files that does not appear in the speaker meta-data. Let’s check to see how many speakers this might affect.
sdac[!complete.cases(sdac), ] %>% select(speaker_id) %>% unique()
Just one speaker. This could very well be annotator error. Since it effects a relatively small proportion of the data, let’s drop this speaker from the dataset. We can use the filter()
function to select the values of speaker_id
that are not equal to 155
.
sdac <- # remove rows where speaker_id == 155 sdac %>% filter(speaker_id != 155) sdac[!complete.cases(sdac), ] %>% glimpse # view incomplete cases ## Observations: 0 ## Variables: 11 ## $ doc_id <chr> ## $ damsl_tag <chr> ## $ speaker <chr> ## $ turn_num <chr> ## $ utterance_num <chr> ## $ utterance_text <chr> ## $ speaker_id <dbl> ## $ sex <chr> ## $ birth_year <int> ## $ dialect_area <chr> ## $ education <int>
Explore the tidy dataset
At this point we have a well-curated dataset which includes linguistic and non-linguistic meta-data. As we did for the ACTIV-ES corpus, let’s get a sense of the distribution of some of the meta-data.
First we will visualize the number of utterances from speakers of the different dialect regions.
sdac %>% group_by(dialect_area) %>% count() %>% ggplot(aes(x = dialect_area, y = n)) + geom_col() + labs(x = "Dialect region", y = "Utterance count", title = "Switchboard Dialog Act Corpus", subtitle = "Utterances per dialect region") + theme(axis.text.x = element_text(angle = 45, hjust = 1))
Let’s see how men and women figure across the dialect areas.
sdac %>% group_by(dialect_area, sex) %>% count() %>% ggplot(aes(x = dialect_area, y = n, fill = sex)) + geom_col() + labs(x = "Dialect region", y = "Utterance count", title = "Switchboard Dialog Act Corpus", subtitle = "Utterances per dialect region and sex") + theme(axis.text.x = element_text(angle = 45, hjust = 1))
There are many other ways to group and count the dataset but I’ll leave that to you to look at!
Round up
In this post I covered tidying a corpus from running text files. We looked at three cases where meta-data is typically stored: in filenames, embedded inline with the text itself, and in stand-off files. As usual we made extensive use of the tidyverse package set (readr, dplyr, ggplot2, etc.) and included discussion of other packages: readtext for reading and organizing meta-data from file names, tidytext for tokenizing text, and stringr for text cleaning and pattern matching. I also briefly introduced the ggplot2 package for creating plots based on the Grammar of Graphics philosophy. Along the way we continued to extend our knowledge of R data and object types working with vectors, data frames, and lists manipulating them in various ways (subsetting, sorting, transforming, and summarizing).
In the next post I will turn to working with meta-data in structured documents, specifically .xml
documents. These type of documents tend to have rich meta-data including linguistic and non-linguistic information. We will focus on working with linguistic annnotations such as part-of-speech and syntactic structure. We will work to parse the linguistic information in these documents into a tidy dataset and also see how to create linguistic annotations for data does not already contain them.
References
Benoit, Kenneth, and Adam Obeng. 2017. Readtext: Import and Handling for Plain and Formatted Text Files. https://CRAN.R-project.org/package=readtext.
Henry, Lionel, and Hadley Wickham. 2017. Purrr: Functional Programming Tools. https://CRAN.R-project.org/package=purrr.
Robinson, David, and Julia Silge. 2018. Tidytext: Text Mining Using ’Dplyr’, ’Ggplot2’, and Other Tidy Tools. https://CRAN.R-project.org/package=tidytext.
Wickham, Hadley. 2018. Stringr: Simple, Consistent Wrappers for Common String Operations. https://CRAN.R-project.org/package=stringr.
Wickham, Hadley, and Winston Chang. 2018. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics.
Wickham, Hadley, Romain Francois, Lionel Henry, and Kirill Müller. 2017. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.
Wickham, Hadley, Jim Hester, and Romain Francois. 2017. Readr: Read Rectangular Text Data. https://CRAN.R-project.org/package=readr.
https://regex101.com is a great place to learn more about Regular Expressions and to practice using them.↩
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.