Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The William Bar congressional hearing on July 28 was a lengthy grilling of the attorney general on a wide variety of subjects, from the protests and riots to the legality of the presidents orders as these events swept across the country. After watching some of the hearing, and reading a few articles, some of them seemed to conclude that the event was less an opportunity for the attorney general to communicate the rationale for his recent decisions, or even to communicate anything, than an opportunity for political grandstanding by congressional representatives. I began thinking of ways to empirically test these claims. Naturally, I proceeded to look for some data and start writing some R code.
As is often the case, there wasn’t a readily available dataset, but it was possible to acquire something like one by retrieving the transcript for the hearing and parsing the text. Luckily, a transcript was available at rev.com, unfortunately it was not in a machine readble form. In order to make it available in a format that I could do some analysis on, I decided to scrape the text in R.
Web Scraping in R vs Python
I’ve done a fair bit of web scraping in Python, using the Beautiful Soup package, in combination with Requests, over the years. However, I invariably have to open up or search for another python script to find the boilerplate code necessary to get up and running, each time. The code, below, is necessary to go from url to a structure ready to parse with a variety of useful methods, but it is an awful lot of overhead in my opinion, just to get started.
import bs4 as bs import requests as req url = 'www.somewebpage.com' html = req.get(url) soup = bs.BeautifulSoup(html.text, "html.parser")
For this quick weekend afternoon project, I decided to venture into new territory – web scraping in R. So, I loaded in the ever faithful Tidyverse and Lubridate packages, and after installing it, the Rvest package.
library(tidyverse) library(lubridate) library(rvest)
The Rvest package in R is the equivalent of BeautifulSoup in Python. As I said earlier, you want to take a url of some webpage and turn it into a well organized structure that is easily parseable in your language of choice. Both of these packages do the same thing, but look quite a bit different.
As in Python, Rvest requires you to first grab the raw html, passing in the url you are interested in. For our purposes, today, we will get the html for the transcript from rev.com.
url <- 'https://www.rev.com/blog/transcripts/house-judiciary-committee-hearing-of-attorney-general-barr-transcript-july-28' transcript_raw <- xml2::read_html(url)
Now that we have the html, we can start looking around within it for the html tags that hold the transcript text we are interested in. After right clicking in a browser and selecting “inspect”, I’ve identified the paragraph (‘p’) tag as holding the transcript text. We use the html_nodes
function and pass it the paragraph html tag. Once we have this, we indicate that we don’t want to see the html, anymore, but just the text, using html_text
. This returns a very long list with the contents of each paragraph tag as a separate item. Given the structure of the transcript webpage, this works out quite conveniently since it also separates each turn speaking into separate items.
transcript_text <- transcript_raw %>% html_nodes("p") %>% html_text()
Maybe it’s a couple years of heavy R use, but I find this much more readable than the python approach. I can’t speak to the flexibility, or the ability to implement any complex scraping logic, but for this very basic usage, it’s much clearer, concise, and consequently, more likely to be remembered.
Taking a look at the first few rows of our transcript list, we can see that we have a few rows of “metadata” and then each paragraph has the name of the speaker, the timestamp, and the words spoken. We naturally are going to want to get this text organized, with the name of the speaker, the timestamp, and the text in separate columns within a dataframe. This will allow us to calculate some statistics and look at word frequencies and other patterns in this large volume of text spoken in a marathon (nearly) 5 hour session.
## [1] "Mr. Nadler: (00:00) Any time. We welcome everyone to this morning’s hearing on o...." ## [2] "Mr. Nadler: (00:54) Before we begin, I would like to remind members that we have...." ## [3] "Mr. Nadler: (01:18) I would also remind all members that guidance from the Offic...." ## [4] "Mr. Nadler: (01:38) Thank you for being here Mr. Barr. According to the Congress...."
Regexing
Any time we are presented with a bunch of text from which we need to extract certain types of information (numerical sequences, words, word combinations, symbols), our go-to tool will be regular expressions. Regular expressions are a powerful concept and can be used with most programming languages to search for patterns within text of a considerable range of complexities. We can build in all sorts of logic when writing regular expressions, most of which is far beyond the scope of this blog post. For our purposes we will cover some R-specific function and syntax peculiarities.
In order to extract the name of the speaker, the timestamp, and the text, we will use the str_extract
function from the incredibly handy stringr
package. We will get a random item from the transcript and pass a regular expression to the function in order to iteratively build out the regular expression, seeing the output after each change – one of the great features of a computational notebook, which R notebooks are but one example.
The first regular expression will be comprised of the following:
^
An upward caret, indicating that we want to find all text starting at the beginning of the line.
.*
We want to get text of 0 or more characters in length….
\\)
…up through the first close parens
This gets us the name of the speaker and the timestamp.
str_extract(transcript_text[19], '^.*\\)') # name and timestamp ## [1] "Mr. Jordan: (13:20)"
The next one is a bit more complicated. Here we are using negative look behinds and negative look aheads in order to get just the timestamp (and not the surrounding parens)
This regular expression specifies the following:
(?<=\\()
Find the first open parens…..
[0-9].+?
…then get a numerical sequence of length of at least one…
(?=\\))
…that is then bounded on the right hand side by a close parens
You’ll notice those double backslashes in there – those are unique to R; they “escape” the special characters that we are looking for. Basically, when double backlashes appear before a special character in a regular expression, they indicate that we should not treat the character as special, but to actually look for that character in the text, instead. Most languages’ regular expressions use a single backslash, but R has to do it’s own thing for some reason.
str_extract(transcript_text[19], '(?<=\\()[0-9].+?(?=\\))') # timestamp only ## [1] "13:20"
To extract the name of the speaker, only, we will use a regular expression comprised of the following:
^
Start at the beginning of the line….
.+
…getting all text of 0 or more characters….
(?=:)
…up to the colon
str_extract(transcript_text[19], '^.+?(?=:)') # name only ## [1] "Mr. Jordan"
Once we have our regular expressions built out, we will use them within a call to str_extract
to build a tibble (a slightly modified dataframe). To do this, we will use map_chr
from the the purrr
package. The map
family of functions in purrr
allow you to pass a list or vector, a function, and any necessary parameters, and get a list/vector of the type specified in the latter part of it’s name back – each item having been processed by the function. So, map_chr
returns a character type vector, map_dbl
a double type vector, and so on. Type map
into your script and hit tab to see all of the available functions. It’s a little hard to explain; see the cheat sheet here for a clearer breakdown of the concept.
So, we are going to use the tibble
constructor function to build it, and since it takes vectors/lists as it’s arguments, we will pass in the map_chr
calls to extract the timestamp/name/text from each item in each list using str_extract
. Once the tibble data structure has been created, we pipe it to filter to remove the rows that don’t correspond to someone speaking.
transcript_df <- tibble( timestamp = map_chr(transcript_text, str_extract, '(?<=\\()[0-9].+?(?=\\))'), name = map_chr(transcript_text, str_extract, '^.+?(?=:)'), text = map_chr(transcript_text, str_remove, '^.*\\)') ) %>% filter(!is.na(timestamp)) head(transcript_df, n = 10) ## # A tibble: 10 x 3 ## timestamp name text ## <chr> <chr> <chr> ## 1 00:00 Mr. Nad… " Any time. We welcome ever… ## 2 00:54 Mr. Nad… " Before we begin, I would … ## 3 01:18 Mr. Nad… " I would also remind all m… ## 4 01:38 Mr. Nad… " Thank you for being here … ## 5 02:32 Mr. Nad… " Second, Congress establis… ## 6 03:32 Mr. Nad… " There is no precedent for… ## 7 04:32 Mr. Nad… " In your time at the depar… ## 8 05:51 Mr. Nad… " Fourth, at the president’… ## 9 06:41 Mr. Nad… " The message these actions… ## 10 08:14 Mr. Nad… " Again, this failure of le…
Next, we will get a word count of the text column, using the str_count
function from stringr
, matching to each whitespace and adding 1 to get the total number of words spoken in each paragraph. To compare the number of words spoken by William Barr versus everyone else, we create a column indicating if the person speaking is William Barr or someone else.
To get a timestamp out of the “timestamp” character column, we will append the date to it and account for the fact that before the first hour mark, the timestamp only displayed minutes and seconds. Pre-pending a date to a timestamp is a very common approach to dealing with timestamps since neither Python nor R allows for calculations on a time that does not have a date associated with it.
After we have our timestamp column formatted correctly, we’ll remove some rows where a video played during the hearing.
transcript_cleaned_df <- transcript_df %>% mutate(words = str_count(text, '\\s+') + 1, speaker = if_else(!name == 'Wiliam Barr', 'everyone_else', 'william_barr'), timestamp = if_else(str_length(str_trim(timestamp, side="both")) < 6, paste0('2020-07-28 00:', timestamp), paste0('2020-07-28 ',timestamp)), timestamp = as_datetime(timestamp) ) %>% filter(!is.na(name), !between(timestamp, as_datetime('2020-07-28 00:16:21'),as_datetime('2020-07-28 00:16:41')) ) head(transcript_cleaned_df, n=3) %>% kable_table()
timestamp | name | text | words | speaker |
---|---|---|---|---|
2020-07-28 00:00:00 | Mr. Nadler | Any time. We welcome everyone to this morning’s hearing on oversight of the Department of Justice. I apologize for beginning the hearing late as many of you know I was in a minor car accident on the way in this morning. Everyone is fine except perhaps the car but it did cause significant delay. I thank the attorney general and the members for their patience and their flexibility and we will now begin. Before we begin I want to acknowledge … I want to note that we are joined this morning by the distinguished Majority Leader, the gentleman from Maryland, Mr. Hoyer. Leader Hoyer has long recognized the need for vigorous congressional oversight of the executive branch under both parties and we appreciate his presence today as we question the attorney general. | 133 | everyone_else |
2020-07-28 00:00:54 | Mr. Nadler | Before we begin, I would like to remind members that we have established an email address and distribution list dedicated to circulating exhibits, motions or other written materials that members might want to offer as part of our hearing today. If you would like to submit materials, please send them to the email address that has previously been distributed to your offices and we will circulate the materials to members and staff as quickly as we can. | 78 | everyone_else |
2020-07-28 00:01:18 | Mr. Nadler | I would also remind all members that guidance from the Office of the Attending Physician states that face coverings are required for all meetings in an enclosed space such as this committee hearing. I expect all members on both sides of the aisle to wear a mask except when you are speaking. I will now recognize myself for an opening statement. | 62 | everyone_else |
Results
Now, let’s see how many words were spoken by William Barr versus everyone else. We group by the speaker column and sum the words column to get total words spoken for each, then calculate the percentage of all words spoken by each.
pct_breakdown <- function(col_name){ col_name_quoted <- enquo(col_name) transcript_cleaned_df %>% group_by(!!col_name_quoted) %>% summarise(total_words = sum(words, na.rm = TRUE)) %>% arrange(desc(total_words)) %>% mutate(pct = scales::percent(total_words / sum(total_words))) } pct_breakdown(speaker) ## # A tibble: 2 x 3 ## speaker total_words pct ## <chr> <dbl> <chr> ## 1 everyone_else 31545 73% ## 2 william_barr 11647 27%
It turns out that William Barr accounted for only 27% of all words spoken at the hearing. It does seem that this was less of a hearing than four hours of political theater. I suspected this number would be low, but I did not think it would be quite this low.
speaker | total_words | pct |
---|---|---|
everyone_else | 31545 | 73% |
william_barr | 11647 | 27% |
I was also curious about the new trendy “I reclaim my time” phrase. This phrase was used 24 times during the hearing, by 12 different representatives. It appears they “reclaimed” a considerable amount of time.
transcript_cleaned_df %>% filter(str_detect(text, '[rR]eclaim')) ## # A tibble: 24 x 5 ## timestamp name text words speaker ## <dttm> <chr> <chr> <dbl> <chr> ## 1 2020-07-28 00:58:42 Congr… " I’m… 58 everyo… ## 2 2020-07-28 00:59:15 Congr… " I’m… 172 everyo… ## 3 2020-07-28 01:21:52 Mr. J… " [cr… 30 everyo… ## 4 2020-07-28 01:22:23 Mr. J… " No,… 41 everyo… ## 5 2020-07-28 01:23:57 Mr. J… " Rec… 4 everyo… ## 6 2020-07-28 01:25:29 Mr. J… " Rec… 9 everyo… ## 7 2020-07-28 01:47:10 Bass " I r… 87 everyo… ## 8 2020-07-28 02:12:47 Jeffr… " Rec… 24 everyo… ## 9 2020-07-28 02:33:11 Ted L… " Rec… 19 everyo… ## 10 2020-07-28 02:34:22 Ted L… " Rec… 97 everyo… ## # … with 14 more rows
Conclusion
We covered web scraping in R, regular expressions for retrieving patterns from a body of text, and some basic operations on dataframes using the tidyverse packages, all in service of empirically answering a question that began with a very subjective and general assessment and ended with a concrete, data driven answer. Now, as with any analysis, and especially one as limited as this one, to form strong conclusions we want to find as many data sources as possible and see if different data analytic methods reach the same conclusion, but this is a good first step in this direction, and in my opinion,better than the intuition driven subjective assessments of the news media.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.