Who wrote the anti-Trump New York Times op-ed? Using tidytext to find document similarity
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Like a lot of people, I was intrigued by “I Am Part of the Resistance Inside the Trump Administration”, an anonymous New York Times op-ed written by a “senior official in the Trump administration”. And like many data scientists, I was curious about what role text mining could play.
Ok NLP people, now’s your chance to shine. Just spitballing here but TF-IDF on “the op-ed” compared to the published writing of every senior Trump admin official? I want likelihood estimates with standard errors. GO!
— Drew Conway (@drewconway) September 5, 2018
This is a useful opportunity to demonstrate how to use the tidytext package that Julia Silge and I developed, and in particular to apply three methods:
- Using TF-IDF to find words specific to each document (examined in more detail in Chapter 3 of our book)
- Using widyr to compute pairwise cosine similarity
- How to make similarity interpretable by breaking it down by word
Since my goal is R education more than it is political analysis, I show all the code in the post.
Even in the less than 24 hours since the article was posted, I’m far from the first to run text analysis on it. In particular Mike Kearney has shared a great R analysis on GitHub (which in particular pointed me towards CSPAN’s cabinet Twitter list), and Kanishka Misra has done some exciting work here.
Downloading data
Getting the text of the op-ed is doable with the rvest package.
The harder step is getting a set of documents representing “senior officials”. An imperfect but fast approach is to collect text from their Twitter accounts. (If you find an interesting dataset of, say, government FOIA documents, I recommend you try extending this analysis!)
We can look at a combination of two (overlapping) Twitter lists containing administration staff members:
This results in a set of 136,501 from 69 Twitter handles. There’s certainly no guarantee that the op-ed writer is among these Twitter accounts (or, if they are, that they even write their tweets themselves). But it still serves as an interesting case study of text analysis. How do we find the tweets with the closest use of language?
Tokenizing tweets
First, we need to tokenize the tweets: to turn them from full messages into individual words. We probably want to avoid retweets, and we need to use a custom regular expression for splitting it and remove links (just as I’d done when analyzing Trump’s Twitter account).
This parses the corpus of tweets into almost 1.5 million words.
Among this population of accounts, and ignoring “stop words” like “the” and “of”, what are the most common words? We can use ggplot2 to visualize this.
No real surprises here. Many accounts mention @POTUS often, as well as words like “people”, “American”, and “Trump” that you’d expect from administration accounts.
Finding a text signature: TF-IDF vectors
What words make up someone’s “signature”? What make up mine, or Trump’s, or Mike Pence’s, or the op-ed’s?
We could start with the most common words someone uses. But there are some words, like “the” and “of” that just about everyone uses, as well as words like “President” that everyone in our dataset will use. So we also want to downweight words that appear across many documents. A common tool for balancing these two considerations and turning them into a “signature” vector is tf-idf: term-frequency inverse-document-frequency. This takes how frequently someone uses a term, but divides it by (the log of) how many documents mention it. For more details, see Chapter 3 of Text Mining with R.
The bind_tf_idf
function from tidytext lets us compute tf-idf on a dataset of word counts like this. Before we do, we bring in the op-ed as an additional document (since we’re interesting in considering it as one “special” document in our corpus).
We can now see the words with the strongest associations to a user. For example, Josh Paciorek (the VP’s Deputy Press Secretary) uses the hashtag #gogreen (supporting Michigan State Football) quite often; it makes up 2% of the words (tf
, term frequency). Since no one else uses it (leading to an inverse document frequency, idf
, of 4.5), this makes it a critical part of his TF-IDF vector (his “signature”).
We could take a look at the “signatures” of a few selected Twitter accounts.
This gives us a set of words that are quite specific to each account. For instance, @DeptVetAffairs uses hashtags like “#vantagepoint” and “#veteranoftheday” that almost no other account in this set would use. Words that are specific to Trump include “witch” (as in “witch hunt”), “fake” (as in “fake news”) and other phrases that he tends to fixate on while other government officials don’t. (See here for my text analysis of Trump’s tweets as of August 2017).
This shows how TF-IDF offers us a vector (an association of each word with a number) that describes the unique signature of that document. To compare our documents (the op-ed with each Twitter account), we’ll be comparing those vectors.
The widyr package: cosine similarity
How can we compare two vectors to get a measure of document similarity? There are many approaches, but perhaps the most common for comparing TF-IDF vectors is cosine similarity. This is a combination of a dot product (multiplying the same term in document X and document Y together) and a normalization (dividing by the magnitudes of the vectors).
My widyr package offers a convenient way to compute pairwise similarities on a tidy dataset:
The top results show that this elementary method is able to match people to their positions. The VP Press Secretary and VP Communications Director unsurprisingly work closely together and tweet on similar topics. Similarly, it matches Shulkin, Perry, Carson, DeVos, and Zinke to their (current or former) cabinet positions, and links the two consecutive Health and Human Services directors (Price and Azar) to each other.
It’s worth seeing this document similarity metric in action, but it’s not what you’re here for. We’re really excited about seeing comparisons between the op-ed and Twitter articles. We can
This unveils the most similar writer as… Trump himself.
Hmmm. While that would certainly be a scoop, it doesn’t sound very likely to me. And the other top picks (the official White House account, the Press Secretary, and the Vice President) also seem like suspicious guesses.
Interpreting machine learning: what words contributed to scores?
The method of tf-idf is a fairly basic one for text mining, but as a result it has a useful trait: it’s based on a linear combination of one-score-per-word. This means we can say exactly how much each word contributed to a TF-IDF similarity between the article and a Twitter account. (Other machine learning methods allow interactions between words, which makes them harder to interpret).
We’ll try an approach of decomposing our TF-IDF similarity to see how much each . You could think of this as asking “if the op-ed hadn’t used this word, how much lower would the similarity score be?”
Now the reasons for the TF-IDF similarities become clearer.
The op-ed uses the words “Russia” five times. The Press Secretary and especially Trump mention Russia multiple times on their Twitter accounts, always within the context of defending Trump (as expected). Several accounts also get a high score because they mention the word “Trump” so frequently.
Unfortunately, with a document this short and topical, that’s all it takes to get a high similarity score (a bag of words method can’t understand the context, such as mentioning Russia in a negative or a defensive context). This is one reason it’s worth taking a closer look at what goes into an algorithm,
Having said that, there’s one signature I think is notable.
“Malign behavior”
Many others have noted “lodestar” as a telltale word in the piece. None of the relevant documents included that. I’d like to focus on another word that did: malign. Emphasis mine:
He complained for weeks about senior staff members letting him get boxed into further confrontation with Russia, and he expressed frustration that the United States continued to impose sanctions on the country for its malign behavior.
“Malign” isn’t as rare a word as “lodestar”, but it’s notable for being used in the exact same context (discussing Russia or other countries’ behavior) in a number of tweets from both Secretary of State Pompeo and the @StateDepartment account. (Pompeo has actually used the term “malign” an impressive seven times since May, though all but one were about Iran rather than Russia).
In #Finland tonight. Monday, @POTUS & I will meet with our Russian counterparts in #Helsinki A better relationship with the Russian government would benefit all, but the ball is in Russia’s court. We will continue to hold Russia responsible for its malign activities @StateDept pic.twitter.com/K4C1PSrRMb
— Secretary Pompeo (@SecPompeo) July 15, 2018
“Malign behavior” has been common language for Pompeo this whole year, as it has for other State Department officials like Jon Huntsman. What’s more, you don’t need data science to notice the letter spends three paragraphs on foreign policy (and praises “the rest of the administration” on that front). I’m not a pundit or a political journalist, but I can’t resist speculating a bit. Pompeo is named by the Weekly Standard as one of four likely authors of the op-ed, but even if he’s not the author my guess would be someone in the State Department.
Conclusion: Opening the black box
It’s worth emphasizing again that this article is just my guess based on a single piece of language (it’s nowhere close to the certainty of my analysis of Trump’s Twitter account during the campaign, which was statistically significant enough that I’d be willing to consider it “proof”).
I was fairly skeptical from the start that we could get strong results with document-comparison methods like this, especially on such a small article. That opinion mirrored people with much more expertise than I have:
That means you need a pretty large sample to not have large error bars. Don’t expect conclusive or even suggestive evidence here.
— David Mimno (@dmimno) September 6, 2018
But I’m satisfied with this analysis both as a demonstration of tidytext methods and one on the importance of model interpretability. When we ran a TF-IDF comparison, we knew it was wrong because @realDonaldTrump appeared at the top. But what if Trump hadn’t been the one to mention Russia the most, or if another false positive had caused an account to rise to the top? Breaking similarity scores down by word is a useful way to interrogate our model and understand its output. (See here for a similar article about understanding the components of a model).
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.