Who wrote that anonymous NYT op-ed? Text similarity analyses with R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In US politics news, the New York Times took the unusual step this week of publishing an anonymous op-ed from a current member of the White House (assumed to be a cabinet member or senior staffer). Speculation about the identity of the author is, of course, rife. Much of the attention has focused on the use of specific words in the article, but can data science provide additional clues? In the last 48 hours, several R users have employed text similarity analysis to try and identify likely culprits.
David Robinson followed a similar process to the one he used to identify which of President Trump's tweets were written by him personally (as opposed to a staffer). He downloaded the tweets written by 69 Cabinet members and WH staff who maintain accounts on on Twitter (about 1.5 million words total). Then, he used the TD-IDF (term-frequency inverse-document-frequency) distance measure to find the rate at which certain words are more frequently (or less frequently) by the each subject. Comparing those rates to the rate of use in the op-ed. After ruling out the VP (and the President himself), this analysis suggests Secretary of State Pompeo as a suspect, in part due to his preference for the word “malign”.
Michael Kearney also used tweets as the comparison text, but limited the suspect pool to cabinet members only. He compared the text of those tweets to the NYT op-ed on using 107 numeric features, such as use of capitalizaion, punctuation, whitespace, and sentence length and structure. On that basis, again ruling out the President and VP, the closest match is Secretary Pompeo.
Kanishka Misra also used tweets from cabinet members, but used the Burows' Delta metric to compare their text to that of the op-ed; those tweets with similar absolute word frequencies to the op-ed would be “closest” by this measure. Secretary of Transportation Elaine Chao is the closest match using this method.
The authors take great pains to point out the limitations of these analyses: not all possible authors are active on Twitter; some of those likely don't write all their own tweets; Twitter is a medium that uses very different language than a newspaper editorial; and it's even possible the author was deliberately concealing their authorship through their language choices. But even so, the use of data has highlighted some unusual turns of phrase in the original op-ed that might have otherwise gone unobserved.
In each case the analysis was performed using the R language, and the authors make use of a common set of R packages to assist with the analysis:
- rvest to scrape data from the NYT website,
- rtweet, to download the text of tweets from likely suspects,
- tidytext, to tokenize and otherwise prepare the texts for analysis, and
- widyr to reshape the data for distance calculations
The full R code behind the analysis is provided by each author, and if nothing else provides interesting examples of different methods for text similarity analysis. These articles are a great demonstration of the power of rapid, in-depth data science with R: the op-ed was only published two days ago!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.