[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Topics and Categories in the Russian Troll Tweets I decided to return to the analysis I conducted for the IRA tweets dataset. (You can read up on that analysis and R code here.) Specifically, I returned to the LDA results, which looked like they lined up pretty well with the account categories identified by Darren Linvill and Patrick Warren. But with slightly altered code, we can confirm that or see if there’s more to the topics data than meets the eye. (Spoiler alert: There is more than meets the eye.)
I reran much of the original code – creating the file, removing non-English tweets and URLs, generating the DTM and conducting the 6-topic LDA. For brevity, I’m not including it in this post, but once again, you can see it here.
I will note that the topics were numbered a bit differently than they were in my previous analysis. Here’s the new plot. The results look very similar to before. (LDA is a variational Bayesian method and there is an element of randomness to it, so the results aren’t a one-to-one match, but they’re very close.)
Before, when I generated a plot of the LDA results, I asked it to give me the top 15 terms by topic. I’ll use the same code, but instead have it give the top topic for each term.
I can then match this dataset up to the original tweetwords dataset, to show which topic each word is most strongly associated with. Because the word variable is known by two different variable names in my datasets, I need to tell R how to match.
tweetwords <- tweetwords %>%
left_join(word_topic, by = c("word" = "term"))
Now we can generate a crosstable, which displays the matchup between LDA topic (1-6) and account category (Commercial, Fearmonger, Hashtag Gamer, Left Troll, News Feed, Right Troll, and Unknown).
This table is a bit hard to read, because it’s frequencies, and the total number of words for each topic and account category differ. But we can solve that problem by asking instead for proportions. I’ll have it generate proportions by column, so we can see the top account category associated with each topic.
Category 1 is News Feed, Category 2 Left Troll, Category 4 Commercial, and Category 5 Hashtag Gamer. But look at Categories 3 and 6. For both, the highest percentage is Right Troll. Fearmonger is not most strongly associated with any specific topic. What happens if we instead ask for a proportion table by row, which tells us which category each topic most associated with?
Based on these results, Fearmonger now seems closest to Category 3 and Right Troll with Category 6. But Right Troll also shows up on Categories 3 (20%) and 1 (16%). Left Trolls show up in these categories at nearly exact proportions. It appears, then, that political trolls show strong similarity in topics with Fearmongers (stirring things up) and News Feed (“informing”) trolls. Unknown isn’t the top contributer to any topic, but it aligns with Categories 3 (showing elements of Fearmongering) and 5 (showing elements of Hashtag Gaming). Let’s focus in on 5 categories.
For now, let’s define our topics like this: 1 = News Feed, 2 = Left Troll, 3 = Fearmonger, 4 = Commercial, 5 = Hashtag Gamer, and 6 = Right Troll. We’ll ask R to go through our PFH dataset and tell us when account category topic matches and when it mismatches. Then we can look at these terms.
Red indicates a match and blue indicates a mismatch. So when Fearmongers talk about food poisoning or Koch Farms, it’s a match, but when they talk about Hillary Clinton or the police, it’s a mismatch. Terms like “MAGA” and “CNN” are matches for Right Trolls but “news” and “love” are mismatches. Left Trolls show a match when tweeting about “Black Lives Matter” or “police” but a mismatch when tweeting about “Trump” or “America.” An interesting observation is that Trump is a mismatch for every topic it’s displayed under on the plot. (Now, realdonaldtrump, Trump’s Twitter handle, is a match for Right Trolls.) So where does that term, and associated terms like “Donald”, belong?
These terms apparently were sorted into Category 3, which we’ve called Fearmongers. Once again, this highlights the similarity between political trolls and fearmongering trolls in this dataset.
To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial.