[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
files <- c("IRAhandle_tweets_1.csv", "IRAhandle_tweets_2.csv", "IRAhandle_tweets_3.csv", "IRAhandle_tweets_4.csv", "IRAhandle_tweets_5.csv", "IRAhandle_tweets_6.csv", "IRAhandle_tweets_7.csv", "IRAhandle_tweets_8.csv", "IRAhandle_tweets_9.csv") my_files <- paste0("~/Downloads/russian-troll-tweets-master/",files) each_file <- function(file) { tweet <- read_csv(file) } library(tidyverse) tweet_data <- NULL for (file in my_files) { temp <- each_file(file) temp$id <- sub(".csv", "", file) tweet_data <- rbind(tweet_data, temp) }
Note that this is a large file, with 2,973,371 observations of 16 variables. Let’s do some cleaning of this dataset first. The researchers, Darren Linvill and Patrick Warren, identified 5 majors types of trolls:
But a quick table of the results of the variable, account_category, shows 8 in the dataset.
- Right Troll: These Trump-supporting trolls voiced right-leaning, populist messages, but “rarely broadcast traditionally important Republican themes, such as taxes, abortion, and regulation, but often sent divisive messages about mainstream and moderate Republicans…They routinely denigrated the Democratic Party, e.g. @LeroyLovesUSA, January 20, 2017, “#ThanksObama We’re FINALLY evicting Obama. Now Donald Trump will bring back jobs for the lazy ass Obamacare recipients,” the authors wrote.
- Left Troll: These trolls mainly supported Bernie Sanders, derided mainstream Democrats, and focused heavily on racial identity, in addition to sexual and religious identity. The tweets were “clearly trying to divide the Democratic Party and lower voter turnout,” the authors told FiveThirtyEight.
- News Feed: A bit more mysterious, news feed trolls mostly posed as local news aggregators who linked to legitimate news sources. Some, however, “tweeted about global issues, often with a pro-Russia perspective.”
- Hashtag Gamer: Gamer trolls used hashtag games—a popular call/response form of tweeting—to drum up interaction from other users. Some tweets were benign, but many “were overtly political, e.g. @LoraGreeen, July 11, 2015, “#WasteAMillionIn3Words Donate to #Hillary.”
- Fearmonger: These trolls, who were least prevalent in the dataset, spread completely fake news stories, for instance “that salmonella-contaminated turkeys were produced by Koch Foods, a U.S. poultry producer, near the 2015 Thanksgiving holiday.”
table(tweet_data$account_category) ## ## Commercial Fearmonger HashtagGamer LeftTroll NewsFeed ## 122582 11140 241827 427811 599294 ## NonEnglish RightTroll Unknown ## 837725 719087 13905
The additional three are Commercial, Non-English, and Unknown. At the very least, we should drop the Non-English tweets, since those use Russian characters and any analysis I do will assume data are in English. I’m also going to keep only a few key variables. Then I’m going to clean up this dataset to remove links, because I don’t need those for my analysis – I certainly wouldn’t want to follow them to their destination. If I want to free up some memory, I can then remove the large dataset.
reduced <- tweet_data %>% select(author,content,publish_date,account_category) %>% filter(account_category != "NonEnglish") library(qdapRegex) ## ## Attaching package: 'qdapRegex' reduced$content <- rm_url(reduced$content) rm(tweet_data)
Now we have a dataset of 2,135,646 observations of 4 variables. I’m planning on doing some analysis on my own of this dataset – and will of course share what I find – but for now, I thought I’d repeat a technique I’ve covered on this blog and demonstrate a new one.
library(tidytext) tweetwords <- reduced %>% unnest_tokens(word, content) %>% anti_join(stop_words) ## Joining, by = "word" wordcounts <- tweetwords %>% count(account_category, word, sort = TRUE) %>% ungroup() head(wordcounts) ## # A tibble: 6 x 3 ## account_category word n ## <chr> <chr> <int> ## 1 NewsFeed news 124586 ## 2 RightTroll trump 95794 ## 3 RightTroll rt 86970 ## 4 NewsFeed sports 47793 ## 5 Commercial workout 42395 ## 6 NewsFeed politics 38204
First, I’ll conduct a TF-IDF analysis of the dataset. This code is a repeat from a previous post.
tweet_tfidf <- wordcounts %>% bind_tf_idf(word, account_category, n) %>% arrange(desc(tf_idf)) tweet_tfidf %>% mutate(word = factor(word, levels = rev(unique(word)))) %>% group_by(account_category) %>% top_n(15) %>% ungroup() %>% ggplot(aes(word, tf_idf, fill = account_category)) + geom_col(show.legend = FALSE) + labs(x = NULL, y = "tf-idf") + facet_wrap(~account_category, ncol = 2, scales = "free") + coord_flip() ## Selecting by tf_idf