[This article was first published on data science ish, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
It’s been about a month since the U.S. presidential election, with Donald Trump’s victory over Hillary Clinton coming as a surprise to most. Reddit user Jason Baumgartner collected and published every submission and comment posted to Reddit on the day of (and a bit surrounding) the U.S. election; let’s explore this data set and see what kinds of things we can learn.
Data wrangling
This first bit was the hardest part of this analysis for me, probably because I am not the most experienced JSON person out there. At first, I took an approach of reading in the lines of each text file and parsing each JSON object separately. I complained about this on Twitter and got several excellent recommendations of much better approaches, including using stream_in from the jsonlite package. This works way better and faster than what I was doing before, and now it is easy!
Notice here that I am using files from November 8 and 9 in UTC time and I’m filtering out some of the earlier posts. This will end up leaving me with 30 hours of Reddit posts starting at noon on Election Day in the Central Time Zone. Also notice that I am not using the files that include Reddit comments, only the parent submissions. I tried most of the following analysis with both submissions and comments, but the comments dominated the results and included lots of repeated words/phrases that obscured what we would like to see. For the approach I am taking here, it worked better to just use submissions.
Finding the words
The submissions include a title and sometimes also some text; sometimes Reddit posts are just the title. Let’s use unnest_tokens from the tidytext package to identify all the words in the title and text fields of the submissions and organize them into a tidy data structure.
That’s… almost 18 million rows. People on Reddit are busy.
Which words changed in frequency the fastest?
Right now we have a data frame that has each word on its own row, with an id (url), the time when it was posted, and the subreddit it came from. Let’s use dplyr operations to calculate how many times each word was mentioned in a particular unit of time, so we can model the change with time. We will calculate minute_total, the total words posted in that time unit so we can compare across times of day when people post different amounts, and word_total, the number of times that word was posted so we can filter out words that are not used much.
This is the data frame we can use for modeling. We can use nest from tidyr to make a data frame with a list column that contains the little miniature data frames for each word and then map from purrr to apply our modeling procedure to each of those little data frames inside our big data frame. Jenny Bryan has put together some resources on using purrr with list columns this way. This is count data (how many words were posted?) so let’s use glm for modeling.
Now we can use tidy from broom to pull out the slopes for each of these models and find the important ones.
Which words decreased in frequency of use the fastest during Election Day? Which words increased in use the fastest?
Let’s plot these words.
There are lots of election-related words here, like “elect”, “liberals”, and “policies”. In fact, I think all of these words are conceivably related to the election with the exception of “flex”. I looked at some of the posts with “flex” in them and they were in fact not election-related. I had a hard time deciphering what they were about, but my best guess is either a) fantasy football or b) some kind of gaming. Why do we see “Trump’s” on this plot twice? It is because there is more than one way of encoding an apostrophe. You can see it on the legend if you look closely.
We don’t see Trump’s name by itself on this plot. How far off from being a top word, by my definition here, was it?
Trump must have been being discussed at a high level already, so the change was not as big as for the word “Trump’s”.
What about the words that dropped in use the most during this day?
These are maybe even more interesting to me. Look at that spike for Florida the night of November 8 when it seemed like there might be flashbacks to 2000 or something. And people’s interest in discussing voters/voting, polls/polling, and fraud dropped off precipitously as Trump’s victory became obvious.
Which subreddits demonstrated the most change in sentiment?
We have looked at which words changed most quickly in use on Election Day; now let’s take a look at changes in sentiment. Are there subreddits that exhibited changes in sentiment over the course of this time period? To look at this, we’ll take a bigger time period (2 hours instead of 30 minutes) since the words with measured sentiment are only a subset of all words. Much of the rest of these dplyr operations are similar. We can use inner_join to do the sentiment analysis, and then calculate the sentiment content of each board in each time period.
Let’s again use nest, but this time we’ll nest by subreddit instead of word. This sentiment score is not really count data (since it can be negative) so we’ll use regular old lm here.
Let’s again use unnest, map, and tidy to extract out the slopes from the linear models.
Which subreddits exhibited the biggest changes in sentiment, in either direction?
Let’s plot these!
These relationships are much noisier than the relationships with words were, and you might notice that some p-values are getting kind of high (no adjustment for multiple comparisons has been performed). Also, these subreddits are less related to the election than the quickly changing words were. Really only the shouldvebeenbernie subreddit is that political here.
Again, we see that not really any of these are specifically political, although I coudld image that the aznidentity subreddit (Asian identity board) and the ainbow subreddit (LGBT board) could have been feeling down after Trump’s election. The 1liga board is a German language board and ended up here because it used the word “die” a lot. In case you are wondering, the parrots subreddit is, in fact, about parrots; hopefully nothing too terrible was happening to the parrots on Election Day.
Which subreddits have the most dramatic word use?
Those last plots demonstrated with subreddits were changing in sentiment the fastest around the time of the election, but perhaps we would like to know which subreddits used the largest proportion of high or low sentiment words overall during this time period. To do that, we don’t need to keep track of the timestamp of the posts. Instead, we just need to count by subreddit and word, then use inner_join to find a sentiment score.
I would print some out for you, or plot them or something, but they are almost all extremely NSFW, both the positive and negative sentiment subreddits. I’m sure you can use your imagination.
The End
This is just one approach to take with this extremely extensive data set. There is still lots and lots more that could be done with it. I first saw this data set via Jeremy Singer-Vine’s Data Is Plural newsletter; this newsletter is an excellent resource and I highly recommend it. The R Markdown file used to make this blog post is available here. I am very happy to hear feedback or questions!
To leave a comment for the author, please follow the link and comment on their blog: data science ish.