Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Last summer, David Robinson did this interesting text analysis of Donald Trump’s tweets and found that they more angry ones came from Android (which Trump is known to use). But he didn’t consider how Trump’s emotional state varies over time and he certainly couldn’t have considered what the impact of the election and recent events would have been on Trump.
Using the twitteR
package and the tidyverse
ecosystem (plus tidytext
) this is actually a very simple analysis.
For starters, pulling Trump’s tweets (the last 3,200) is very simple:
library(twitteR) library(tidyverse) library(tidytext) source("~/twitter_key.R") setup_twitter_oauth(api_key, api_secret, access_token, access_token_secret) ## [1] "Using direct authentication" trump <- userTimeline("realDonaldTrump", n = 3100, includeRts = TRUE, excludeReplies = FALSE) %>% twListToDF() %>% as_tibble()
And then we have a tidy tibble with Trump’s tweets:
glimpse(trump) ## Observations: 3,099 ## Variables: 16 ## $ text <chr> "Heading to Joint Base Andrews on #MarineOne wit... ## $ favorited <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,... ## $ favoriteCount <dbl> 77699, 85576, 71312, 220083, 64348, 84125, 62284... ## $ replyToSN <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ... ## $ created <time> 2017-02-10 23:24:51, 2017-02-10 13:35:50, 2017-... ## $ truncated <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, ... ## $ replyToSID <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ... ## $ id <chr> "830195857530183684", "830047626414477312", "830... ## $ replyToUID <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ... ## $ statusSource <chr> "<a href=\"http://twitter.com/download/iphone\" ... ## $ screenName <chr> "realDonaldTrump", "realDonaldTrump", "realDonal... ## $ retweetCount <dbl> 21473, 19779, 15069, 64363, 10082, 14185, 11294,... ## $ isRetweet <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,... ## $ retweeted <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,... ## $ longitude <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ... ## $ latitude <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
Using tidytext
, it is straightforward to unnest and tokenize the words in the body of the tweets:
words <- trump %>% select(id, statusSource, retweetCount, favoriteCount, created, isRetweet, text) %>% unnest_tokens(word, text) words ## # A tibble: 57,239 x 7 ## id ## <chr> ## 1 830195857530183684 ## 2 830195857530183684 ## 3 830195857530183684 ## 4 830195857530183684 ## 5 830195857530183684 ## 6 830195857530183684 ## 7 830195857530183684 ## 8 830195857530183684 ## 9 830195857530183684 ## 10 830195857530183684 ## # ... with 57,229 more rows, and 6 more variables: statusSource <chr>, ## # retweetCount <dbl>, favoriteCount <dbl>, created <time>, ## # isRetweet <lgl>, word <chr>
Given what David Robinson found, we might want to convert the statusSource
variable into a flag for whether it was posted via an Android device:
words <- words %>% mutate(android = stringr::str_detect(statusSource, "Android")) %>% select(- statusSource) words ## # A tibble: 57,239 x 7 ## id retweetCount favoriteCount created ## <chr> <dbl> <dbl> <time> ## 1 830195857530183684 21473 77699 2017-02-10 23:24:51 ## 2 830195857530183684 21473 77699 2017-02-10 23:24:51 ## 3 830195857530183684 21473 77699 2017-02-10 23:24:51 ## 4 830195857530183684 21473 77699 2017-02-10 23:24:51 ## 5 830195857530183684 21473 77699 2017-02-10 23:24:51 ## 6 830195857530183684 21473 77699 2017-02-10 23:24:51 ## 7 830195857530183684 21473 77699 2017-02-10 23:24:51 ## 8 830195857530183684 21473 77699 2017-02-10 23:24:51 ## 9 830195857530183684 21473 77699 2017-02-10 23:24:51 ## 10 830195857530183684 21473 77699 2017-02-10 23:24:51 ## # ... with 57,229 more rows, and 3 more variables: isRetweet <lgl>, ## # word <chr>, android <lgl>
Let’s now code the tweets using the afinn
sentiment set:
words <- words %>% inner_join(get_sentiments("afinn")) ## Joining, by = "word" words ## # A tibble: 5,093 x 8 ## id retweetCount favoriteCount created ## <chr> <dbl> <dbl> <time> ## 1 830047626414477312 19779 85576 2017-02-10 13:35:50 ## 2 830047626414477312 19779 85576 2017-02-10 13:35:50 ## 3 830042498806460417 15069 71312 2017-02-10 13:15:27 ## 4 829721019720015872 10082 64348 2017-02-09 15:58:01 ## 5 829721019720015872 10082 64348 2017-02-09 15:58:01 ## 6 829689436279603206 14185 84125 2017-02-09 13:52:31 ## 7 829689436279603206 14185 84125 2017-02-09 13:52:31 ## 8 829689436279603206 14185 84125 2017-02-09 13:52:31 ## 9 829689436279603206 14185 84125 2017-02-09 13:52:31 ## 10 829689436279603206 14185 84125 2017-02-09 13:52:31 ## # ... with 5,083 more rows, and 4 more variables: isRetweet <lgl>, ## # word <chr>, android <lgl>, score <int>
And now let’s see how the typical sentiment of those tweets has varied since April 2016 (midsts of the Republican primary) to present:
words %>% filter(isRetweet == FALSE) %>% group_by(id, created) %>% summarize(sentiment = mean(score)) %>% ggplot(aes(x = created, y = sentiment)) + geom_smooth() + geom_vline(xintercept = as.numeric(as.POSIXct(("2017-01-20")))) + geom_vline(xintercept = as.numeric(as.POSIXct(("2016-11-08")))) + geom_vline(xintercept = as.numeric(as.POSIXct(("2016-05-03")))) + labs(x = "Date", y = "Mean Afinn Sentiment Score")
The vertical lines denote the date he was named as the Republican candidate (May 3rd 2016), the date of the election (Nov 8th 2016) and inauguration day. Thing aren’t looking up for Trump. He seems to be more angry/sad/negative now than any prior point during the past year.
What if we consider the grouping by using Android vs not:
words %>% filter(isRetweet == FALSE) %>% group_by(id, created, android) %>% summarize(sentiment = mean(score)) %>% ggplot(aes(x = created, y = sentiment, color = android)) + geom_smooth() + geom_vline(xintercept = as.numeric(as.POSIXct(("2017-01-20")))) + geom_vline(xintercept = as.numeric(as.POSIXct(("2016-11-08")))) + geom_vline(xintercept = as.numeric(as.POSIXct(("2016-05-03")))) + labs(x = "Date", y = "Mean Afinn Sentiment Score")
We see the general trend that David Robinson identified - the Android tweets tended to be more negitive than the other platforms. It is interesting that they were more positive than the tweets presumed to be by staff right before the election. Also, we can see the non-Android tweets were more positive during the transition than the Android tweets that clearly became more negitive. Perhaps the limits of Presidential powers are stricter than he expected. It is interesting that the Android tweets are now more negitive than positive, the first time this has occurred.
Interestingly, there seems to be no effect of being positive/negitive on the number of retweets
words %>% filter(isRetweet == FALSE) %>% group_by(id, created, android) %>% summarize(sentiment = mean(score)) %>% inner_join(select(words, id, retweetCount, favoriteCount) %>% distinct()) %>% ggplot(aes(x = sentiment, y = retweetCount, color = android)) + geom_smooth() + geom_point() + scale_y_log10() + labs(x = "Mean Afinn Sentiment Score", y = "Number of Retweets") ## Joining, by = "id"
or the number of favorites
words %>% filter(isRetweet == FALSE) %>% group_by(id, created, android) %>% summarize(sentiment = mean(score)) %>% inner_join(select(words, id, retweetCount, favoriteCount) %>% distinct()) %>% ggplot(aes(x = sentiment, y = favoriteCount, color = android)) + geom_smooth() + geom_point() + scale_y_log10() + labs(x = "Mean Afinn Sentiment Score", y = "Number of Favorites") ## Joining, by = "id"
that a tweet gets.
Regression analysis suggests that the number of retweets is increased significantly by a more negitive tweet but that also the effect wears off with time (very very slightly):
words %>% filter(isRetweet == FALSE, android) %>% group_by(id, created) %>% summarize(sentiment = mean(score)) %>% inner_join(select(words, id, retweetCount, favoriteCount) %>% distinct()) %>% lm(log(retweetCount) ~ created * (sentiment < 0), data = .) %>% summary() ## Joining, by = "id" ## ## Call: ## lm(formula = log(retweetCount) ~ created * (sentiment < 0), data = .) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.7744 -0.3806 0.0005 0.3576 3.2661 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -1.012e+02 3.942e+00 -25.679 < 2e-16 *** ## created 7.488e-08 2.680e-09 27.939 < 2e-16 *** ## sentiment < 0TRUE 1.959e+01 6.086e+00 3.219 0.00132 ** ## created:sentiment < 0TRUE -1.313e-08 4.135e-09 -3.175 0.00154 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.5923 on 1198 degrees of freedom ## Multiple R-squared: 0.5195, Adjusted R-squared: 0.5183 ## F-statistic: 431.7 on 3 and 1198 DF, p-value: < 2.2e-16
A similar pattern exists for the number of favorites
words %>% filter(isRetweet == FALSE, android) %>% group_by(id, created) %>% summarize(sentiment = mean(score)) %>% inner_join(select(words, id, retweetCount, favoriteCount) %>% distinct()) %>% lm(log(favoriteCount) ~ created * (sentiment < 0), data = .) %>% summary() ## Joining, by = "id" ## ## Call: ## lm(formula = log(favoriteCount) ~ created * (sentiment < 0), ## data = .) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.75782 -0.35691 -0.00795 0.33800 2.48914 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -1.176e+02 3.452e+00 -34.068 < 2e-16 *** ## created 8.689e-08 2.347e-09 37.020 < 2e-16 *** ## sentiment < 0TRUE 1.435e+01 5.329e+00 2.692 0.00721 ** ## created:sentiment < 0TRUE -9.648e-09 3.621e-09 -2.664 0.00781 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.5187 on 1198 degrees of freedom ## Multiple R-squared: 0.6525, Adjusted R-squared: 0.6517 ## F-statistic: 749.9 on 3 and 1198 DF, p-value: < 2.2e-16
The words used by the Android postings that were positive and negitive varied from before the election, during the transition and after Trump was sworn in:
words %>% filter(android) %>% mutate(phase = ifelse(as.POSIXct("2016-11-08") > created, "Before the election", ifelse(as.POSIXct("2017-01-20") > created, "Transition", "In the White House"))) %>% group_by(phase, pos_sentiment = score >= 0, word) %>% count() %>% group_by(phase, pos_sentiment) %>% filter(word != "no") %>% top_n(3, wt = n) %>% arrange(pos_sentiment, phase, desc(n)) ## Source: local data frame [18 x 4] ## Groups: phase, pos_sentiment [6] ## ## phase pos_sentiment word n ## <chr> <lgl> <chr> <int> ## 1 Before the election FALSE bad 62 ## 2 Before the election FALSE dishonest 27 ## 3 Before the election FALSE rigged 25 ## 4 In the White House FALSE bad 10 ## 5 In the White House FALSE fake 10 ## 6 In the White House FALSE ban 6 ## 7 Transition FALSE bad 13 ## 8 Transition FALSE wrong 11 ## 9 Transition FALSE dishonest 10 ## 10 Before the election TRUE great 175 ## 11 Before the election TRUE thank 69 ## 12 Before the election TRUE big 54 ## 13 In the White House TRUE great 8 ## 14 In the White House TRUE big 5 ## 15 In the White House TRUE win 5 ## 16 Transition TRUE great 56 ## 17 Transition TRUE big 16 ## 18 Transition TRUE win 14
We have the fake news to thank for the fake
debut post-being sworn in. At least the election was no longer rigged
after he worn it.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.