How do you feel about Last Week Tonight?
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Welcome, welcome, welcome!
One thing my husband and I enjoy a lot is watching Last Week Tonight with John Oliver every week. It is an HBO political talk-show that airs on Sunday nights, and we usually watch it while we have dinner sometime during the week. We love the show because it covers a huge amount of diverse topics and news from all over the world, plus we laugh a lot (bittersweet laughs mostly ???????? ♀️ ).
I think John has a fantastic sense of humor and he is a spectacular communicator, but only if you share the way he sees the world. And because he is so enthusiastic about his views, I believe it is a character you either love or hate. I suspect he (as well as the controversial topics he proposes) arouses strong feelings in people and I want to check it by analyzing the comments people leave on his Youtube videos and his Facebook ones as well.
I’ve been wanting to try Julia Silge and David Robinson’s tidytext
package for a while now, and after I read Erin’s text analysis on the Lizzie Bennet Diaries’ Youtube captions I thought about giving Youtube a try ????
Fetching Youtube videos and comments
Every episode has one main story and many short stories that are mostly available to watch online via Youtube.
I’m using the Youtube Data API and the tuber
package to get the info from Youtube (I found a bug in the get_comment_thread
function on the CRAN version, so I recommend you use the GitHub one instead, where that is fixed). The first time you need to do some things to obtain authorization credentials so your application can submit API requests (you can follow this guide to do so). Then you just use the tuber::yt_oauth
function that launches a browser to allow you to authorize the application and you can start retrieving information.
First I search for the Youtube channel, I select the correct one and then I retrieve the playlist_id
that I’m going to use to fetch all videos.
library(tuber)
app_id <- "####"
app_password <- "####"
yt_oauth(app_id, app_password)
search_channel <- yt_search("lastweektonight")
channel <- "UC3XTzVzaHQEd30rQbuvCtTQ"
channel_resources <- list_channel_resources(filter = c(channel_id = channel),
part = "contentDetails")
playlist_id <- channel_resources$items[[1]]$contentDetails$relatedPlaylists$uploads
Fetching the videos
To get all videos I use the get_playlist_items
function, but it only retrieve the first 50 elements. I know soodoku is planning on implementing an argument ala “get_all”, but in the meantime I have to implement this myself to get all the videos (I took more than a few ideas from Erin’s script!).
I should warn you ⚠️ : The tuber
package is all about lists, and not tidy dataframes, so I dedicate a lot of effort to tidying this data.
library(dplyr)
library(tuber)
library(purrr)
library(magrittr)
library(tibble)
get_videos <- function(playlist) {
# pass NA as next page to get first page
nextPageToken <- NA
videos <- {}
# Loop over every available page
repeat {
vid <- get_playlist_items(filter = c(playlist_id = playlist),
page_token = nextPageToken)
vid_id <- map(vid$items, "contentDetails") %>%
map_df(magrittr::extract, c("videoId", "videoPublishedAt"))
titles <- lapply(vid_id$videoId, get_video_details) %>%
map("localized") %>%
map_df(magrittr::extract, c("title", "description"))
videos <- videos %>% bind_rows(tibble(id = vid_id$videoId,
created = vid_id$videoPublishedAt,
title = titles$title,
description = titles$description))
# get the token for the next page
nextPageToken <- ifelse(!is.null(vid$nextPageToken), vid$nextPageToken, NA)
# if no more pages then done
if (is.na(nextPageToken)) {
break
}
}
return(videos)
}
videos <- get_videos(playlist_id)
Then I extract the first part from the title and description (the rest is just advertisement), and format the video’s creation date,
library(stringr)
videos <- videos %>%
mutate(short_title = str_match(title, "^([^:]+).+")[,2],
short_desc = str_match(description, "^([^\n]+).+")[,2],
vid_created = as.Date(created)) %>%
select(-created)
Lets take a look at the videos.
library(DT)
datatable(videos[, c(4:6)], rownames = FALSE,
options = list(pageLength = 5)) %>%
formatStyle(c(1:3), `font-size` = '15px')
Fetching the comments
Now I get the comments for every video. I make my own functions for the same reason as before. The function get_video_comments
retrieves comments from a given video_id
, receiving the n
parameter as the maximum of comments we want.
get_video_comments <- function(video_id, n = 5) {
nextPageToken <- NULL
comments <- {}
repeat {
com <- get_comment_threads(c(video_id = video_id),
part = "id, snippet",
page_token = nextPageToken,
text_format = "plainText")
for (i in 1:length(com$items)) {
com_id <- com$items[[i]]$snippet$topLevelComment$id
com_text <- com$items[[i]]$snippet$topLevelComment$snippet$textDisplay
com_video <- com$items[[i]]$snippet$topLevelComment$snippet$videoId
com_created <- com$items[[i]]$snippet$topLevelComment$snippet$publishedAt
comments <- comments %>% bind_rows(tibble(video_id = com_video,
com_id = com_id,
com_text = com_text,
com_created = com_created))
if (nrow(comments) == n) {
break
}
nextPageToken <- ifelse(!is.null(com$nextPageToken), com$nextPageToken, NA)
}
if (is.na(nextPageToken) | nrow(comments) == n) {
break
}
}
return(comments)
}
The function get_videos_comments
receives a vector of video_id
s and returns n
comments for every video, using the previous get_video_comments
function. Then I remove empty comments, join with the video information and remove videos with less than 100 comments.
get_videos_comments <- function(videos, n = 10){
comments <- pmap_df(list(videos, n), get_video_comments)
}
raw_yt_comments <- get_videos_comments(videos$id, n = 300)
yt_comments <- raw_yt_comments %>%
filter(com_text != "") %>%
left_join(videos, by = c("video_id" = "id")) %>%
group_by(short_title) %>%
mutate(n = n(),
com_created = as.Date(com_created)) %>%
ungroup() %>%
filter(n >= 100)
And looking at the first rows we can already see some of that passion I was talking about ????
datatable(head(yt_comments[, c(7, 9, 3)], 30), rownames = FALSE,
options = list(pageLength = 5)) %>%
formatStyle(c(1:3), `font-size` = '15px')
Most used words and sentiment
In the tidy text world, a tidy dataset is a table with one-token-per-row. I start by tidying the yt_comments
dataframe, and removing the stop words (the stop_word
dictionary is already included in the tidytext
package).
library(tidytext)
tidy_yt_comments <- yt_comments %>%
tidytext::unnest_tokens(word, com_text) %>%
anti_join(stop_words, by = "word")
Positive and Negative words in comments
I’m using the bing
lexicon to evaluate the emotion in the word, that categorizes it into positive and negative. I join the words in the tidy_yt_comments
dataset with the sentiment on the bing
lexicon, and then count how many times each word appears.
So let’s find out the most used words in the comments!
library(ggplot2)
yt_pos_neg_words <- tidy_yt_comments %>%
inner_join(get_sentiments("bing"), by = "word") %>%
count(word, sentiment, sort = TRUE) %>%
ungroup() %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, nn)) %>%
ggplot(aes(word, nn, fill = sentiment)) +
geom_col(show.legend = FALSE) +
scale_fill_manual(values = c("red2", "green3")) +
facet_wrap(~sentiment, scales = "free_y") +
ylim(0, 2500) +
labs(y = NULL, x = NULL) +
coord_flip() +
theme_minimal()
There is a lot of strong words here! And I’m pretty sure this trump
positive word we are seeing is not quite the same Trump John has been talking about non stop for the last two years… and not precisely in a positive way… I could include this word in a custom_stop_words
dataframe, but I’m going leave it like that for now.
Also… not sure why funny
is in the negative
category ???? I know it can be used as weird or something like that, but I think this happens because I’m not a native English speaker ????????
♀️
Are there more positive or negative words?
tidy_yt_comments %>%
inner_join(get_sentiments("bing"), by = "word") %>%
count(word, sentiment, sort = TRUE) %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(sentiment = reorder(sentiment, nn)) %>%
ggplot(aes(sentiment, nn)) +
geom_col(aes(fill = sentiment), show.legend = FALSE) +
scale_fill_manual(values = c("green3", "red2")) +
ylab(NULL) +
xlab(NULL) +
coord_flip() +
theme_minimal()
Definitely more negative than positive words. OK.
More sentiments in comments
There is a different lexicon, the nrc
one, that classifies the words into more categories: two sentiments: positive and negative, and eight basic emotions: anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. Let’s check what we find!
tidy_yt_comments %>%
inner_join(get_sentiments("nrc"), by = "word") %>%
count(word, sentiment, sort = TRUE) %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, nn)) %>%
ggplot(aes(word, nn, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
xlab(NULL) +
ylab(NULL) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
coord_flip() +
theme_minimal()
OK… a few comments here.
-
john
is considered a negative word associated with disgust… So I checked and found that it means either a toilet or a prostitute’s client, so now I get it ???? Either way, I’m going to include it in thecustom_stop_words
dataframe because it is a word so frequent that makes every other word disproportionate. -
trump
again is considered a positive word, associated with surprise (no doubt about the surprise element for both the word and the character though).
custom_stop_words <- bind_rows(data_frame(word = c("john"),
lexicon = c("custom")),
stop_words)
yt_comments %>%
tidytext::unnest_tokens(word, com_text) %>%
anti_join(custom_stop_words, by = "word") %>%
inner_join(get_sentiments("nrc"), by = "word") %>%
count(word, sentiment, sort = TRUE) %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, nn)) %>%
ggplot(aes(word, nn, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
scale_y_continuous(breaks = c(0, 1000, 2000)) +
xlab(NULL) +
ylab(NULL) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
coord_flip() +
theme_minimal()
There are very controversial classifications on this nrc
lexicon, especially with the terms black
, classified as negative (and sadness) and white
as positive (and joy, anticipation and trust). I don’t like this at all…
I also have some comments:
-
goverment
is negative whilepresident
is positive. Just caught my attention. -
money
: ???? what is wrong with this word?! Apparently it is a very confusing one, because it is linked with positive sentiment and anticipation, joy, surprise and trust emotions, but also with anger ????
Anyway, these are side comments because they are about the lexicon (or the human nature!) and not this analysis. Bottom line: I don’t like this lexicon ????
What is the most present sentiment/emotion?
yt_comments %>%
tidytext::unnest_tokens(word, com_text) %>%
anti_join(custom_stop_words, by = "word") %>%
inner_join(get_sentiments("nrc"), by = "word") %>%
count(word, sentiment, sort = TRUE) %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(pos_neg = ifelse(sentiment %in% c("positive", "anticipation", "joy", "trust", "surprise"),
"Positive", "Negative")) %>%
ggplot(aes(reorder(sentiment, nn), nn)) +
geom_col(aes(fill = pos_neg), show.legend = FALSE) +
scale_fill_manual(values = c("red2", "green3")) +
xlab(NULL) +
ylab(NULL) +
coord_flip()
According to this lexicon, there are more positive than negative words! The opposite of what we found using the bing
lexicon. The thing about this one is that allows us to analyze other sentiments as well. But of course I’m not going to use it anymore ????
Most used n-grams
Other interesting thing to do is find the most common n-grams (threads of n amount of words that tend to co-occur).
yt_comments %>%
tidytext::unnest_tokens(five_gram, com_text, token = "ngrams", n = 5) %>%
count(five_gram, sort = TRUE) %>%
top_n(10) %>%
mutate(five_gram = reorder(five_gram, nn)) %>%
ggplot(aes(five_gram, nn)) +
geom_col(fill = "red", show.legend = FALSE) +
xlab(NULL) +
ylab(NULL) +
coord_flip() +
theme_minimal()
“how is this still a” and “is this still a thing” of course ring a bell for those of us who watch the show, since it has a section called “How is this still a thing?” questioning certain traditions or things that for some reason seemed adequate at some point in time, but now are totally absurd. Like voting for the US Presidential Elections on Tuesday, or the swimsuit issue of the Sports Illustrated magazine ????
The “am i the only one”, “i the only one who” and “is it just me or” 5-grams shows us how much people love rethorical questions! Like a lot! I’m going to take a peek at these comments!
am_i_the_only_one <- yt_comments %>%
tidytext::unnest_tokens(five_gram, com_text, token = "ngrams", n = 5) %>%
filter(five_gram == "am i the only one") %>%
select(com_id)
datatable(head(yt_comments[yt_comments$com_id %in% am_i_the_only_one$com_id, c(7, 3)], 30),
rownames = FALSE,
options = list(pageLength = 5)) %>%
formatStyle(c(1:2), `font-size` = '15px')
And a very strange 5-gram: “great great great great great”… I have to check what this is about!
great_great_great_great_great <- yt_comments %>%
tidytext::unnest_tokens(five_gram, com_text, token = "ngrams", n = 5) %>%
filter(five_gram == "great great great great great") %>%
select(com_id)
datatable(head(yt_comments[yt_comments$com_id %in% great_great_great_great_great$com_id, c(7, 3)], 1),
rownames = FALSE,
options = list(pageLength = 5)) %>%
formatStyle(c(1:2), `font-size` = '15px')
Just like I suspected, this is one very long concatenation of the word “great”. This guy is a very, very enthusiastic atheist who is referring to a very old ancestor, so it doesn’t count for this analysis.
Moving on…
Sentiment Analysis on comments
Similar to what I did for every word, now I join the words in the tidy_yt_comments
dataset with the sentiment on the bing
lexicon, and then count how many positive and negative words are in every comment. Then compute the sentiment
as positive
- negative
, to finally join this to the yt_comment
dataset.
library(tidyr)
yt_comment_sent <- tidy_yt_comments %>%
inner_join(get_sentiments("bing"), by = "word") %>%
count(com_id, sentiment) %>%
spread(sentiment, nn, fill = 0) %>%
mutate(sentiment = positive - negative) %>%
ungroup() %>%
left_join(yt_comments, by = "com_id") %>%
arrange(sentiment)
The longer the comment, the higher potential for higher sentiment. Let’s take a look at the extremes. The most negative comments according to the bing
lexicon are:
datatable(head(yt_comment_sent[, c(10, 12, 6)], 30),
rownames = FALSE,
options = list(pageLength = 1)) %>%
formatStyle(c(1:3), `font-size` = '15px')
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.