Using the tuber package to analyse a YouTube channel

insightr

4 years ago

[This article was first published on R – insightR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

By Gabriel Vasconcelos

So I decided to have a quick look at the tuber package to extract YouTube data in R. My cousin is a singer (a hell of a good one) and he has a YouTube channel (dan vasc), which I strongly recommend, where he posts his covers. I will focus my analysis on his channel. The tuber package is very friendly and it downloads YouTube statistics on comments, views, likes and more straight to R using the YouTube API.

First let us look on some general information on the channel (Codes for replication in the end of the text). The table below shows the number of followers, views, videos, etc in the moment I downloaded the data (2017-12-12 11:20pm). If you run the code on your computer the results may be different because the channel will have more activity. Dan’s channel is getting close to 1 million views and he has 58 times more likes than dislikes. His views ratio is 13000 views per video.

< size="1">

Channel	Subscriptions	Views	Videos	Likes	Dislikes	Comments
Dan Vasc	5127	743322	57	9008	155	1993

< >

We can also see some of the same statistics for each video. I selected only videos published after January 2016 that is when the channel became more active. The list has 29 videos. You can see that the channel became even more active in 2017. In the last month it started with weekly publications.

< size="1">

date	title	viewCount	likeCount	dislikeCount	commentCount
2016-03-09	“Heart Of Steel” – MANOWAR cover	95288	1968	53	371
2016-05-09	“The Sound Of Silence” – SIMON & GARFUNKEL / DISTURBED cover	13959	556	6	85
2016-07-04	One Man Choir – Handel’s Hallelujah	9390	375	6	70
2016-08-16	“Carry On” – MANOWAR cover	19146	598	12	98
2016-09-12	“You Are Loved (Don’t Give Up)” – JOSH GROBAN cover	2524	142	0	21
2016-09-26	“Hearts On Fire” – HAMMERFALL cover	6584	310	4	58
2016-10-26	“Dawn Of Victory” – RHAPSODY OF FIRE cover	10335	354	5	69
2017-04-28	“I Don’t Wanna Miss A Thing” – AEROSMITH cover	9560	396	5	89
2017-05-09	State of affairs	906	99	1	40
2017-05-26	“Cha-La Head Cha-La” – DRAGON BALL Z INTRO cover (Japanese)	2862	160	4	39
2017-05-26	“Cha-La Head Cha-La” – DRAGON BALL Z INTRO cover (Português)	3026	235	3	62
2017-05-26	“Cha-La Head Cha-La” – DRAGON BALL Z INTRO cover (English)	2682	108	2	14
2017-06-14	HOW TO BE A YOUTUBE SINGER \| ASKDANVASC 01	559	44	1	19
2017-06-17	Promotional Live \|\| Q&A and video games	206	16	0	2
2017-07-18	“The Bard’s Song” – BLIND GUARDIAN cover (SPYGLASS INN project)	3368	303	2	47
2017-07-23	“Numb” – LINKIN PARK cover (R.I.P. CHESTER)	6717	350	14	51
2017-07-27	THE PERFECT TAKE and HOW TO MIX VOCALS \| ASKDANVASC 02	305	29	0	11
2017-08-04	4000 Subscribers and Second Channel	515	69	1	23
2017-08-10	“Hello” – ADELE cover [ROCK VERSION]	6518	365	2	120
2017-08-27	“The Rains Of Castamere” (The Lannister Song) – GAME OF THRONES cover	2174	133	5	28
2017-08-31	“Africa” – TOTO cover	18251	642	10	172
2017-09-24	“Chop Suey!” – SYSTEM OF A DOWN cover	2562	236	6	56
2017-10-09	“An American Trilogy” – ELVIS PRESLEY cover	1348	168	1	48
2017-11-08	“Beauty And The Beast” – Main Theme cover \| Feat. Alina Lesnik	2270	192	2	59
2017-11-16	“Bohemian Rhapsody” – QUEEN cover	2589	339	3	95
2017-11-23	“The Phantom Of The Opera” – NIGHTWISH/ANDREW LLOYD WEBBER cover \| Feat. Dragica	1857	209	2	42
2017-11-24	“Back In Black” – AC/DC cover (RIP MALCOLM YOUNG)	2202	207	2	56
2017-11-30	“Immigrant Song” – LED ZEPPELIN cover	3002	204	1	62
2017-12-07	“Sweet Child O’ Mine” – GUNS N’ ROSES cover	1317	201	2	86

< >

Now that we saw the data. Let’s explore it to check for structures and information. The plots below show how likes, dislikes and comments are related to views. The positive relation is obvious. However, we have some degree of nonlinearity in likes and comments. The increment on likes and comments becomes smaller as the views increase. The dislikes look more linear on the views but the number of dislikes is to small to be sure.

Another interesting information is how comments are distributed over time in each video. I selected the four most recent videos and plotted the comments time-series below. All videos have a lot of activity in the first days but it decreases fast a few days latter. Followers and subscribers probably see the videos first and they must be responsible for the intense activity in the beginning of each plot.

The most important information might be how the channel grows over the time. Dan’s channel had two important moments in 2017. It became much more active in April and it started having weekly publications in November. We can clearly see that both strategies worked in the plot below. I put two dashed lines to show these two events. In April the number of comments increased a lot and they increased even more in November.

Finally, let’s have a look at what is in the comments using a WordCloud (wordcloud package). I removed words that are not informative such as “you, was, is, were” for English and Portuguese. The result is just below.

Codes

Before using the tuber package you need an ID and a password from Google Developer Console. Click here for more information. If you are interested, the package tubern has some other tools to work with YouTube data such as generating reports.

library(tuber)
library(tidyverse)
library(lubridate)
library(stringi)
library(wordcloud)
library(gridExtra)

httr::set_config( config( ssl_verifypeer = 0L ) ) # = Fixes some certificate problems on linux = #

# = Autentication = #
yt_oauth("ID",
         "PASS",token = "")

# = Download and prepare data = #

# = Channel stats = #
chstat = get_channel_stats("UCbZRdTukTCjFan4onn04sDA")

# = Videos = #
videos = yt_search(term="", type="video", channel_id = "UCbZRdTukTCjFan4onn04sDA")
videos = videos %>%
  mutate(date = as.Date(publishedAt)) %>%
  filter(date > "2016-01-01") %>%
  arrange(date)

# = Comments = #
comments = lapply(as.character(videos$video_id), function(x){
  get_comment_threads(c(video_id = x), max_results = 1000)
})

# = Prep the data = #
# = Video Stat Table = #
videostats = lapply(as.character(videos$video_id), function(x){
  get_stats(video_id = x)
})
videostats = do.call(rbind.data.frame, videostats)
videostats$title = videos$title
videostats$date = videos$date
videostats = select(videostats, date, title, viewCount, likeCount, dislikeCount, commentCount) %>%
  as.tibble() %>%
  mutate(viewCount = as.numeric(as.character(viewCount)),
         likeCount = as.numeric(as.character(likeCount)),
         dislikeCount = as.numeric(as.character(dislikeCount)),
         commentCount = as.numeric(as.character(commentCount)))

# = General Stat Table = #
genstat = data.frame(Channel="Dan Vasc", Subcriptions=chstat$statistics$subscriberCount,
                   Views = chstat$statistics$viewCount,
                   Videos = chstat$statistics$videoCount, Likes = sum(videostats$likeCount),
                   Dislikes = sum(videostats$dislikeCount), Comments = sum(videostats$commentCount))

# = videostats Plot = #
p1 = ggplot(data = videostats[-1, ]) + geom_point(aes(x = viewCount, y = likeCount))
p2 = ggplot(data = videostats[-1, ]) + geom_point(aes(x = viewCount, y = dislikeCount))
p3 = ggplot(data = videostats[-1, ]) + geom_point(aes(x = viewCount, y = commentCount))
grid.arrange(p1, p2, p3, ncol = 2)

# = Comments TS = #
comments_ts = lapply(comments, function(x){
  as.Date(x$publishedAt)
})
comments_ts = tibble(date = as.Date(Reduce(c, comments_ts))) %>%
  group_by(date) %>% count()
ggplot(data = comments_ts) + geom_line(aes(x = date, y = n)) +
  geom_smooth(aes(x = date, y = n), se = FALSE) + ggtitle("Comments by day")+
  geom_vline(xintercept = as.numeric(as.Date("2017-11-08")), linetype = 2,color = "red")+
  geom_vline(xintercept = as.numeric(as.Date("2017-04-28")), linetype = 2,color = "red")

# = coments by video = #
selected = (nrow(videostats) - 3):nrow(videostats)
top4 = videostats$title[selected]
top4comments = comments[selected]

p = list()
for(i in 1:4){
  df = top4comments[[i]]
  df$date = as.Date(df$publishedAt)
  df = df %>%
    arrange(date) %>%
    group_by(year(date), month(date), day(date)) %>%
    count()
  df$date = make_date(df$`year(date)`, df$`month(date)`,df$`day(date)`)
  p[[i]] = ggplot(data=df) + geom_line(aes(x = date, y = n)) + ggtitle(top4[i])
}
do.call(grid.arrange,p)

## = WordClouds = ##
comments_text = lapply(comments,function(x){
  as.character(x$textOriginal)
})
comments_text = tibble(text = Reduce(c, comments_text)) %>%
  mutate(text = stri_trans_general(tolower(text), "Latin-ASCII"))
remove = c("you","the","que","and","your","muito","this","that","are","for","cara",
         "from","very","like","have","voce","man","one","nao","com","with","mais",
         "was","can","uma","but","ficou","meu","really","seu","would","sua","more",
         "it's","it","is","all","i'm","mas","como","just","make","what","esse","how",
         "por","favor","sempre","time","esta","every","para","i've","tem","will",
         "you're","essa","not","faz","pelo","than","about","acho","isso",
         "way","also","aqui","been","out","say","should","when","did","mesmo",
         "minha","next","cha","pra","sei","sure","too","das","fazer","made",
         "quando","ver","cada","here","need","ter","don't","este","has","tambem",
         "una","want","ate","can't","could","dia","fiquei","num","seus","tinha","vez",
         "ainda","any","dos","even","get","must","other","sem","vai","agora","desde",
         "dessa","fez","many","most","tao","then","tudo","vou","ficaria","foi","pela",
         "see","teu","those","were")
words = tibble(word = Reduce(c, stri_extract_all_words(comments_text$text))) %>%
  group_by(word) %>% count() %>% arrange(desc(n)) %>% filter(nchar(word) >= 3) %>%
  filter(n > 10 & word %in% remove == FALSE) 

set.seed(3)
wordcloud(words$word, words$n, random.order = FALSE, random.color = TRUE,
          rot.per = 0.3, colors = 1:nrow(words))

To leave a comment for the author, please follow the link and comment on their blog: R – insightR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.