Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
By Gabriel Vasconcelos
So I decided to have a quick look at the tuber package to extract YouTube data in R. My cousin is a singer (a hell of a good one) and he has a YouTube channel (dan vasc), which I strongly recommend, where he posts his covers. I will focus my analysis on his channel. The tuber package is very friendly and it downloads YouTube statistics on comments, views, likes and more straight to R using the YouTube API.
First let us look on some general information on the channel (Codes for replication in the end of the text). The table below shows the number of followers, views, videos, etc in the moment I downloaded the data (2017-12-12 11:20pm). If you run the code on your computer the results may be different because the channel will have more activity. Dan’s channel is getting close to 1 million views and he has 58 times more likes than dislikes. His views ratio is 13000 views per video.
< size="1">
Channel | Subscriptions | Views | Videos | Likes | Dislikes | Comments |
---|---|---|---|---|---|---|
Dan Vasc | 5127 | 743322 | 57 | 9008 | 155 | 1993 |
< >
We can also see some of the same statistics for each video. I selected only videos published after January 2016 that is when the channel became more active. The list has 29 videos. You can see that the channel became even more active in 2017. In the last month it started with weekly publications.
< size="1">
date | title | viewCount | likeCount | dislikeCount | commentCount |
---|---|---|---|---|---|
2016-03-09 | “Heart Of Steel” – MANOWAR cover | 95288 | 1968 | 53 | 371 |
2016-05-09 | “The Sound Of Silence” – SIMON & GARFUNKEL / DISTURBED cover | 13959 | 556 | 6 | 85 |
2016-07-04 | One Man Choir – Handel’s Hallelujah | 9390 | 375 | 6 | 70 |
2016-08-16 | “Carry On” – MANOWAR cover | 19146 | 598 | 12 | 98 |
2016-09-12 | “You Are Loved (Don’t Give Up)” – JOSH GROBAN cover | 2524 | 142 | 0 | 21 |
2016-09-26 | “Hearts On Fire” – HAMMERFALL cover | 6584 | 310 | 4 | 58 |
2016-10-26 | “Dawn Of Victory” – RHAPSODY OF FIRE cover | 10335 | 354 | 5 | 69 |
2017-04-28 | “I Don’t Wanna Miss A Thing” – AEROSMITH cover | 9560 | 396 | 5 | 89 |
2017-05-09 | State of affairs | 906 | 99 | 1 | 40 |
2017-05-26 | “Cha-La Head Cha-La” – DRAGON BALL Z INTRO cover (Japanese) | 2862 | 160 | 4 | 39 |
2017-05-26 | “Cha-La Head Cha-La” – DRAGON BALL Z INTRO cover (Português) | 3026 | 235 | 3 | 62 |
2017-05-26 | “Cha-La Head Cha-La” – DRAGON BALL Z INTRO cover (English) | 2682 | 108 | 2 | 14 |
2017-06-14 | HOW TO BE A YOUTUBE SINGER | ASKDANVASC 01 | 559 | 44 | 1 | 19 |
2017-06-17 | Promotional Live || Q&A and video games | 206 | 16 | 0 | 2 |
2017-07-18 | “The Bard’s Song” – BLIND GUARDIAN cover (SPYGLASS INN project) | 3368 | 303 | 2 | 47 |
2017-07-23 | “Numb” – LINKIN PARK cover (R.I.P. CHESTER) | 6717 | 350 | 14 | 51 |
2017-07-27 | THE PERFECT TAKE and HOW TO MIX VOCALS | ASKDANVASC 02 | 305 | 29 | 0 | 11 |
2017-08-04 | 4000 Subscribers and Second Channel | 515 | 69 | 1 | 23 |
2017-08-10 | “Hello” – ADELE cover [ROCK VERSION] | 6518 | 365 | 2 | 120 |
2017-08-27 | “The Rains Of Castamere” (The Lannister Song) – GAME OF THRONES cover | 2174 | 133 | 5 | 28 |
2017-08-31 | “Africa” – TOTO cover | 18251 | 642 | 10 | 172 |
2017-09-24 | “Chop Suey!” – SYSTEM OF A DOWN cover | 2562 | 236 | 6 | 56 |
2017-10-09 | “An American Trilogy” – ELVIS PRESLEY cover | 1348 | 168 | 1 | 48 |
2017-11-08 | “Beauty And The Beast” – Main Theme cover | Feat. Alina Lesnik | 2270 | 192 | 2 | 59 |
2017-11-16 | “Bohemian Rhapsody” – QUEEN cover | 2589 | 339 | 3 | 95 |
2017-11-23 | “The Phantom Of The Opera” – NIGHTWISH/ANDREW LLOYD WEBBER cover | Feat. Dragica | 1857 | 209 | 2 | 42 |
2017-11-24 | “Back In Black” – AC/DC cover (RIP MALCOLM YOUNG) | 2202 | 207 | 2 | 56 |
2017-11-30 | “Immigrant Song” – LED ZEPPELIN cover | 3002 | 204 | 1 | 62 |
2017-12-07 | “Sweet Child O’ Mine” – GUNS N’ ROSES cover | 1317 | 201 | 2 | 86 |
< >
Now that we saw the data. Let’s explore it to check for structures and information. The plots below show how likes, dislikes and comments are related to views. The positive relation is obvious. However, we have some degree of nonlinearity in likes and comments. The increment on likes and comments becomes smaller as the views increase. The dislikes look more linear on the views but the number of dislikes is to small to be sure.
Another interesting information is how comments are distributed over time in each video. I selected the four most recent videos and plotted the comments time-series below. All videos have a lot of activity in the first days but it decreases fast a few days latter. Followers and subscribers probably see the videos first and they must be responsible for the intense activity in the beginning of each plot.
The most important information might be how the channel grows over the time. Dan’s channel had two important moments in 2017. It became much more active in April and it started having weekly publications in November. We can clearly see that both strategies worked in the plot below. I put two dashed lines to show these two events. In April the number of comments increased a lot and they increased even more in November.
Finally, let’s have a look at what is in the comments using a WordCloud (wordcloud package). I removed words that are not informative such as “you, was, is, were” for English and Portuguese. The result is just below.
Codes
Before using the tuber package you need an ID and a password from Google Developer Console. Click here for more information. If you are interested, the package tubern has some other tools to work with YouTube data such as generating reports.
library(tuber) library(tidyverse) library(lubridate) library(stringi) library(wordcloud) library(gridExtra) httr::set_config( config( ssl_verifypeer = 0L ) ) # = Fixes some certificate problems on linux = # # = Autentication = # yt_oauth("ID", "PASS",token = "") # = Download and prepare data = # # = Channel stats = # chstat = get_channel_stats("UCbZRdTukTCjFan4onn04sDA") # = Videos = # videos = yt_search(term="", type="video", channel_id = "UCbZRdTukTCjFan4onn04sDA") videos = videos %>% mutate(date = as.Date(publishedAt)) %>% filter(date > "2016-01-01") %>% arrange(date) # = Comments = # comments = lapply(as.character(videos$video_id), function(x){ get_comment_threads(c(video_id = x), max_results = 1000) }) # = Prep the data = # # = Video Stat Table = # videostats = lapply(as.character(videos$video_id), function(x){ get_stats(video_id = x) }) videostats = do.call(rbind.data.frame, videostats) videostats$title = videos$title videostats$date = videos$date videostats = select(videostats, date, title, viewCount, likeCount, dislikeCount, commentCount) %>% as.tibble() %>% mutate(viewCount = as.numeric(as.character(viewCount)), likeCount = as.numeric(as.character(likeCount)), dislikeCount = as.numeric(as.character(dislikeCount)), commentCount = as.numeric(as.character(commentCount))) # = General Stat Table = # genstat = data.frame(Channel="Dan Vasc", Subcriptions=chstat$statistics$subscriberCount, Views = chstat$statistics$viewCount, Videos = chstat$statistics$videoCount, Likes = sum(videostats$likeCount), Dislikes = sum(videostats$dislikeCount), Comments = sum(videostats$commentCount)) # = videostats Plot = # p1 = ggplot(data = videostats[-1, ]) + geom_point(aes(x = viewCount, y = likeCount)) p2 = ggplot(data = videostats[-1, ]) + geom_point(aes(x = viewCount, y = dislikeCount)) p3 = ggplot(data = videostats[-1, ]) + geom_point(aes(x = viewCount, y = commentCount)) grid.arrange(p1, p2, p3, ncol = 2) # = Comments TS = # comments_ts = lapply(comments, function(x){ as.Date(x$publishedAt) }) comments_ts = tibble(date = as.Date(Reduce(c, comments_ts))) %>% group_by(date) %>% count() ggplot(data = comments_ts) + geom_line(aes(x = date, y = n)) + geom_smooth(aes(x = date, y = n), se = FALSE) + ggtitle("Comments by day")+ geom_vline(xintercept = as.numeric(as.Date("2017-11-08")), linetype = 2,color = "red")+ geom_vline(xintercept = as.numeric(as.Date("2017-04-28")), linetype = 2,color = "red") # = coments by video = # selected = (nrow(videostats) - 3):nrow(videostats) top4 = videostats$title[selected] top4comments = comments[selected] p = list() for(i in 1:4){ df = top4comments[[i]] df$date = as.Date(df$publishedAt) df = df %>% arrange(date) %>% group_by(year(date), month(date), day(date)) %>% count() df$date = make_date(df$`year(date)`, df$`month(date)`,df$`day(date)`) p[[i]] = ggplot(data=df) + geom_line(aes(x = date, y = n)) + ggtitle(top4[i]) } do.call(grid.arrange,p) ## = WordClouds = ## comments_text = lapply(comments,function(x){ as.character(x$textOriginal) }) comments_text = tibble(text = Reduce(c, comments_text)) %>% mutate(text = stri_trans_general(tolower(text), "Latin-ASCII")) remove = c("you","the","que","and","your","muito","this","that","are","for","cara", "from","very","like","have","voce","man","one","nao","com","with","mais", "was","can","uma","but","ficou","meu","really","seu","would","sua","more", "it's","it","is","all","i'm","mas","como","just","make","what","esse","how", "por","favor","sempre","time","esta","every","para","i've","tem","will", "you're","essa","not","faz","pelo","than","about","acho","isso", "way","also","aqui","been","out","say","should","when","did","mesmo", "minha","next","cha","pra","sei","sure","too","das","fazer","made", "quando","ver","cada","here","need","ter","don't","este","has","tambem", "una","want","ate","can't","could","dia","fiquei","num","seus","tinha","vez", "ainda","any","dos","even","get","must","other","sem","vai","agora","desde", "dessa","fez","many","most","tao","then","tudo","vou","ficaria","foi","pela", "see","teu","those","were") words = tibble(word = Reduce(c, stri_extract_all_words(comments_text$text))) %>% group_by(word) %>% count() %>% arrange(desc(n)) %>% filter(nchar(word) >= 3) %>% filter(n > 10 & word %in% remove == FALSE) set.seed(3) wordcloud(words$word, words$n, random.order = FALSE, random.color = TRUE, rot.per = 0.3, colors = 1:nrow(words))
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.