Profanity in Twitch Chat
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This post contains references to profane language. All instances of profanity on this page have been censored.
The post is part tutorial with code chunks, if you are uninterested in the data manipulation steps you can jump to Visualizing Profanity in Chat to see the results.
Background
In this post we’ll be analyzing profanity in a Twitch.tv chat. The data was collected during the final day of Smash the Summit Spring 2017. The tournament boasted the largest prize pool in Melee tournament history at $51,448. The full past broadcast of the final day (including a chat replay) can be found at Beyond the Summit’s twitch channel. If you are interested in learning more about the competitive Melee scene you can check out /r/smashbros.
Packages Used in this Post
library(tidyverse) library(tidytext) library(stringr) library(hunspell) library(fuzzyjoin) library(plotly)
Data Prep
The twitch chat data was collected using the autolog in the IRC client Irssi. A processed version of the chat log can be found on my github.
In order to detect profanity we’ll be using a list of FCC banned words.
Clean Chat Data
Before we start analyzing the chat messages we’ll need to clean the text. The code chunk below splits chat messages into single words, removes stopwords, and performs word stemming. Laslty, words deemed as ‘problem words’ are removed from the chat as well. Problem words have been manually collected to limit false positives when we try to detect typod curse words. An example of a problem word is ‘duck’ since it’s one character away from a popular curse word.
#tokenize chat messages into single words chat_tokens <- chat_df %>% unnest_tokens(token, msg) %>% #replace na tokens with blank str mutate(token = if_else(is.na(token),"",token)) #remove stopwords, problem words, and stem chat_tokens_clean <- chat_tokens %>% #remove stop words anti_join(stop_words, by=c("token"="word")) %>% #stem words mutate(token_stem = vec_hunspell_stem(token)) %>% mutate(token_stem = if_else(token_stem %in% problem_words_df$problem_words, "", token_stem))
Fuzzy Join FCC and Chat Data
Now that we have a table of words in chat, we’ll detect FCC banned words using a fuzzy join based on string distance. Using this fuzzy join will allow us to catch some typos of our banned word list. If the string distances using Optimal String Alignmentis under 1 then we will set a curse_flag
on the chat word.
#join chat tokens to fcc banned tokens when stringdist <= threshold # stringdist method = "osa"; threshold = 1 chat_tokens_join <- chat_tokens_clean %>% stringdist_left_join(fcc_wont_let_me_be, by=c("token_stem"="bad_token"), max_dist=1) %>% mutate(curse_flag = if_else(is.na(curse_flag),0,1))
Visualizing Profanity in Chat
FCC Banned Word Frequencies
Next we’ll take a look at which FCC banned words were the most popular overall. The top 2 words account for 4013 of the 4575 (88%), so we’re going to collapse the words into 3 groups before we plot. The groups will be f**k
, s**t
, and all others
. Once we have the data summarized into these 3 groups we’ll make a bar chart using plotly.
The plot’s code is not displayed here but can be seen in the appendix.
fcc_word_count <- chat_tokens_join %>% #filter out non curse words filter(curse_flag == 1) %>% #censor tokens.. fcc wont let me be mutate(bad_token = censor(bad_token, "u|c|h|i","*")) %>% #summarise and count each banned word count(bad_token) %>% #create groups for top 2 curses and catchall group mutate(group=map_chr(bad_token, ~case_when(.x=="f**k" ~ .x, .x=="s**t" ~ .x, TRUE ~ "all others"))) %>% #convert group to factor for order in plot mutate(group=factor(group, levels=c("f**k","s**t", "all others"))) %>% #sum count for all others group group_by(group) %>% summarise(n=sum(n)) %>% ungroup()
FCC Banned Words Over Time
Now we’ll plot a time series of FCC banned words per minute, and annotate the most profane minute. The average banned words per minute is 8 (shown by the black line in the plot) with the max being 58 (annotated with the SwiftRage twitch emote). The max curse words per minute happened when Liquid.Hungrybox made a shocking come back on TSM.Leffen; the most used banned word in this minute was f**k.
The plot’s code is not displayed here but can be seen in the appendix.
fcc_by_min <- chat_tokens_join %>% #group at minute level (preserve date) group_by(date, time) %>% #sum curse counts per minute; store all words in list column summarise(curse_count = sum(curse_flag), curse_words = list(bad_token[!is.na(bad_token)])) %>% ungroup() %>% #create var of most used curse word per minute #(words concat with / if tie) mutate(top_curse = map_chr(curse_words,censored_word_mode)) %>% #index minutes for use in plots mutate(minute_i = row_number())
Appendix
Plot Code
FCC Banned Word Counts Bar Plot
plot_ly(fcc_word_count, x=~group, y=~n, type="bar", marker=list(color='#6441A4')) %>% layout(title="<b>Total Counts of FCC Banned Words in Chat</b> <br> <sup>Beyond the Summit Spring 2017 Finals Day</sup>", xaxis=list(title=""), yaxis=list(title="Count"), margin=list(t=50))
FCC Banned Words per Minute Line Plot
time_labels <- fcc_by_min %>% filter(str_sub(time, -2) == "00") %>% select(minute_i, time) max_row <- fcc_by_min[which.max(fcc_by_min$curse_count),] plot_ly(data=fcc_by_min %>% filter(minute_i>113) %>% filter(!is.na(time)), x=~minute_i, y=~curse_count, type = "scatter", mode = "lines", line = list(color = '#6441A4'), hoverinfo = "text", text = ~paste0("Count: ", curse_count, "<br>", "Top Word: ", top_curse, "<br>", "Time: ", time)) %>% add_trace(x = ~c(min(minute_i):max(minute_i)), y= ~mean(curse_count), mode = "lines", line = list(color = 'black'), hoverinfo="text", text = ~paste0("Avg Count: ", floor(mean(curse_count)))) %>% layout(title="<b>FCC Banned Words in Chat per Minute</b> <br> <sup>Beyond the Summit Spring 2017 Finals Day</sup>", xaxis=list(title="Time (EST)", tickmode="array", tickvals=time_labels$minute_i, ticktext=time_labels$time), yaxis=list(title="Banned Words per Minute"), margin=list(t=50), showlegend = FALSE, annotations=list(x = max_row$minute_i-15, y = max_row$curse_count, text = "Leffen rested by Hbox <br> twice to lose G5 LF", xref = "x", yref = "y", showarrow = TRUE, arrowhead = 1, ax = -40, ay = -30), images=list(list(source = "https://static-cdn.jtvnw.net/emoticons/v1/34/2.0", xref = "x", yref = "y", x = 580, y = 61.5, sizex = 30, sizey = 10, sizing = "stretch", layer = "above")))
Helper Functions
##vectorised hunspell stemmer---------------------------------------------------- my_hunspell_stem <- function(token) { stem_token <- hunspell::hunspell_stem(token)[[1]] if (length(stem_token) == 0) return(token) else return(stem_token[1]) } vec_hunspell_stem <- Vectorize(my_hunspell_stem, "token") ##function to censor bad words in print------------------------------------------ censor <- function(sailor_vocab, censor_regex="a|e|i|o|u", replace="*") { stringr::str_replace_all(tolower(sailor_vocab), stringr::regex(censor_regex), replace) } ##function extract mode------------------------------------------ censored_word_mode <- function(curse_vec) { if(length(curse_vec) == 0) return("NA") curse_vec %>% table() %>% .[which(.==max(.))] %>% names() %>% censor() %>% paste(collapse="/") }
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.