Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
— Kurt Seapoint”
In this post I am going to analyse my departmental Whatsapp chat group “Statistics Class of 2019”. Thanks to a special package called rwhatsapp
, this allows us to work with Whatsapp text data in R. We are also going to perform some text mining using the tidytext
package.
#load libraries library(rwhatsapp) library(tidyverse) #load data chat <- rwa_read("C:/Users/Adejumo/Downloads/whatsapp.txt") %>% #remove messages without author filter(!is.na(author)) chat ## # A tibble: 403 x 6 ## time author text source emoji emoji_name ## <dttm> <fct> <chr> <chr> <lis> <list> ## 1 2021-10-17 14:13:16 +234 703 857 6887 "\U0001f60~ C:/Users/~ <chr~ <chr [1]> ## 2 2021-10-17 14:13:16 +234 703 857 6887 "Don't men~ C:/Users/~ <chr~ <chr [1]> ## 3 2021-10-17 14:30:16 Don Sepa "Congratul~ C:/Users/~ <NUL~ <NULL> ## 4 2021-10-17 15:40:16 Jagaban "<Media om~ C:/Users/~ <NUL~ <NULL> ## 5 2021-10-17 15:42:16 +234 703 857 6887 "<U+2764><U+FE0F>" C:/Users/~ <chr~ <chr [1]> ## 6 2021-10-17 15:46:16 +234 816 195 5210 "Congratul~ C:/Users/~ <NUL~ <NULL> ## 7 2021-10-17 15:51:16 Sobah "Congratul~ C:/Users/~ <NUL~ <NULL> ## 8 2021-10-17 15:51:16 Zahra "Congrats ~ C:/Users/~ <NUL~ <NULL> ## 9 2021-10-17 15:52:16 +234 706 590 8705 "<Media om~ C:/Users/~ <NUL~ <NULL> ## 10 2021-10-17 15:52:16 +234 813 737 2046 "This mess~ C:/Users/~ <NUL~ <NULL> ## # ... with 393 more rows
I lost some messages, the messages are actually more that this. We have just 403 messages here. Let’s see number of messages sent on daily basis.
chat %>% mutate(day = lubridate::date(time)) %>% count(day) %>% ggplot(aes(x = day, y = n)) + geom_bar(stat = "identity") + ylab("") + xlab("") + ggtitle("Messages per day")
chat %>% mutate(day = lubridate::date(time)) %>% count(author) %>% arrange(desc(n)) %>% head() %>% ggplot(aes(x = reorder(author, n), y = n, fill = author)) + geom_bar(stat = "identity") + ylab("") + xlab("") + coord_flip() + ggtitle("Number of messages")
library("ggimage") emoji_data <- rwhatsapp::emojis %>% # data built into package mutate(hex_runes1 = gsub("\\s.*", "", hex_runes)) %>% # ignore combined emojis mutate(emoji_url = paste0("https://abs.twimg.com/emoji/v2/72x72/", tolower(hex_runes1), ".png")) chat %>% unnest(emoji) %>% count(author, emoji, sort = TRUE) %>% arrange(desc(n)) %>% head(10) %>% group_by(author) %>% left_join(emoji_data, by = "emoji") %>% ggplot(aes(x = reorder(emoji, n), y = n, fill = author)) + geom_col(show.legend = FALSE) + ylab("") + xlab("") + coord_flip() + geom_image(aes(y = n + 20, image = emoji_url)) + facet_wrap(~author, ncol = 2, scales = "free_y") + ggtitle("Most often used emojis") + theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())
face with tears of joy
. The group admin Jagaban
is also the member that sends the most emoji. Let’s compare favourite words.
library(tidytext) chat %>% unnest_tokens(input = text, output = word) %>% count(author, word, sort = TRUE) %>% head(80) %>% group_by(author) %>% top_n(n = 6, n) %>% ggplot(aes(x = reorder_within(word, n, author), y = n, fill = author)) + geom_col(show.legend = FALSE) + ylab("") + xlab("") + coord_flip() + facet_wrap(~author, ncol = 2, scales = "free_y") + scale_x_reordered() + ggtitle("Most often used words")
library("stopwords") to_remove <- c(stopwords(language = "en"), "media","omitted","na","2","s","u","ahni","irc","dey","3","au","mak","u","don","naa","4","6","una","b","oo","2021","go","sir") chat %>% unnest_tokens(input = text, output = word) %>% filter(!word %in% to_remove) %>% count(author, word, sort = TRUE) %>% head(90) %>% group_by(author) %>% top_n(n = 6, n) %>% ggplot(aes(x = reorder_within(word, n, author), y = n, fill = author)) + geom_col(show.legend = FALSE) + ylab("") + xlab("") + coord_flip() + facet_wrap(~author, ncol = 2, scales = "free_y") + scale_x_reordered() + ggtitle("Most often used words")
ngojobsite.com
in the group chat. Another text mining technique we can do is to calculate lexical diversity. This allows us to see how many unique words are used by an author.
chat %>% unnest_tokens(input = text, output = word) %>% filter(!word %in% to_remove) %>% count(author, word, sort = TRUE) %>% group_by(author) %>% summarise(lex_diversity = n_distinct(word)) %>% arrange(desc(lex_diversity)) %>% head(10) %>% ggplot(aes(x = reorder(author, lex_diversity), y = lex_diversity, fill = author)) + geom_col(show.legend = FALSE) + scale_y_continuous(expand = (mult = c(0, 0, 0, 500))) + geom_text(aes(label = scales::comma(lex_diversity)), hjust = -0.1) + ylab("unique words") + xlab("") + ggtitle("Lexical Diversity") + coord_flip()
o_words <- chat %>% unnest_tokens(input = text, output = word) %>% filter(author != "Limitless") %>% count(word, sort = TRUE) chat %>% unnest_tokens(input = text, output = word) %>% filter(author == "Limitless") %>% count(word, sort = TRUE) %>% filter(!word %in% o_words$word) %>% # only select words nobody else uses top_n(n = 6, n) %>% ggplot(aes(x = reorder(word, n), y = n, fill = word)) + geom_col(show.legend = FALSE) + ylab("") + xlab("") + coord_flip() + ggtitle("Unique words of Limitless")
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.