Statistics Sunday: Creating Wordclouds
[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
library(quanteda) #install with install.packages("quanteda") if needed data(data_corpus_inaugural) speeches <- data_corpus_inaugural$documents row.names(speeches) <- NULL
As you can see, this dataset has each Inaugural Address in a column called "texts," with year and President's name as additional variables. To do analysis on the words in speeches, and generate a wordcloud, we'll want to unnest the words in the texts column.
library(tidytext) library(tidyverse) speeches_tidy <- speeches %>% unnest_tokens(word, texts) %>% anti_join(stop_words) ## Joining, by = "word"
For our first wordcloud, let's see what are the most common words across all speeches.
library(wordcloud) #install.packages("wordcloud") if needed speeches_tidy %>% count(word, sort = TRUE) %>% with(wordcloud(word, n, max.words = 50))
We could very easily create a wordcloud for one President specifically. For instance, let's create one for Obama, since he provides us with two speeches worth of words. But to take things up a notch, let's add sentiment information to our wordcloud. To do that, we'll use the comparison.cloud function; we'll also need the reshape2 library.
library(reshape2) #install.packages("reshape2") if needed obama_words <- speeches_tidy %>% filter(President == "Obama") %>% count(word, sort = TRUE) obama_words %>% inner_join(get_sentiments("nrc") %>% filter(sentiment %in% c("positive", "negative"))) %>% filter(n > 1) %>% acast(word ~ sentiment, value.var = "n", fill = 0) %>% comparison.cloud(colors = c("red","blue")) ## Joining, by = "word"