[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
These tools are useful when you have multiple documents you’re analyzing, such as interview text from different people or books by the same author. For my demonstration today, I’ll be using (what else?) song lyrics, this time from Florence + the Machine (one of my all-time favorites), who just dropped a new album, High as Hope. So let’s get started by pulling in those lyrics.
library(geniusR) high_as_hope <- genius_album(artist = "Florence the Machine", album = "High as Hope") ## Joining, by = c("track_title", "track_n", "track_url") library(tidyverse) library(tidytext) tidy_hope <- high_as_hope %>% unnest_tokens(word,lyric) %>% anti_join(stop_words) ## Joining, by = "word" head(tidy_hope) ## # A tibble: 6 x 4 ## track_title track_n line word ## <chr> <int> <int> <chr> ## 1 June 1 1 started ## 2 June 1 1 crack ## 3 June 1 2 woke ## 4 June 1 2 chicago ## 5 June 1 2 sky ## 6 June 1 2 black
Now we have a tidy dataset with stop words removed. Before we go any farther, let’s talk about the tools we’re going to apply. Often, when we analyze text, we want to try to discover what different documents are about – what are their topics or themes? One way to do that is to look at common words used in a document, which can tell us something about the document’s theme. An overall measure of how often a term comes up in a particular document is term frequency (TF).
Removing stop words is an important step before looking at TF, because otherwise, the high frequency words wouldn’t be very meaningful – they’d be words that fill every sentence, like “the” or “a.” But there still might be many common words that don’t get weeded out by our stop words anti-join. And it’s often the less frequently used words that tell us something about the meaning of a document. This is where inverse document frequency (IDF) comes in; it takes into account how common a word is across a set of documents, and gives higher weight to words that are infrequent across a set of documents and lower weight to common words. This means that a word used a great deal in one song but very little in the other songs will have a higher IDF.
We can use these two values at the same time, by multiplying them together to form TF-IDF, which tells us the frequency of the term in a document adjusted for common it is across a set of documents. And thanks to the tidytext package, these values can be automatically calculated for us with the bind_tf_idf function. First, we need to reformat our data a bit, by counting use of each word by song. We do this by referencing the track_title variable in our count function, which tells R to group by this variable, followed by what we want R to count (the variable called word).
song_words <- tidy_hope %>% count(track_title, word, sort = TRUE) %>% ungroup()
The bind_tf_idf function needs 3 arguments: word (or whatever we called the variable containing our words), the document indicator (in this case, track_title), and the word counts by document (n).
song_words <- song_words %>% bind_tf_idf(word, track_title, n) %>% arrange(desc(tf_idf)) head(song_words) ## # A tibble: 6 x 6 ## track_title word n tf idf tf_idf ## <chr> <chr> <int> <dbl> <dbl> <dbl> ## 1 Hunger hunger 25 0.236 2.30 0.543 ## 2 Grace grace 16 0.216 2.30 0.498 ## 3 The End of Love wash 18 0.209 2.30 0.482 ## 4 Hunger ooh 20 0.189 2.30 0.434 ## 5 Patricia wonderful 10 0.125 2.30 0.288 ## 6 100 Years hundred 12 0.106 2.30 0.245
Some of the results are unsurprising – “hunger” is far more common in the track called “Hunger” than any other track, “grace” is more common in “Grace”, and “hundred” is more common in “100 Years”. But let’s explore the different words by plotting the highest tf-idf for each track. To keep the plot from getting ridiculously large, I’ll just ask for the top 5 for each of the 10 tracks.
song_words %>% mutate(word = factor(word, levels = rev(unique(word)))) %>% group_by(track_title) %>% top_n(5) %>% ungroup() %>% ggplot(aes(word, tf_idf, fill = track_title)) + geom_col(show.legend = FALSE) + labs(x = NULL, y = "tf-idf") + facet_wrap(~track_title, ncol = 2, scales = "free") + coord_flip() ## Selecting by tf_idf