Statistics Sunday: More Text Analysis – Term Frequency and Inverse Document Frequency

4 years ago

[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Statistics Sunday: Term Frequency and Inverse Document Frequency As a mixed methods researcher, I love working with qualitative data, but I also love the idea of using quantitative methods to add some meaning and context to the words. This is the main reason I’ve started digging into using R for text mining, and these skills have paid off in not only fun blog posts about Taylor Swift, Lorde, and “Hotel California”, but also in analyzing data for my job (blog post about that to be posted soon). So today, I thought I’d keep moving forward to other tools you can use in text analysis: term frequency and inverse document frequency.

These tools are useful when you have multiple documents you’re analyzing, such as interview text from different people or books by the same author. For my demonstration today, I’ll be using (what else?) song lyrics, this time from Florence + the Machine (one of my all-time favorites), who just dropped a new album, High as Hope. So let’s get started by pulling in those lyrics.

library(geniusR)

high_as_hope <- genius_album(artist = "Florence the Machine", album = "High as Hope")

## Joining, by = c("track_title", "track_n", "track_url")

library(tidyverse)

library(tidytext)
tidy_hope <- high_as_hope %>%
  unnest_tokens(word,lyric) %>%
  anti_join(stop_words)

## Joining, by = "word"

head(tidy_hope)

## # A tibble: 6 x 4
##   track_title track_n  line word   
##   <chr>         <int> <int> <chr>  
## 1 June              1     1 started
## 2 June              1     1 crack  
## 3 June              1     2 woke   
## 4 June              1     2 chicago
## 5 June              1     2 sky    
## 6 June              1     2 black

Now we have a tidy dataset with stop words removed. Before we go any farther, let’s talk about the tools we’re going to apply. Often, when we analyze text, we want to try to discover what different documents are about – what are their topics or themes? One way to do that is to look at common words used in a document, which can tell us something about the document’s theme. An overall measure of how often a term comes up in a particular document is term frequency (TF).

Removing stop words is an important step before looking at TF, because otherwise, the high frequency words wouldn’t be very meaningful – they’d be words that fill every sentence, like “the” or “a.” But there still might be many common words that don’t get weeded out by our stop words anti-join. And it’s often the less frequently used words that tell us something about the meaning of a document. This is where inverse document frequency (IDF) comes in; it takes into account how common a word is across a set of documents, and gives higher weight to words that are infrequent across a set of documents and lower weight to common words. This means that a word used a great deal in one song but very little in the other songs will have a higher IDF.

We can use these two values at the same time, by multiplying them together to form TF-IDF, which tells us the frequency of the term in a document adjusted for common it is across a set of documents. And thanks to the tidytext package, these values can be automatically calculated for us with the bind_tf_idf function. First, we need to reformat our data a bit, by counting use of each word by song. We do this by referencing the track_title variable in our count function, which tells R to group by this variable, followed by what we want R to count (the variable called word).

song_words <- tidy_hope %>%
  count(track_title, word, sort = TRUE) %>%
  ungroup()

The bind_tf_idf function needs 3 arguments: word (or whatever we called the variable containing our words), the document indicator (in this case, track_title), and the word counts by document (n).

song_words <- song_words %>%
  bind_tf_idf(word, track_title, n) %>%
  arrange(desc(tf_idf))

head(song_words)

## # A tibble: 6 x 6
##   track_title     word          n    tf   idf tf_idf
##   <chr>           <chr>     <int> <dbl> <dbl>  <dbl>
## 1 Hunger          hunger       25 0.236  2.30  0.543
## 2 Grace           grace        16 0.216  2.30  0.498
## 3 The End of Love wash         18 0.209  2.30  0.482
## 4 Hunger          ooh          20 0.189  2.30  0.434
## 5 Patricia        wonderful    10 0.125  2.30  0.288
## 6 100 Years       hundred      12 0.106  2.30  0.245

Some of the results are unsurprising – “hunger” is far more common in the track called “Hunger” than any other track, “grace” is more common in “Grace”, and “hundred” is more common in “100 Years”. But let’s explore the different words by plotting the highest tf-idf for each track. To keep the plot from getting ridiculously large, I’ll just ask for the top 5 for each of the 10 tracks.

song_words %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>%
  group_by(track_title) %>%
  top_n(5) %>%
  ungroup() %>%
  ggplot(aes(word, tf_idf, fill = track_title)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~track_title, ncol = 2, scales = "free") +
  coord_flip()

## Selecting by tf_idf

Some tracks have more than 5 words listed, because of ties, but this plot helps us to look for commonalities and differences across the tracks. There is a strong religious theme across many of the tracks, with concepts like “pray”, “god”, “grace”, and “angel” coming up in many tracks. The song “Patricia” uses many positively-valenced words like “wonderful” and “believer”. “No Choir” references music-themed words. And “Sky Full of Song” references things that fly (like “arrow”) and things in the sky (like “thunder”).

What does Florence Welch have to say about the meaning behind this album?

There is loneliness in this record, and there’s issues, and pain, and things that I struggled with, but the overriding feeling is that I have hope about them, and that’s what kinda brought me to this title; I was gonna call it The End of Love, which I actually saw as a positive thing cause it was the end of a needy kind of love, it was the end of a love that comes from a place of lack, it’s about a love that’s bigger and broader, that takes so much explaining. It could sound a bit negative but I didn’t really think of it that way.

She’s also mentioned that High as Hope is the first album she made sober, so her past struggles with addiction are certainly also a theme of the album. And many reviews of the album (like this one) talk about the meaning and stories behind the music. This information can provide some context to the TF-IDF results.

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.