[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
There are two packages – geniusR and geniusr – which will do this. I played with both and found geniusR easier to use. Neither is perfect, but what is perfect, anyway?
To install geniusR, you’ll use a different method than usual – you’ll need to install the package devtools, then call the install_github function to download the R package directly from GitHub.
install.packages("devtools") devtools::install_github("josiahparry/geniusR") ## Downloading GitHub repo josiahparry/geniusR@master ## from URL https://api.github.com/repos/josiahparry/geniusR/zipball/master ## Installing geniusR ## '/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file \ ## --no-environ --no-save --no-restore --quiet CMD INSTALL \ ## '/private/var/folders/85/9ygtlz0s4nxbmx3kgkvbs5g80000gn/T/Rtmpl3bwRx/devtools33c73e3f989/JosiahParry-geniusR-5907d82' \ ## --library='/Library/Frameworks/R.framework/Versions/3.4/Resources/library' \ ## --install-tests ##
Now you’ll want to load geniusR and tidyverse so we can work with our data.
library(geniusR) library(tidyverse) ## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ── ## ✔ ggplot2 2.2.1 ✔ purrr 0.2.4 ## ✔ tibble 1.4.2 ✔ dplyr 0.7.4 ## ✔ tidyr 0.8.0 ✔ stringr 1.3.0 ## ✔ readr 1.1.1 ✔ forcats 0.3.0 ## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag()
For today’s demonstration, I’ll be working with data from two artists I love: Taylor Swift and Lorde. Both dropped new albums last year, Reputation and Melodrama, respectively, and both, though similar in age and friends with each other, have very different writing and musical styles.
geniusR has a function genius_album that will download lyrics from an entire album, labeling it by track.
swift_lyrics <- genius_album(artist="Taylor Swift", album="Reputation") ## Joining, by = c("track_title", "track_n", "track_url") lorde_lyrics <- genius_album(artist="Lorde", album="Melodrama") ## Joining, by = c("track_title", "track_n", "track_url")
Now we want to tokenize our datasets, remove stop words, and count word frequency – this code should look familiar, except this time, I’m combining them using the pipeline symbol (%>%) from the tidyverse, which allows you to string together multiple functions without having to nest them.
library(tidytext) tidy_swift <- swift_lyrics %>% unnest_tokens(word,lyric) %>% anti_join(stop_words) %>% count(word, sort=TRUE) ## Joining, by = "word" head(tidy_swift) ## # A tibble: 6 x 2 ## word n ## <chr> <int> ## 1 call 46 ## 2 wanna 37 ## 3 ooh 35 ## 4 ha 34 ## 5 ah 33 ## 6 time 32 tidy_lorde <- lorde_lyrics %>% unnest_tokens(word,lyric) %>% anti_join(stop_words) %>% count(word, sort=TRUE) ## Joining, by = "word" head(tidy_lorde) ## # A tibble: 6 x 2 ## word n ## <chr> <int> ## 1 boom 40 ## 2 love 26 ## 3 shit 24 ## 4 dynamite 22 ## 5 homemade 22 ## 6 light 22
Looking at the top 6 words for each, it doesn’t look like there will be a lot of overlap. But let’s explore that, shall we? Lorde’s album is 3 tracks shorter than Taylor Swift’s. To make sure our word comparisons are meaningful, I’ll create new variables that takes into account total number of words, so each word metric will be a proportion, allowing for direct comparisons. And because I’ll be joining the datasets, I’ll be sure to label these new columns by artist name.
tidy_swift <- tidy_swift %>% rename(swift_n = n) %>% mutate(swift_prop = swift_n/sum(swift_n)) tidy_lorde <- tidy_lorde %>% rename(lorde_n = n) %>% mutate(lorde_prop = lorde_n/sum(lorde_n))
There are multiple types of joins available in the tidyverse. I used an anti_join to remove stop words. Today, I want to use a full_join, because I want my final dataset to retain all words from both artists. When one dataset contributes a word not found in the other artist’s set, it will fill those variables in with missing values.
compare_words <- tidy_swift %>% full_join(tidy_lorde, by = "word") summary(compare_words) ## word swift_n swift_prop lorde_n ## Length:957 Min. : 1.000 Min. :0.00050 Min. : 1.0 ## Class :character 1st Qu.: 1.000 1st Qu.:0.00050 1st Qu.: 1.0 ## Mode :character Median : 1.000 Median :0.00050 Median : 1.0 ## Mean : 3.021 Mean :0.00152 Mean : 2.9 ## 3rd Qu.: 3.000 3rd Qu.:0.00151 3rd Qu.: 3.0 ## Max. :46.000 Max. :0.02321 Max. :40.0 ## NA's :301 NA's :301 NA's :508 ## lorde_prop ## Min. :0.0008 ## 1st Qu.:0.0008 ## Median :0.0008 ## Mean :0.0022 ## 3rd Qu.:0.0023 ## Max. :0.0307 ## NA's :508
The final dataset contains 957 tokens – unique words – and the NAs tell how many words are only present in one artist’s corpus. Lorde uses 301 words Taylor Swift does not, and Taylor Swift uses 508 words that Lorde does not. That leaves 148 words on which they overlap.
There are many things we could do with these data, but let’s visualize words and proportions, with one artist on the x-axis and the other on the y-axis.
ggplot(compare_words, aes(x=swift_prop, y=lorde_prop)) + geom_abline() + geom_text(aes(label=word), check_overlap=TRUE, vjust=1.5) + labs(y="Lorde", x="Taylor Swift") + theme_classic() ## Warning: Removed 809 rows containing missing values (geom_text).