LP and Tidy Data Principles (Part 1)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Motivation
The Life Changing Magic of Tidying Text is one of those post I keep re-reading from time to time and I wanted to try the analysis with songs.
I shall use lp
package, a small data package I had for experimental purposes.
Note: If some images appear too small on your screen you can open them in a new tab to show them in their original size.
LP Can Be So Tidy
I use unnest_tokens
to divide the lyrics in words. This function uses the tokenizers package to separate each line into words. The default tokenizing is for words, but other options include characters, sentences, lines, paragraphs, or separation around a regex pattern.
if (!require("pacman")) install.packages("pacman") p_load(tidyr, tidytext, tibble, dplyr, ggplot2, viridis, purrr, forcats, igraph, ggraph) p_load_gh("dgrtwo/widyr") p_load_gh("pachamaltese/lp") lp_albums <- list( lost_on_you = lost_on_you, forever_for_now = forever_for_now, heart_to_mouth = heart_to_mouth ) lp_albums_tidy <- map_df( seq_along(lp_albums), function(x) { lp_albums[[x]] %>% enframe(name = "song") %>% unnest(cols = "value") %>% filter(!grepl("\\[", value)) %>% unnest_tokens(line, value, token = "lines") %>% group_by(song) %>% mutate(linenumber = row_number()) %>% ungroup() %>% unnest_tokens(word, line) %>% mutate(album = names(lp_albums[x])) %>% select(album, song, word, linenumber) %>% anti_join(stop_words) } )
The data is in one-word-per-row format, and we can manipulate it with tidy tools like dplyr. For example, in the last chunk I used an anti_join
to remove words such a “a”, “an” or “the”.
Then we can use count
to find the most common words in all of LP songs as a whole.
lp_albums_tidy %>% count(word, sort = TRUE) # A tibble: 851 x 2 word n <chr> <int> 1 love 109 2 ooh 75 3 halo 69 4 baby 66 5 living 55 6 yeah 52 7 eh 51 8 lost 47 9 light 37 10 gonna 36 # … with 841 more rows
Most LP songs are about love, and some are covers. For example, halo is the 3rd most repeated word and we can see it in the next song.
forever_for_now$halo_live [1] "Remember those walls I built" [2] "Well, baby they're tumbling down" [3] "And they didn't even put up a fight" [4] "They didn't even make up a sound" [5] "" [6] "I found a way to let you in" [7] "But I never really had a doubt" [8] "Standing in the light of your halo" [9] "I got my angel now" [10] "" [11] "It's like I've been awakened" [12] "Every rule I had you breakin'" [13] "It's the risk that I'm takin'" [14] "I ain't never gonna shut you out" [15] "" [16] "Everywhere I'm looking now" [17] "I'm surrounded by your embrace" [18] "Baby I can see your halo" [19] "You know you're my saving grace" [20] "" [21] "You're everything I need and more" [22] "It's written all over your face" [23] "Baby I can feel your halo" [24] "Pray it won't fade away" [25] "" [26] "I can feel your halo halo halo" [27] "I can see your halo halo halo" [28] "I can feel your halo halo halo" [29] "I can see your halo halo halo" [30] "" [31] "Hit me like a ray of sun" [32] "Burning through my darkest night" [33] "You're the only one that I want" [34] "Think I'm addicted to your light" [35] "" [36] "I swore I'd never fall again" [37] "But this don't even feel like falling" [38] "Gravity can't forget" [39] "To pull me back to the ground again" [40] "" [41] "Feels like I've been awakened" [42] "Every rule I had you breakin'" [43] "The risk that I'm takin'" [44] "I'm never gonna shut you out" [45] "" [46] "Everywhere I'm looking now" [47] "I'm surrounded by your embrace" [48] "Baby I can see your halo" [49] "You know you're my saving grace" [50] "" [51] "You're everything I need and more" [52] "It's written all over your face" [53] "Baby I can feel your halo" [54] "Pray it won't fade away" [55] "" [56] "I can feel your halo halo halo" [57] "I can see your halo halo halo" [58] "I can feel your halo halo halo" [59] "I can see your halo halo halo" [60] "" [61] "I can feel your halo halo halo" [62] "I can see your halo halo halo" [63] "I can feel your halo halo halo" [64] "I can see your halo halo halo" [65] "Halo, halo" [66] "" [67] "Everywhere I'm looking now" [68] "I'm surrounded by your embrace" [69] "Baby I can see your halo" [70] "You know you're my saving grace" [71] "" [72] "You're everything I need and more" [73] "It's written all over your face" [74] "Baby I can feel your halo" [75] "Pray it won't fade away" [76] "" [77] "I can feel your halo halo halo" [78] "I can see your halo halo halo" [79] "I can feel your halo halo halo" [80] "I can see your halo halo halo" [81] "" [82] "I can feel your halo halo halo" [83] "I can see your halo halo halo" [84] "I can feel your halo halo halo" [85] "I can see your halo halo halo"
Sentiment analysis can be done as an inner join. There is one sentiment lexicon in the tidytext package. Let’s examine how sentiment changes changes during each album. Let’s count the number of positive and negative words in the songs of each album
lp_albums_sentiment <- lp_albums_tidy %>% inner_join(sentiments) %>% count(song, sentiment) %>% spread(sentiment, n, fill = 0) %>% mutate(sentiment = positive - negative) %>% left_join( lp_albums_tidy %>% select(song, album) %>% distinct() %>% group_by(album) %>% mutate(song_number = row_number()) %>% ungroup() ) %>% mutate( album = as.factor(album), album = fct_relevel(album, "lost_on_you", "forever_for_now") ) %>% arrange(album, song_number) %>% select(album, song, song_number, everything()) lp_albums_sentiment # A tibble: 46 x 6 album song song_number negative positive sentiment <fct> <chr> <int> <dbl> <dbl> <dbl> 1 lost_on_you death_valley 1 10 8 -2 2 lost_on_you into_the_wild 2 36 8 -28 3 lost_on_you lost_on_you 3 33 5 -28 4 lost_on_you muddy_waters 4 24 3 -21 5 lost_on_you no_witness 5 13 7 -6 6 lost_on_you other_people 6 12 11 -1 7 lost_on_you strange 7 12 6 -6 8 lost_on_you tightrope 8 15 2 -13 9 lost_on_you up_against_me 9 1 3 2 10 lost_on_you you_want_it_all 10 11 0 -11 # … with 36 more rows
Now we can plot these sentiment scores across the plot trajectory of each album.
ggplot(lp_albums_sentiment, aes(song_number, sentiment, fill = album)) + geom_bar(stat = "identity", show.legend = FALSE) + facet_wrap(~album, nrow = 3, scales = "free_x", dir = "v") + theme_minimal(base_size = 13) + labs(title = "Sentiment in LP's Albums", y = "Sentiment") + scale_fill_viridis(end = 0.75, discrete = TRUE) + scale_x_discrete(expand = c(0.02,0)) + theme(strip.text = element_text(hjust = 0)) + theme(strip.text = element_text(face = "italic")) + theme(axis.title.x = element_blank()) + theme(axis.ticks.x = element_blank()) + theme(axis.text.x = element_blank())
Looking at Units Beyond Words
Lots of useful work can be done by tokenizing at the word level, but sometimes it is useful or necessary to look at different units of text. For example, some sentiment analysis algorithms look beyond only unigrams (i.e. single words) to try to understand the sentiment of a sentence as a whole. These algorithms try to understand that I am not having a good day is a negative sentence, not a positive one, because of negation.
lp_albums_lines <- map_df( seq_along(lp_albums), function(x) { lp_albums[[x]] %>% enframe(name = "song") %>% unnest(cols = "value") %>% filter(!grepl("\\[", value)) %>% unnest_tokens(line, value, token = "lines") %>% ungroup() %>% mutate(album = names(lp_albums[x])) %>% select(album, song, line) } )
Let’s look at just one.
lp_albums_lines$line[44] [1] "there's no light of day"
We can use tidy text analysis to ask questions such as: What are the most negative song in each of LP’s albums? First, let’s get the list of negative words from the lexicon. Second, let’s make a dataframe of how many words are in each song so we can normalize for the length of songs. Then, let’s find the number of negative words in each song and divide by the total words in each song. Which song has the highest proportion of negative words?
sentiment_negative <- sentiments %>% filter(sentiment == "negative") wordcounts <- lp_albums_tidy %>% group_by(album, song) %>% summarize(words = n()) lp_albums_tidy %>% semi_join(sentiment_negative) %>% group_by(album, song) %>% summarize(negativewords = n()) %>% left_join(wordcounts, by = c("album", "song")) %>% mutate(ratio = negativewords/words) %>% top_n(1) # A tibble: 3 x 5 # Groups: album [3] album song negativewords words ratio <chr> <chr> <int> <int> <dbl> 1 forever_for_now wasted_love_live 24 87 0.276 2 heart_to_mouth die_for_your_love 18 88 0.205 3 lost_on_you lost_on_you 33 68 0.485
Networks of Words
Another function in widyr is pairwise_count
, which counts pairs of items that occur together within a group. Let’s count the words that occur together in the songs of the first album.
word_cooccurences <- lp_albums_tidy %>% filter(album == "lost_on_you") %>% pairwise_count(word, linenumber, sort = TRUE) word_cooccurences # A tibble: 7,504 x 3 item1 item2 n <chr> <chr> <dbl> 1 witness bear 11 2 bear witness 11 3 we’re muddy 9 4 muddy we’re 9 5 die gonna 8 6 change gonna 8 7 gonna die 8 8 water muddy 8 9 muddy water 8 10 we’re water 8 # … with 7,494 more rows
This can be useful, for example, to plot a network of co-occuring words with the igraph and ggraph packages.
set.seed(1724) word_cooccurences %>% filter(n >= 3) %>% graph_from_data_frame() %>% ggraph(layout = "kk") + geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "#a8a8a8") + geom_node_point(color = "darkslategray4", size = 8) + geom_node_text(aes(label = name), vjust = 2.2) + ggtitle("Word Network in LP's albums") + theme_void()
It looks good!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.