‚Arrest this man, he talks in maths‘ – Animating ten years of listening history on Last.FM

Rcrastinate

3 years ago

[This article was first published on Rcrastinate, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Previously, when Rcrastinate was still on blogspot.com, I had a first look at ten years of my playback history on Last.FM. But there is still a lot one can do with this dataset. I wanted to try {gganimate} for a long time and this nice longitudinal dataset gives me the opportunity.

Loading and preparing the data

First, I am loading the dataset. I already did some preparations like extracting the top five tags for each track and some other stuff I used in my previous entry.

library(knitr) # for 'kable'
scrobs <- readRDS("LastFM-history-proc2.Rds")
# I am getting rid of some columns we won't need.
scrobs <- scrobs[, c("artist", "track", "album",
                     "timestamp")]
nrow(scrobs)
## [1] 65356
kable(head(scrobs))

artist	track	album	timestamp
Cat Power	Ramblin’ (Wo)man	Cat Power – Jukebox (Deluxe Edition)	2009-01-01 13:30:00
Cat Power	Metal Heart (2008 Version)	Cat Power – Jukebox (Deluxe Edition)	2009-01-01 13:33:00
Cat Power	Silver Stallion	Cat Power – Jukebox (Deluxe Edition)	2009-01-01 13:37:00
Cat Power	Aretha, Sing One for Me	Cat Power – Jukebox (Deluxe Edition)	2009-01-01 13:40:00
Cat Power	Lost Someone	Cat Power – Jukebox (Deluxe Edition)	2009-01-01 13:43:00
Cat Power	Lord, Help the Poor & Needy	Cat Power – Jukebox (Deluxe Edition)	2009-01-01 13:46:00

So, you see: There is one row in the dataset for each song I played (65356 in total). Artist, track, and album name as well as the the timestamp of the scrobble are associated with each playback. We will now extract the date from the timestamp and create a new dataset art.day that contains the number of playbacks (or “scrobbles” in Last.FM’s terminology) for each artist on each day in the ten-years period.

scrobs$date <- substr(scrobs$timestamp, 1, 10)
art.day <- as.data.frame(table(scrobs$artist, scrobs$date))
names(art.day) <- c("Artist", "Date", "Scrobbles")
kable(tail(art.day[art.day$Artist == "Beck",]),
      row.names = F)

Artist	Date	Scrobbles
Beck	2018-12-25	1
Beck	2018-12-27	0
Beck	2018-12-28	0
Beck	2018-12-29	0
Beck	2018-12-30	9
Beck	2018-12-31	0

What you see above are the last 6 days of my scrobble history for Beck. Now, for some {dplyr} magic. I am starting to really like what you can do with all the piping and stuff.

Calculating cumulative sums

What I want to do now is:

Group art.day by artist.
Create a new column called cum.plays (for “cumulated playbacks”) which holds the number of scrobbles for this artist up to the respective point in time.
To achieve this, I am sorting the different groups of the dataset by date before computing the cumulated sum – just to be sure that the cumulated sum is correct.
In the end, I am only ungrouping the dataset again and overwrite the variable holding the old one (because the operations before are “non-destructive”).

library(dplyr)
art.day %>% group_by(Artist) %>%
  arrange(Date) %>%
  mutate(cum.plays = cumsum(Scrobbles)) %>%
  ungroup() -> art.day
kable(tail(art.day[art.day$Artist == "Radiohead",]),
      row.names = F)

Artist	Date	Scrobbles	cum.plays
Radiohead	2018-12-25	0	1524
Radiohead	2018-12-27	4	1528
Radiohead	2018-12-28	2	1530
Radiohead	2018-12-29	0	1530
Radiohead	2018-12-30	4	1534
Radiohead	2018-12-31	0	1534

With the last lines for Radiohead we can see that this worked: On Dec 25th, 2018, 1524 scrobbles were registered for Radiohead. On Dec 26th, I didn’t listen to any music, so this day is missing from the dataset altogether. This is no problem because I am converting the Date column to date format later. On Dec 27th, four more scrobbles were registered for Radiohead, so cum.plays is increased by four to 1528 and so on.

Selecting top ten artists

Now, I am doing three things:

Create a table with the overall artist scrobbles to get the top ten artists.
Restricting art.day only to scrobbles by these top ten artists and save the result in top.art.day.
Convert the Date column in the new dataset to date format.

art.tab <- sort(table(scrobs$artist), decreasing = T)
top.art.day <- art.day[art.day$Artist %in% names(art.tab[1:10]),]
top.art.day$Date <- as.Date(top.art.day$Date)
nrow(top.art.day)
## [1] 28260

Scrobbles distribution

Since we’re already here, we can also see how much percent of all scrobbles are registered for the top ten artists only. We can also look at the overall distribution of scrobbles over artists using the table that I created before.

library(ggplot2)
sum(top.art.day$Scrobbles) / sum(art.day$Scrobbles) * 100
## [1] 20.11751
ggplot() + aes(x = 1:length(art.tab), y = unname(art.tab), group = 1) +
  geom_line() +
  labs(x = "Artist rank",
       y = "Artist scrobbles") +
  geom_vline(aes(xintercept = 20), col = "red") +
  theme_minimal()

So, roughly 20% of all scrobbles come from the top ten artists alone. We are dealing with a massively scewed distribution here. See that red line in the plot? 20% of all scrobbles lie to the left of this line! It’s almost like word frequency distributions in a corpus of natural language data.

Animating scrobble history

Now for the main aim of this post. I am assuming {gganimate} is already installed. I then create a plot object p which also includes a call to transition_reveal(Date). This is a function from {gganimate} which tells the plot to be revealed over time. The calls to geom_segment() and geom_text() create the artist labels on the right side of the plot that are connected to the moving dot. The artist names rise according to the current value of cum.plays. I got this code fragment from a nice example by Thomas Lin Pedersen. You can read the discussion there for further explanations. Just a quick summary of important points:

I had to switch off the legend with guides(col = F) to prevent overplotting of the moving labels and the color legend.
I had to increase the plot margin on the right side to fit the longest artist name.
Clipping is turned of with coord_cartesian(clip = 'off') because the moving labels are not shown otherwise.

library(gganimate)
p <- ggplot(top.art.day, aes(x = Date, y = cum.plays, col = Artist, group = Artist)) +
  geom_line() +
  geom_point(size = 2) +
  geom_segment(aes(xend = as.Date("2019-01-10"),
                   yend = cum.plays),
               linetype = 2,
               colour = 'grey') +
  geom_text(aes(x = as.Date("2019-01-20"),
                label = Artist),
            hjust = 0, size = 4) +
  coord_cartesian(clip = 'off') +
  transition_reveal(Date) +
  guides(col = F) +
  labs(x = "Date", y = "Playcount",
       col = "Artist",
       title = "Artist playcount through time") +
  theme_minimal() + 
  theme(plot.margin = margin(5.5, 120, 5.5, 5.5))

Now for animating the plot. Depending on the detailed settings, this could take a while. Especially the number of frames is heavily influencing rendering time.

animate(p, nframes = 200, end_pause = 20,
        height = 1000, width = 1250, res = 120)

< !-- -->

It takes a little bit of tweaking of the parameters to get a readable result. Check out the different parameters for ?animate if you want to tweak your animations a little more. I have a tip for you: If you’re still tweaking your plot parameters like the dimensions or margins, you might want to set the nframes parameter to a lower number like 50 or even 20. This rapidly accelerates the rendering of your animation.

A few things are worth mentioning when I look at this plot:

Do you see how I discovered Scott Matthew only in the end of 2013? He quickly made it into my top ten artists with a rapid sprint of scrobbles.
Radiohead, Prince, and PJ Harvey were a lonely group at the top for quite some time. Then, Menomena entered the picture and joined this top group. Later, Beck also joins. At the end of 2018, with me discovering the album “Colors”, he takes second place after briefly being my top artist.
Janelle Monáe hasn’t been in my top ten artists up until 2018. Then, she released “Dirty Computer”. I listened to this album a lot and she made a final sprint into the final top ten artists.
That’s maybe the biggest flaw of this animation: Only the final top ten artists are included. The plot does not show the top ten artists over time. I would love to see the development of the top ten artists through time, maybe with greyed out lines whenever an artist leaves the top ten. For that, however, I would have to learn a lot more of {gganimate}.

Alright, that’s it. See you next time.

To leave a comment for the author, please follow the link and comment on their blog: Rcrastinate.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.