Site icon R-bloggers

Downloading and Analyzing CD1025’s Playlist

[This article was first published on Statistically Significant, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
CD1025 is an “alternative” radio station here in Columbus. They are one of the few remaining radio stations that are independently owned and they take great pride in it. For data nerds like me, they also put a real time list of recently played songs on their website. The page has the most recent 50 songs played, but you can also click on “Older Tracks” to go back in time. When you do this, the URL ends “now-playing/?start=50”. If you got back again, it says “now-playing/?start=100”.

Using this structure, I decided to see if I could download all of their historical data and see how far it goes back. In the code below, I use the XML package to go to the website and download the 50 songs and then increment the number by 50 to find the previous 50 songs. I am telling the code to keep doing this until I get to January 1, 2012.
library(ggplot2)
theme_set(theme_bw())
library(XML)
library(lubridate)
library(sqldf)
startNum = 0
while (TRUE) {
    theurl <- paste0("http://cd1025.com/about/playlists/now-playing/?start=", 
        startNum)
    table <- readHTMLTable(theurl, stringsAsFactors = FALSE)[[1]]
    if (startNum == 0) {
        playlist = table[, -1]
    } else {
        playlist = rbind(playlist, table[, -1])
    }
    dt = mdy(substring(table[1, 4], nchar(table[1, 4]) - 9, nchar(table[1, 4])))
    print(dt)
    if (dt < mdy("1/1/12")) {
        break
    }
    startNum = startNum + 50
}

playlist = unique(playlist)  # Remove Dupes

write.csv(playlist, "CD101Playlist.csv", row.names = FALSE)
This takes a while and is fairly large. My file has over 150,000 songs. If you want just a little data, change the date to last week or so. The first thing I will do is parse the dates and times of the songs, order them, and look at the first few songs. You can see that data only goes back to March of 2012.
dates = mdy(substring(playlist[, 3], nchar(playlist[, 3]) - 9, nchar(playlist[, 
    3])))
times = hm(substring(playlist[, 3], 1, nchar(playlist[, 3]) - 10))
playlist$Month = ymd(paste(year(dates), month(dates), "1", sep = "-"))
playlist$Day = dates
playlist$Time = times
playlist = playlist[order(playlist$Day, playlist$Time), ]
head(playlist)

##                     Artist                Song       Last.Played
## 151638 DEATH CAB FOR CUTIE   YOU ARE A TOURIST 12:34am03/01/2012
## 151637       SLEEPER AGENT          GET BURNED 12:38am03/01/2012
## 151636          WASHED OUT           AMOR FATI 12:41am03/01/2012
## 151635            COLDPLAY       CHARLIE BROWN 12:45am03/01/2012
## 151634           GROUPLOVE         TONGUE TIED 12:49am03/01/2012
## 151633               SUGAR YOUR FAVORITE THING 12:52am03/01/2012
##             Month        Day   Time
## 151638 2012-03-01 2012-03-01 34M 0S
## 151637 2012-03-01 2012-03-01 38M 0S
## 151636 2012-03-01 2012-03-01 41M 0S
## 151635 2012-03-01 2012-03-01 45M 0S
## 151634 2012-03-01 2012-03-01 49M 0S
## 151633 2012-03-01 2012-03-01 52M 0S
Using the sqldf package, I can easily see what the most played artists and songs are from the data I scraped.
sqldf("Select Artist, Count(Artist) as PlayCount
       From playlist
       Group By Artist
       Order by PlayCount DESC
       Limit 10")

##                   Artist PlayCount
## 1      SILVERSUN PICKUPS      2340
## 2         THE BLACK KEYS      2203
## 3                   MUSE      1988
## 4              THE SHINS      1885
## 5    OF MONSTERS AND MEN      1753
## 6            PASSION PIT      1552
## 7              GROUPLOVE      1544
## 8  RED HOT CHILI PEPPERS      1514
## 9                 METRIC      1495
## 10          ATLAS GENIUS      1494


sqldf("Select Artist, Song, Count(Song) as PlayCount
      From playlist
      Group By Artist, Song
      Order by PlayCount DESC
      Limit 10")

##                 Artist                    Song PlayCount
## 1          PASSION PIT             TAKE A WALK       828
## 2    SILVERSUN PICKUPS                PIT, THE       825
## 3         ATLAS GENIUS                 TROJANS       819
## 4        WALK THE MOON                ANNA SUN       742
## 5       THE BLACK KEYS LITTLE BLACK SUBMARINES       736
## 6          DIVINE FITS  WOULD THAT NOT BE NICE       731
## 7        THE LUMINEERS                  HO HEY       722
## 8       CAPITAL CITIES          SAFE AND SOUND       712
## 9  OF MONSTERS AND MEN          MOUNTAIN SOUND       711
## 10               ALT J            BREEZEBLOCKS       691
I am a little surprised that Silversun Pickups are the number one band, but everyone on the list makes sense. Looking at how the plays of the top artists have varied from month to month, you can see a few patterns. Muse has been more popular recently and The Shins and Grouplove have lost some steam.
artist.month=sqldf("Select Month, Artist, Count(Song) as Num
      From playlist
      Group By Month, Artist
      Order by Month, Artist")
artist=sqldf("Select Artist, Count(Artist) as Num
      From playlist
      Group By Artist
      Order by Num DESC")
p=ggplot(subset(artist.month,Artist %in% head(artist$Artist,8)),aes(Month,Num))
p+geom_bar(stat="identity",aes(fill=Artist),position='fill',colour="grey")+
 labs(y="Percentage of Plays")

For the play count of the top artists, I see some odd numbers in June and July of 2012. The number of plays went way down.
p + geom_area(aes(fill = Artist), position = "stack", colour = 1) + labs(y = "Number of Plays")


Looking into this further, I plotted the date and time that the song was played by the cumulative number of songs played since the beginning of the list. The plot should be a line with a constant slope, meaning that the plays per day are relatively constant. You can see in June and July of 2012 there are flat spots where there is no playlist history.
qplot(playlist$Day + playlist$Time, 1:length(dates), geom = "path")

There are also smaller flat spots in September and December, but I am going to decide that those are small enough not to affect any further analyses. From this, I am going to use data only from August 2012 to present.
playlist = subset(playlist, Day >= mdy("8/1/12"))
Next up, I am going to use this data to analyze the plays of artists from Summerfest, and try to infer if the play counts varied once they were added to the bill.

To leave a comment for the author, please follow the link and comment on their blog: Statistically Significant.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.