[This article was first published on Statistically Significant, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
CD1025 is an “alternative” radio station here in Columbus. They are one of the few remaining radio stations that are independently owned and they take great pride in it. For data nerds like me, they also put a real time list of recently played songs on their website. The page has the most recent 50 songs played, but you can also click on “Older Tracks” to go back in time. When you do this, the URL ends “now-playing/?start=50”. If you got back again, it says “now-playing/?start=100”.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Using this structure, I decided to see if I could download all of their historical data and see how far it goes back. In the code below, I use the XML package to go to the website and download the 50 songs and then increment the number by 50 to find the previous 50 songs. I am telling the code to keep doing this until I get to January 1, 2012.
library(ggplot2) theme_set(theme_bw()) library(XML) library(lubridate) library(sqldf) startNum = 0 while (TRUE) { theurl <- paste0("http://cd1025.com/about/playlists/now-playing/?start=", startNum) table <- readHTMLTable(theurl, stringsAsFactors = FALSE)[[1]] if (startNum == 0) { playlist = table[, -1] } else { playlist = rbind(playlist, table[, -1]) } dt = mdy(substring(table[1, 4], nchar(table[1, 4]) - 9, nchar(table[1, 4]))) print(dt) if (dt < mdy("1/1/12")) { break } startNum = startNum + 50 } playlist = unique(playlist) # Remove Dupes write.csv(playlist, "CD101Playlist.csv", row.names = FALSE)This takes a while and is fairly large. My file has over 150,000 songs. If you want just a little data, change the date to last week or so. The first thing I will do is parse the dates and times of the songs, order them, and look at the first few songs. You can see that data only goes back to March of 2012.
dates = mdy(substring(playlist[, 3], nchar(playlist[, 3]) - 9, nchar(playlist[, 3]))) times = hm(substring(playlist[, 3], 1, nchar(playlist[, 3]) - 10)) playlist$Month = ymd(paste(year(dates), month(dates), "1", sep = "-")) playlist$Day = dates playlist$Time = times playlist = playlist[order(playlist$Day, playlist$Time), ] head(playlist) ## Artist Song Last.Played ## 151638 DEATH CAB FOR CUTIE YOU ARE A TOURIST 12:34am03/01/2012 ## 151637 SLEEPER AGENT GET BURNED 12:38am03/01/2012 ## 151636 WASHED OUT AMOR FATI 12:41am03/01/2012 ## 151635 COLDPLAY CHARLIE BROWN 12:45am03/01/2012 ## 151634 GROUPLOVE TONGUE TIED 12:49am03/01/2012 ## 151633 SUGAR YOUR FAVORITE THING 12:52am03/01/2012 ## Month Day Time ## 151638 2012-03-01 2012-03-01 34M 0S ## 151637 2012-03-01 2012-03-01 38M 0S ## 151636 2012-03-01 2012-03-01 41M 0S ## 151635 2012-03-01 2012-03-01 45M 0S ## 151634 2012-03-01 2012-03-01 49M 0S ## 151633 2012-03-01 2012-03-01 52M 0SUsing the sqldf package, I can easily see what the most played artists and songs are from the data I scraped.
sqldf("Select Artist, Count(Artist) as PlayCount From playlist Group By Artist Order by PlayCount DESC Limit 10") ## Artist PlayCount ## 1 SILVERSUN PICKUPS 2340 ## 2 THE BLACK KEYS 2203 ## 3 MUSE 1988 ## 4 THE SHINS 1885 ## 5 OF MONSTERS AND MEN 1753 ## 6 PASSION PIT 1552 ## 7 GROUPLOVE 1544 ## 8 RED HOT CHILI PEPPERS 1514 ## 9 METRIC 1495 ## 10 ATLAS GENIUS 1494 sqldf("Select Artist, Song, Count(Song) as PlayCount From playlist Group By Artist, Song Order by PlayCount DESC Limit 10") ## Artist Song PlayCount ## 1 PASSION PIT TAKE A WALK 828 ## 2 SILVERSUN PICKUPS PIT, THE 825 ## 3 ATLAS GENIUS TROJANS 819 ## 4 WALK THE MOON ANNA SUN 742 ## 5 THE BLACK KEYS LITTLE BLACK SUBMARINES 736 ## 6 DIVINE FITS WOULD THAT NOT BE NICE 731 ## 7 THE LUMINEERS HO HEY 722 ## 8 CAPITAL CITIES SAFE AND SOUND 712 ## 9 OF MONSTERS AND MEN MOUNTAIN SOUND 711 ## 10 ALT J BREEZEBLOCKS 691I am a little surprised that Silversun Pickups are the number one band, but everyone on the list makes sense. Looking at how the plays of the top artists have varied from month to month, you can see a few patterns. Muse has been more popular recently and The Shins and Grouplove have lost some steam.
artist.month=sqldf("Select Month, Artist, Count(Song) as Num From playlist Group By Month, Artist Order by Month, Artist") artist=sqldf("Select Artist, Count(Artist) as Num From playlist Group By Artist Order by Num DESC") p=ggplot(subset(artist.month,Artist %in% head(artist$Artist,8)),aes(Month,Num)) p+geom_bar(stat="identity",aes(fill=Artist),position='fill',colour="grey")+ labs(y="Percentage of Plays")
For the play count of the top artists, I see some odd numbers in June and July of 2012. The number of plays went way down.
p + geom_area(aes(fill = Artist), position = "stack", colour = 1) + labs(y = "Number of Plays")
Looking into this further, I plotted the date and time that the song was played by the cumulative number of songs played since the beginning of the list. The plot should be a line with a constant slope, meaning that the plays per day are relatively constant. You can see in June and July of 2012 there are flat spots where there is no playlist history.
qplot(playlist$Day + playlist$Time, 1:length(dates), geom = "path")
There are also smaller flat spots in September and December, but I am going to decide that those are small enough not to affect any further analyses. From this, I am going to use data only from August 2012 to present.
playlist = subset(playlist, Day >= mdy("8/1/12"))Next up, I am going to use this data to analyze the plays of artists from Summerfest, and try to infer if the play counts varied once they were added to the bill.
To leave a comment for the author, please follow the link and comment on their blog: Statistically Significant.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.