Peace through Music. Country clustering using R and the last.fm API
[This article was first published on Rcrastinate, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
last.fm is an internet radio and music suggestion service. Registered users can also use last.fm to ‘scrobble’ tracks they’ve been listening to. last.fm then keeps track of a user’s statistics in terms of top artists, albums and tracks.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Luckily, last.fm also has an API which is accessible as soon as you get a key for it. Thanks to this API, there are lot of cool web-based applications for last.fm.
Today, I want to show you a few little things we can do with this API using R. I used (and modified) the R package RLastFM by Greg Hirson (thanks again, Greg!) to access the API and get the information.
I had the idea to group countries based on the listening habits (‘scrobbles’) of the people living there. Hierarchical clustering is the way to go here, I guess. As distances, we could just use the number of overlapping artists in the top 50 artists of each country.
First, we will need a function to access the API. This is just a convinience function for the already great working functions by Greg Hirson.
library(RLastFM)
get.country.artists <- function (country) {
geo.getTopArtists(country)$artist }
Now, we select some countries (I selected all OECD countries, that’s kind of arbitrary, but it’s a start). Note, that the country names are defined by the ISO 3166-1 country names standard.
oecd.countries <- c("Belgium", "Denmark", "Germany", "France", "Greece", "Ireland", "Iceland", "Italy", "Canada", "Luxembourg", "Netherlands", "Norway", "Austria", "Portugal", "Sweden", "Switzerland", "Spain", "Turkey", "USA", "United Kingdom", "Japan", "Finland", "Australia", "New Zealand", "Mexico", "Czech Republic", "Korea, Republic of", "Hungary", "Poland", "Slovakia", "Chile", "Slovenia", "Israel", "Estonia")
Now, I access the last.fm API and put the results into a list.
countries <- sort(oecd.countries)
art.list <- list()
for (coun in countries) {
cat(coun,”\n”)
art.list[[coun]] <- get.country.artists(coun) }
Afterwards, we need to create distance matrix based on the number of overlapping artists of two countries. First, I define a function to intersect two artist lists:
intersect.countries <- function (country1.artists, country2.artists) {
length(intersect(country1.artists, country2.artists)) }
Now, I use the function on every possible pair of countries, write the results into a matrix and convert this matrix into a distance matrix.
result.mat <- c()
for (coun in countries) {
new.vec <- c()
for (i in 1:length(countries)) {
new.dist <- 1 - (intersect.countries(art.list[[coun]], art.list[[countries[i]]]) / 50)
new.vec <- c(new.vec, new.dist) }
result.mat <- rbind(result.mat, new.vec) }
colnames(result.mat) <- countries
rownames(result.mat) <- countries
dists <- as.dist(result.mat, diag = T, upper = T)
Now, I’m doing the hierarchical clustering. I’m chosing the Ward method.
dists.clust <- hclust(dists, method = "ward")
And now for the plot (finally!)…
plot(dists.clust, main = “Clustering Dendogram, Method: Ward”, xlab = “Similarities based on number of overlapping artists in top 50 artists”, sub = “”, cex = 0.9)
(click to be able to read anything)
It makes sense, doesn’t it? Countries with many overlapping artists in the top 50 share one branch of the clustering tree. Other groups of countries are ‘clustering in’ later. In the right-most branch, large portions of Scandinavia (except Iceland) are clustering together. For some countries, I don’t have an explanation (Iceland and Portugal?).
Currently, I’m experimenting with some visualization technique with the nice R maps package.
last.fm also supplies metro charts, where for specific cities, there are extra charts. Let’s play around with it. First, we gonna need some new functions (these are adaptations from the RLastFM package and you gonna need to insert your own API key to make them work).
get.all.metros <- function (country, lastapi = RLastFM:::baseurl) {
xpathSApply(xmlParse(getForm(lastapi, method = “geo.getMetros”, country = country, api_key = ), asText = T), “//metro/name”, xmlValue) }
p.geo.getMetroArtistChart <- function (f) {
doc = xmlParse(f, asText = T)
list(artist = xpathSApply(doc, “//artist/name”, xmlValue),
playcount = xpathSApply(doc, “//artist/listeners”, xmlValue)) }
get.metro.artist <- function (metro, country = "germany", n = 100) {
p.geo.getMetroArtistChart(
getForm(RLastFM:::baseurl,
method = “geo.getMetroArtistChart”,
country = country,
metro = metro,
limit = n,
api_key = )) }
Now, let’s use them to extract all metros supported in Germany and France. Afterwards, build two lists with metro charts.
de.metros <- get.all.metros(country = "germany")
fr.metros <- get.all.metros(country = "france")
build.metro.chart.list <- function (metros, country) {
metro.chart.list <- list()
for (metro in metros) {
cat(metro, “\n”)
metro.chart.list[[metro]] <- get.metro.artist(metro, country = country) }
metro.chart.list }
de.metro.charts <- build.metro.chart.list(get.all.metros(country = "germany"), "germany")
fr.metro.charts <- build.metro.chart.list(get.all.metros(country = "france"), "france")
Now, load the maps package and the dataset of cities that comes with it. Then, draw Germany and France.
library(maps)
data(world.cities)
map(database = “world”, regions = c(“Germany”, “France”), exact = T)
Here comes the fun part: Look into world.citites for each metro and write the top artist of each metro at the location of the city (under the city’s name). Please note, that there are two Frankfurts and two Lilles’ in world.cities. I have to select the correct ones.
for (city in names(de.metro.charts)) {
city.info <- world.cities[world.cities$name == city,]
if (city.info$name[1] == “Frankfurt”) city.info <- city.info[1,]
text(x = city.info$long, y = city.info$lat, labels = city.info$name, cex = .6)
text(x = city.info$long, y = city.info$lat – 0.25,
labels = de.metro.charts[[city]]$artist[1],
col = “#FF0000FF”, cex = .6)
}
for (city in names(fr.metro.charts)) {
city.info <- world.cities[world.cities$name == city,]
if (city.info$name[1] == “Lille”) city.info <- city.info[2,]
text(x = city.info$long, y = city.info$lat, labels = city.info$name, cex = .6)
text(x = city.info$long, y = city.info$lat – 0.25,
labels = fr.metro.charts[[city]]$artist[1],
col = “#FF0000FF”, cex = .6)
}
(click to enlarge)
So much for today, I’m too shocked by Coldplay in whole Germany to go on 🙂
To leave a comment for the author, please follow the link and comment on their blog: Rcrastinate.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.