Site icon R-bloggers

Using Ggplot2 to plot last.fm top 100 albums

[This article was first published on R Psychologist » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I began with downloading and importing the tab separated data file from last.fm (TSV).

# read data
lastfm <- read.delim("~/Downloads/bestof_2011_tsv/bestof_2011_releases.tsv")

Then I did some data cleanup, because one row just contained junk and some columns were unnecessary. I also removed all entries after row 100.

# remove row 541 'cause it's just junk
lastfm <- lastfm[-541,]
# remove unnecessary columns
lastfm <- lastfm[-c(3, 5)]
# remove all rows after 100
lastfm <- lastfm[-c(101:nrow(lastfm)) , ]

I did a search for missing values, but none were found.

which(lastfm == "NULL", arr.ind = TRUE)
which(is.na(lastfm), arr.ind = TRUE)

The XML-file contained information about artists location. So I loaded it and cleaned it up a bit. The location column was a bit messy so I edited manually in statas data editor, I figured it was the easiest way. I then read the edited data file back into R and combined that data.frame with the rest of the data from the TSV-file.

library(XML)
last.xml <- xmlToDataFrame("~/Downloads/bestof_2011_xml/bestof_2011_releases.xml")
last.xml <- last.xml[-c(101:nrow(last.xml)) , ]
last.xml <- last.xml[-c(1,4,5,6,7,8,9)]
write.dta(last.xml, "stata", version = 7L)

# read stata-file
library(foreign)
last.xml <- read.dta(file="/Users/Kris/stata.dta")
# combine data.frames
lastfm <- cbind(lastfm, location = last.xml$location)

I tried plotting this data.frame with ggplot but the location variable contained 17 countries, which made a messy plot. Therefore I choose to group some countries under the label “other”.

lastfm$location <- as.character(lastfm$location)
lastfm$location[lastfm$location %in% c("Denmark", "Sweden")] <- "Sweden/Denmark"
lastfm$location[lastfm$location %in% c("Germany",
                                       "France","Paris","Australia",
                                       "New Zealand",
                                       "Iceland","Brazil", "Scotland",
                                       "Democratic Republic of the Congo",
                                       "Romania","Belgium",
                                       "Netherlands")] <- "Other"

I still wasn’t satisfied with the plot, because it wasn’t sorted after album plays. I tried quite a lot of different methods of sorting the data.frame before figuring out how to do it successfully with reorder().

lastfm$artist.name <- reorder(lastfm$artist.name, rowSums(lastfm[4]))

I wanted my plot to have readable decimal notation so I created my own x-breaks.

library(scales)
x.breaks <- cbreaks(
  c(0, max(lastfm$album.plays)), #range: 0 to album.plays max
  pretty_breaks(10), # 10 ticks
  labels = comma_format()) # create labels with commas, ie 10,000. 

I also used my own custom colors for the plots legend, which I saved in a list before initiating ggplot2.

location.color <- c("Canada" = "#7b8dbf",
                    "Other" = "#f97850",
                    "Sweden/Denmark" = "#df72b6",
                    "UK" = "#57b894",
                    "USA" = "#4a4a4a"
                    )

Then, at last, I drew the plot with ggplot2.

library(ggplot2)
ggplot(lastfm, aes(artist.name,album.plays, fill=location)) +
  geom_bar(stat="identity") +
  coord_flip() + # flip x and y
  xlab("Album Artist") +
  ylab("Album plays") +
  # Use the labels and breaks I defined earlier
  scale_y_continuous(breaks = x.breaks$breaks, labels = x.breaks$labels) +
  # Add a plot title
  opts(title = "Last.fm top 100 albums 2011",
       # Move the legend inside the plot to save space.
       legend.position=c(.85, .5),
       # Change it's background to white.
       legend.background=theme_rect(fill="#ffffff")) +
  # Use my custom color scale which I defined earlier.
  scale_fill_manual("Artist homeland", values = location.color)


We can see that the plot is dominated by USA and UK and that Adele and Lady Gaga got exponentially more album plays than the rest. To give a summary of $location I used summary().

summary(as.factor(lastfm$location))

Which gave the following:

           Canada          Other     Sweden/Denmark      UK            USA
             5             13              4             24             54

To leave a comment for the author, please follow the link and comment on their blog: R Psychologist » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.