Visualising Wikipedia search statistics with R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I have been playing with R to parse html. After reading about visualising “fantasy football” search traffic with RGoogleTrends at The Log Cabin blog I decided to write a few functions to do similar things with Wikipedia search statistics.
This is what I have managed to come up with:
wikiStat <- function (query, lang = 'en', monback = 12, since = Sys.Date() ) { #load packages require(mondate) require(XML) namespace <- c("a" = "http://www.w3.org/1999/xhtml") wikidata <- data.frame() #iterate "monback" number of months back for (i in 1:monback) { #get number of days in a given month and create a vector curdate <- strptime(mondate(since) - (i - 1), "%Y-%m-%d") previous <- strptime(mondate(since) - (i - 2), "%Y-%m-%d") noofdays <- round(as.numeric(previous - curdate), 0) days <- seq(from = 1, to = noofdays, by = 1) #build url if(curdate$mon + 1 < 10) { dateurl <- paste(as.character(curdate$year + 1900), "0", as.character(curdate$mon + 1), sep = "") } else { dateurl <- paste(as.character(curdate$year + 1900), as.character(curdate$mon + 1), sep = "") } url <- paste("http://stats.grok.se/", lang, '/', dateurl, '/', query, sep = "") #get and parse a wikipedia statistics webpage wikitree <- xmlTreeParse(url, useInternalNodes=T) #find nodes specyfying traffic traffic <- xpathSApply(wikitree,"//a:li[@class='sent bar']/a:p", xmlValue, namespaces = namespace) #edit obtained strings (sometimes its in the format # of e.g. 7.5k meaning 7500) traffic <- gsub("\\.", "", traffic) traffic <- gsub("k", "00", traffic) traffic <- as.numeric(traffic) #it seems that there is some kind of a bug in wikipedia statistics # and the results are shifted by one day in month - this is a fix if(length(traffic) > noofdays) { traffic <- traffic[2:length(traffic)] } #create daily dates relating to traffic vector #and create a dataframe days <- seq(from = 1, to = length(traffic), by = 1) yearmon <- rep(paste(curdate$year + 1900, curdate$mon + 1, sep = "-"), length(traffic)) date <- as.Date(paste(yearmon, days, sep = "-"), "%Y-%m-%d") wikidata <- rbind(wikidata, data.frame(date, traffic)) } #remove rows that are missing (due to the bug?) wikidata <- wikidata[!is.na(wikidata$date),] #return dataframe return(wikidata) } wikiPlotStat <- function(wikitraffic, title = "Wikipedia statistics") { require(ggplot2) #create a plot wikiplot <- ggplot() + geom_bar(aes(x = date, y = traffic, fill = traffic), stat = "identity", data = wikitraffic) + opts(title = title) #...with no legend and a theme that fits colours of my blog ;) wikiplot <- wikiplot + theme_bw() + opts(legend.position = "none") return(wikiplot) }
With these two functions you can take a look at search traffic for any article you wish. For instance, we can take a look at the search statistics for “Financial crisis”. The wikiStat() function returns dataframe with the necessary data:
#look 40 months back from now critraffic <- wikiStat("Financial_crisis", monback = 40)
To plot the data easily we can use the second function:
criplot <- wikiPlotStat(critraffic, "Wikipedia search traffic for 'Financial crisis'") criplot
And this is the result:
You can clearly see the outbreak of the crisis in the second half of 2008, when Lehman Brothers collapsed. Since then people seem to be still willing to learn about the crisis.
Do you have any suggestions?
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.