Olimpic predictions – from an R web service provider’s point of view
[This article was first published on rapporter, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Hello, world!
Back in July we have read Markus Gesmann’s great blogpost about a prediction for the 100m final in London. Soon we decided to create similar estimates about the forthcoming events and started to post our results on Facebook.We would like to emphasise again that these kind of extrapolated estimates are rather just for fun and we also think that any sport fan can give more valid and even more reliable estimates about the results without any statistical knowledge. All our models assume that the performance of the winners, so the expected results also fit the historical data. So this blogpost has nothing to do with causal models, nor would we encourage anyone to bet on these kind of estimates in the forthcoming Olympics, but it might be interesting to see, visualise the lower and lower results from year to year.
In this article we will show how the models look like, what kind of tools we used to build and visualise those and also providing a demo web application where anyone could compile a similar plot with a decent amount of annotations with a single click.
As you could see in the above demo graph, we have built two models for all sport events. A so called “simple” prediction which stands for a log-linear, and a “power” estimate for a non-linear model.
Of course R provides some really great tools and databaseolympics.com has all kind of historical data to patch a plot like the above in a few minutes, but we tried to build some environment to even let the non-R users benefit from all these resources and also give an instrument to professional R users where they can concentrate on the code and do not have to deal with the wearisome process of editing/fine-tuning/finalising/re-thinking the style of the output.
We would ask all our kind readers without any interest in the technical details to check out our demo web application right now and also to leave some feedback below in the comment box to let us know what you would expect from a statistical tool running in your browser.
Hello, R users!
Before digging deeper in the models, you might be interested in our work flow. At least this article concentrates rather on this latter, as we are pretty sure you are familiar with the prior 🙂Most R users have gotten used to or developed some kind of tools to export the results from their R console to some user-friendly format (maybe in pdf, HTML, odt, docx or xls beside others) available to share with clients/collaborators etc. They might have dedicated much time to create custom templates and helper functions to speed-up the dry part of the job, or even still puzzling which to choose from the wide variety of R packages for this end.
Although we are aware of sweave, Hmisc, xtable, R2HTML and hwriter, even odfWeave or ascii and brew beside the Windows specific libraries and others which we have not used in production (but also worth checking on the reproducible research CRAN task view), and have also seen the rise of knitr (although started to work on this project well before we first heard of this great package back in January), we decided to introduce some new packages to fit our custom needs: to provide some ways to run regular R commands with pretty GUI and output in a browser.
Introducing rapport
Aleksandar Blagotić and Gergely Daróczi started to work on a package almost one and a half year ago to create reproducible “R templates”. These templates hold some R code with optional parameters which could be specified in a simple R command.Of course this sounds like the templates are some custom R functions, but actually the template body is a Pandoc’s markdown text with any number of inline R commands (called: “chunks”) specified between brew-like tags. The main difference from
brew
is that the “chunks” are automatically rendered as markdown and the resulting output of a template can be easily exported to bunch of formats (just to name a few: HTML, pdf, odt or docx). If you might be interested, please check out our examples (do not forget to click on the “exports” tabs in the list) and for more details, please see the above URL.We started to build on the awesome
ascii
and evaluate
packages with a bunch of document converter back-ends, but soon we realised that we had to tweak those too much for our specific needs, so decided to write our custom “evaluate R call(s) and render results in markdown” package.Introducing pander
As stated above, we needed a package which caneval
R commands with the following parameters:- catching all
messages
,warnings
and errors, - catching the
stdout
(likecapture.output
) while evaluating, - storing the results of each R command (not per chunk!) in R objects,
- running each R command and not halting on the first error,
- rendering the results (let it be a character vector,
table
,data.frame
or anlm
object beside others) in markdown, - grabbing each plot (let it be a so-called
base
plot,lattice
orggplot2
) and save to disk (also: show to the user on demand and also including those in the exported documents), - and last but not least trying to unify the images based on user preferences in the major graph libraries (
graphics
,lattice
andggplot2
).
evals
(the base command of the package) can be considered as quite stable and mature with even a custom caching mechanism.evals
takes a character vector of R commands to run as its main parameter with some optional arguments, and is frequently called in Pandoc.brew
which was the main purpose of pander
: to provide a brew
-like engine resulting in markdown while keeping the great advantages of brew
(running loops or printing parts of a report conditionally). Please check out the examples (and the exported documents of those in the bottom) to see the engine in action.Beside
Pandoc.brew
there is also an R5 reference class to create styled reports in the R console based on the idea got from the ascii
package.Back to the Olympics
Why did we write that lengthy introduction without any interesting source code? Of course we used our packages to generate the above plot with some chatty annotations shown in our demo (or follow the direct link to “100m Freestyle Men”) . We encourage you to check that out before reading further.Fetching data
First, we downloaded the required data set from databaseolympics.com if not found in our local cache:## specifying the sport event based on the above URL u <- 'SWI-120' ## datafile store (caching not to bother the server) dir.create(file.path(getwd(), 'reports', 'data'), recursive = TRUE, showWarnings = FALSE) datafile <- file.path(getwd(), 'reports', 'data', u) ## build the URL uri <- strsplit(u, '-', fixed = TRUE)[[1]] url <- paste0('http://www.databaseolympics.com/sport/sportevent.htm?sp=', uri[1], '&enum=', uri[2]) ## fetching data from \url{databaseolympics.com} if not cached if (!file.exists(datafile)) { library(XML) d <- readHTMLTable(readLines(url, warn = FALSE), which = 2, header = TRUE) saveRDS(d, file = datafile) } else { d <- readRDS(datafile) }
Data transformations
Now we have all cells of the above HTML table saved tod
, which is to be filtered in most cases (we need only the first results from each year without any duplicates):## defining events which seem to be invalid dEvents <- tail(as.character(d$Event[d$Event != '']), 1) dEvents.invalid <- which(d$Event != dEvents) ## removing events which seem to be invalid if (length(dEvents.invalid) > 0) d <- d[-dEvents.invalid, ] ## just dealing with the winners ATM golddata <- subset(d, Medal %in% "GOLD") ## removing duplicated rows (we just need the Results) if (any(table(golddata['Year']) > 1)) golddata <- golddata[-which(duplicated(golddata['Year'])), ] ## are we dealing with time results? resultInTime <- any(grepl(':', golddata$Result)) ## transforming data d$Event <- as.character(d$Event) golddata$Year <- as.numeric(as.character(golddata$Year)) years <- c(as.numeric(as.character(sort(unique(golddata$Year)))), 2012) golddata$Result <- rms(as.character(golddata$Result), resultInTime) rownames(golddata) <- NULL
We called there a custom function, which transforms the human readable time format to seconds:
#' Revert pretty time printing #' @param text character in \code{hours:minutes:seconds} format #' @param transform if set to FALSE the input will be returned without any tweaks #' @return numeric rms <- function(text, transform = TRUE) { sapply(text, function(x) { if (transform) { x <- as.numeric(strsplit(x, ':')[[1]]) if (length(x) == 1) return(x) if (length(x) == 2) return(x[1] * 60 + x[2]) if (length(x) == 3) return(x[1] * 60^2 + x[2] * 60 + x[3]) } else round(as.numeric(x), 2) }, USE.NAMES = FALSE) }
Fitting models
Building the models cannot be easier in R. We decided to build a non-linear (power) and a log-linear model, which fit the data set in most cases:nonLin <- suppressWarnings(lm(Result ~ poly(Year, 4), data = golddata)) nonLin.predict <- suppressWarnings(predict(nonLin, newdata = data.frame(Year = years))) logLin <- lm(log(Result) ~ Year, data = golddata) logLin.predict <- exp(predict(logLin, newdata = data.frame(Year = years)))
Of course checking the assumptions of the models also makes sense, where we build on the
gvlma
package for simplicity:library(gvlma) summary(gvlma(nonLin)) summary(gvlma(logLin))
Please note that in the beginning of August (when the first plot of this post was created) we used a more saturated model for the power prediction:
year2 <- golddata$Year**2 year3 <- golddata$Year**3 year4 <- golddata$Year**4 nonlin3 <- lm(Result~Year*year2*year3*year4, data=cbind(golddata, year2, year3, year4))
Visualising results
A quickggplot2
solution can be seen below after some data transformation (adding the expected results in 2012 to the data set): golddata2012 <- rbind(golddata, c(2012, rep(NA, times = ncol(golddata)-1))) golddata2012$nonLin <- nonLin.predict golddata2012$logLin <- logLin.predict library(ggplot2) g <- ggplot(golddata2012) + geom_point(aes(x=Year, y=Result)) + geom_smooth(aes(x=Year, y=nonLin), alpha=0.2) + geom_smooth(aes(x=Year, y=logLin, col = "#56B4E9"), alpha=0.2, col = "#009E73") + opts(title = d$Event[1]) + ylab("") + xlab("") + theme_bw() + opts(legend.position = "none") if (resultInTime) g <- g + scale_y_continuous(labels = ms) g
Which looks like (automatically styled with
pander
):In the above
ggplot
call we used another custom function, which does the opposite of rms
:#' Pretty time printing #' @param sec integer #' @param transform if set to FALSE the input will be returned without any tweaks #' @return character in \code{hours:minutes:seconds} format ms <- function(sec, transform = TRUE) { sapply(sec, function(sec) { if (transform) { if (sec < 60) { msec <- sec %% 1 if (msec == 0) msec <- '' else msec <- paste0('.', round(msec, 2) * 100) } else msec <- '' sec <- round(sec) minute <- sec %/% 60 if (minute > 59) { sec <- sprintf("%02d", sec - minute * 60) hours <- minute %/% 60 minute <- sprintf("%02d", minute - hours * 60) paste0(paste(hours, minute, sec, sep=":"), msec) } else { sec <- sprintf("%02d", sec - minute * 60) paste0(paste(minute, sec, sep=":"), msec) } } else round(sec, 2) }) }
Creating a reusable template for easy access
The above scripts (and even more: the above "complex plot" beside some chatty annotations and optionally added weights to the models - which can be seen in our demo also) were compiled to abrew
file and can be found on GitHub.Please note, that this solution uses only
pander
, but creating a template which shows the predictions building on rapport
with a one-liner in R can be accomplished easily based on the above, on which we will focus in our next post.Welcome, feedback!
We really hope you may find our packages useful. Please do not hesitate to share your opinion about pander or rapport on GitHub, or more generally below in the comment box!
To leave a comment for the author, please follow the link and comment on their blog: rapporter.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.