Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
On Sunday the Tokyo Olympics men sprint 100m final will take place. Francesc Montané reminded me in his analysis that 9 years ago I used a simple regression model to predict the winning time for the 100m men sprint final of the 2012 Olympics in London. My model predicted a winning time of 9.68s, yet Usain Bolt finished in 9.63s. For this Sunday my prediction is 9.72s, with a 50% credible interval of [9.61s, 9.85s].
Since the 2012 Olympics things changed, Usain Bolt has retired and new shoes with advance spike technology have created a little bit of a controversy. I don’t think a log-linear regression is still sensible. We may see faster times, but there will be a limit. Hence, an S-shape curve might be a good starting point. There are many to choose from. I picked the following form:
\[ \begin{aligned} f(x) = & L + 1 – \frac{x}{\left(1 + |x|^{k}\right)^{1/k}} \\ \end{aligned} \]
With \(L=9\) and \(k=0.9\) it looks like this:
Data
Here is the historical data again:
library(data.table) golddata <- fread( sep=",", header=TRUE, text=" Year, Event, Athlete, Medal, Country, Time 1896, 100m Men, Tom Burke, GOLD, USA, 12.00 1900, 100m Men, Frank Jarvis, GOLD, USA, 11.00 1904, 100m Men, Archie Hahn, GOLD, USA, 11.00 1906, 100m Men, Archie Hahn, GOLD, USA, 11.20 1908, 100m Men, Reggie Walker, GOLD, SAF, 10.80 1912, 100m Men, Ralph Craig, GOLD, USA, 10.80 1920, 100m Men, Charles Paddock, GOLD, USA, 10.80 1924, 100m Men, Harold Abrahams, GOLD, GBR, 10.60 1928, 100m Men, Percy Williams, GOLD, CAN, 10.80 1932, 100m Men, Eddie Tolan, GOLD, USA, 10.30 1936, 100m Men, Jesse Owens, GOLD, USA, 10.30 1948, 100m Men, Harrison Dillard, GOLD, USA, 10.30 1952, 100m Men, Lindy Remigino, GOLD, USA, 10.40 1956, 100m Men, Bobby Morrow, GOLD, USA, 10.50 1960, 100m Men, Armin Hary, GOLD, GER, 10.20 1964, 100m Men, Bob Hayes, GOLD, USA, 10.00 1968, 100m Men, Jim Hines, GOLD, USA, 9.95 1972, 100m Men, Valery Borzov, GOLD, URS, 10.14 1976, 100m Men, Hasely Crawford, GOLD, TRI, 10.06 1980, 100m Men, Allan Wells, GOLD, GBR, 10.25 1984, 100m Men, Carl Lewis, GOLD, USA, 9.99 1988, 100m Men, Carl Lewis, GOLD, USA, 9.92 1992, 100m Men, Linford Christie, GOLD, GBR, 9.96 1996, 100m Men, Donovan Bailey, GOLD, CAN, 9.84 2000, 100m Men, Maurice Greene, GOLD, USA, 9.87 2004, 100m Men, Justin Gatlin, GOLD, USA, 9.85 2008, 100m Men, Usain Bolt, GOLD, JAM, 9.69 2012, 100m Men, Usain Bolt, GOLD, JAM, 9.63 2016, 100m Men, Usain Bolt, GOLD, JAM, 9.81 ") library(ggplot2) ggplot(golddata, aes(x=Year, y=Time)) + geom_line() + geom_point() + labs(title="Winning times of Olympic gold medalist 100m sprint men")
Model
First I prepare the data, I remove the first measurement from 1896 that looks
too much like an outlier. To fit the non-linear function it is helpful to
centre and scale the ‘Year’ metric. However, instead of doing this right away,
I will include variables based on the output of scale
in my model.
golddata1900 <- golddata[Year>=1900] attr(scale(golddata1900$Year), "scaled:center") ## [1] 1958.786 attr(scale(golddata1900$Year), "scaled:scale") ## [1] 36.69833
To fit the non-linear function I use the brms
package. For my Bayesian regression model I assume a Gaussian data distribution and Normal priors for the parameters.
library(brms) library(parallel) nCores <- detectCores() options(mc.cores = nCores) mdl <- brm( bf(Time ~ L + 1 - ((Year-C)/S)/(1+fabs((Year-C)/S)^k)^(1/k), C ~ 1, S ~ 1, L ~ 1, k ~ 1, nl = TRUE), data = golddata1900, family = gaussian(), prior = c( prior(normal(1959, 5), nlpar = "C"), prior(normal(37, 1), nlpar = "S"), prior(normal(9, 0.2), nlpar = "L", lb=0), prior(normal(1, 0.2), nlpar = "k", lb=0) ), seed = 1234, iter = 2000, chains = 4, cores = nCores, ) mdl ## Family: gaussian ## Links: mu = identity; sigma = identity ## Formula: Time ~ L + 1 - ((Year - C)/S)/(1 + fabs((Year - C)/S)^k)^(1/k) ## C ~ 1 ## S ~ 1 ## L ~ 1 ## k ~ 1 ## Data: golddata1900 (Number of observations: 28) ## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; ## total post-warmup samples = 4000 ## ## Population-Level Effects: ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS ## C_Intercept 1956.62 4.17 1947.95 1964.53 1.00 2403 2066 ## S_Intercept 37.12 1.00 35.18 39.10 1.00 2770 2547 ## L_Intercept 9.31 0.05 9.21 9.40 1.00 2483 2388 ## k_Intercept 0.90 0.09 0.74 1.10 1.00 3803 2643 ## ## Family Specific Parameters: ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS ## sigma 0.18 0.03 0.13 0.24 1.00 3269 2478 ## ## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS ## and Tail_ESS are effective sample size measures, and Rhat is the potential ## scale reduction factor on split chains (at convergence, Rhat = 1).
The model runs without problems and the estimated parameters are not too far away from my prior assumptions. Nice!
The parameter \(L\) represents the fastest a human could possibly run. The 95% credible interval is between 9.21s and 9.40s.
plot(brms::conditional_effects(mdl, method = "predict"), points=TRUE)
Prediction
The model looks plausible, so what are the predictions for this Sunday, and the Olympics in Paris and Los Angeles?
p <- brms::posterior_predict( mdl, newdata=data.frame(Year=c(2021, 2024, 2028))) colnames(p) <- c(2021, 2024, 2028) rbind( mean=apply(p, 2, mean), apply(p, 2, quantile, probs=c(0.25, 0.5, 0.75)) ) ## 2021 2024 2028 ## mean 9.724736 9.707647 9.695373 ## 25% 9.606649 9.585469 9.575641 ## 50% 9.723891 9.710383 9.697171 ## 75% 9.845782 9.828603 9.819280
A winning time of close to 9.72s for this Sunday might be achievable. But the fastest time this year of the American Trayvon Bromell, who is favourite to take Bolt’s 100m title in Tokyo, was 9.77s.
Interestingly, Francesc’s ETS model without Usain Bolt’s data is predicting a winning time of 9.73s.
Session Info
sessionInfo() ## R version 4.1.0 (2021-05-18) ## Platform: aarch64-apple-darwin20 (64-bit) ## Running under: macOS Big Sur 11.5.1 ## ## Matrix products: default ## BLAS: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRblas.dylib ## LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib ## ## locale: ## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] ggplot2_3.3.5 data.table_1.14.0 ## ## loaded via a namespace (and not attached): ## [1] minqa_1.2.4 colorspace_2.0-2 ellipsis_0.3.2 ## [4] ggridges_0.5.3 brms_2.15.0 rsconnect_0.8.18 ## [7] markdown_1.1 base64enc_0.1-3 rstudioapi_0.13 ## [10] farver_2.1.0 rstan_2.21.2 DT_0.18 ## [13] fansi_0.5.0 mvtnorm_1.1-2 codetools_0.2-18 ## [16] bridgesampling_1.1-2 splines_4.1.0 knitr_1.33 ## [19] shinythemes_1.2.0 bayesplot_1.8.1 projpred_2.0.2 ## [22] jsonlite_1.7.2 nloptr_1.2.2.2 shiny_1.6.0 ## [25] compiler_4.1.0 backports_1.2.1 assertthat_0.2.1 ## [28] Matrix_1.3-4 fastmap_1.1.0 cli_3.0.1 ## [31] later_1.2.0 htmltools_0.5.1.1 prettyunits_1.1.1 ## [34] tools_4.1.0 igraph_1.2.6 coda_0.19-4 ## [37] gtable_0.3.0 glue_1.4.2 reshape2_1.4.4 ## [40] dplyr_1.0.7 V8_3.4.2 Rcpp_1.0.7 ## [43] jquerylib_0.1.4 vctrs_0.3.8 nlme_3.1-152 ## [46] blogdown_1.4 crosstalk_1.1.1 xfun_0.24 ## [49] stringr_1.4.0 ps_1.6.0 lme4_1.1-27.1 ## [52] mime_0.11 miniUI_0.1.1.1 lifecycle_1.0.0 ## [55] gtools_3.9.2 MASS_7.3-54 zoo_1.8-9 ## [58] scales_1.1.1 colourpicker_1.1.0 promises_1.2.0.1 ## [61] Brobdingnag_1.2-6 parallel_4.1.0 inline_0.3.19 ## [64] shinystan_2.5.0 gamm4_0.2-6 yaml_2.2.1 ## [67] curl_4.3.2 gridExtra_2.3 loo_2.4.1 ## [70] StanHeaders_2.21.0-7 sass_0.4.0 stringi_1.7.3 ## [73] highr_0.9 dygraphs_1.1.1.6 boot_1.3-28 ## [76] pkgbuild_1.2.0 rlang_0.4.11 pkgconfig_2.0.3 ## [79] matrixStats_0.59.0 evaluate_0.14 lattice_0.20-44 ## [82] purrr_0.3.4 rstantools_2.1.1 htmlwidgets_1.5.3 ## [85] labeling_0.4.2 tidyselect_1.1.1 processx_3.5.2 ## [88] plyr_1.8.6 magrittr_2.0.1 bookdown_0.22 ## [91] R6_2.5.0 generics_0.1.0 DBI_1.1.1 ## [94] pillar_1.6.1 withr_2.4.2 mgcv_1.8-36 ## [97] xts_0.12.1 abind_1.4-5 tibble_3.1.3 ## [100] crayon_1.4.1 utf8_1.2.2 rmarkdown_2.9 ## [103] grid_4.1.0 callr_3.7.0 threejs_0.3.3 ## [106] digest_0.6.27 xtable_1.8-4 httpuv_1.6.1 ## [109] RcppParallel_5.1.4 stats4_4.1.0 munsell_0.5.0 ## [112] bslib_0.2.5.1 shinyjs_2.0.0
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.