Fun with Statistics – Is Usain Bolt really the fastest man on earth?
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
If you search for the phrase “fastest man on earth” in Google, chances are that it will return the answer “Usain Bolt”. It certainly does so for me, even though the results might be different if Google decides to personalize the results for you. This is because currently he holds the world record for being the quickest (9.58s) to run a 100 metres race. Justin Gatlin once did it in 9.45s, but that does not count as the record because he was assisted by the wind.
However, as statisticians, we know that a single data point does not necessarily prove anything and we should look at the distribution of timings for the top runners to check if Usain Bolt is truly the fastest man on earth. In this article, we will try to find an answer to this question using data.
Data
Our first step is to find data on the 100m race timings for all top runners. Unfortunately, I could not find any public source of this data. The best option we have is to use data from the excellent website Alltime Athletics maintained by Peter Larsson. This page provides the details of the all-time men’s best 100m race timings. We can answer a slightly different question using this data. Out of all runners who appear in this list, is there any evidence to suggest that there is a statistical difference among the top runners?
Our first step will be to extract and store the data in a R data frame. We use the rvest and readr R packages to do this.
library(rvest) library(readr) male_100_html <- read_html("http://www.alltime-athletics.com/m_100ok.htm")
An examination of the HTML source reveals that the data has been stored within HTML pre-tags. There are also seven tables available as part of the page, out of which we are interested in the table “All-time men’s best 100m”. We use the function html_nodes from the rvest package using an XPath selector. The next step is to extract the HTML text using the function html_text, and restrict to the first element.
male_100_pres <- male_100_html %>% html_nodes(xpath = "//pre") male_100_htext <- male_100_pres %>% html_text() male_100_htext <- male_100_htext[[1]]
The data is stored in fixed-width format in the variable male_100_htext. The read_fwf function from the readr package can be used to store the data in a data frame. The exact arguments to the function were decided using a bit of trial and error.
male_100 <- readr::read_fwf(male_100_htext, skip = 1, n_max = 3178, col_types = cols(.default = col_character()), col_positions = fwf_positions( c(1, 16, 27, 35, 66, 74, 86, 93, 123), c(15, 26, 34, 65, 73, 85, 92, 122, 132) )) str(male_100) ## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 3178 obs. of 9 variables: ## $ X1: chr "1" "2" "3" "3" ... ## $ X2: chr "9.58" "9.63" "9.69" "9.69" ... ## $ X3: chr "+0.9" "+1.5" "±0.0" "+2.0" ... ## $ X4: chr "Usain Bolt" "Usain Bolt" "Usain Bolt" "Tyson Gay" ... ## $ X5: chr "JAM" "JAM" "JAM" "USA" ... ## $ X6: chr "21.08.86" "21.08.86" "21.08.86" "09.08.82" ... ## $ X7: chr "1" "1" "1" "1" ... ## $ X8: chr "Berlin" "London" "Beijing" "Shanghai" ... ## $ X9: chr "16.08.2009" "05.08.2012" "16.08.200" "20.09.2009" ... ## - attr(*, "spec")= ## .. cols( ## .. .default = col_character(), ## .. X1 = col_character(), ## .. X2 = col_character(), ## .. X3 = col_character(), ## .. X4 = col_character(), ## .. X5 = col_character(), ## .. X6 = col_character(), ## .. X7 = col_character(), ## .. X8 = col_character(), ## .. X9 = col_character() ## .. )
For this analysis, we are only interested in the name of the runner and the race timing. So we restrict the data to these two columns, and give each column an appropriate name.
library(dplyr) male_100 <- male_100 %>% select(X2, X4) colnames(male_100) <- c("timing", "runner")
We observe that some of the timings have an “A” against the time. The Wikipedia page Athletics abbreviations informs us that this indicates a mark set at altitude. We simply remove the “A” and also convert the variable timing into a numeric variable.
male_100 <- male_100 %>% mutate(timing = gsub("A", "", timing), timing = as.numeric(timing)) male_100 ## # A tibble: 3,178 x 2 ## timing runner ## <dbl> <chr> ## 1 9.58 Usain Bolt ## 2 9.63 Usain Bolt ## 3 9.69 Usain Bolt ## 4 9.69 Tyson Gay ## 5 9.69 Yohan Blake ## 6 9.71 Tyson Gay ## 7 9.72 Usain Bolt ## 8 9.72 Asafa Powell ## 9 9.74 Asafa Powell ## 10 9.74 Justin Gatlin ## # ... with 3,168 more rows
Exploration
As with any statistical analysis, the first step is to explore and understand the data we are working with. We see that there are 318 unique runners in the data.
n_distinct(male_100$runner) ## [1] 328 runner_cnt <- male_100 %>% group_by(runner) %>% summarise(n_rec = n()) %>% arrange(desc(n_rec)) runner_cnt ## # A tibble: 328 x 2 ## runner n_rec ## <chr> <int> ## 1 Asafa Powell 144 ## 2 Mike Rodgers 101 ## 3 Justin Gatlin 98 ## 4 Maurice Greene 83 ## 5 Nesta Carter 73 ## 6 Usain Bolt 71 ## 7 Kim Collins 68 ## 8 Tyson Gay 67 ## 9 Ato Boldon 65 ## 10 Frank Fredericks 62 ## # ... with 318 more rows
As expected, some of the well-known names dominate the list with the maximum number of records. Our next step is to look at the distribution of the number of records.
library(ggplot2) ggplot(runner_cnt, aes(n_rec)) + geom_histogram(binwidth = 5, fill = "lightblue", colour = "black") + xlab(NULL) + ylab(NULL) + ggtitle("Distribution of number of records by runner") + theme_bw() + theme( panel.border = element_blank(), plot.title = element_text(hjust = 0.5, vjust = 0.5) )
As expected, this distribution is heavily skewed as the top runners will tend to dominate the list. We first restrict our data to those who have at least 50 records each.
male_100_2 <- male_100 %>% inner_join( select(filter(runner_cnt, n_rec >= 50), runner), by = c("runner" = "runner") )
There are 16 runners left in this data. Let us have a look at the distribution of their race timings as well as the mean and median race timings for these runners.
ggplot(male_100_2, aes(timing)) + geom_density() + facet_wrap(~runner, nrow = 4, ncol = 4) + xlab(NULL) + ylab(NULL) + ggtitle("Distribution of race timings by runner") + theme_bw() + theme( panel.border = element_blank(), plot.title = element_text(hjust = 0.5, vjust = 0.5) )
male_100_means <- male_100_2 %>% group_by(runner) %>% summarise(mean_timing = mean(timing)) %>% arrange(mean_timing) male_100_means ## # A tibble: 16 x 2 ## runner mean_timing ## <chr> <dbl> ## 1 Usain Bolt 9.90 ## 2 Asafa Powell 9.94 ## 3 Tyson Gay 9.95 ## 4 Yohan Blake 9.96 ## 5 Justin Gatlin 9.96 ## 6 Maurice Greene 9.97 ## 7 Ato Boldon 10.00 ## 8 Jimmy Vicaut 10.0 ## 9 Frank Fredericks 10.0 ## 10 Mike Rodgers 10.0 ## 11 Nesta Carter 10.0 ## 12 Carl Lewis 10.0 ## 13 Linford Christie 10.0 ## 14 Dennis Mitchell 10.0 ## 15 Kim Collins 10.0 ## 16 Michael Frater 10.0 male_100_medians <- male_100_2 %>% group_by(runner) %>% summarise(median_timing = median(timing)) %>% arrange(median_timing) male_100_medians ## # A tibble: 16 x 2 ## runner median_timing ## <chr> <dbl> ## 1 Usain Bolt 9.9 ## 2 Asafa Powell 9.95 ## 3 Yohan Blake 9.96 ## 4 Justin Gatlin 9.97 ## 5 Maurice Greene 9.97 ## 6 Tyson Gay 9.97 ## 7 Ato Boldon 10 ## 8 Mike Rodgers 10 ## 9 Frank Fredericks 10.0 ## 10 Jimmy Vicaut 10.0 ## 11 Nesta Carter 10.0 ## 12 Linford Christie 10.0 ## 13 Dennis Mitchell 10.0 ## 14 Carl Lewis 10.0 ## 15 Kim Collins 10.0 ## 16 Michael Frater 10.0
There are 6 runners who are at the top using both the mean or the median. Also, for all of them, the mean and the median timings are less than 10 seconds. So we will further restrict the data to these 6 runners, and see if their race timings are significantly different.
male_100_3 <- male_100_2 %>% filter(runner %in% c("Usain Bolt", "Asafa Powell", "Yohan Blake", "Justin Gatlin", "Maurice Greene", "Tyson Gay"))
Analysis
Now that we have the data ready, our next step is to try and answer the original question. We will follow two different approaches. Our first approach will be the non-parametric Kruskall-Wallis rank sum test. We will also do a multiple comparison test for the Kruskall-Wallis. The next approach will be a pairwise comparison of the runners using simulations. This approach was popularized in an article by Allen Downey, which said that there is only one statistical test. Another article which implements the same idea, and provides code using the infer package, can be found in this link.
kt <- kruskal.test(timing ~ runner, data = male_100_3) kt ## ## Kruskal-Wallis rank sum test ## ## data: timing by runner ## Kruskal-Wallis chi-squared = 19.202, df = 5, p-value = 0.001762
Since the p-value is significant at the 99% level, it indicates that at least one of the runners is different from the others.
library(pgirmess) ktph <- kruskalmc(male_100_3$timing, male_100_3$runner) ktph$dif.com ## obs.dif critical.dif difference ## Asafa Powell-Justin Gatlin 32.046910 57.86481 FALSE ## Asafa Powell-Maurice Greene 49.073712 60.89670 FALSE ## Asafa Powell-Tyson Gay 24.278711 65.34670 FALSE ## Asafa Powell-Usain Bolt 45.633314 64.07814 FALSE ## Asafa Powell-Yohan Blake 28.887692 68.71975 FALSE ## Justin Gatlin-Maurice Greene 17.026801 65.91562 FALSE ## Justin Gatlin-Tyson Gay 7.768200 70.04750 FALSE ## Justin Gatlin-Usain Bolt 77.680224 68.86559 TRUE ## Justin Gatlin-Yohan Blake 3.159219 73.20426 FALSE ## Maurice Greene-Tyson Gay 24.795001 72.57220 FALSE ## Maurice Greene-Usain Bolt 94.707025 71.43207 TRUE ## Maurice Greene-Yohan Blake 20.186020 75.62365 FALSE ## Tyson Gay-Usain Bolt 69.912024 75.26171 FALSE ## Tyson Gay-Yohan Blake 4.608981 79.25099 FALSE ## Usain Bolt-Yohan Blake 74.521005 78.20829 FALSE
The only two cases where there seems to be a difference are - “Justin Gatlin-Usain Bolt” and “Maurice Green-Usain Bolt”.
We now do the pairwise comparison of athletes using simulations. To do that, we will write a function to compare two individual runners. The first step will be to calculate the difference in medians in the observed data. This is accomplished using the specify and calculate functions in the infer package. specify is used to specify the response and explanatory variables. calculate is used to calculate the test statistic, which is the difference in medians in this case. The second step is similar, except that we include a null hypothesis that there is no difference in the median timings between the runners. We also perform 10,000 simulations - where each simulation permutes the runners in the existing data, and then calculate the difference in medians between the two groups. The infer package also provides a visualize function to visualize the distribution of the simulated test statistic.
library(infer) set.seed(2154) compare_runners <- function(runner1, runner2, visualize = FALSE) { diff_med <- male_100_3 %>% filter(runner %in% c(runner1, runner2)) %>% arrange(runner) %>% specify(timing ~ runner) %>% calculate("diff in medians", order = c(runner1, runner2)) diff_med_null <- male_100_3 %>% filter(runner %in% c(runner1, runner2)) %>% arrange(runner) %>% specify(timing ~ runner) %>% hypothesize(null = "independence") %>% generate(reps = 10000, type = "permute") %>% calculate("diff in medians", order = c(runner1, runner2)) if (visualize) { diff_med_plot <- diff_med_null %>% visualize(method = "simulation") + shade_p_value(obs_stat = diff_med, direction = "less") + xlab("Difference in median run timings") + ylab(NULL) + ggtitle(paste0("Comparing ", runner1, " and ", runner2)) + scale_x_continuous(labels = scales::number_format(accuracy = 0.01)) + theme_bw() + theme( panel.border = element_blank(), plot.title = element_text(hjust = 0.5, vjust = 0.5) ) print(diff_med_plot) } diff_med_null %>% get_pvalue(obs_stat = diff_med, direction = "less") }
The function is applied to each pair of runners. Rather than looking at the results for all the 15 pairs of runners, we do it based on their rankings by median times. There is only one exception, where we also look at the pair Usain Bolt-Yohan Blake, who have the 1st and 3rd least median times respectively. The results for each pair, as well as the p-value generated by the simulation, are shown in the graphs and tables below.
compare_runners("Usain Bolt", "Asafa Powell", TRUE)
## # A tibble: 1 x 1 ## p_value ## <dbl> ## 1 0.0185 compare_runners("Asafa Powell", "Yohan Blake", TRUE)
## # A tibble: 1 x 1 ## p_value ## <dbl> ## 1 0.200 compare_runners("Yohan Blake", "Justin Gatlin", TRUE)
## # A tibble: 1 x 1 ## p_value ## <dbl> ## 1 0.473 compare_runners("Justin Gatlin", "Maurice Greene", TRUE)
## # A tibble: 1 x 1 ## p_value ## <dbl> ## 1 0.715 compare_runners("Maurice Greene", "Tyson Gay", TRUE)
## # A tibble: 1 x 1 ## p_value ## <dbl> ## 1 0.702 compare_runners("Usain Bolt", "Yohan Blake", TRUE)
## # A tibble: 1 x 1 ## p_value ## <dbl> ## 1 0.0027
There is only a 1.85% chance of seeing a difference as large as the observed difference if there is actually no difference between the timings of Usain Bolt and Asafa Powell. The chance for the pair of Usain Bolt and Yohan Blake is even lower at 0.27%. For all the remaining pairs, we fail to reject the null hypothesis and do not find sufficient evidence to conclude that the timings for these pairs are statistically different. It appears that Usain Bolt is better than the next best in the list which is Asafa Powell and Yohan Blake. Given the data we have, we do have reason to believe that Usain Bolt is the fastest man on earth (as measured by 100m race timings).
The reproducible code for this analysis can be found in Github.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.