Site icon R-bloggers

Fun with Statistics – Is Usain Bolt really the fastest man on earth?

[This article was first published on Anindya Mozumdar, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

If you search for the phrase “fastest man on earth” in Google, chances are that it will return the answer “Usain Bolt”. It certainly does so for me, even though the results might be different if Google decides to personalize the results for you. This is because currently he holds the world record for being the quickest (9.58s) to run a 100 metres race. Justin Gatlin once did it in 9.45s, but that does not count as the record because he was assisted by the wind.

However, as statisticians, we know that a single data point does not necessarily prove anything and we should look at the distribution of timings for the top runners to check if Usain Bolt is truly the fastest man on earth. In this article, we will try to find an answer to this question using data.

Data

Our first step is to find data on the 100m race timings for all top runners. Unfortunately, I could not find any public source of this data. The best option we have is to use data from the excellent website Alltime Athletics maintained by Peter Larsson. This page provides the details of the all-time men’s best 100m race timings. We can answer a slightly different question using this data. Out of all runners who appear in this list, is there any evidence to suggest that there is a statistical difference among the top runners?

Our first step will be to extract and store the data in a R data frame. We use the rvest and readr R packages to do this.

library(rvest)
library(readr)
male_100_html <- read_html("http://www.alltime-athletics.com/m_100ok.htm")

An examination of the HTML source reveals that the data has been stored within HTML pre-tags. There are also seven tables available as part of the page, out of which we are interested in the table “All-time men’s best 100m”. We use the function html_nodes from the rvest package using an XPath selector. The next step is to extract the HTML text using the function html_text, and restrict to the first element.

male_100_pres <- male_100_html %>%
  html_nodes(xpath = "//pre")
male_100_htext <- male_100_pres %>%
  html_text()
male_100_htext <- male_100_htext[[1]]

The data is stored in fixed-width format in the variable male_100_htext. The read_fwf function from the readr package can be used to store the data in a data frame. The exact arguments to the function were decided using a bit of trial and error.

male_100 <- readr::read_fwf(male_100_htext, skip = 1, n_max = 3178,
                            col_types = cols(.default = col_character()),
                            col_positions = fwf_positions(
                              c(1, 16, 27, 35, 66, 74, 86, 93, 123),
                              c(15, 26, 34, 65, 73, 85, 92, 122, 132)
                            ))
str(male_100)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 3178 obs. of  9 variables:
##  $ X1: chr  "1" "2" "3" "3" ...
##  $ X2: chr  "9.58" "9.63" "9.69" "9.69" ...
##  $ X3: chr  "+0.9" "+1.5" "±0.0" "+2.0" ...
##  $ X4: chr  "Usain Bolt" "Usain Bolt" "Usain Bolt" "Tyson Gay" ...
##  $ X5: chr  "JAM" "JAM" "JAM" "USA" ...
##  $ X6: chr  "21.08.86" "21.08.86" "21.08.86" "09.08.82" ...
##  $ X7: chr  "1" "1" "1" "1" ...
##  $ X8: chr  "Berlin" "London" "Beijing" "Shanghai" ...
##  $ X9: chr  "16.08.2009" "05.08.2012" "16.08.200" "20.09.2009" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   .default = col_character(),
##   ..   X1 = col_character(),
##   ..   X2 = col_character(),
##   ..   X3 = col_character(),
##   ..   X4 = col_character(),
##   ..   X5 = col_character(),
##   ..   X6 = col_character(),
##   ..   X7 = col_character(),
##   ..   X8 = col_character(),
##   ..   X9 = col_character()
##   .. )

For this analysis, we are only interested in the name of the runner and the race timing. So we restrict the data to these two columns, and give each column an appropriate name.

library(dplyr)
male_100 <- male_100 %>%
  select(X2, X4)
colnames(male_100) <- c("timing", "runner")

We observe that some of the timings have an “A” against the time. The Wikipedia page Athletics abbreviations informs us that this indicates a mark set at altitude. We simply remove the “A” and also convert the variable timing into a numeric variable.

male_100 <- male_100 %>%
  mutate(timing = gsub("A", "", timing),
         timing = as.numeric(timing))
male_100
## # A tibble: 3,178 x 2
##    timing runner       
##     <dbl> <chr>        
##  1   9.58 Usain Bolt   
##  2   9.63 Usain Bolt   
##  3   9.69 Usain Bolt   
##  4   9.69 Tyson Gay    
##  5   9.69 Yohan Blake  
##  6   9.71 Tyson Gay    
##  7   9.72 Usain Bolt   
##  8   9.72 Asafa Powell 
##  9   9.74 Asafa Powell 
## 10   9.74 Justin Gatlin
## # ... with 3,168 more rows

Exploration

As with any statistical analysis, the first step is to explore and understand the data we are working with. We see that there are 318 unique runners in the data.

n_distinct(male_100$runner)
## [1] 328
runner_cnt <- male_100 %>%
  group_by(runner) %>%
  summarise(n_rec = n()) %>%
  arrange(desc(n_rec))
runner_cnt
## # A tibble: 328 x 2
##    runner           n_rec
##    <chr>            <int>
##  1 Asafa Powell       144
##  2 Mike Rodgers       101
##  3 Justin Gatlin       98
##  4 Maurice Greene      83
##  5 Nesta Carter        73
##  6 Usain Bolt          71
##  7 Kim Collins         68
##  8 Tyson Gay           67
##  9 Ato Boldon          65
## 10 Frank Fredericks    62
## # ... with 318 more rows

As expected, some of the well-known names dominate the list with the maximum number of records. Our next step is to look at the distribution of the number of records.

library(ggplot2)
ggplot(runner_cnt, aes(n_rec)) +
  geom_histogram(binwidth = 5, fill = "lightblue", colour = "black") +
  xlab(NULL) +
  ylab(NULL) +
  ggtitle("Distribution of number of records by runner") +
  theme_bw() +
  theme(
    panel.border = element_blank(),
    plot.title = element_text(hjust = 0.5, vjust = 0.5)
  )

As expected, this distribution is heavily skewed as the top runners will tend to dominate the list. We first restrict our data to those who have at least 50 records each.

male_100_2 <- male_100 %>%
  inner_join(
    select(filter(runner_cnt, n_rec >= 50), runner),
    by = c("runner" = "runner")
  )

There are 16 runners left in this data. Let us have a look at the distribution of their race timings as well as the mean and median race timings for these runners.

ggplot(male_100_2, aes(timing)) +
  geom_density() +
  facet_wrap(~runner, nrow = 4, ncol = 4) +
  xlab(NULL) +
  ylab(NULL) +
  ggtitle("Distribution of race timings by runner") +
  theme_bw() +
  theme(
    panel.border = element_blank(),
    plot.title = element_text(hjust = 0.5, vjust = 0.5)
  )

male_100_means <- male_100_2 %>%
  group_by(runner) %>%
  summarise(mean_timing = mean(timing)) %>%
  arrange(mean_timing)
male_100_means
## # A tibble: 16 x 2
##    runner           mean_timing
##    <chr>                  <dbl>
##  1 Usain Bolt              9.90
##  2 Asafa Powell            9.94
##  3 Tyson Gay               9.95
##  4 Yohan Blake             9.96
##  5 Justin Gatlin           9.96
##  6 Maurice Greene          9.97
##  7 Ato Boldon             10.00
##  8 Jimmy Vicaut           10.0 
##  9 Frank Fredericks       10.0 
## 10 Mike Rodgers           10.0 
## 11 Nesta Carter           10.0 
## 12 Carl Lewis             10.0 
## 13 Linford Christie       10.0 
## 14 Dennis Mitchell        10.0 
## 15 Kim Collins            10.0 
## 16 Michael Frater         10.0
male_100_medians <- male_100_2 %>%
  group_by(runner) %>%
  summarise(median_timing = median(timing)) %>%
  arrange(median_timing)
male_100_medians
## # A tibble: 16 x 2
##    runner           median_timing
##    <chr>                    <dbl>
##  1 Usain Bolt                9.9 
##  2 Asafa Powell              9.95
##  3 Yohan Blake               9.96
##  4 Justin Gatlin             9.97
##  5 Maurice Greene            9.97
##  6 Tyson Gay                 9.97
##  7 Ato Boldon               10   
##  8 Mike Rodgers             10   
##  9 Frank Fredericks         10.0 
## 10 Jimmy Vicaut             10.0 
## 11 Nesta Carter             10.0 
## 12 Linford Christie         10.0 
## 13 Dennis Mitchell          10.0 
## 14 Carl Lewis               10.0 
## 15 Kim Collins              10.0 
## 16 Michael Frater           10.0

There are 6 runners who are at the top using both the mean or the median. Also, for all of them, the mean and the median timings are less than 10 seconds. So we will further restrict the data to these 6 runners, and see if their race timings are significantly different.

male_100_3 <- male_100_2 %>%
  filter(runner %in% c("Usain Bolt", "Asafa Powell", "Yohan Blake",
                       "Justin Gatlin", "Maurice Greene", "Tyson Gay"))

Analysis

Now that we have the data ready, our next step is to try and answer the original question. We will follow two different approaches. Our first approach will be the non-parametric Kruskall-Wallis rank sum test. We will also do a multiple comparison test for the Kruskall-Wallis. The next approach will be a pairwise comparison of the runners using simulations. This approach was popularized in an article by Allen Downey, which said that there is only one statistical test. Another article which implements the same idea, and provides code using the infer package, can be found in this link.

kt <- kruskal.test(timing ~ runner, data = male_100_3)
kt
## 
##  Kruskal-Wallis rank sum test
## 
## data:  timing by runner
## Kruskal-Wallis chi-squared = 19.202, df = 5, p-value = 0.001762

Since the p-value is significant at the 99% level, it indicates that at least one of the runners is different from the others.

library(pgirmess)
ktph <- kruskalmc(male_100_3$timing, male_100_3$runner)
ktph$dif.com
##                                obs.dif critical.dif difference
## Asafa Powell-Justin Gatlin   32.046910     57.86481      FALSE
## Asafa Powell-Maurice Greene  49.073712     60.89670      FALSE
## Asafa Powell-Tyson Gay       24.278711     65.34670      FALSE
## Asafa Powell-Usain Bolt      45.633314     64.07814      FALSE
## Asafa Powell-Yohan Blake     28.887692     68.71975      FALSE
## Justin Gatlin-Maurice Greene 17.026801     65.91562      FALSE
## Justin Gatlin-Tyson Gay       7.768200     70.04750      FALSE
## Justin Gatlin-Usain Bolt     77.680224     68.86559       TRUE
## Justin Gatlin-Yohan Blake     3.159219     73.20426      FALSE
## Maurice Greene-Tyson Gay     24.795001     72.57220      FALSE
## Maurice Greene-Usain Bolt    94.707025     71.43207       TRUE
## Maurice Greene-Yohan Blake   20.186020     75.62365      FALSE
## Tyson Gay-Usain Bolt         69.912024     75.26171      FALSE
## Tyson Gay-Yohan Blake         4.608981     79.25099      FALSE
## Usain Bolt-Yohan Blake       74.521005     78.20829      FALSE

The only two cases where there seems to be a difference are – “Justin Gatlin-Usain Bolt” and “Maurice Green-Usain Bolt”.

We now do the pairwise comparison of athletes using simulations. To do that, we will write a function to compare two individual runners. The first step will be to calculate the difference in medians in the observed data. This is accomplished using the specify and calculate functions in the infer package. specify is used to specify the response and explanatory variables. calculate is used to calculate the test statistic, which is the difference in medians in this case. The second step is similar, except that we include a null hypothesis that there is no difference in the median timings between the runners. We also perform 10,000 simulations – where each simulation permutes the runners in the existing data, and then calculate the difference in medians between the two groups. The infer package also provides a visualize function to visualize the distribution of the simulated test statistic.

library(infer)
set.seed(2154)
compare_runners <- function(runner1, runner2, visualize = FALSE) {
  
  diff_med <- male_100_3 %>%
    filter(runner %in% c(runner1, runner2)) %>%
    arrange(runner) %>%
    specify(timing ~ runner) %>%
    calculate("diff in medians",
              order = c(runner1, runner2))
  
  diff_med_null <- male_100_3 %>%
    filter(runner %in% c(runner1, runner2)) %>%
    arrange(runner) %>%
    specify(timing ~ runner) %>%
    hypothesize(null = "independence") %>%
    generate(reps = 10000, type = "permute") %>%
    calculate("diff in medians",
              order = c(runner1, runner2))
  
  if (visualize) {
    diff_med_plot <- diff_med_null %>%
      visualize(method = "simulation") +
      shade_p_value(obs_stat = diff_med, direction = "less") +
      xlab("Difference in median run timings") +
      ylab(NULL) +
      ggtitle(paste0("Comparing ", runner1, " and ", runner2)) +
      scale_x_continuous(labels = scales::number_format(accuracy = 0.01)) +
      theme_bw() +
      theme(
        panel.border = element_blank(),
        plot.title = element_text(hjust = 0.5, vjust = 0.5)
      )
    print(diff_med_plot)
  }
  
  diff_med_null %>%
    get_pvalue(obs_stat = diff_med, direction = "less")
    
}

The function is applied to each pair of runners. Rather than looking at the results for all the 15 pairs of runners, we do it based on their rankings by median times. There is only one exception, where we also look at the pair Usain Bolt-Yohan Blake, who have the 1st and 3rd least median times respectively. The results for each pair, as well as the p-value generated by the simulation, are shown in the graphs and tables below.

compare_runners("Usain Bolt", "Asafa Powell", TRUE)

## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1  0.0185
compare_runners("Asafa Powell", "Yohan Blake", TRUE)

## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1   0.200
compare_runners("Yohan Blake", "Justin Gatlin", TRUE)

## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1   0.473
compare_runners("Justin Gatlin", "Maurice Greene", TRUE)

## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1   0.715
compare_runners("Maurice Greene", "Tyson Gay", TRUE)

## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1   0.702
compare_runners("Usain Bolt", "Yohan Blake", TRUE)

## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1  0.0027

There is only a 1.85% chance of seeing a difference as large as the observed difference if there is actually no difference between the timings of Usain Bolt and Asafa Powell. The chance for the pair of Usain Bolt and Yohan Blake is even lower at 0.27%. For all the remaining pairs, we fail to reject the null hypothesis and do not find sufficient evidence to conclude that the timings for these pairs are statistically different. It appears that Usain Bolt is better than the next best in the list which is Asafa Powell and Yohan Blake. Given the data we have, we do have reason to believe that Usain Bolt is the fastest man on earth (as measured by 100m race timings).

The reproducible code for this analysis can be found in Github.

To leave a comment for the author, please follow the link and comment on their blog: Anindya Mozumdar.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.