Comrades Marathon Attrition Rate

[This article was first published on Exegetic Analytics » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It is a bit of a mission to get the complete data set for this year’s Comrades Marathon. The full results are easily accessible, but come as an HTML file. Embedded in this file are links to the splits for individual athletes. So with a bit of scripting wizardry it is also possible to download the HTML files for each of the individual athletes. Parsing all of these yields the complete result set, which is the starting point for this analysis.

The first interesting thing that I found was that according to the main results page there were 19907 entrants (this is also the number quoted in the 2013 Comrades Marathon Highlights). However, there were only detailed data for 19903 individual athletes. This immediately aroused my suspicions, so I had a look for duplicate race numbers and, guess what? Yup! There were four: 57234, 54243, 16266 and 25315. If you don’t believe me, check out the results for yourself. Here are the relevant data:

Position Race Number Name Time
1980 57234 Izelle Pretorius 09:16:02
1981 57234 Justin Powrie 09:16:02
3179 54243 Daniel Matseme 09:56:55
3180 54243 Headman Magadeni 09:56:55
3786 16266 Doctor Masina 10:17:56
3787 16266 Doctor Patrick Masina 10:17:56
25315 Paulus Mpho  DNF
25315 Ludwe Tsoliwe  DNF

That’s interesting: for each duplicated race number there are two names, both of which have the same finishing time and are assigned independent positions in the field. I don’t know what has happened here, but there is clearly a glitch in the data being provided by the CMA. Logic suggests that in each case there was in fact just one runner and so the overall position data are not correct. Not a big issue, but if you came in after position 1980, then your real position may be out by a few places.

Moving on to something more relevant: attrition. Of the 19903 independent entrants, I find that only 10185 finished. Again this number differs from the official number by 3 (this is because of the duplication issue mention above!). But many of those entrants didn’t even start the race. There were 6008 entrants who did not make it to the City Hall in Durban on Sunday morning. Of the 13895 athletes who were there when the start gun went off, only 10183 made it to the finish line before the 12 hour cutoff. This means that the total attrition rate was 26.7%: just over one quarter of the field didn’t make it! In view of the carnage that I witnessed on Sunday, I would have expected this number to be a lot higher!

Let’s break this down by gender. The figure below shows the proportion of athletes who did not start (DNS), did not finish (DNF), and who did actually finish the race as a function of gender. The DNS data are the categories “not yet started”, “pre-race withdrawal” and “substituted”. The DNF data also include “disqualified” and “started and running”.

status-gender-spineplot

So what can we take away from this plot? Here are the main points:

  • men made up 78.0% of the entrants;
  • women accounted for 20.3% of those that crossed the start line;
  • men made up 80.8% of the finishers.

The proportions are rather consistent! But this is only one way of looking at the data. What about if we consider the proportions within each gender? Then the picture is slightly different:

  • 28.7% of the male entrants did not start the race (compared with 35.6% of the females);
  • 74.3% of the males who started also reached the finish line before the gun (as opposed to 69.4% for females).

I am not going to interpret these results any further. I know which side my bread is buttered. Draw your own conclusions.

Next we look at the same data but broken down according to age category. Here the 40-49 age group was the best for getting to the starting line. Obviously they (and I include myself here) have learned that if you don’t start, then you certainly can’t finish! Ahem. Moving on. Of those that did start, runners in the 20-29 age group fared the best with 81.3% finishing. Things got progressively worse from there with the percentage of finishers dropping from 79.6% in the 30-39 group, to 73.5% in the 40-49 group, 61.6% in the 50-59 group and only 45.1% in the 60 and older group. Still damn impressive for the senior runners, but the youngsters appear to have fared best on the day. Perhaps they are more tolerant to warm weather?

status-category-spineplot

Now, let’s put all of this together, looking at gender, age group and finishing status. There is a lot more information and it is a little difficult to make sense of all of it at once. But here are the salient points:

  • men in all three of the 30-39, 40-49 and 50-59 age groups were equally likely to start;
  • men in the 30-39 age group were most likely to finish;
  • among the women, those in the 40-49 age group were the most likely to start;
  • of the women that did start, the 30-39 age group was most likely to finish.

Looks like 30-39 is the prime time to be running the Comrades. That’s not to say that I am past my prime. Hell no! Not at all.

status-category-gender-mosaicplot

Over the next few days I will look at the following questions:

  • what is the effect of running a negative split on overall time? and
  • how does the finishing rate vary with time? Is there evidence of a “diamond carat” effect?

To leave a comment for the author, please follow the link and comment on their blog: Exegetic Analytics » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)