Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The data itself was in pdf, so it was a bit of a pain to extract in usable format. After some copying and pasting in a spreadsheet, I finally got it in csv, which was easy to parse with R. Given the amount of effort required, you’ll forgive me to have only extracted the race I was in, that is, the sprint male. It consists of a 400m swim, followed by a 23km ride and ends with a 5km run.
So first, let’s read the data:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | times<-read.table( "stratford.csv", header=TRUE, sep="\t", stringsAsFactors=FALSE, na.strings="") head(times) Position StartingPosition StartingTime Age Category Swim Cycle Run Total 1 1 441 08:44:45 32 F 00:06:04 00:36:46 00:19:11 01:02:01 2 2 5 08:46:00 35 G 00:05:55 00:37:23 00:20:18 01:03:36 3 3 26 08:56:00 23 D 00:06:28 00:37:39 00:19:30 01:03:37 4 4 443 10:35:30 31 F <NA> <NA> 00:20:51 01:04:09 5 5 445 10:36:00 27 E 00:06:43 00:37:26 00:21:36 01:05:45 6 6 32 08:52:45 32 F 00:06:25 00:39:27 00:21:08 01:07:00 |
A technical point about the data: the times were manually reported and some are missing. On top of that, the transition times (e.g. between the end of the swim and the start of the cycling) were not recorded but were added to one of the disciplines. The starting position can be seen as an ID.
Next are a few lines that transform the times from character strings to numeric (in minutes). First we define a column-wise function which does exactly that:
1 2 3 4 5 6 7 8 9 10 11 | library("ggplot2") library("stringr") stringToMinutes<-colwise( function(s) unlist( lapply( str_split(s,":"),function(s) sum(as.numeric(s)*c(60,1,1/60)) ) ), .(Swim,Cycle,Run,Total)) |
Then apply it to the times data frame:
1 2 3 4 5 6 7 8 9 | times<-ddply(times,.(Position,StartingPosition,Category),stringToMinutes) head(times) Position StartingPosition Category Swim Cycle Run Total 1 1 441 F 6.066667 36.76667 19.18333 62.01667 2 2 5 G 5.916667 37.38333 20.30000 63.60000 3 3 26 D 6.466667 37.65000 19.50000 63.61667 4 4 443 F NA NA 20.85000 64.15000 5 5 445 E 6.716667 37.43333 21.60000 65.75000 6 6 32 F 6.416667 39.45000 21.13333 67.00000 |
With the benefit of hindsight, I can tell you that there are a couple of obvious errors (like an hour-long swim). Let’s clean up the data a bit:
Ideally we would expect Swim+Cycle+Run=Total but given the precision of the recording, we have to allow for some discrepancy. The is.na(Swim+Cycle+Swim) condition is there to allow for cases where at least one of the disciplines is missing, in which case the sum is NA, the first test fails and the corresponding row is lost. None of the Total time is NA.
In fact, let’s only consider the runners with all times defined (most of them):
We can now have a first look at the average times for each age category, with a mixture of ddply and summarise.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | digest<-ddply( times, "Category", summarise, median=median(Total), average=mean(Total), headCount=length(Total) ) print(digest) Category median average headCount 1 A 93.66667 93.33519 9 2 C 79.02500 79.19583 4 3 D 78.31667 80.78333 12 4 E 81.81667 85.83667 25 5 F 84.88333 88.81955 52 6 G 83.27500 87.47064 88 7 H 85.78333 88.57263 81 8 I 84.81667 85.50847 61 9 J 83.90000 85.85541 37 10 K 88.11667 90.53333 11 11 L 99.15833 100.03889 6 12 M 98.01667 98.01667 2 13 N 100.01667 100.01667 1 14 <NA> 90.55833 90.55833 2 |
As usual, the median is more robust than the average, which is useful here given how uneven the groups are and the existence of outliers. A graphical representation shows that the total time is actually fairly flat across age categories.
1 2 3 4 | ggplot(times,aes(x=Category,y=Total,group=1)) +geom_jitter(position=position_jitter(width=0.05)) +geom_smooth() +ylim(70,125) |
Now let’s have a look at the distributions of times for each discipline. For this, we’re going to melt the data frame so as to have the swim, cycle, run and total as factors in a new column called Discipline.
1 2 3 4 5 6 7 8 9 10 11 12 13 | meltedTimes<-melt( times, c("StartingPosition","Category"), c("Swim","Cycle","Run","Total"), variable_name="Discipline") head(meltedTimes) StartingPosition Category Position Discipline value 1 441 F 1 Swim 6.066667 2 5 G 2 Swim 5.916667 3 26 D 3 Swim 6.466667 4 445 E 5 Swim 6.716667 5 32 F 6 Swim 6.416667 6 2 H 7 Swim 6.033333 |
We can now show the 4 distributions we’re interested in in just one command, thanks to the faceting option:
1 2 3 4 | ggplot(meltedTimes,aes(x=value)) +geom_density() +facet_wrap(~Discipline,scales="free") +xlab("Time (mn)") |
scales=”free” is useful here, because the times across disciplines are quite different.
You can also show the same distributions split across age categories:
1 2 3 4 | ggplot(meltedTimes,aes(x=value,fill=Category)) +geom_density(alpha=0.3) +facet_wrap(~Discipline,scales="free") +xlab("Time (mn)") |
But it’s a bit misleading given the small size of some categories; the kernel smoothing can exaggerate local density. Things can be a bit improved by limiting the plot to age categories of at least 10 athletes, but not much.
In an attempt at investigating whether some athletes are better in some disciplines than others, we can plot their trajectories:
1 2 3 | ggplot(meltedTimes,aes(x=Discipline,y=value,group=StartingPosition)) +geom_line(alpha=0.05,colour="#0000FF",size=1) +ylab("Time (mn)") |
Unfortunately, the graph is too busy to reveal anything. A better way is to plot their ranking for each discipline and see in which discipline they rank best. Once again, plyr makes things very easy for us.
First we add a column Ranking to meltedTimes. This will be the rank of the athletes within the discipline, i.e. either swim, cycle, run or total.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | meltedTimes<-ddply( meltedTimes, .(Discipline), transform, ranking=rank(value,ties.method="random") ) head(meltedTimes) StartingPosition Category Position Discipline value ranking 1 441 F 1 Swim 6.066667 7 2 5 G 2 Swim 5.916667 4 3 26 D 3 Swim 6.466667 17 4 445 E 5 Swim 6.716667 25 5 32 F 6 Swim 6.416667 14 6 2 H 7 Swim 6.033333 6 |
i.e. Athlete #441 ranked 7th for the swim.
Equipped with the intra-discipline ranking, we can summarise each athlete by his best discipline:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | bestDiscipline<-ddply( meltedTimes, .(StartingPosition), summarise, Discipline=Discipline[order(ranking)[1]], Position=Position[order(ranking)[1]]) head(bestDiscipline) StartingPosition Discipline Position 1 1 Swim 204 2 2 Swim 7 3 4 Swim 12 4 5 Cycle 2 5 9 Run 147 6 10 Swim 40 |
i.e. athlete #5′s strong point is the cycling; this is the discipline he was best ranked at. Position is the final position, which I’ll need in a minute.
So let’s see what’s the athletes’ favourite disciplines:
1 2 3 | ggplot(bestDiscipline) +geom_bar(aes(x=Discipline,y=..count..)) +xlab("Strong point") |
But what about the best athletes? Say, those in the first quartile? Well, let’s see.
1 2 3 4 | bestDiscipline$isInFirstQuartile<-bestDiscipline$Position<nrow(bestDiscipline)/4 ggplot(bestDiscipline) +geom_bar(aes(x=Discipline,y=..count..,fill=isInFirstQuartile)) +xlab("Strong point") |
1 2 3 4 5 6 7 | summaryTable<-table(bestDiscipline[,c("Discipline","isInFirstQuartile")]) isInFirstQuartile Discipline FALSE TRUE Swim 113 18 Cycle 83 30 Run 79 20 Total 1 16 |
The over-representation of top athletes in Total time makes sense: by definition, their rank for Total is already low, so it’s less likely to do even better in the other disciplines.
But top athletes do have a different distribution than the rest of the competitors:
1 2 3 4 5 6 7 | print(100*summaryTable/rep(colSums(summaryTable),1,each=4),digits=2) isInFirstQuartile Discipline FALSE TRUE Swim 40.94 21.43 Cycle 30.07 35.71 Run 28.62 23.81 Total 0.36 19.05 |
There are many other things one could do with this data (what’s the biggest catch-up, do you have to be in the top 25% in all disciplines to be in the final top 25%? etc.) but I hope it gave you a flavour of what’s possible with plyr and ggplot2, two excellent packages which might require some getting-used-to but are definitely worth the effort.
A sequel to this post can be found here
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.