F1Stats – Correlations Between Qualifying, Grid and Race Classification
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Following directly on from F1Stats – Visually Comparing Qualifying and Grid Positions with Race Classification, and continuing in my attempt to replicate some of the methodology and results used in A Tale of Two Motorsports: A Graphical-Statistical Analysis of How Practice, Qualifying, and Past SuccessRelate to Finish Position in NASCAR and Formula One Racing, here’s a quick look at the correlation scores between the final practice, qualifying and grid positions and the final race classification.
I’ve already done brief review of what correlation means (sort of) in F1Stats – A Prequel to Getting Started With Rank Correlations, so I’m just going to dive straight in with some R code that shows how I set about trying to find the correlations between the different classifications:
Here’s the answer from the back of the book paper that we’re aiming for…
Here’s what I got:
> corrs.df[order(corrs.df$V1),]
V1 p3pos.int qpos.int grid.int racepos.raw pval.grid pval.qpos pval.p3pos
2 AUSTRALIA 0.30075188 0.01503759 0.087218045 1 7.143421e-01 9.518408e-01 0.197072158
13 MALAYSIA 0.42706767 0.57293233 0.630075188 1 3.584362e-03 9.410805e-03 0.061725312
6 CHINA -0.26015038 0.57443609 0.514285714 1 2.183596e-02 9.193214e-03 0.266812583
3 BAHRAIN 0.13082707 0.73233083 0.739849624 1 2.900250e-04 3.601434e-04 0.581232598
16 SPAIN 0.25112782 0.80451128 0.804511278 1 2.179221e-05 2.179221e-05 0.284231482
14 MONACO 0.51578947 0.48120301 0.476691729 1 3.513870e-02 3.326706e-02 0.021403708
17 TURKEY 0.52330827 0.73082707 0.730827068 1 3.756531e-04 3.756531e-04 0.019344720
9 GREAT BRITAIN 0.65413534 0.83007519 0.830075188 1 8.921842e-07 8.921842e-07 0.002260234
8 GERMANY 0.32030075 0.46917293 0.452631579 1 4.657539e-02 3.844275e-02 0.168419054
10 HUNGARY 0.49649123 0.37017544 0.370175439 1 1.194050e-01 1.194050e-01 0.032293715
7 EUROPE 0.28120301 0.72030075 0.720300752 1 4.997719e-04 4.997719e-04 0.228898214
4 BELGIUM 0.06766917 0.62105263 0.621052632 1 4.222076e-03 4.222076e-03 0.777083014
11 ITALY 0.52932331 0.52481203 0.524812030 1 1.895282e-02 1.895282e-02 0.017815489
15 SINGAPORE 0.50526316 0.58796992 0.715789474 1 5.621214e-04 7.414170e-03 0.024579520
12 JAPAN 0.34912281 0.74561404 0.849122807 1 0.000000e+00 3.739715e-04 0.143204045
5 BRAZIL -0.51578947 -0.02105263 -0.007518797 1 9.771776e-01 9.316030e-01 0.021403708
1 ABU DHABI 0.42556391 0.66466165 0.628571429 1 3.684738e-03 1.824565e-03 0.062722332
The paper mistakenly reports the grid values as the qualifying positions, so if we look down the grid.int column that I use to contain the correlation values between the grid and final classifications, we see they broadly match the values quoted in the paper. I also calculated the p-values and they seem to be a little bit off, but of the right order.
And here’s the R-code I used to get those results… The first chunk is just the loader, a refinement of the code I have used previously:
require(RSQLite) require(reshape) #Data downloaded from my f1com scraper on scraperwiki f1 = dbConnect(drv="SQLite", dbname="f1com_megascraper.sqlite") getRacesData.full=function(year='2012'){ #Data query results.combined=dbGetQuery(f1, paste('SELECT raceResults.year as year, qualiResults.pos as qpos, p3Results.pos as p3pos, raceResults.pos as racepos, raceResults.race as race, raceResults.grid as grid, raceResults.driverNum as driverNum, raceResults.raceNum as raceNum FROM raceResults, qualiResults, p3Results WHERE raceResults.year==',year,' and raceResults.year = qualiResults.year and raceResults.year = p3Results.year and raceResults.race = qualiResults.race and raceResults.race = p3Results.race and raceResults.driverNum = qualiResults.driverNum and raceResults.driverNum = p3Results.driverNum;',sep='')) #Data tidying results.combined=ddply(results.combined,.(race),mutate,racepos.raw=1:length(race)) for (i in c('racepos','grid','qpos','p3pos','driverNum')) results.combined[[paste(i,'.int',sep='')]]=as.integer( as.character(results.combined[[i]])) results.combined$race=reorder(results.combined$race,results.combined$raceNum) results.combined } f1 = dbConnect(drv="SQLite", dbname="f1com_megascraper.sqlite") results.combined=getRacesData.full(2009) corrs.df[order(corrs.df$V1),]
Here’s the actual correlation calculation – I use the cor function:
#The cor() function returns data that looks like: # p3pos.int qpos.int grid.int racepos.raw #p3pos.int 1.0000000 0.31578947 0.28270677 0.30075188 #qpos.int 0.3157895 1.00000000 0.97744361 0.01503759 #grid.int 0.2827068 0.97744361 1.00000000 0.08721805 #racepos.raw 0.3007519 0.01503759 0.08721805 1.00000000 #Row/col 4 relates to the correlation with the race classification, so for now just return that corr.rank.race=function(results.combined,cmethod='spearman'){ ##Correlations corrs=NULL #Run through the races for (i in levels(factor(results.combined$race))){ results.classified = subset( results.combined, race==i, select=c('p3pos.int','qpos.int','grid.int','racepos.raw')) #print(i) #print( results.classified) cp=cor(results.classified,method=cmethod,use="complete.obs") #print(cp[4,]) corrs=rbind(corrs,c(i,cp[4,])) } corrs.df=as.data.frame(corrs) signif=data.frame() for (i in levels(factor(results.combined$race))){ results.classified = subset( results.combined, race==i, select=c('p3pos.int','qpos.int','grid.int','racepos.raw')) #p.value pval.grid=cor.test(results.classified$racepos.raw,results.classified$grid.int,method=cmethod,alternative = "two.sided")$p.value pval.qpos=cor.test(results.classified$racepos.raw,results.classified$qpos.int,method=cmethod,alternative = "two.sided")$p.value pval.p3pos=cor.test(results.classified$racepos.raw,results.classified$p3pos.int,method=cmethod,alternative = "two.sided")$p.value signif=rbind(signif,data.frame(race=i,pval.grid=pval.grid,pval.qpos=pval.qpos,pval.p3pos=pval.p3pos)) } corrs.df$qpos.int=as.numeric(as.character(corrs.df$qpos.int)) corrs.df$grid.int=as.numeric(as.character(corrs.df$grid.int)) corrs.df$p3pos.int=as.numeric(as.character(corrs.df$p3pos.int)) corrs.df=merge(corrs.df,signif,by.y='race',by.x='V1') corrs.df$V1=factor(corrs.df$V1,levels=levels(results.combined$race)) corrs.df } corrs.df=corr.rank.race(results.combined)
It’s then trivial to plot the result:
require(ggplot2) xRot=function(g,s=5,lab=NULL) g+theme(axis.text.x=element_text(angle=-90,size=s))+xlab(lab) g=ggplot(corrs.df)+geom_point(aes(x=V1,y=grid.int)) g=xRot(g,6)+xlab(NULL)+ylab('Correlation')+ylim(0,1) g=g+ggtitle('F1 2009 Correlation: grid and final classification') g
Recalling that there are different types of rank correlation function, specifically “Kendall’s τ (that is, Kendall’s Tau; this coefficient is based on concordance, which describes how the sign of the difference in rank between pairs of numbers in one data series is the same as the sign of the difference in rank between a corresponding pair in the other data series”, I wondered whether it would make sense to look at correlations under this measure to see whether there were any obvious looking differences compared to Spearmans’s rho, that might prompt us to look at the actual grid/race classifications to see which score appears to be more meaningful.
The easiest way to spot the difference is probably graphically:
corrs.df2=corr.rank.race(results.combined,'kendall') corrs.df2[order(corrs.df2$V1),] g=ggplot(corrs.df)+geom_point(aes(x=V1,y=grid.int),col='red',size=4) g=g+geom_point(data=corrs.df2, aes(x=V1,y=grid.int),col='blue') g=xRot(g,6)+xlab(NULL)+ylab('Correlation')+ylim(0,1) g=g+ggtitle('F1 2009 Correlation: grid and final classification') g
corrs.df2[order(corrs.df2$V1),]
V1 p3pos.int qpos.int grid.int racepos.raw pval.grid pval.qpos pval.p3pos
2 AUSTRALIA 0.17894737 -0.01052632 0.04210526 1 8.226829e-01 9.744669e-01 0.288378196
13 MALAYSIA 0.26315789 0.41052632 0.46315789 1 3.782665e-03 1.110136e-02 0.112604127
6 CHINA -0.20000000 0.41052632 0.35789474 1 2.832863e-02 1.110136e-02 0.233266557
3 BAHRAIN 0.07368421 0.51578947 0.52631579 1 8.408301e-04 1.099522e-03 0.677108239
16 SPAIN 0.17894737 0.64210526 0.64210526 1 2.506940e-05 2.506940e-05 0.288378196
14 MONACO 0.38947368 0.35789474 0.35789474 1 2.832863e-02 2.832863e-02 0.016406081
17 TURKEY 0.37894737 0.64210526 0.64210526 1 2.506940e-05 2.506940e-05 0.019784403
9 GREAT BRITAIN 0.46315789 0.63157895 0.63157895 1 3.622261e-05 3.622261e-05 0.003782665
8 GERMANY 0.23157895 0.31578947 0.30526316 1 6.380788e-02 5.475355e-02 0.164976406
10 HUNGARY 0.36842105 0.36842105 0.36842105 1 2.860214e-02 2.860214e-02 0.028602137
7 EUROPE 0.21052632 0.62105263 0.62105263 1 5.176962e-05 5.176962e-05 0.208628398
4 BELGIUM 0.02105263 0.46315789 0.46315789 1 3.782665e-03 3.782665e-03 0.923502331
11 ITALY 0.35789474 0.36842105 0.36842105 1 2.373450e-02 2.373450e-02 0.028328627
15 SINGAPORE 0.35789474 0.45263158 0.55789474 1 3.589956e-04 4.748310e-03 0.028328627
12 JAPAN 0.26315789 0.57894737 0.69590643 1 6.491222e-06 3.109641e-04 0.124796908
5 BRAZIL -0.37894737 -0.05263158 -0.04210526 1 8.226829e-01 7.732195e-01 0.019784403
1 ABU DHABI 0.34736842 0.61052632 0.55789474 1 3.589956e-04 7.321900e-05 0.033643947
Hmm.. Kendall gives lower values for all races except Hungary – maybe put that on the “must look at Hungary compared to the other races” pile…;-)
One thing that did occur to me was that I have access to race data from other years, so it shouldn’t be too hard to see how the correlations play out over the years at different circuits (do grid/race correlations tend to be higher at some circuits, for example?).
testYears=function(years=2009:2012){ bd=NULL for (year in years) { d=getRacesData.full(year) corrs.df=corr.rank.race(d) bd=rbind(bd,cbind(year,corrs.df)) } bd } a=testYears(2006:2012) ggplot(a)+geom_point(aes(x=year,y=grid.int))+facet_wrap(~V1)+ylim(0,1) g=ggplot(a)+geom_boxplot(aes(x=V1,y=grid.int)) g=xRot(g) g
So Spain and Turkey look like they tend to the processional? Let’s see if a boxplot bears that out:
How predictable have the years been, year on year?
g=ggplot(a)+geom_point(aes(x=V1,y=grid.int))+facet_wrap(~year)+ylim(0,1) g=xRot(g) g ggplot(a)+geom_boxplot(aes(x=factor(year),y=grid.int))
And as a boxplot:
From a betting point of view, (eg Getting Started with F1 Betting Data and The Basics of Betting as a Way of Keeping Score…) it possibly also makes sense to look at the correlation between the P3 times and the qualifying classification to see if there is a testable edge in the data when it comes to betting on quali?
I think I need to tweak my code slightly to make it easy to pull out correlations between specific columns, but that’ll have to wait for another day…
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.