F1 Doing the Data Visualisation Competition Thing With Tata?
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Sort of via @jottevanger, it seems that Tata Communications announces the first challenge in the F1® Connectivity Innovation Prize to extract and present new information from Formula One Management’s live data feeds. (The F1 site has a post Tata launches F1® Connectivity Innovation Prize dated “10 Jun 2014″? What’s that about then?)
Tata Communications are the folk who supply connectivity to F1, so this could be a good call from them. It’ll be interesting to see how much attention – and interest – it gets.
The competition site can be found here: The F1 Innovation Connectivity Prize.
The first challenge is framed as follows:
The Formula One Management Data Screen Challenge is to propose what new and insightful information can be derived from the sample data set provided and, as a second element to the challenge, show how this insight can be delivered visually to add suspense and excitement to the audience experience.
…
The sample dataset provided by Formula One Management includes Practice 1, Qualifying and race data, and contains the following elements:– Position
– Car number
– Driver’s name
– Fastest lap time
– Gap to the leader’s fastest lap time
– Sector 1 time for the current lap
– Sector 2 time for the current lap
– Sector 3 time for the current lap
– Number of laps
If you aren’t familiar with motorsport timing screens, they typically look like this…
A technical manual is also provided for helping makes sense of the data files.
Here are fragments from the data files – one for practice, one for qualifying and one for the race.
First up, practice:
... <transaction identifier="101" messagecount="10640" timestamp="10:53:14.159"><data column="2" row="15" colour="RED" value="14"/></transaction> <transaction identifier="101" messagecount="10641" timestamp="10:53:14.162"><data column="3" row="15" colour="WHITE" value="F. ALONSO"/></transaction> <transaction identifier="103" messagecount="10642" timestamp="10:53:14.169"><data column="9" row="2" colour="YELLOW" value="16"/></transaction> <transaction identifier="101" messagecount="10643" timestamp="10:53:14.172"><data column="2" row="6" colour="WHITE" value="17"/></transaction> <transaction identifier="102" messagecount="1102813" timestamp="10:53:14.642"><data column="2" row="1" colour="YELLOW" value="59:39" clock="true"/></transaction> <transaction identifier="102" messagecount="1102823" timestamp="10:53:15.640"><data column="2" row="1" colour="YELLOW" value="59:38" clock="true"/></transaction> ...
Then qualifying:
... <transaction identifier="102" messagecount="64968" timestamp="12:22:01.956"><data column="4" row="3" colour="WHITE" value="210"/></transaction> <transaction identifier="102" messagecount="64971" timestamp="12:22:01.973"><data column="3" row="4" colour="WHITE" value="PER"/></transaction> <transaction identifier="102" messagecount="64972" timestamp="12:22:01.973"><data column="4" row="4" colour="WHITE" value="176"/></transaction> <transaction identifier="103" messagecount="876478" timestamp="12:22:02.909"><data column="2" row="1" colour="YELLOW" value="16:04" clock="true"/></transaction> <transaction identifier="101" messagecount="64987" timestamp="12:22:03.731"><data column="2" row="1" colour="WHITE" value="21"/></transaction> <transaction identifier="101" messagecount="64989" timestamp="12:22:03.731"><data column="3" row="1" colour="YELLOW" value="E. GUTIERREZ"/></transaction> ...
Then the race:
... <transaction identifier="101" messagecount="121593" timestamp="14:57:10.878"><data column="23" row="1" colour="PURPLE" value="31.6"/></transaction> <transaction identifier="103" messagecount="940109" timestamp="14:57:11.219"><data column="2" row="1" colour="YELLOW" value="1:41:13" clock="true"/></transaction> <transaction identifier="101" messagecount="121600" timestamp="14:57:11.681"><data column="2" row="3" colour="WHITE" value="77"/></transaction> <transaction identifier="101" messagecount="121601" timestamp="14:57:11.681"><data column="3" row="3" colour="WHITE" value="V. BOTTAS"/></transaction> <transaction identifier="101" messagecount="121602" timestamp="14:57:11.681"><data column="4" row="3" colour="YELLOW" value="17.7"/></transaction> <transaction identifier="101" messagecount="121603" timestamp="14:57:11.681"><data column="5" row="3" colour="YELLOW" value="14.6"/></transaction> <transaction identifier="101" messagecount="121604" timestamp="14:57:11.681"><data column="6" row="3" colour="WHITE" value="1:33.201"/></transaction> <transaction identifier="101" messagecount="121605" timestamp="14:57:11.686"><data column="9" row="3" colour="YELLOW" value="35.4"/></transaction> ...
We can parse the datafiles using python using an approach something like the following:
from lxml import etree pl=[] for xml in open(xml_doc, 'r'): pl.append(etree.fromstring(xml)) pl[100].attrib #{'identifier': '101', 'timestamp': '10:49:56.085', 'messagecount': '9716'} pl[100][0].attrib #{'column': '3', 'colour': 'WHITE', 'value': 'J. BIANCHI', 'row': '12'}
A few things are worth mentioning about this format… Firstly, the identifier is an identifier of the message type, rather then the message: each transaction message appears instead to be uniquely identified by the messagecount. The transactions each update the value of a single cell in the display screen, setting its value and colour. The cell is identified by its row and column co-ordinates. The timestamp also appears to group messages.
Secondly, within a session, several screen views are possible – essentially associated with data labelled with a particular identifier. This means the data feed is essentially powering several data structures.
Thirdly, each screen display is a snapshot of a datastructure at a particular point in time. There is no single record in the datafeed that gives a view over the whole results table. In fact, there is no single message that describes the state of a single row at a particular point in time. Instead, the datastructure is built up by a continual series of updates to individual cells. Transaction elements in the feed are cell based events not row based events.
It’s not obvious how we can make a row based transaction update, even, though on occasion we may be able to group updates to several columns within a row by gathering together all the messages that occur at a particular timestamp and mention a particular row. For example, look at the example of the race timing data above, for timestamp=”14:57:11.681″ and row=”3″. If we parsed each of these into separate dataframes, using the timestamp as the index, we could align the dataframes using the *pandas* DataFrame .align() method.
Note: I’m not sure that the timestamps are necessarily unique across rows, though I suspect that they are likely to be so, which means it would be safer to align, or merge, on the basis of the timestamp and the row number? From inspection of the data, it looks as if it is possible for a couple of timestamps to differ slightly (by milliseconds) yet apply to the same row. I guess we would treat these as separate grouped elements? Depending on the timewidth that all changes to a row are likely to occur in, we could perhaps round times for the basis of the join?
Even with a bundling, we still don’t a have a complete description of all the cells in a row. They need to have been set historically…
The following fragment is a first attempt at building up the timing screen data structure for the practice timing at a particular point of time. To find the state of the timing screen at a particular time, we’d have to start building it up from the start of time, and then stop it updating at the time we were interested in:
#Hacky load and parse of each row in the datafile pl=[] for xml in open('data/F1 Practice.txt', 'r'): pl.append(etree.fromstring(xml)) #Dataframe for current state timing screen df_practice_pos=pd.DataFrame(columns=[ "timestamp", "time", "classpos", "classpos_colour", "racingNumber","racingNumber_colour", "name","name_colour", ],index=range(50)) #Column mappings practiceMap={ '1':'classpos', '2':'racingNumber', '3':'name', '4':'laptime', '5':'gap', '6':'sector1', '7':'sector2', '8':'sector3', '9':'laps', '21':'sector1_best', '22':'sector2_best', '23':'sector3_best' } def parse_practice(p,df_practice_pos): if p.attrib['identifier']=='101' and 'sessionstate' not in p[0].attrib: if p[0].attrib['column'] not in ['10','21','22','23']: colname=practiceMap[p[0].attrib['column']] row=int(p[0].attrib['row'])-1 df_practice_pos.ix[row]['timestamp']=p.attrib['timestamp'] tt=p.attrib['timestamp'].replace('.',':').split(':') df_practice_pos.ix[row]['time'] = datetime.time(int(tt[0]),int(tt[1]),int(tt[2]),int(tt[3])*1000) df_practice_pos.ix[row][colname]=p[0].attrib['value'] df_practice_pos.ix[row][colname+'_colour']=p[0].attrib['colour'] return df_practice_pos for p in pl[:2850]: df_practice_pos=parse_practice(p,df_practice_pos) df_practice_pos
Getting sensible data structures at the timing screen level looks like it could be problematic. But to what extent are the feed elements meaningful in and of themselves? Each element in the feed actually has a couple of semantically meaningful data points associated with it, as well as the timestamp: the classification position, which corresponds to the row; and the column designator.
That means we can start to explore simple charts that map driver number against race classification, for example, by grabbing the row (that is, the race classification position) and timestamp every time we see a particular driver number:
A notebook where I start to explore some of these ideas can be found here: racedemo.ipynb.
Something else I’ve started looking at is the use of MongoDB for grouping items that share the same timestamp (again, check the racedemo.ipynb notebook). If we create an ID based on the timestamp and row, we can repeatedly $set document elements against that key even if they come from separate timing feed elements. This gets us so far, but still falls short of identifying row based sets. We can perhaps get closer by grouping items associated with a particular row in time, for example, grouping elements associated with a particular row that are within half a second of each other. Again, the racedemo.ipynb notebook has the first fumblings of an attempt to work this out.
I’m not likely to have much chance to play with this data over the next week or so, and the time for making entries is short. I never win data competitions anyway (I can’t do the shiny stuff that judges tend to go for), but I’m keen to see what other folk can come up with:-)
PS The R book has stalled so badly I’ve pushed what I’ve got so far to wranglingf1datawithr repo now… Hopefully I’ll get a chance to revisit it over the summer, and push on with it a bit more… WHen I get a couple of clear hours, I’ll try to push the stuff that’s there out onto leanpub as a preview…
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.