Programming with R – Processing Football League Data Part I
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this post we will make use of football results data from the football-data.co.uk website to demonstrate creating functions in R to automate a series of standard operations that would be required for results data from various leagues and divisions.
The first step is to consider what control options should be available as part of the function and here is a list of some arguments that will be used for this implementation of a football result data processing function:
- The name of a csv data file from the football-data.co.uk website.
- A text string to specify the country and division for the data.
- A text string specifying the season.
- A list of teams in the division (optional), which could be used to test for data entry errors in the data file.
- The number of points for a win, draw or loss. This might seem a strange option initially but different leagues might award different points for the three outcomes.
Some of this information might appear optional but is included so that we can write a custom print function at a later date to display a meaningful summary of the object (list) that will be created by the function.
The first part of our function is concerned with checking the various values provided to the function arguments. Our skeleton function is as follows:
football.process.v1 = function(datafile, country, divname, season, teams, winPoints = 3, drawPoints = 1, lossPoints = 0) { }
Here we have specified default options for three of the arguments with the most likely number of points for each match outcome, i.e. 3 points for a win and 1 point for a draw.
To illustrate the working of the result processing function we will use a small exert from the start of the 2010/2011 English Premiership season which is shown below:
Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,Referee E0,14/8/2010,Aston Villa,West Ham,3,0,H,2,0,H,M Dean E0,14/8/2010,Blackburn,Everton,1,0,H,1,0,H,P Dowd E0,14/8/2010,Bolton,Fulham,0,0,D,0,0,D,S Attwell E0,14/8/2010,Chelsea,West Brom,6,0,H,2,0,H,M Clattenburg E0,14/8/2010,Sunderland,Birmingham,2,2,D,1,0,H,A Taylor E0,14/8/2010,Tottenham,Man City,0,0,D,0,0,D,A Marriner E0,14/8/2010,Wigan,Blackpool,0,4,A,0,3,A,M Halsey E0,14/8/2010,Wolves,Stoke,2,1,H,2,0,H,L Probert E0,15/8/2010,Liverpool,Arsenal,1,1,D,0,0,D,M Atkinson E0,16/8/2010,Man United,Newcastle,3,0,H,2,0,H,C Foy E0,21/8/2010,Arsenal,Blackpool,6,0,H,3,0,H,M Jones E0,21/8/2010,Birmingham,Blackburn,2,1,H,0,0,D,M Oliver E0,21/8/2010,Everton,Wolves,1,1,D,1,0,H,L Mason E0,21/8/2010,Stoke,Tottenham,1,2,A,1,2,A,C Foy E0,21/8/2010,West Brom,Sunderland,1,0,H,0,0,D,K Friend E0,21/8/2010,West Ham,Bolton,1,3,A,0,0,D,A Marriner E0,21/8/2010,Wigan,Chelsea,0,6,A,0,1,A,M Dean E0,22/8/2010,Fulham,Man United,2,2,D,0,1,A,P Walton E0,22/8/2010,Newcastle,Aston Villa,6,0,H,3,0,H,M Atkinson E0,23/8/2010,Man City,Liverpool,3,0,H,1,0,H,P Dowd E0,28/8/2010,Blackburn,Arsenal,1,2,A,1,1,D,C Foy E0,28/8/2010,Blackpool,Fulham,2,2,D,0,1,A,M Oliver E0,28/8/2010,Chelsea,Stoke,2,0,H,1,0,H,M Atkinson E0,28/8/2010,Man United,West Ham,3,0,H,1,0,H,M Clattenburg E0,28/8/2010,Tottenham,Wigan,0,1,A,0,0,D,P Dowd E0,28/8/2010,Wolves,Newcastle,1,1,D,1,0,H,S Attwell E0,29/8/2010,Aston Villa,Everton,1,0,H,1,0,H,M Jones E0,29/8/2010,Bolton,Birmingham,2,2,D,0,1,A,K Friend E0,29/8/2010,Liverpool,West Brom,1,0,H,0,0,D,L Probert E0,29/8/2010,Sunderland,Man City,1,0,H,0,0,D,M Dean
This is stored in a file E0test.csv so that we can use the read.csv function to import the results data and then process it.
The first series of commands that we add to the function are for checking various function arguments specified by the user to ensure that they are sensible. First up we check whether a results data file has been specified as we cannot do any processing without any results. The simple test is for whether a file name has been specified:
if (missing(datafile)) { stop("Results csv file not specified.") }
It might be sensible to check whether the object datafile is actually a character string specifying a file, but this hasn’t been done for now. We then check whether the country name and division have been specified and set them to blank strings if they haven’t been set by the user.
if (missing(country)) { warning("Country of league not specified.") country = "" } if (missing(divname)) { warning("Name of league division not specified.") divname = "" }
Next up we import the data file and only save the columns of interest (at this point of the development of the function at least. There are many more columns of information that we need in the raw data from the website,
tmpResults = read.csv(datafile)[,c("Date","HomeTeam","AwayTeam","FTR","FTHG","FTAG")]
The square brackets are used to subset on a part set of columns and only save these. Then we check whether the team names have been specified by the user and if not extract them from the data provided:
if (missing(teams)) { warning("Team names not specified - extracted from results data.") teams = sort(unique(c(as.character(tmpResults$HomeTeam), as.character(tmpResults$AwayTeam)))) }
The sort function is used to order the team names alphabetically which is the order often used in league tables, especially when no games have been played. We then convert the columns HomeTeam and AwayTeam into factors, which allows teams that haven’t played a fixture yet to be included in the table.
tmpResults$HomeTeam = factor(tmpResults$HomeTeam, levels = teams) tmpResults$AwayTeam = factor(tmpResults$AwayTeam, levels = teams)
To round off the first part of creating the result processing function we create a list object to return at the end of the function.
tmpSummary = list(Country = country, Division = divname, Season = season, Teams = teams, Results = tmpResults)
The function so far:
football.process.v1 = function(datafile, country, divname, season, teams, winPoints = 3, drawPoints = 1, lossPoints = 0) { ## Validation Function Arguments if (missing(datafile)) { stop("Results csv file not specified.") } if (missing(country)) { warning("Country of league not specified.") country = "" } if (missing(divname)) { warning("Name of league division not specified.") divname = "" } ## Import Results tmpResults = read.csv(datafile)[,c("Date","HomeTeam","AwayTeam","FTR","FTHG","FTAG")] if (missing(teams)) { warning("Team names not specified - extracted from results data.") teams = sort(unique(c(as.character(tmpResults$HomeTeam), as.character(tmpResults$AwayTeam)))) } tmpResults$HomeTeam = factor(tmpResults$HomeTeam, levels = teams) tmpResults$AwayTeam = factor(tmpResults$AwayTeam, levels = teams) ## Return Division Information tmpSummary = list(Country = country, Division = divname, Season = season, Teams = teams, Results = tmpResults) invisible(tmpSummary) }
We then test this function with the data file shown above. First up we create our own list of teams in the English Premiership for 2010/2011 and specify some of the other function arguments while using the defaults for points.
> E0teams.1011 = c("Arsenal", "Aston Villa", "Birmingham", "Blackburn", + "Blackpool", "Bolton", "Chelsea", "Everton", "Fulham", "Liverpool", + "Man City", "Man United", "Newcastle", "Stoke", "Sunderland", + "Tottenham", "West Brom", "West Ham", "Wigan", "Wolves") > print(football.process.v1("E0test.csv", "England", "Premiership", "2010-2011", E0teams.1011)) $Country [1] "England" $Division [1] "Premiership" $Season [1] "2010-2011" $Teams [1] "Arsenal" "Aston Villa" "Birmingham" "Blackburn" "Blackpool" [6] "Bolton" "Chelsea" "Everton" "Fulham" "Liverpool" [11] "Man City" "Man United" "Newcastle" "Stoke" "Sunderland" [16] "Tottenham" "West Brom" "West Ham" "Wigan" "Wolves" $Results Date HomeTeam AwayTeam FTR FTHG FTAG 1 14/8/2010 Aston Villa West Ham H 3 0 2 14/8/2010 Blackburn Everton H 1 0 3 14/8/2010 Bolton Fulham D 0 0 4 14/8/2010 Chelsea West Brom H 6 0 5 14/8/2010 Sunderland Birmingham D 2 2 6 14/8/2010 Tottenham Man City D 0 0 7 14/8/2010 Wigan Blackpool A 0 4 8 14/8/2010 Wolves Stoke H 2 1 9 15/8/2010 Liverpool Arsenal D 1 1 10 16/8/2010 Man United Newcastle H 3 0 11 21/8/2010 Arsenal Blackpool H 6 0 12 21/8/2010 Birmingham Blackburn H 2 1 13 21/8/2010 Everton Wolves D 1 1 14 21/8/2010 Stoke Tottenham A 1 2 15 21/8/2010 West Brom Sunderland H 1 0 16 21/8/2010 West Ham Bolton A 1 3 17 21/8/2010 Wigan Chelsea A 0 6 18 22/8/2010 Fulham Man United D 2 2 19 22/8/2010 Newcastle Aston Villa H 6 0 20 23/8/2010 Man City Liverpool H 3 0 21 28/8/2010 Blackburn Arsenal A 1 2 22 28/8/2010 Blackpool Fulham D 2 2 23 28/8/2010 Chelsea Stoke H 2 0 24 28/8/2010 Man United West Ham H 3 0 25 28/8/2010 Tottenham Wigan A 0 1 26 28/8/2010 Wolves Newcastle D 1 1 27 29/8/2010 Aston Villa Everton H 1 0 28 29/8/2010 Bolton Birmingham D 2 2 29 29/8/2010 Liverpool West Brom H 1 0 30 29/8/2010 Sunderland Man City H 1 0
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.