Programming with R – Processing Football League Data Part II
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Following on from the previous post about creating a football result processing function for data from the football-data.co.uk website we will add code to the function to generate a league table based on the results to date.
To create the league table we need to count various things such as the number of games played, number of wins/draws/losses, goals scored etc. This information is available in the results object that is loaded from a csv file in the function as it stands.
To facilitate these calculations we create a data frame with a row for each team in the division and then calculate the statistics required – this was a reason for ordering the factors in the HomeTeam and AwayTeam columns of the results table. The data frame is created with the code below:
tmpTable = data.frame(Team = teams, Games = 0, Win = 0, Draw = 0, Loss = 0, HomeGames = 0, HomeWin = 0, HomeDraw = 0, HomeLoss = 0, AwayGames = 0, AwayWin = 0, AwayDraw = 0, AwayLoss = 0, Points = 0, HomeFor = 0, HomeAgainst = 0, AwayFor = 0, AwayAgainst = 0, For = 0, Against = 0, GoalDifference = 0)
There are a number of slots that are may be redundant in a league table but are used for intermediate calculations, such as HomeWin and AwayWin that are combined to find the total number of victories for a team.
The number of games played by each team home and away are counted using the table command for the two columns respectively.
tmpTable$HomeGames = as.numeric(table(tmpResults$HomeTeam)) tmpTable$AwayGames = as.numeric(table(tmpResults$AwayTeam))
The labels created by the table command are discarded using the as.numeric function to retain only the number of games. The table command is also used to count the number of wins, draws and losses at home and away for each team. The commands are shown here:
tmpTable$HomeWin = as.numeric(table(tmpResults$HomeTeam[tmpResults$FTR == "H"])) tmpTable$HomeDraw = as.numeric(table(tmpResults$HomeTeam[tmpResults$FTR == "D"])) tmpTable$HomeLoss = as.numeric(table(tmpResults$HomeTeam[tmpResults$FTR == "A"])) tmpTable$AwayWin = as.numeric(table(tmpResults$AwayTeam[tmpResults$FTR == "A"])) tmpTable$AwayDraw = as.numeric(table(tmpResults$AwayTeam[tmpResults$FTR == "D"])) tmpTable$AwayLoss = as.numeric(table(tmpResults$AwayTeam[tmpResults$FTR == "H"]))
Note that we subset on the values in the FTR column, which is full-time result, and then count. The subsetting is reversed when looking at the away fixtures because a victory for the team is now an away win rather than a home win.
This information is then combined to get total games played, won etc.
tmpTable$Games = tmpTable$HomeGames + tmpTable$AwayGames tmpTable$Win = tmpTable$HomeWin + tmpTable$AwayWin tmpTable$Draw = tmpTable$HomeDraw + tmpTable$AwayDraw tmpTable$Loss = tmpTable$HomeLoss + tmpTable$AwayLoss
The total points is calclated by multiplying the number of wins, draws and losses by the number of points awarded for each match outcome.
tmpTable$Points = winPoints * tmpTable$Win + drawPoints * tmpTable$Draw + lossPoints * tmpTable$Loss
The next set of calculations are to count the number of goals scored, goals conceeded and goal difference. The tapply function is used for these calculations.
tmpTable$HomeFor = as.numeric(tapply(tmpResults$FTHG, tmpResults$HomeTeam, sum, na.rm = TRUE)) tmpTable$HomeAgainst = as.numeric(tapply(tmpResults$FTAG, tmpResults$HomeTeam, sum, na.rm = TRUE)) tmpTable$AwayFor = as.numeric(tapply(tmpResults$FTAG, tmpResults$AwayTeam, sum, na.rm = TRUE)) tmpTable$AwayAgainst = as.numeric(tapply(tmpResults$FTHG, tmpResults$AwayTeam, sum, na.rm = TRUE))
The tapply function applies the sum to the number of goals scored at home or away, and the number of goals conceeded by each team in the division. These are then combined to create totals home and away:
tmpTable$For = ifelse(is.na(tmpTable$HomeFor), 0, tmpTable$HomeFor) + ifelse(is.na(tmpTable$AwayFor), 0, tmpTable$AwayFor) tmpTable$Against = ifelse(is.na(tmpTable$HomeAgainst), 0, tmpTable$HomeAgainst) + ifelse(is.na(tmpTable$AwayAgainst), 0, tmpTable$AwayAgainst)
The ifelse statement is used to handle situations where a team hasn’t played a home and/or away fixture yet. The goal difference is easy to calculate:
tmpTable$GoalDifference = tmpTable$For - tmpTable$Against
Now that all of the statistics have been calculated we sort the table based on the number of points, goal difference and finally alphabetically. There might be different ways that we can order the teams but this is what we will use for the time being:
tmpTable = tmpTable[order(- tmpTable$Points, - tmpTable$GoalDifference, tmpTable$Team),]
The ordering might look odd but we want to ranking from highest to lowest points and goal difference but then in ascending alphabetical order for the teams.
The whole function is now:
football.process.v2 = function(datafile, country, divname, season, teams, winPoints = 3, drawPoints = 1, lossPoints = 0) { ## Validation Function Arguments if (missing(datafile)) { stop("Results csv file not specified.") } if (missing(country)) { warning("Country of league not specified.") country = "" } if (missing(divname)) { warning("Name of league division not specified.") divname = "" } ## Import Results tmpResults = read.csv(datafile)[,c("Date","HomeTeam","AwayTeam","FTR","FTHG","FTAG")] if (missing(teams)) { warning("Team names not specified - extracted from results data.") teams = sort(unique(c(as.character(tmpResults$HomeTeam), as.character(tmpResults$AwayTeam)))) } tmpResults$HomeTeam = factor(tmpResults$HomeTeam, levels = teams) tmpResults$AwayTeam = factor(tmpResults$AwayTeam, levels = teams) ## Create Empty League Table tmpTable = data.frame(Team = teams, Games = 0, Win = 0, Draw = 0, Loss = 0, HomeGames = 0, HomeWin = 0, HomeDraw = 0, HomeLoss = 0, AwayGames = 0, AwayWin = 0, AwayDraw = 0, AwayLoss = 0, Points = 0, HomeFor = 0, HomeAgainst = 0, AwayFor = 0, AwayAgainst = 0, For = 0, Against = 0, GoalDifference = 0) ## Count Number of Games Played tmpTable$HomeGames = as.numeric(table(tmpResults$HomeTeam)) tmpTable$AwayGames = as.numeric(table(tmpResults$AwayTeam)) ## Count Number of Wins/Draws/Losses tmpTable$HomeWin = as.numeric(table(tmpResults$HomeTeam[tmpResults$FTR == "H"])) tmpTable$HomeDraw = as.numeric(table(tmpResults$HomeTeam[tmpResults$FTR == "D"])) tmpTable$HomeLoss = as.numeric(table(tmpResults$HomeTeam[tmpResults$FTR == "A"])) tmpTable$AwayWin = as.numeric(table(tmpResults$AwayTeam[tmpResults$FTR == "A"])) tmpTable$AwayDraw = as.numeric(table(tmpResults$AwayTeam[tmpResults$FTR == "D"])) tmpTable$AwayLoss = as.numeric(table(tmpResults$AwayTeam[tmpResults$FTR == "H"])) tmpTable$Games = tmpTable$HomeGames + tmpTable$AwayGames tmpTable$Win = tmpTable$HomeWin + tmpTable$AwayWin tmpTable$Draw = tmpTable$HomeDraw + tmpTable$AwayDraw tmpTable$Loss = tmpTable$HomeLoss + tmpTable$AwayLoss tmpTable$Points = winPoints * tmpTable$Win + drawPoints * tmpTable$Draw + lossPoints * tmpTable$Loss ## Count Goals Scored and Conceeded tmpTable$HomeFor = as.numeric(tapply(tmpResults$FTHG, tmpResults$HomeTeam, sum, na.rm = TRUE)) tmpTable$HomeAgainst = as.numeric(tapply(tmpResults$FTAG, tmpResults$HomeTeam, sum, na.rm = TRUE)) tmpTable$AwayFor = as.numeric(tapply(tmpResults$FTAG, tmpResults$AwayTeam, sum, na.rm = TRUE)) tmpTable$AwayAgainst = as.numeric(tapply(tmpResults$FTHG, tmpResults$AwayTeam, sum, na.rm = TRUE)) tmpTable$For = ifelse(is.na(tmpTable$HomeFor), 0, tmpTable$HomeFor) + ifelse(is.na(tmpTable$AwayFor), 0, tmpTable$AwayFor) tmpTable$Against = ifelse(is.na(tmpTable$HomeAgainst), 0, tmpTable$HomeAgainst) + ifelse(is.na(tmpTable$AwayAgainst), 0, tmpTable$AwayAgainst) tmpTable$GoalDifference = tmpTable$For - tmpTable$Against ## Sort Table ## By Points ## By Goal Difference ## By Team Name (Alphabetical) tmpTable = tmpTable[order(- tmpTable$Points, - tmpTable$GoalDifference, tmpTable$Team),] tmpTable = tmpTable[,c("Team", "Games", "Win", "Draw", "Loss", "Points", "For", "Against", "GoalDifference")] ## Return Division Information tmpSummary = list(Country = country, Division = divname, Season = season, Teams = teams, Results = tmpResults, Table = tmpTable) invisible(tmpSummary) }
There are other functionality that we might want to add to the function.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.