Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this blog post I will be adapting some code from the wonderful FC R Stats, a great football statistics resource – be sure to check out their tutorial for more detail on how to compose something like this. This post focuses a lot on writing functions in R, so check out tutorials here or here if you need. The functions used are available on my Github.
Getting Started with xG
Expected goals (xG) is a pretty simple stat at it’s core. If a player is taking a shot from this position, what are the chances it goes in? So considering his position striking the ball, Dirk Kuyt probably had an xG of about 1 for this goal against Manchester United in 2011. Asmir Begovic probably had an xG close to 0 for this strike versus Southampton two years later. There are other factors that influence xG, and different statisticians have their own formulae to reach a number, but all we really care about today is the result itself. So let’s see who should be winning games knowing the xG of each shot.
Random Numbers and xG
Lucky for us, we don’t need true randomness here. Base R has the runif() function which will give us a random number evenly distributed between 0 and 1. Since xG is also measured between 0 and 1, we don’t need to manipulate these random outputs. So how can we use this against our xG numbers?
We can compare our random number to our xG figure. We generate a random number between 0 and 1, and if that number is less than or equal to our xG we consider this a successful attempt on goal. For example, if our xG = 1, we know that we are guaranteed a goal. Our random number will always be less than or equal to 1 so we know this works.
Let’s say we have a shot with an xG of 0.6. Meaning we could expect this to result in a goal 60% of the time. We can check if our shot scored using the below.
if (runif(1) <= 0.6) { print("GOAL!") } else { print("MISS!") }
But checking just once doesn’t tell us a lot. Let’s run it 10,000 times and see how often we score.
goals <- 0 for (i in 1:10000) { if (runif(1) <= 0.6) { goals = goals + 1 } } print(goals)
Running this 10,000 times with an approximately 60% success rate should naturally give us 6,000 goals. My first time running this loop gave me 6,026 goals – we’re well on our way. Let’s apply this to a match.
Who wins the game?
We’ll write a function that takes two arguments – the home team xG, and the away team xG. Inside that function we can store the score of the game, which we’ll start at 0-0.
calculateWinner <- function(home, away) { # Set score to nil-nil for start of game homeGoals <- 0 awayGoals <- 0
Our two xG variables we pass into our function can be lists – we don’t need to check only one shot. In order to check each shot individually we’ll create another function nested inside our calculateWinner()
function.
testShots <- function(shots) { # Start goal count at 0 goal <- 0 # If a shot goes in, add a goal for (shot in shots) { if(runif(1) <= shot){ goal <- goal + 1 } } #Return the number of goals return(goal) }
Now all we need to do is to run this function on both of our inputs to the calculateWinner()
function. At the start of calculateWinner()
, we declared homeGoals
and awayGoals
to be 0. We can use the above block of code to get a score for each team and change our homeGoals/awayGoals variables.
homeGoals = testShots(home) awayGoals = testShots(away)
Now we can finally finish our calculateWinner()
function by telling us who won the game.
if (homeGoals > awayGoals) { print(paste0("Home team wins ", homeGoals, "-", awayGoals) } else if (awayGoals > homeGoals) { print(paste0("Away team wins ", homeGoals, "-", awayGoals) } else { print(paste0("Draw. Full time score ", homeGoals, "-", awayGoals) } }
So let’s test this code out. Let’s take a look and see what’s better – one good chance, or several okay chances?
Our home team has one chance with 0.6 xG. Our away team has five chances worth 0.12 each. Running calculateWinner(0.6, c(0.12, 0.12, 0.12, 0.12, 0.12))
returned an away win of 0-1. But this does not tell us the whole story – same as our 0.6 xG shot at the start. A sample size of one is far too small. So let’s run this 10,000 times as we did above and see who comes out on top.
Who really wins the game?
In order to get a nice breakdown of how the games typically go we can put together another quick function. Before we do that, a quick change to the end of our calculateWinner()
function is needed. Rather than having the result be a sentence telling us the score, let’s just return a value letting us know the team that won.
if (homeGoals > awayGoals) { return("home") } else if (awayGoals > homeGoals) { return("away") } else { return("draw") }
Now we’ve made it easy to find the likeliest winner. Run calculateWinner()
1,000s of times and count each result. I’ll call this function calculateChance()
. calculateChance()
needs to hold the amount of wins each team has, so I’ll start by setting these to zero.
calculateChance <- function(team1, team2) { home <- 0 away <- 0 draw <- 0
Now we just run our calculateWinner()
code 10,000 times and note each result.
for (i in 1:10000) { matchWinner <- calculateWinner(team1, team2) if (matchWinner == "home") { home <- home + 1 } else if (matchWinner == "away") { away <- away + 1 } else { draw <- draw + 1 } }
Divide each result by 100 to get a percentage and output the proportion of each. That’s it, calculateChance()
is done!
home <- home/100 away <- away/100 draw <- draw/100 print(paste0("Over 10,000 games home wins ", home, "% of games. Away wins ", away, "% of games. ", draw, "% of games end in a tie")) }
Running this with our above values of 0.6xG for the home team and 0.12 xG five times for the away team gives us the below output.
So yes, the 0-1 away win we got earlier was a fluke, and one big chance beats many small chances given the same total xG. See this great article from Total Football Analysis for more detail on why this is. Now that we have a working function, we can apply this to real life matches.
Who should have won the game?
For this, I’ll be looking at the 2019 Champions League final – sorry Tottenham fans. I’ve taken the data from FBref.com, a good resource for stats of this granularity. In this game, Tottenham had six chances worth a total of 1.1xG, and Liverpool have three chances worth 1.2xG, including a penalty worth 0.9xG. As we know from above, one big chance beats many small ones. That said, let’s plug these numbers in and see what we get.
tottenhamxG <- c(0.1, 0.1, 0.1, 0.4, 0.3, 0.1) liverpoolxG <- c(0.1, 0.9, 0.1) calculateChance(tottenhamxG, liverpoolxG)
As expected, the one big chance has beaten the many small chances. But both teams had an xG of about one, and Tottenham failed to score while Liverpool scored two. Excluding their penalty, Liverpool scored one goal from an xG of 0.2. Let’s create one more function, to check the odds of a game having a given result.
How did they score those?
We’ll create the calculateScore()
function, which is just our calculateWinner()
function that outputs the final score as a vector rather than telling us who won.
calculateScore <- function(home, away) { # Set score to nil-nil for start of game homeGoals <- 0 awayGoals <- 0 # Need a function in our function # runs runif(1) test for goals in a list testShots <- function(shots) { # Start goal count at 0 goal <- 0 # If a shot goes in, add a goal for (shot in shots) { if(runif(1) <= shot){ goal <- goal + 1 } } #Finally, return the number of goals return(goal) } #Run the above formula for home and away lists homeGoals = testShots(home) awayGoals = testShots(away) score <- c(homeGoals, awayGoals) #Return the score return(score) }
For checking the odds of a given score I’ll name this function checkScoreOdds()
since I’m checking the odds of a score. Clever, right? Whereas our previous functions only required two inputs, we’ll take three here – our home team’s xG, our away team’s xG, and the score we want to check the chances of. We’ll need to keep track of how often our desired result comes up, so let’s call this correct
and set it to zero.
checkScoreOdds <- function(homeXG, awayXG, scoreCheck) { correct <- 0
Now we just need to iterate through calculateScore()
a few thousand times and see how many times our desired score came through. Divide this by 10,000 (the number of simulated games) and you have your odds of hitting that score.
for (i in 1:10000) { score <- calculateScore(homeXG, awayXG) if (score[1] == scoreCheck[1] && score[2] == scoreCheck[2]) { correct <- correct + 1 } } return(correct/10000) }
Plugging in our Liverpool and Tottenham xG values, what were the odds this game finished 0-2 to Liverpool?
Liverpool managed to win the game with a result that could only be expected about 5% of the time given the chances in the game, all thanks to Divock Origi. If any bookies had offered odds on Origi scoring the winner in the Champion’s League final at the start of the season, 5% might have been generous then.
What about every other score?
We’ve seen that Liverpool winning 0-2 was only going to happen 4.6% of the time with these chances, but what about every other score? Since Tottenham had six chances and Liverpool had three, theoretically this game could have ended 6-3 (provided both keepers had the worst day of their careers). Let’s write one last function to see the odds of any combination of scores.
Our last function everyScore()
is pretty short. All we need to do is create an empty dataframe and add to it one row at a time with the results of our calculateScore()
function. We’ll run it 10,000 times as we did before.
everyScore <- function(homeTeam, awayTeam) { df <- data.frame("HomeScore" = as.integer(character()), "AwayScore" = as.integer(character())) for (i in 1:10000) { score <- calculateScore(homeTeam, awayTeam) df <- rbind(df, score) } return(df) }
Pop in our xG data for each time and we get out a 10,000 row dataframe. This isn’t exactly the most readable format so we can use the table()
function to make it a bit nicer. We’ll also take a copy formatted as a dataframe to make it easier to play with further down the line.
thfcLFC <- everyScore(tottenhamxG,liverpoolxG) thfcLFC <- table(thfcLFC) thfcLFCdf <- as.data.frame(thfcLFC)
Making this into a table gives us a count for each result, which we can divide by 10,000 to get the percentage chance that result happens. We’ll also add a line to sort it in order of most to least likely result.
thfcLFCTable <- cbind(thfcLFCTable, thfcLFCTable$Freq/10000) thfcLFCTable <- thfcLFCTable[order(-thfcLFCTable$Freq),]
So here we can see the breakdown of all results and the chance of them happening. Our actual result of 0-2 was only the 5th most likely, with 1-1 draws and 0-1 away wins taking up more than 50% of all outcomes. A 3-1 Tottenham win is almost as likely as what we actually saw. Interestingly, although Liverpool had fewer chances, they only failed to score in approximately 7% of our simulated games, whereas Tottenham failed to score more than a quarter of the time.
A heatmap made using ggplot2 shows pretty clearly that 1-1 is by far our leader, followed one goal swings either side. Anything beyond that however looks equally unlikely.
So maybe Liverpool had all the luck on the day. Maybe strong defending by the outfield players kept the xG of each Tottenham chance down, and good goalkeeping meant the better opportunities were still a big ask. Maybe Divock Origi is the most clinical finisher in world football. Maybe my Liverpool bias is showing itself already.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.