Site icon R-bloggers

Could an Independent Yorkshire Win the World Cup – LASSOs and Player Positions

[This article was first published on rstats on Robert Hickman, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Recently, a Yorkshire national football team appeared in a league of national teams for stateless people. This got me wondering how the historic counties of the UK would do at the world cup. Could any of them compete with full international teams?

This is the complete script for an short article I wrote for CityMetric on the topic. It’s split over 6 separate parts and is pretty hefty but contains pretty much everything you need to clone the article. Last time, we got the shapefiles for the historic counties of the UK and scraped the player data we’ll use to build the teams

library(dplyr)
library(magrittr)
library(data.table)
library(ggplot2)
#we'll use glmnet to do LASSO regression to determine players positional ability
library(glmnet)

Work Out Player Position Ability

The data we’ve scraped only gives a player’s overall ‘ability’ and their abilities on specific skills (e.g. strength, long shots, dribbling…). We want to use this to work out how good each player is at each position.

It’s logical to assume that a player’s overall ability is how good they are at their main position (the position listed first on their page on fifaindex.com). We can therefore use LASSO regression to work out which stats are contributing to their overall ability score. For instance, we would expect that a goalkeepers overall ability score is just a function of gk_positioning, gk_diving, gk_handling and so on… and doesn’t care about (e.g.) dribbling.

This positional ability score is important as we can’t just select the 11 best players for each team as we might end up playing a goalkeeper and 10 defenders (or etc.). We need to make sure we select the best palyers for each position on a realistic formation.

Unfortunately, there’s going to be no real way to tell between a players ability to play on either side of the field. There’s some correlation with their footedness ($foot), but it’s not worth going too in the weeds about that. So first we want to make all position symmetrical (i.e. we do discriminate between left and right sided positions).

all_players_data %<>% mutate(symmetric_position = gsub("L|R", "W", main_position))

unique(as.character(all_players_data$main_position))
##  [1] "CF"  "LW"  "ST"  "GK"  "CAM" "CM"  "CB"  "CDM" "RW"  "LB"  "RM" 
## [12] "LM"  "RB"  "LWB" "RWB"
unique(all_players_data$symmetric_position)
##  [1] "CF"  "WW"  "ST"  "GK"  "CAM" "CM"  "CB"  "CDM" "WB"  "WM"  "WWB"

First we need to convert the stats that we’re going to use for this regression/prediction into matrix form

player_stats <- all_players_data %>%
  #select only the players stats for each skill
  select(c(12, 14:47)) %>%
  #transition data into matrix
  model.matrix(overall~., .)

#show the first 6 instances of the first 5 stats
head(player_stats[,c(1:5)])
##   (Intercept) ball_control dribbling marking slide_tackle
## 1           1           96        97      13           26
## 2           1           93        91      22           23
## 3           1           95        96      21           33
## 4           1           91        86      30           38
## 5           1           48        30      10           11
## 6           1           42        18      13           13

To show how this works I’m going to illustrate it using goalkeepers and predicting the ability of each outfield player to play in goal.

#filter out only the goalkeepers
goalkeepers <- all_players_data %>%
  filter(symmetric_position == "GK")

#select a percentage of these to use as training data
sample_percent <- 10
train_samples <- sample(1:nrow(goalkeepers), (nrow(goalkeepers)/100)*sample_percent)

gk_stats <- goalkeepers %>%
  slice(train_samples) %>%
  select(c(12, 14:47))

#get the stats per skill
train_matrix <- model.matrix(overall~., gk_stats)
#and the overall (gk) ability
train_ability <- gk_stats$overall
  
#perform the regression
cv_model <- cv.glmnet(train_matrix, train_ability)

#look at the weightings given to important variables using lambda value that gives minimum mean cross-validated error
coef(cv_model, s = "lambda.min")
## 36 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept)    1.48054668
## (Intercept)    .         
## ball_control   .         
## dribbling      .         
## marking        .         
## slide_tackle   .         
## stand_tackle   .         
## aggression     .         
## reactions      0.11010673
## positioning    .         
## interceptions  .         
## vision         .         
## composure      .         
## crossing       .         
## short_pass     .         
## long_pass      .         
## acceleration   .         
## stamina        .         
## strength       .         
## balance        .         
## sprint_speed   .         
## agility        .         
## jumping        .         
## heading        .         
## shot_power     .         
## finishing      .         
## long_shots     .         
## curve          .         
## free_kicks     .         
## penalties      .         
## volleys        .         
## gk_positioning 0.20822323
## gk_diving      0.20474961
## gk_handling    0.20874770
## gk_kicking     0.05344379
## gk_reflexes    0.20770553

The regression selects only variables which have a strong relationship with the outcome (overall ability in the gk position in this case). As expected, it selects only the gk_… skillset and also a players reactions, which makes sense if goalkeepers have to make point blank saves.

We can validate this by predicting the overall ability of the goalkeepers that aren’t in the training set fairly simply

#find non training examples of goalkeepers
test_matrix <- goalkeepers %>%
  slice(-train_samples) %>%
  select(c(12, 14:47)) %>%
  model.matrix(overall~., .)

#get the overall ability for these players
gk_abilities <- goalkeepers %>%
  slice(-train_samples) %>%
  select(overall)

#predict their overall ability based on their stats
gk_abilities$predicted <- as.vector(predict(cv_model, newx = test_matrix, s = "lambda.min", type="response"))

#plot these
p <- ggplot(data = gk_abilities, aes(x = overall, y = predicted)) +
  geom_point() +
  xlab("Overall FIFA Ability") +
  ylab("Predicted FIFA Ability") +
  ggtitle("Actual and Predicted Goalkeeping Ability of Goalkeepers in FIFA18")

plot(p)

Which gives a very good fit. This is expected for the dataset we’re using as the overall ability hasbeen directly calclulated from the complete set of skills using some hidden algorithm. In the real world, the actual vs. predicted results would most likely have more noise.

We can use this model now to predict how well outfield players would fare in goal, given that they have (low) ratings for all of these skills

#find all outfield players and convert stats the matrix
outfield_players <- all_players_data %>%
  filter(symmetric_position != "GK") %>%
  select(c(12, 14:47)) %>%
  model.matrix(overall~., .)

#get the names of each outfield palyer
outfield_goalkeepers <- all_players_data %>%
  filter(symmetric_position != "GK") %>%
  select(name)

#predict how well each outfield player would do in goal
outfield_goalkeepers$predicted_ability <- as.vector(predict(cv_model, newx = outfield_players, s = "lambda.min", type="response"))

head(outfield_goalkeepers)
##                 name predicted_ability
## 1       Lionel Messi          20.84383
## 2  Cristiano Ronaldo          21.78181
## 3             Neymar          21.10118
## 4        Luis Suárez          38.68063
## 5    Kevin De Bruyne          22.22474
## 6 Robert Lewandowski          20.20816

It is of course, extremely ironic that Luis Suarez scores relatively highly as an outfield player in goal given his history.

We can use this technique to predict how each player would play in each position using the following function

#function to predict how each player would play in each position
get_position_weights <- function(position, sample_percent) {
  #filter data
  position_df <- all_players_data %>%
    filter(symmetric_position == position)
  
  #get training data
  train_samples <- sample(1:nrow(position_df), (nrow(position_df)/100)*sample_percent)
  
  train_stats <- position_df %>%
    .[train_samples,] %>%
    select(c(12, 14:47))
  
  train_matrix <- model.matrix(overall~., train_stats)
  
  train_ability <- train_stats$overall
  
  #use LASSO regression to find weighting of significant covariates
  cv_model <- cv.glmnet(train_matrix, train_ability)
  
  #predict players ability in that position
  position_ability <- predict(cv_model, newx = player_stats, s = "lambda.min", type="response")
}

#run through every mirrored position
#using a high percentage of palyers in training set (50%) as in theory should be perfect regression
position_abilities <- lapply(unique(all_players_data$symmetric_position), 
                             get_position_weights, sample_percent = 50) %>%
  do.call(cbind, .) %>%
  data.frame()

#name each position
names(position_abilities) <- unique(all_players_data$symmetric_position)

#bind this to the data we have on all the players
all_players_data <- cbind(all_players_data, position_abilities) %>%
  #convert all non-natural goalkeepers goalkeeping ability to zero
  #want to make sure no non-goalkeepers are chosen in goal
  mutate(GK = ifelse(main_position == "GK", overall, 0)) 

#show the first 6 rows
head(select(all_players_data, c(1, 49:60)))
##                name symmetric_position       CF       WW       ST GK
## 1      Lionel Messi                 CF 93.60706 92.27461 89.70278  0
## 2 Cristiano Ronaldo                 WW 92.88560 91.16822 92.62438  0
## 3            Neymar                 WW 89.48736 89.30329 85.17042  0
## 4       Luis Suárez                 ST 90.48249 87.40918 88.93461  0
## 5      Manuel Neuer                 GK 43.64828 39.22244 36.55564 92
## 6            De Gea                 GK 41.70048 36.48414 34.36504 91
##        CAM       CM       CB      CDM       WB       WM      WWB
## 1 93.36468 85.23987 46.02382 60.18700 58.22984 91.18465 62.26366
## 2 90.39084 83.31839 53.60718 62.65345 62.21724 90.32306 66.13034
## 3 89.01684 80.80476 47.23479 60.16824 60.11565 88.27848 64.19203
## 4 87.59322 81.40708 59.77832 66.74543 65.13876 86.44133 67.67849
## 5 47.58404 49.47508 34.07456 42.88950 35.23432 44.42806 35.91158
## 6 44.18874 46.37230 34.44230 42.30995 36.35090 41.15741 36.12737

Now we have the ability of each player to perform in any position we can use it to build teams. First however, we need to sort British players into the county of their birth, which we’ll do in the next post.

To leave a comment for the author, please follow the link and comment on their blog: rstats on Robert Hickman.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.