Site icon R-bloggers

The Guardian Knowledge June 2019

[This article was first published on rstats on Robert Hickman, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Most Wednesday’s I enjoy reading The Knowledge blog on the Guardian’s website and reading the football trivia therein. When time (and questions) allow, I like to answer some of the questions posed, example of which are here, here, and here.

League of Nations

The first question comes from

Which player had the nationality with the lowest FIFA World Ranking at the time of him winning the Premier League?

— The Tin Boonie (@TheTinBoonie) June 18, 2019

a similar question is also answered in this weeks column:

“Fulham defender Zesh Rehman made his debut for Pakistan, who are ranked 168 by Fifa. Is that the lowest-ranked country a Premier League player has played for?” wondered Zulfiqar Shah in January 2006.’

Answers

The latter question is more clear cut- noone has beaten the record set by Zesh Rehman in December 205 when he made his debut for Pakistan. Using ELO ratings which are more consistent over time, Pakistan were ranked as teh 197th best team in the world at that time. The nearest someone has gotten to this is Neil Etheridge in 2018 when playing for 165th ranked Philippines.

For the first question, it depends how you take the rankings and if you require players to have appeared a certain number of time in the season, but Chistopher Wreh who represented 110th ranked Liberia whilst plaing for Arsenal in the 1997/1998 season is probably the winner with Jonny Evans (Northern Ireland and Manchester United, 2012/2013) and Igors Stepanovs (Latvia and Arsenal, 2001/2002) close behind.

Working

First some libraries we’ll need, and also set a seed for reproducibility.

library(tidyverse)
library(magrittr)
library(rvest)

set.seed(3459)

The first thing we need to answer these is the nationality of EPL players. For this two good sources are transfermarkt.co.uk and 11v11.com. I’m going to opt for the later, just because the tables are a little easier to scrape.

To start, we need to get the links to every team to have competed in the premier league

# get years of EPL seasons
years <- 1993:2019
# base url we'll scrape from
base_url <- "https://www.11v11.com"

# cat together
tables <- paste0(base_url, "/league-tables/premier-league/01-june-", years)

competing_teams <- tables %>%
  # get a list of the links to every teams squad page
  map(., function(x) {
    x %>%
      read_html() %>%
      html_nodes("#table-league > tbody:nth-child(2) > tr > td:nth-child(2) > a:nth-child(1)") %>%
      html_attr("href") %>%
      # paste into working link for year and competition (EPL)
      paste0(base_url, ., "tab/players/season/", gsub(".*01-june-", "", x), "/comp/1/")
  }) %>%
  unlist()

head(competing_teams, n = 5)

We can then scrape the squads for these teams in that specific season. We want the players, their nationality and also the number of appearances they made in the league that season

squads <- competing_teams %>%
  # get the players/appearances/nationalities
  map_df(., function(y) {
    # read once to save server calls
    read <- y %>%
      read_html() 
    
    # get the squad info
    squad <- read %>%
      html_nodes(".squad") %>%
      html_table(fill = TRUE) %>%
      as.data.frame() %>%
      # get rid of rows without player info
      filter(!is.na(Player))
    
    # get the listed nationalities
    flags <- read %>%
      html_nodes(".squad > tbody:nth-child(2) > tr > td:nth-child(3)")
    
    # from here get the actual nationalities per player
    nations <- flags %>%
      html_nodes("img") %>%
      html_attr("title")
    
    # these might mismatch in length
    # in which case append NA
    if(length(flags) != length(nations)) {
      missing <- which(!grepl("img", flags))
      
      nations <- c(
        nations[1:(missing-1)],
        NA,
        nations[missing:length(nations)]
      )
    }
    
    # mutate nationality and team and season
    squad %>%
      mutate(
        nation = nations,
        year = gsub(".*season\\/", "", gsub("\\/comp.*", "", y)),
        team = gsub("\\/tab\\/players.*", "", gsub(".*teams\\/", "", y))
      ) %>%
      # select useful appearance information
      select(player = Player, position = Position,
             appearances = A, sub_appearances = S, 
             nation, year, team)
  }) %>%
  # manually add in some missing nationalities
  mutate(nation = case_when(
    grepl("Steffen Karl", player) ~ "Germany",
    grepl("Marc Muniesa", player) ~ "Spain",
    grepl("Oriol Romeu", player) ~ "Spain",
    grepl("Aleix García", player) ~ "Spain",
    grepl("Martín Montoya", player) ~ "Spain",
    TRUE ~ nation
  ))

head(squads, n = 10)

To answer the question as asked, we’d want historical FIFA ranking data. While it does exist going back to 2007, I’d prefer to have the full data set back to 1993, and in any case, there are also some problems with the historical calculation FIFA used for it’s ratings.

Instead, we can use the ELO method of rating teams (most commonly used to rank chess players). There are two ways to do this- we can calculate the ratings ourselves using a dataframe of international results, or we can take the accepted ratings at eloratings.net.

I’ll outline the first method here and then use the data from the second further below here.

Calculating ELO ratings

To calculate our ratings, first we need to to load up a dataframe of international football results. The one I’m using comes from kaggle and has 40k matches listed since the start of international football in 1872:

# https://www.kaggle.com/martj42/international-football-results-from-1872-to-2017
international_results <- readRDS("../../static/files/international_results.rds")

head(international_results)
##         date home_team away_team home_score away_score tournament    city
## 1 1872-11-30  Scotland   England          0          0   Friendly Glasgow
## 2 1873-03-08   England  Scotland          4          2   Friendly  London
## 3 1874-03-07  Scotland   England          2          1   Friendly Glasgow
## 4 1875-03-06   England  Scotland          2          2   Friendly  London
## 5 1876-03-04  Scotland   England          3          0   Friendly Glasgow
## 6 1876-03-25  Scotland     Wales          4          0   Friendly Glasgow
##    country neutral
## 1 Scotland   FALSE
## 2  England   FALSE
## 3 Scotland   FALSE
## 4  England   FALSE
## 5 Scotland   FALSE
## 6 Scotland   FALSE

We only need a few of these variables- enough to know who wins each match and when/where it was played. We can then use this data to initialise several parameters to be used in our ELO calculation. For team i (in a match of teams i and j), this is calculated as:

\[ rating_{i_{t}} = rating_{i_{t-1}} + K \cdot G \cdot (R – E(R)) \] The rating of team i is their old rating plus the difference between the actual result (R = 1 for a win, 0.5 for a draw, 0 for a loss) and the expected result (where 1 means certain win for team i).

The unexpectedness of the result is then multiplied by two parameters. The first K, is to account for the importance of the match, with more important matches having a higher K factor, and a greater influence of team rating.

\[\begin{equation} K = \begin{cases} 60 & \text{if World Cup Final} \\ 50 & \text{if World Cup/ Major Intercontinental Matches} \\ 40 & \text{if World Cup/Continental Competition Qualifiers} \\ 30 & \text{if Other Tournaments}\\ 20 & \text{if Friendly} \\ \end{cases} \end{equation}\]

The second parameter, G is controlled by the strength of the result

\[\begin{equation} G = \begin{cases} 1 & \text{if } N < 2 \\ 1.5 & \text{if } N = 2 \\ 1.75 & \text{if } N = 3 \\ 1.75 + \frac{N-3}{8} & \text{if N > 3} \\ \end{cases} \end{equation}\]

where N is the goal difference:

\[ N = Goals_{i} – Goals_{j} \]

The expected result is calculated based on the rankings of both teams going into the match

\[ E(result) = \frac{1}{10 ^ \frac{-dr_{i,j}}{400} + 1} \] where the difference in rankings (dr) is calculated as

\[\begin{equation} dr_{i,j} = \begin{cases} rating_{i, t-1} + 100 – rating_{j, t-1} & \text{if i at home} \\ rating_{i, t-1} – 100 – rating_{j, t-1} & \text{if j at home} \\ rating_{i, t-1} – rating_{j, t-1} & \text{if neutral} \\ \end{cases} \end{equation}\]

We then add K, G and R to each row of the data frame to make our calculations easier down the line. Unfortunately, my dataset doesn’t give the context of each game, so I’ve set K to 40 for every match. In theory this shouldn’t make a difference, but will affect the ratings of teams who over/under perform in big matches.

international_results %<>%
  # select relevant columns
  select(
    date,
    home = home_team, away = away_team,
    hgoal = home_score, agoal = away_score,
    neutral
  ) %>%
  # convert date to date format
  mutate(date = as.Date(date)) %>%
  # K = match importance
  # don't have competition data in this dataset so just set to 40
  mutate(K =  40) %>%
  # G = goal difference factor
  # takes into account how much a team is beaten by
  mutate(G = case_when(
    abs(hgoal-agoal) < 2 ~ 1,
    abs(hgoal-agoal) < 3 ~ 1.5,
    abs(hgoal-agoal) >= 3 ~ 1.75 + (abs(hgoal-agoal)-3)/8
  )) %>%
  # results = 1 for win and 0.5 for a draw
  mutate(result = case_when(
    hgoal > agoal ~ 1,
    hgoal < agoal ~ 0,
    hgoal == agoal ~ 0.5
  )) %>%
  # arrange by date so ELO can be updated sequentially
  arrange(date)

We still need to initialise our rating (R) for each team, which for simplicity I’ve set to 1200 to start with. That is, every team starts with the same rating and will gradually tend towards their ‘natural’ rating. Given There’s probably at least 50 years of data for most teams before the Premier League begins in 1992, hopefully it should be enough for this to level out.

team_ratings <- international_results %>%
  # select date and teams
  select(date, home, away) %>%
  # melt
  gather(., "location", "nation", home, away) %>%
  select(-location) %>%
  arrange(date) %>%
  # set out unique teams with a rating of 1200
  filter(!duplicated(nation)) %>%
  mutate(rating = 1200) %>%
  select(-date)

Therefore, using the very first international fixture between England and Scotland in 1872 we have parameters of

head(international_results, n = 1)
##         date     home    away hgoal agoal neutral  K G result
## 1 1872-11-30 Scotland England     0     0   FALSE 40 1    0.5

ratings(t-1) = 1200 for both England and Scotland K = 40 G = 1 for a draw R = 0.5 for a draw

the equal ratings, but home location for Scotland mean that for England:

\[ dr_{i,j} = 1200 + 100 – 1200 = 100 \] and so an expected result (1- the expected home result)

\[ E(R) = 1 – \frac{1}{10 ^ {100/400} + 1} = 1 – \frac{1}{2.78} = 0.36\] and so England will get a post match rating of

\[ rating_{j} = 1200 + (40 \cdot 1 \cdot (0.5 – 0.36)) = 1207.2 \]

calc_ELO <- function(date, home, away, K, G, result) {
  #get the difference in ratings
  hr <- team_ratings$rating[which(team_ratings$nation == home)]
  vr <- team_ratings$rating[which(team_ratings$nation == away)]
  dr <- vr - (hr + 100)
  
  # calculate expected results
  e_result <- 1/ ((10^(dr/400))+1)
  
  # calculate new ratings
  new_hr <- hr + ((K*G) * (result - e_result))
  new_vr <- vr + ((K*G) * ((1-result) - (1-e_result)))
  
  # pipe these back into a df of team ratings to sample from
  team_ratings$rating[which(team_ratings$nation == home)] <<- new_hr
  team_ratings$rating[which(team_ratings$nation == away)] <<- new_vr
  
  # return new ratings
  return(list(h_rating = new_hr, v_rating = new_vr))
}

which can be applied to the dataframe of match information using pmap_df from the purrr package, which allows for some pleasing conciseness. It allows for the speed of applying a function, without needing to split the data frame by row and pass into lapply and rebind together.

elo_data <- international_results %>%
  # select relevant variable
  # keep date so we know a teams ELO at specific date
  select(date, home, away, K, G, result) %>%
  bind_cols(pmap_df(., calc_ELO)) %>%
  # get rid of ELO parameters
  select(date, home, away, h_rating, v_rating) %>%
  # gather twice to get a long df of teams ratings after matches
  gather("location", "nation", -date, -h_rating, -v_rating) %>%
  gather("rating", "value", -date, -location, -nation) %>%
  # filter for home rating for teams at home and vice versa
  filter((location == "home" & rating == "h_rating") |
           (location == "away" & rating == "v_rating")) %>%
  select(date, nation, rating = value) %>%
  # we only care about ratings from August 1992
  filter(date > "1992-07-31")

We’re left with a dataframe of 46992 observations for 3 variables: date, nation and the ranking of the nation at that time. We can plot a random selection of 5 teams just to sanity check and see that teams we know have historically been stronger (e.g. Argentina) show consistently higher rankings than weaker nations (e.g. Greece).

# randomly choose 5 teams with max ELOs > 1600
teams <- elo_data %>%
  filter(rating > 1600) %>%
  .$nation %>%
  unique() %>%
  .[sample(length(.), 5)]

# plot rating over time for these 5 teams
p1 <- elo_data %>%
  filter(nation %in% teams) %>%
  ggplot(aes(x = date, y = rating, colour = nation, group = nation)) +
  geom_point() +
  geom_line() +
  # colour by football shirt colour
  scale_colour_manual(values = c("skyblue", 
                                 "darkblue",
                                 "darkorange",
                                 "darkgreen",
                                 "red")) +
  labs(title = "ELO Ratings of Select Countries over Time",
       subtitle = "ratings calculated using homebrew script",
       x = "date",
       y = "rating") +
  theme_minimal()

plot(p1)

I’m pretty happy with how the code works. With a more complete dataset of matches, and also the time to properly filter countries in and out as they are formed/dissolved, I think it would make a pretty viable answer, however, I wanted to be as accurate as possible, and I can’t compete with the official-unofficial ratings of eloratings.net.

Dynamic Scraping

When scraping data for blog posts, I typically rely on rvest and it’s read_html(url) function. However, while this works for the static websites which make up the vast majority of sites containing tables of data, it struggles with websites that use JavaScript to dynamically generate pages.

Eloratings.net is one such website which rvest is unable to scrape. E.g.

# url to data on Brazil's ELO rating over time
url <- "https://eloratings.net/Brazil"

read <- read_html(url) %>%
  # this is the CSS selector for the page title
  html_nodes("#mainheader")

read
## {xml_nodeset (1)}
## [1] <h1 id="mainheader" class="mainheader"></h1>

does not manage to capture the data displayed in the page mainheader (it ‘should’ return “World Football Elo Ratings: Brazil” from the title of that page).

Instead, what we want to do is save a copy of the generated page as a .html file and then read that into R using read_html(). Luckily, a way exists to do just that, using the (now deprecated, but still working) PhantomJS headless browser. Much of the code I used to get going with this is adapted from a tutorial here.

First you want to install PhantomJS from the above website and run through it’s quick start guide. This is a pretty thorough guide, I would say that there are really only three steps from installation to getting going:

  1. Add phantomjs to the system PATH
  2. Open a text editor and save one of the tutorial scripts as filename.js
  3. run > phantomjs C:/Users/usr/path/to/file.js in a command line console

The file we’re going to use to render the js pages and then save the html is below:

// scrapes a given url (for eloratings.net)

// create a webpage object
var page = require('webpage').create(),
  system = require('system')

// the url for each country provided as an argument
country= system.args[1];

// include the File System module for writing to files
var fs = require('fs');

// specify source and path to output file
// we'll just overwirte iteratively to a page in the same directory
var path = 'elopage.html'

page.open(country, function (status) {
  var content = page.content;
  fs.write(path,content,'w')
  phantom.exit();
});

(which, again, is stolen and adapted from here)

This is saved as scrape_ELO.js in the static directory of my blog folder.

To keep everything in R, we can use the system() family of functions, which provides access to the OS command line. Though the referenced tutorial uses system(), it relies on scraping a single referenced page. To iteratively scrape every country, we’ll need to provide an argument (country) which will contain the link to the page on eloratings.net for that country.

E.g. for Brazil we will provide “https://www.eloratings.net/Brazil” as the country argument

phantom_dir <- "C:/Users/path/to/scrape_ELO/"
country_url <- "https://www.eloratings.net/Brazil"

# use system2 to invoke phantomjs via it's executable
system2("C:/Users/path/to/phantomjs-2.1.1-windows/bin/phantomjs.exe",
        #provide the path to the scraping script and the country url as argument
        args = c(file.path(phantom_dir, "scrape_ELO.js"), country_url))

We can then read in this saved html page using rvest as per usual and recover the information therein.

# read in the saved html file
page <- read_html("elopage.html")

# scrape with rvest as normal
country_name <- page %>%
  html_nodes("#mainheader") %>%
  html_text() %>%
  gsub("Elo Ratings: ", "", .)

country_name
## [1] "Brazil"

I’m not going to include my full script for scraping eloratings.net as usually a reason for doing this obscuring of the data is to prevent exactly what I’m doing. Instead I’ll give a skeleton function of the one I use. If you are having problems with setting up phantomjs to scrape pages, my contact details are listed on my blog homepage

scrape_nation <- function(country) {
  # download the page
  url <- paste0("https://eloratings.net/", country)
  system2("C:/Users/path/to/phantomjs-2.1.1-windows/bin/phantomjs.exe", 
          args = c(file.path(phantom_dir, "scrape_ELO.js"), url))
  
  # read in downloaded page
  page <- read_html("elopage.html")
  
  # recover information
  country_name <- page %>%
    html_nodes("#mainheader") %>%
    html_text() %>%
    gsub("Elo Ratings: ", "", .)
  
  opposing <- page %>%
      html_nodes(".r1 a") %>%
      html_text()
  
  teams <- page %>%
      html_nodes(".r1")
  
  fixtures <- map2_df(teams, opposing, split_teams)

  ratings <- page %>%
    html_nodes(".r4") %>%
    html_text() %>%
    map_df(., split_ratings)
  
  rankings <- page %>%
    html_nodes(".r6") %>%
    map_df(., split_rankings)

  dates <- page %>%
    html_nodes(".r0") %>%
    html_text() %>%
    map_df(., convert_date)

  # bind into a data frame
  df <- fixtures %>%
    cbind(., ratings) %>%
    cbind(., rankings) %>%
    cbind(., dates) %>%
    mutate(table_country = country_name)
}

elO_data <- map_df(country_links, scrape_nation)

Finally, we want to convert this to long format. We have two observations per country and any point in time- the rating, and the ranking. I’m going to filter out just the ranking, as that’s what the questions ask, but if anything there’s possibly more information in the rating data.

elo_data %<>%
  # rename and select variables
  select(
    date,
    home, away,
    rating_home = r1, rating_away = r2,
    ranking_home = ranking1, ranking_away = ranking2
  ) %>%
  # melt twice to convert to long format
  gather(
    "location", "nation",
    -rating_home, -rating_away, -ranking_home, -ranking_away, -date
  ) %>%
  gather("measure", "value", -nation, -date, -location) %>%
  # take only relevant information
  filter(
    (location == "home" & measure %in% c("rating_home", "ranking_home")) |
      (location == "away" & measure %in% c("rating_away", "ranking_away"))
  ) %>%
  separate(measure, into = c("measure", "location"), "_") %>%
  # filter out relevant data
  filter(!duplicated(.)) %>%
  filter(date > "1992-01-01") %>%
  filter(measure == "ranking") %>%
  select(-measure, ranking = value, -location)

Answering the Question

So now we have a database of Premier League players’ nationalities, and also of the ELO rankings of countries since 1992, we can answer the original questions.

First we need to make sure that the data can join to each other, which means making sure that the nation names are common between the two data sets.

# get unique names for countries in both data sets
squad_teams <- unique(squads$nation)
rating_teams <- unique(elo_data$nation)

# find non joining country names
squad_teams[!squad_teams %in% rating_teams]
##  [1] "Ireland Republic"       "Trinidad and Tobago"   
##  [3] "Czech Republic"         "Macedonia FYR"         
##  [5] "St. Kitts and Nevis"    "Bosnia and Herzegovina"
##  [7] "Congo DR"               "Antigua and Barbuda"   
##  [9] "Korea Republic"         "Curacao"               
## [11] "Cape Verde Islands"     "Equatorial Guinea"

Some are missing where country names follow different convention- e.g. the Democratic Republic of Congo is named DR Congo in one, and Congo DR in the other. We can quickly convert these odd countries and join the two data sets together using dplyr. Then we can get an idea of the national rankings of the nationality of Premier League players since it’s inception

# rename mismatching nations
elo_data %<>%
  mutate(nation = case_when(
    grepl("^Ireland", nation) ~ "Ireland Republic",
    grepl("^Czechia", nation) ~ "Czech Republic",
    grepl("Trinidad/Tobago", nation) ~ "Trinidad and Tobago",
    grepl("Macedonia", nation) ~ "Macedonia FYR",
    grepl("St Kitts and Nevis", nation) ~ "St. Kitts and Nevis",
    grepl("Bosnia/Herzeg", nation) ~ "Bosnia and Herzegovina",
    grepl("DR Congo", nation) ~ "Congo DR",
    grepl("Antigua/Barbuda", nation) ~ "Antigua and Barbuda",
    grepl("South Korea", nation) ~ "Korea Republic",
    grepl("Curaçao", nation) ~ "Curacao",
    grepl("Cape Verde", nation) ~ "Cape Verde Islands",
    grepl("Equat Guinea", nation) ~ "Equatorial Guinea",
    TRUE ~ nation
  ))
players_national_elo <- squads %>%
  # convert dates
  mutate(year = as.numeric(year)) %>%
  # join elo rating data
  left_join(., elo_data, by = "nation") %>%
  # take only relevant data per season
  filter(date < as.Date(paste0(year, "-06-30")) &
           date > as.Date(paste0(year-1, "-07-01"))) %>%
  # rename for concise printing
  rename(apps = appearances, sub_apps = sub_appearances)

# histogram of players national team lowest ratings
p2 <- players_national_elo %>%
  arrange(ranking) %>%
  # take only lowest ranked observations
  filter(!duplicated(paste(player, team, year), fromLast = TRUE)) %>%
  # group by decade
  mutate(decade = case_when(
    year < 2000 ~ "1990",
    year < 2010 ~ "2000",
    year < 2020 ~ "2010"
  )) %>%
  ggplot(aes(ranking)) +
  geom_histogram() +
  labs(title = "Distribution of EPL Player's Nation's Ranking",
       subtitle = "(taking lowest point of ranking data)",
       x = "lowest national team ELO rating",
       y = "player count") +
  theme_minimal() +
  facet_wrap(~decade)

plot(p2)

It surprised me that the distribution hasn’t obviously changed over the year. There’s maybe a few more players from ‘mid-level’ nations (ranking 30-50) this decade but I’d doubt it’s significantly more. The majority of players come from nations in the top 20 worldwide consistently since the Premier League’s inception.

We can easily then take the players with the worst national team ranking by arranging by the ranking of national teams (and removing duplicate players so it isn’t just filled with the same 2/3 names)

worst_national_team_players <- players_national_elo %>%
  # arrange rating from low to high
  arrange(-ranking) %>%
  # remove duplicated players
  filter(!duplicated(player)) %>%
  # select only relevant info
  select(year, player, team, apps, sub_apps, nation, ranking)

# show the 25 players with the worst ranking their nation had during that time
head(worst_national_team_players, n = 25)
##    year             player                    team apps sub_apps
## 1  2015     Brandon Comley     queens-park-rangers    0        1
## 2  2004        Zesh Rehman                  fulham    0        1
## 3  2012     George Elokobi wolverhampton-wanderers    3        6
## 4  2014     Leandro Bacuna             aston-villa   28        7
## 5  2013      Kemy Agustien            swansea-city    4       14
## 6  2015        Kenji Gorré            swansea-city    0        1
## 7  1998 Danny Higginbotham       manchester-united    0        1
## 8  1999          Carl Cort               wimbledon    6       10
## 9  2007 Mikele Leigertwood        sheffield-united   16        3
## 10 2007     Moses Ashikodi                 watford    0        2
## 11 2016       Cuco Martina             southampton   11        4
## 12 2003         Neil Danns        blackburn-rovers    1        1
## 13 2005  Dexter Blackstock             southampton    8        1
## 14 2019     Neil Etheridge            cardiff-city   38        0
## 15 2013     Emmerson Boyce          wigan-athletic   36        0
## 16 1997          Mart Poom            derby-county    4        0
## 17 2007     Matthew Briggs                  fulham    0        1
## 18 2018        Nahki Wells                 burnley    0        9
## 19 2012      Jason Roberts        blackburn-rovers    5        5
## 20 1998 Sagi Burton-Godwin          crystal-palace    1        1
## 21 2010     Gunnar Nielsen         manchester-city    0        1
## 22 2009          Leon Cort              stoke-city    9        2
## 23 2000        Adam Newton         west-ham-united    0        2
## 24 2004       Delroy Facey        bolton-wanderers    0        1
## 25 2013    Gaël Bigirimana        newcastle-united    3       10
##                 nation ranking
## 1           Montserrat     227
## 2             Pakistan     204
## 3              Somalia     199
## 4              Curacao     188
## 5              Curacao     186
## 6              Curacao     186
## 7            Gibraltar     181
## 8               Guyana     179
## 9  Antigua and Barbuda     179
## 10 Antigua and Barbuda     179
## 11             Curacao     179
## 12              Guyana     175
## 13 Antigua and Barbuda     174
## 14         Philippines     174
## 15            Barbados     173
## 16             Estonia     169
## 17              Guyana     169
## 18             Bermuda     167
## 19             Grenada     166
## 20 St. Kitts and Nevis     160
## 21       Faroe Islands     160
## 22              Guyana     153
## 23 St. Kitts and Nevis     147
## 24             Grenada     145
## 25             Burundi     143

So Zesh Rehman as the question supposes, is one of the players from extremely low ranked nations. However, he is beaten by Brandon Comley at QPR who represents Montserrat internationally, who fell as low as 227 in the ELO world rankings. The 204 for Zesh Rehman is different to the 168 listed in the question for three reasons:

  1. As earlier discussed we are using ELO rankings, not the official FIFA rankings (and are probably right to)
  2. This lists the lowest that nation fell in the time that player was in the Premier League, not just at the time Zesh Rehman (or any other player) made his debut for Pakistan (/other nation)

Indeed, Brandon Comley did not start to play for Montserrat until September 2018, by which time he was playing for Colchester in League Two.

What is interesting though, is that there are 19 instances of players of nationalities ranked lower than 164- many more than I would have imagined.

To truly answer the question, that is, to only count rankings in games Premier League players played in, we can scrape international match data from national_football-teams.com. 11v11.com which I’ve used thus far does list international appearances, but tends to have thinner data (and vice versa for national-football-teams.com with regards to Premier League squad data).

First, we’ll manually add the links to the profiles of players who are contenders for having the lowest ranked appearance (the lowest 20 in the above dataframe), then we can use this to scrape the international matches they’ve played in. This is then matched by data to the ELO ranking data to find the national ranking of the country they represented. Finally, we take the lowest ranked appearance for each player and see if any can beat Zesh Rehman.

head(players_intl_links)
##           player
## 1 Brandon Comley
## 2    Zesh Rehman
## 3 George Elokobi
## 4 Leandro Bacuna
## 5  Kemy Agustien
## 6    Kenji Gorré
##                                                                        url
## 1 https://www.national-football-teams.com/player/71845/Brandon_Comley.html
## 2    https://www.national-football-teams.com/player/12929/Zesh_Rehman.html
## 3                                                                     <NA>
## 4 https://www.national-football-teams.com/player/63672/Leandro_Bacuna.html
## 5  https://www.national-football-teams.com/player/59413/Kemy_Agustien.html
## 6    https://www.national-football-teams.com/player/74671/Kenji_Gorre.html
get_international_matches <- function(player, url) {
  if(!is.na(url)) {
    df <- url %>%
      read_html() %>%
      html_nodes("#games > table") %>%
      html_table(fill = TRUE) %>%
      as.data.frame() %>%
      .[c(1:2, 5:7)] %>%
      filter(Date != "") %>%
      mutate(result = gsub("\n.*", "", Result), player = player, match_date = as.Date(Date)) %>%
      select(player, match_date, home = Home.Team, away = Away.Team.1, result, event = Event)
  }
}

low_ranked_appearances <- players_intl_links %>%
  pmap_df(., get_international_matches) %>%
  left_join(., players_national_elo, by = "player") %>%
  filter(match_date == date) 

lowest_ranked_appearances <- low_ranked_appearances %>%
  arrange(-ranking) %>%
  filter(!duplicated(player, fromLast = TRUE)) %>%
  mutate(match = paste(home, "vs", away)) %>%
  select(date, player, team, nation, ranking, match, result)

lowest_ranked_appearances
##          date         player             team            nation ranking
## 1  2005-12-07    Zesh Rehman           fulham          Pakistan     197
## 2  2018-11-13 Neil Etheridge     cardiff-city       Philippines     165
## 3  2018-03-26   Cuco Martina          everton           Curacao     136
## 4  2004-05-20  Jason Roberts       portsmouth           Grenada     131
## 5  2008-03-26 Emmerson Boyce   wigan-athletic          Barbados     126
## 6  2019-03-22   Pedro Obiang  west-ham-united Equatorial Guinea     125
## 7  2015-06-13   Modou Barrow     swansea-city            Gambia     122
## 8  2006-11-21     Paul Ifill sheffield-united          Barbados     119
## 9  2019-06-21 Leandro Bacuna     cardiff-city           Curacao     118
## 10 2003-06-07      Mart Poom       sunderland           Estonia      85
##                                     match result
## 1                   Pakistan vs Sri Lanka    1:0
## 2                Philippines vs Singapore    1:0
## 3                      Curaçao vs Bolivia    1:0
## 4                         Cuba vs Grenada    2:2
## 5                    Barbados vs Dominica    1:0
## 6              Sudan vs Equatorial Guinea    1:4
## 7                  South Africa vs Gambia    0:0
## 8  Barbados vs Saint Vincent & Grenadines    3:0
## 9                     Honduras vs Curaçao    0:1
## 10                     Estonia vs Andorra    2:0

And it appears not! It’s quite incredible that in the 13.5 years since the question was answered noone has even got close. This is probably due to the fact that though Pakistan were ranked 168th by FIFA at the time, using the ELO system, they come out as the 197th best team at the end of 2005.

Perhaps more surprising is that it seems that until 2003 the record was 85th, held by Mart Poom playing for Estonia (though this might also be because data on low ranking international matches gets worse as you go back past this point- Mart Poom probably played for Estonia in).

We can graph the appearances by Premier League players in matches involving low ranked nations to see how close people have gotten pretty easily. The below shows the rankings for the countries in the above printed data frame. Matches where Premier League players made an appearance are highlighted in black boxes.

p3 <- elo_data %>%
  filter(nation %in% low_ranked_appearances$nation) %>%
  mutate(date = as.Date(date)) %>%
  ggplot(aes(x = date, y = ranking, colour = nation, group = nation)) +
  geom_point(alpha = 0.5, size = 2) +
  geom_line(alpha = 0.5, size = 2) +
  geom_point(data = low_ranked_appearances, aes(x = match_date),
             colour = "black", fill = NA, size = 3,stroke = 2, shape = 22) +
  # colour by football shirt colour
  scale_colour_manual(values = c("yellow", "darkblue", "darkred", "lightblue", "green", "red", "darkgreen", "darkgoldenrod")) +
  scale_x_date() +
  labs(title = "Lowest National Ranking of Premier League International Caps",
          subtitle = "premier league player caps in red",
          y = "national team ELO ranking") +
  theme_minimal()

plot(p3)

It seems Pakistan have continued to hover around the 200th position pretty consistently over the last 25 years. Indeed, if Zesh Rehman was still playing in the Premier League he would have broken his own record less than 1 month ago, when Pakistan sank to 201st position after a defeat to Cambodia. The Philippines, represented by Neil Etheridge in recent years, have been ranked lower, but not since 2010. Other than that Curacao can be seen to only have come close near to their formation as a FIFA member, and are clearly rapidly improving. Barbados and Grenada might be fruitful for any possible record taker, but are still a good 30 places higher than Pakistan.

Lowest Ranked Countries of Premier League Winners

We can answer the first question and find the players with the worst national team rankings by joining in and selecting for teams that won the league. This is done by first initializing a simple df with the league winners per season. The given answers to this question have differed in when they take the country’s rankings, whether at their lowest point within the season, or at the season’s end. I decided to use a different method- to take the average ranking of the nation over the whole season.

# data frame of EPL winners over year
epl_winners <- data.frame(
  champion = c(
    "manchester-united",
    "manchester-united",
    "blackburn-rovers",
    "manchester-united",
    "manchester-united",
    "arsenal",
    "manchester-united",
    "manchester-united",
    "manchester-united",
    "arsenal",
    "manchester-united",
    "arsenal",
    "chelsea",
    "chelsea",
    "manchester-united",
    "manchester-united",
    "manchester-united",
    "chelsea",
    "manchester-united",
    "manchester-city",
    "manchester-united",
    "manchester-city",
    "chelsea",
    "leicester-city",
    "chelsea",
    "manchester-city",
    "manchester-city"),
  year = 1993:2019
) 

# merge in winning temas
epl_winning_squads <- players_national_elo %>%
  left_join(., epl_winners, by = "year") %>%
  # filter for players that win the league that season
  filter(team == champion) %>%
  # filter for the year that player wins the league
  filter(date < as.Date(paste0(year, "-06-30")) &
           date > as.Date(paste0(year-1, "-07-01"))) %>%
  group_by(player, year, team, apps, sub_apps, nation) %>%
  summarise(av_ranking = mean(ranking)) %>%
  arrange(-av_ranking) %>%
  # take the lowest ranking per player/year combination
  select(year, player, team, apps, sub_apps, nation, av_ranking)

# show the 25 players with the worst ranking their nation had during that time
head(epl_winning_squads, n = 25)
## # A tibble: 25 x 7
## # Groups:   player, year, team, apps, sub_apps [25]
##     year player       team           apps sub_apps nation        av_ranking
##    <dbl> <chr>        <chr>         <int>    <int> <chr>              <dbl>
##  1  1998 Christopher~ arsenal           7        9 Liberia            114. 
##  2  2010 Gaël Kakuta  chelsea           0        1 Congo DR           106. 
##  3  2004 Justin Hoyte arsenal           0        1 Trinidad and~      102. 
##  4  2013 Jonny Evans  manchester-u~    21        2 Northern Ire~      101. 
##  5  2002 Igors Stepa~ arsenal           6        0 Latvia              99.1
##  6  2009 Manucho      manchester-u~     0        1 Angola              89.9
##  7  2006 Eidur Gudjo~ chelsea          16       10 Iceland             79.8
##  8  2003 Roy Carroll  manchester-u~     8        2 Northern Ire~       76.5
##  9  2001 David Healy  manchester-u~     0        1 Northern Ire~       76  
## 10  1999 Dwight Yorke manchester-u~    32        0 Trinidad and~       75.8
## # ... with 15 more rows

Here we have another quite clear winner Christopher Wreh, whose Liberia averaged a ranking of 114 over the 1997/1998 season. Gael Kakuta and Justin Hoyte (and Igors Stepanovs and Manucho) seem like false answers as they only achieved a handful of appearances on the way to winning the league, which only leaves Jonny Evans in 2013 as a real contender for this record. Northern Ireland in this season averaged just under 100th place.

Extra Credit

We can pretty quickly and easily see how these players compare to what we found above in terms of the lowest ranked nation they’ve turned out for. Again, first i had to manually add links to their pages of www.national-football-teams.com and then scrape each match they’ve appeared for their home country in.

head(champions_df)
## # A tibble: 6 x 2
##   player         url                                                       
##   <chr>          <chr>                                                     
## 1 Christopher W~ https://www.national-football-teams.com/player/13930/Chri~
## 2 Gaël Kakuta    https://www.national-football-teams.com/player/67629/Gael~
## 3 Justin Hoyte   https://www.national-football-teams.com/player/52483/Just~
## 4 Jonny Evans    https://www.national-football-teams.com/player/16586/Jonn~
## 5 Igors Stepano~ https://www.national-football-teams.com/player/3728/Igors~
## 6 Manucho        https://www.national-football-teams.com/player/14353/Manu~

When we do this scrape and analyse we get Jonny Evans as the Premier League winner who appeared for the lowest rated nation during a championship campaign. He’s Northern Ireland team slumped to a 2-0 defeat to Israel in 2013 which left them 106th in the world rankings. 2 months later, he won the Premier League with Manchester United. There’s relatively few challengers with only Igors Stepanovs (Latvia 101st, 6 appearances for Arsenal in 2002) getting close and then Roy Carroll (Northern Ireland), Eidur GOdjohnsen (Iceland), and Rihad Mahrez (Algeria) in the 80s (Manucho excluded due to lack of appearances in Manchester United’s 2009 Premier League campaign).

low_ranked_champs <- champions_df %>%
  pmap_df(., get_international_matches) %>%
  left_join(., players_national_elo, by = "player") %>%
  left_join(., epl_winners, by = "year") %>%
  filter(team == champion) %>%
  filter(match_date == date) 

lowest_ranked_champs <- low_ranked_champs %>%
  arrange(-ranking) %>%
  filter(!duplicated(player)) %>%
  mutate(match = paste(home, "vs", away)) %>%
  select(date, player, team, nation, ranking, match, result)

head(lowest_ranked_champs, n = 10)
##          date           player              team           nation ranking
## 1  2013-03-26      Jonny Evans manchester-united Northern Ireland     106
## 2  2001-11-14  Igors Stepanovs           arsenal           Latvia     101
## 3  2009-06-14          Manucho manchester-united           Angola      94
## 4  2003-06-03      Roy Carroll manchester-united Northern Ireland      87
## 5  2018-10-16     Riyad Mahrez   manchester-city          Algeria      85
## 6  2006-02-28 Eidur Gudjohnsen           chelsea          Iceland      82
## 7  2001-06-06      David Healy manchester-united Northern Ireland      82
## 8  2016-06-13       Wes Morgan    leicester-city          Jamaica      72
## 9  2000-09-02       Ryan Giggs manchester-united            Wales      70
## 10 2007-06-02   DONG Fang Zhou manchester-united            China      64
##                                 match result
## 1          Northern Ireland vs Israel    0:2
## 2                    Latvia vs Russia    1:3
## 3                    Angola vs Guinea    0:0
## 4           Italy vs Northern Ireland    2:0
## 5                    Benin vs Algeria    1:0
## 6        Trinidad & Tobago vs Iceland    2:0
## 7  Czech Republic vs Northern Ireland    3:1
## 8                  Uruguay vs Jamaica    3:0
## 9                    Belarus vs Wales    2:1
## 10                       USA vs China    4:1

However! One notable omission is Christopher Wreh, who played in 16 matches for Arsenal in the 1997/1998 season. Wreh only played 26 times for his nation so I assumed he just hadn’t appeared for them that year, but 11v11.com lists him playing on July 27th 1997 (1997/1998 season) at which point Liberia were ranked 110th in the world.

To leave a comment for the author, please follow the link and comment on their blog: rstats on Robert Hickman.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.