The Guardian Knowledge June 2019
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Most Wednesday’s I enjoy reading The Knowledge blog on the Guardian’s website and reading the football trivia therein. When time (and questions) allow, I like to answer some of the questions posed, example of which are here, here, and here.
League of Nations
The first question comes from
Which player had the nationality with the lowest FIFA World Ranking at the time of him winning the Premier League?
— The Tin Boonie (@TheTinBoonie) June 18, 2019
a similar question is also answered in this weeks column:
‘“Fulham defender Zesh Rehman made his debut for Pakistan, who are ranked 168 by Fifa. Is that the lowest-ranked country a Premier League player has played for?” wondered Zulfiqar Shah in January 2006.’
Answers
The latter question is more clear cut- noone has beaten the record set by Zesh Rehman in December 205 when he made his debut for Pakistan. Using ELO ratings which are more consistent over time, Pakistan were ranked as teh 197th best team in the world at that time. The nearest someone has gotten to this is Neil Etheridge in 2018 when playing for 165th ranked Philippines.
For the first question, it depends how you take the rankings and if you require players to have appeared a certain number of time in the season, but Chistopher Wreh who represented 110th ranked Liberia whilst plaing for Arsenal in the 1997/1998 season is probably the winner with Jonny Evans (Northern Ireland and Manchester United, 2012/2013) and Igors Stepanovs (Latvia and Arsenal, 2001/2002) close behind.
Working
First some libraries we’ll need, and also set a seed for reproducibility.
library(tidyverse) library(magrittr) library(rvest) set.seed(3459)
The first thing we need to answer these is the nationality of EPL players. For this two good sources are transfermarkt.co.uk and 11v11.com. I’m going to opt for the later, just because the tables are a little easier to scrape.
To start, we need to get the links to every team to have competed in the premier league
# get years of EPL seasons years <- 1993:2019 # base url we'll scrape from base_url <- "https://www.11v11.com" # cat together tables <- paste0(base_url, "/league-tables/premier-league/01-june-", years) competing_teams <- tables %>% # get a list of the links to every teams squad page map(., function(x) { x %>% read_html() %>% html_nodes("#table-league > tbody:nth-child(2) > tr > td:nth-child(2) > a:nth-child(1)") %>% html_attr("href") %>% # paste into working link for year and competition (EPL) paste0(base_url, ., "tab/players/season/", gsub(".*01-june-", "", x), "/comp/1/") }) %>% unlist() head(competing_teams, n = 5)
We can then scrape the squads for these teams in that specific season. We want the players, their nationality and also the number of appearances they made in the league that season
squads <- competing_teams %>% # get the players/appearances/nationalities map_df(., function(y) { # read once to save server calls read <- y %>% read_html() # get the squad info squad <- read %>% html_nodes(".squad") %>% html_table(fill = TRUE) %>% as.data.frame() %>% # get rid of rows without player info filter(!is.na(Player)) # get the listed nationalities flags <- read %>% html_nodes(".squad > tbody:nth-child(2) > tr > td:nth-child(3)") # from here get the actual nationalities per player nations <- flags %>% html_nodes("img") %>% html_attr("title") # these might mismatch in length # in which case append NA if(length(flags) != length(nations)) { missing <- which(!grepl("img", flags)) nations <- c( nations[1:(missing-1)], NA, nations[missing:length(nations)] ) } # mutate nationality and team and season squad %>% mutate( nation = nations, year = gsub(".*season\\/", "", gsub("\\/comp.*", "", y)), team = gsub("\\/tab\\/players.*", "", gsub(".*teams\\/", "", y)) ) %>% # select useful appearance information select(player = Player, position = Position, appearances = A, sub_appearances = S, nation, year, team) }) %>% # manually add in some missing nationalities mutate(nation = case_when( grepl("Steffen Karl", player) ~ "Germany", grepl("Marc Muniesa", player) ~ "Spain", grepl("Oriol Romeu", player) ~ "Spain", grepl("Aleix García", player) ~ "Spain", grepl("Martín Montoya", player) ~ "Spain", TRUE ~ nation )) head(squads, n = 10)
To answer the question as asked, we’d want historical FIFA ranking data. While it does exist going back to 2007, I’d prefer to have the full data set back to 1993, and in any case, there are also some problems with the historical calculation FIFA used for it’s ratings.
Instead, we can use the ELO method of rating teams (most commonly used to rank chess players). There are two ways to do this- we can calculate the ratings ourselves using a dataframe of international results, or we can take the accepted ratings at eloratings.net.
I’ll outline the first method here and then use the data from the second further below here.
Calculating ELO ratings
To calculate our ratings, first we need to to load up a dataframe of international football results. The one I’m using comes from kaggle and has 40k matches listed since the start of international football in 1872:
# https://www.kaggle.com/martj42/international-football-results-from-1872-to-2017 international_results <- readRDS("../../static/files/international_results.rds") head(international_results) ## date home_team away_team home_score away_score tournament city ## 1 1872-11-30 Scotland England 0 0 Friendly Glasgow ## 2 1873-03-08 England Scotland 4 2 Friendly London ## 3 1874-03-07 Scotland England 2 1 Friendly Glasgow ## 4 1875-03-06 England Scotland 2 2 Friendly London ## 5 1876-03-04 Scotland England 3 0 Friendly Glasgow ## 6 1876-03-25 Scotland Wales 4 0 Friendly Glasgow ## country neutral ## 1 Scotland FALSE ## 2 England FALSE ## 3 Scotland FALSE ## 4 England FALSE ## 5 Scotland FALSE ## 6 Scotland FALSE
We only need a few of these variables- enough to know who wins each match and when/where it was played. We can then use this data to initialise several parameters to be used in our ELO calculation. For team i (in a match of teams i and j), this is calculated as:
\[ rating_{i_{t}} = rating_{i_{t-1}} + K \cdot G \cdot (R - E(R)) \] The rating of team i is their old rating plus the difference between the actual result (R = 1 for a win, 0.5 for a draw, 0 for a loss) and the expected result (where 1 means certain win for team i).
The unexpectedness of the result is then multiplied by two parameters. The first K, is to account for the importance of the match, with more important matches having a higher K factor, and a greater influence of team rating.
\[\begin{equation} K = \begin{cases} 60 & \text{if World Cup Final} \\ 50 & \text{if World Cup/ Major Intercontinental Matches} \\ 40 & \text{if World Cup/Continental Competition Qualifiers} \\ 30 & \text{if Other Tournaments}\\ 20 & \text{if Friendly} \\ \end{cases} \end{equation}\]
The second parameter, G is controlled by the strength of the result
\[\begin{equation} G = \begin{cases} 1 & \text{if } N < 2 \\ 1.5 & \text{if } N = 2 \\ 1.75 & \text{if } N = 3 \\ 1.75 + \frac{N-3}{8} & \text{if N > 3} \\ \end{cases} \end{equation}\]
where N is the goal difference:
\[ N = Goals_{i} - Goals_{j} \]
The expected result is calculated based on the rankings of both teams going into the match
\[ E(result) = \frac{1}{10 ^ \frac{-dr_{i,j}}{400} + 1} \] where the difference in rankings (dr) is calculated as
\[\begin{equation} dr_{i,j} = \begin{cases} rating_{i, t-1} + 100 - rating_{j, t-1} & \text{if i at home} \\ rating_{i, t-1} - 100 - rating_{j, t-1} & \text{if j at home} \\ rating_{i, t-1} - rating_{j, t-1} & \text{if neutral} \\ \end{cases} \end{equation}\]
We then add K, G and R to each row of the data frame to make our calculations easier down the line. Unfortunately, my dataset doesn’t give the context of each game, so I’ve set K to 40 for every match. In theory this shouldn’t make a difference, but will affect the ratings of teams who over/under perform in big matches.
international_results %<>% # select relevant columns select( date, home = home_team, away = away_team, hgoal = home_score, agoal = away_score, neutral ) %>% # convert date to date format mutate(date = as.Date(date)) %>% # K = match importance # don't have competition data in this dataset so just set to 40 mutate(K = 40) %>% # G = goal difference factor # takes into account how much a team is beaten by mutate(G = case_when( abs(hgoal-agoal) < 2 ~ 1, abs(hgoal-agoal) < 3 ~ 1.5, abs(hgoal-agoal) >= 3 ~ 1.75 + (abs(hgoal-agoal)-3)/8 )) %>% # results = 1 for win and 0.5 for a draw mutate(result = case_when( hgoal > agoal ~ 1, hgoal < agoal ~ 0, hgoal == agoal ~ 0.5 )) %>% # arrange by date so ELO can be updated sequentially arrange(date)
We still need to initialise our rating (R) for each team, which for simplicity I’ve set to 1200 to start with. That is, every team starts with the same rating and will gradually tend towards their ‘natural’ rating. Given There’s probably at least 50 years of data for most teams before the Premier League begins in 1992, hopefully it should be enough for this to level out.
team_ratings <- international_results %>% # select date and teams select(date, home, away) %>% # melt gather(., "location", "nation", home, away) %>% select(-location) %>% arrange(date) %>% # set out unique teams with a rating of 1200 filter(!duplicated(nation)) %>% mutate(rating = 1200) %>% select(-date)
Therefore, using the very first international fixture between England and Scotland in 1872 we have parameters of
head(international_results, n = 1) ## date home away hgoal agoal neutral K G result ## 1 1872-11-30 Scotland England 0 0 FALSE 40 1 0.5
ratings(t-1) = 1200 for both England and Scotland K = 40 G = 1 for a draw R = 0.5 for a draw
the equal ratings, but home location for Scotland mean that for England:
\[ dr_{i,j} = 1200 + 100 - 1200 = 100 \] and so an expected result (1- the expected home result)
\[ E(R) = 1 - \frac{1}{10 ^ {100/400} + 1} = 1 - \frac{1}{2.78} = 0.36\] and so England will get a post match rating of
\[ rating_{j} = 1200 + (40 \cdot 1 \cdot (0.5 - 0.36)) = 1207.2 \]
calc_ELO <- function(date, home, away, K, G, result) { #get the difference in ratings hr <- team_ratings$rating[which(team_ratings$nation == home)] vr <- team_ratings$rating[which(team_ratings$nation == away)] dr <- vr - (hr + 100) # calculate expected results e_result <- 1/ ((10^(dr/400))+1) # calculate new ratings new_hr <- hr + ((K*G) * (result - e_result)) new_vr <- vr + ((K*G) * ((1-result) - (1-e_result))) # pipe these back into a df of team ratings to sample from team_ratings$rating[which(team_ratings$nation == home)] <<- new_hr team_ratings$rating[which(team_ratings$nation == away)] <<- new_vr # return new ratings return(list(h_rating = new_hr, v_rating = new_vr)) }
which can be applied to the dataframe of match information using pmap_df from the purrr package, which allows for some pleasing conciseness. It allows for the speed of applying a function, without needing to split the data frame by row and pass into lapply and rebind together.
elo_data <- international_results %>% # select relevant variable # keep date so we know a teams ELO at specific date select(date, home, away, K, G, result) %>% bind_cols(pmap_df(., calc_ELO)) %>% # get rid of ELO parameters select(date, home, away, h_rating, v_rating) %>% # gather twice to get a long df of teams ratings after matches gather("location", "nation", -date, -h_rating, -v_rating) %>% gather("rating", "value", -date, -location, -nation) %>% # filter for home rating for teams at home and vice versa filter((location == "home" & rating == "h_rating") | (location == "away" & rating == "v_rating")) %>% select(date, nation, rating = value) %>% # we only care about ratings from August 1992 filter(date > "1992-07-31")
We’re left with a dataframe of 46992 observations for 3 variables: date, nation and the ranking of the nation at that time. We can plot a random selection of 5 teams just to sanity check and see that teams we know have historically been stronger (e.g. Argentina) show consistently higher rankings than weaker nations (e.g. Greece).
# randomly choose 5 teams with max ELOs > 1600 teams <- elo_data %>% filter(rating > 1600) %>% .$nation %>% unique() %>% .[sample(length(.), 5)] # plot rating over time for these 5 teams p1 <- elo_data %>% filter(nation %in% teams) %>% ggplot(aes(x = date, y = rating, colour = nation, group = nation)) + geom_point() + geom_line() + # colour by football shirt colour scale_colour_manual(values = c("skyblue", "darkblue", "darkorange", "darkgreen", "red")) + labs(title = "ELO Ratings of Select Countries over Time", subtitle = "ratings calculated using homebrew script", x = "date", y = "rating") + theme_minimal() plot(p1)
I’m pretty happy with how the code works. With a more complete dataset of matches, and also the time to properly filter countries in and out as they are formed/dissolved, I think it would make a pretty viable answer, however, I wanted to be as accurate as possible, and I can’t compete with the official-unofficial ratings of eloratings.net.
Dynamic Scraping
When scraping data for blog posts, I typically rely on rvest and it’s read_html(url) function. However, while this works for the static websites which make up the vast majority of sites containing tables of data, it struggles with websites that use JavaScript to dynamically generate pages.
Eloratings.net is one such website which rvest is unable to scrape. E.g.
# url to data on Brazil's ELO rating over time url <- "https://eloratings.net/Brazil" read <- read_html(url) %>% # this is the CSS selector for the page title html_nodes("#mainheader") read ## {xml_nodeset (1)} ## [1] <h1 id="mainheader" class="mainheader"></h1>
does not manage to capture the data displayed in the page mainheader (it ‘should’ return “World Football Elo Ratings: Brazil” from the title of that page).
Instead, what we want to do is save a copy of the generated page as a .html file and then read that into R using read_html(). Luckily, a way exists to do just that, using the (now deprecated, but still working) PhantomJS headless browser. Much of the code I used to get going with this is adapted from a tutorial here.
First you want to install PhantomJS from the above website and run through it’s quick start guide. This is a pretty thorough guide, I would say that there are really only three steps from installation to getting going:
- Add phantomjs to the system PATH
- Open a text editor and save one of the tutorial scripts as filename.js
- run > phantomjs C:/Users/usr/path/to/file.js in a command line console
The file we’re going to use to render the js pages and then save the html is below:
// scrapes a given url (for eloratings.net) // create a webpage object var page = require('webpage').create(), system = require('system') // the url for each country provided as an argument country= system.args[1]; // include the File System module for writing to files var fs = require('fs'); // specify source and path to output file // we'll just overwirte iteratively to a page in the same directory var path = 'elopage.html' page.open(country, function (status) { var content = page.content; fs.write(path,content,'w') phantom.exit(); });
(which, again, is stolen and adapted from here)
This is saved as scrape_ELO.js in the static directory of my blog folder.
To keep everything in R, we can use the system() family of functions, which provides access to the OS command line. Though the referenced tutorial uses system(), it relies on scraping a single referenced page. To iteratively scrape every country, we’ll need to provide an argument (country) which will contain the link to the page on eloratings.net for that country.
E.g. for Brazil we will provide “https://www.eloratings.net/Brazil” as the country argument
phantom_dir <- "C:/Users/path/to/scrape_ELO/" country_url <- "https://www.eloratings.net/Brazil" # use system2 to invoke phantomjs via it's executable system2("C:/Users/path/to/phantomjs-2.1.1-windows/bin/phantomjs.exe", #provide the path to the scraping script and the country url as argument args = c(file.path(phantom_dir, "scrape_ELO.js"), country_url))
We can then read in this saved html page using rvest as per usual and recover the information therein.
# read in the saved html file page <- read_html("elopage.html") # scrape with rvest as normal country_name <- page %>% html_nodes("#mainheader") %>% html_text() %>% gsub("Elo Ratings: ", "", .) country_name ## [1] "Brazil"
I’m not going to include my full script for scraping eloratings.net as usually a reason for doing this obscuring of the data is to prevent exactly what I’m doing. Instead I’ll give a skeleton function of the one I use. If you are having problems with setting up phantomjs to scrape pages, my contact details are listed on my blog homepage
scrape_nation <- function(country) { # download the page url <- paste0("https://eloratings.net/", country) system2("C:/Users/path/to/phantomjs-2.1.1-windows/bin/phantomjs.exe", args = c(file.path(phantom_dir, "scrape_ELO.js"), url)) # read in downloaded page page <- read_html("elopage.html") # recover information country_name <- page %>% html_nodes("#mainheader") %>% html_text() %>% gsub("Elo Ratings: ", "", .) opposing <- page %>% html_nodes(".r1 a") %>% html_text() teams <- page %>% html_nodes(".r1") fixtures <- map2_df(teams, opposing, split_teams) ratings <- page %>% html_nodes(".r4") %>% html_text() %>% map_df(., split_ratings) rankings <- page %>% html_nodes(".r6") %>% map_df(., split_rankings) dates <- page %>% html_nodes(".r0") %>% html_text() %>% map_df(., convert_date) # bind into a data frame df <- fixtures %>% cbind(., ratings) %>% cbind(., rankings) %>% cbind(., dates) %>% mutate(table_country = country_name) } elO_data <- map_df(country_links, scrape_nation)
Finally, we want to convert this to long format. We have two observations per country and any point in time- the rating, and the ranking. I’m going to filter out just the ranking, as that’s what the questions ask, but if anything there’s possibly more information in the rating data.
elo_data %<>% # rename and select variables select( date, home, away, rating_home = r1, rating_away = r2, ranking_home = ranking1, ranking_away = ranking2 ) %>% # melt twice to convert to long format gather( "location", "nation", -rating_home, -rating_away, -ranking_home, -ranking_away, -date ) %>% gather("measure", "value", -nation, -date, -location) %>% # take only relevant information filter( (location == "home" & measure %in% c("rating_home", "ranking_home")) | (location == "away" & measure %in% c("rating_away", "ranking_away")) ) %>% separate(measure, into = c("measure", "location"), "_") %>% # filter out relevant data filter(!duplicated(.)) %>% filter(date > "1992-01-01") %>% filter(measure == "ranking") %>% select(-measure, ranking = value, -location)
Answering the Question
So now we have a database of Premier League players’ nationalities, and also of the ELO rankings of countries since 1992, we can answer the original questions.
First we need to make sure that the data can join to each other, which means making sure that the nation names are common between the two data sets.
# get unique names for countries in both data sets squad_teams <- unique(squads$nation) rating_teams <- unique(elo_data$nation) # find non joining country names squad_teams[!squad_teams %in% rating_teams] ## [1] "Ireland Republic" "Trinidad and Tobago" ## [3] "Czech Republic" "Macedonia FYR" ## [5] "St. Kitts and Nevis" "Bosnia and Herzegovina" ## [7] "Congo DR" "Antigua and Barbuda" ## [9] "Korea Republic" "Curacao" ## [11] "Cape Verde Islands" "Equatorial Guinea"
Some are missing where country names follow different convention- e.g. the Democratic Republic of Congo is named DR Congo in one, and Congo DR in the other. We can quickly convert these odd countries and join the two data sets together using dplyr. Then we can get an idea of the national rankings of the nationality of Premier League players since it’s inception
# rename mismatching nations elo_data %<>% mutate(nation = case_when( grepl("^Ireland", nation) ~ "Ireland Republic", grepl("^Czechia", nation) ~ "Czech Republic", grepl("Trinidad/Tobago", nation) ~ "Trinidad and Tobago", grepl("Macedonia", nation) ~ "Macedonia FYR", grepl("St Kitts and Nevis", nation) ~ "St. Kitts and Nevis", grepl("Bosnia/Herzeg", nation) ~ "Bosnia and Herzegovina", grepl("DR Congo", nation) ~ "Congo DR", grepl("Antigua/Barbuda", nation) ~ "Antigua and Barbuda", grepl("South Korea", nation) ~ "Korea Republic", grepl("Curaçao", nation) ~ "Curacao", grepl("Cape Verde", nation) ~ "Cape Verde Islands", grepl("Equat Guinea", nation) ~ "Equatorial Guinea", TRUE ~ nation )) players_national_elo <- squads %>% # convert dates mutate(year = as.numeric(year)) %>% # join elo rating data left_join(., elo_data, by = "nation") %>% # take only relevant data per season filter(date < as.Date(paste0(year, "-06-30")) & date > as.Date(paste0(year-1, "-07-01"))) %>% # rename for concise printing rename(apps = appearances, sub_apps = sub_appearances) # histogram of players national team lowest ratings p2 <- players_national_elo %>% arrange(ranking) %>% # take only lowest ranked observations filter(!duplicated(paste(player, team, year), fromLast = TRUE)) %>% # group by decade mutate(decade = case_when( year < 2000 ~ "1990", year < 2010 ~ "2000", year < 2020 ~ "2010" )) %>% ggplot(aes(ranking)) + geom_histogram() + labs(title = "Distribution of EPL Player's Nation's Ranking", subtitle = "(taking lowest point of ranking data)", x = "lowest national team ELO rating", y = "player count") + theme_minimal() + facet_wrap(~decade) plot(p2)
It surprised me that the distribution hasn’t obviously changed over the year. There’s maybe a few more players from ‘mid-level’ nations (ranking 30-50) this decade but I’d doubt it’s significantly more. The majority of players come from nations in the top 20 worldwide consistently since the Premier League’s inception.
We can easily then take the players with the worst national team ranking by arranging by the ranking of national teams (and removing duplicate players so it isn’t just filled with the same 2/3 names)
worst_national_team_players <- players_national_elo %>% # arrange rating from low to high arrange(-ranking) %>% # remove duplicated players filter(!duplicated(player)) %>% # select only relevant info select(year, player, team, apps, sub_apps, nation, ranking) # show the 25 players with the worst ranking their nation had during that time head(worst_national_team_players, n = 25) ## year player team apps sub_apps ## 1 2015 Brandon Comley queens-park-rangers 0 1 ## 2 2004 Zesh Rehman fulham 0 1 ## 3 2012 George Elokobi wolverhampton-wanderers 3 6 ## 4 2014 Leandro Bacuna aston-villa 28 7 ## 5 2013 Kemy Agustien swansea-city 4 14 ## 6 2015 Kenji Gorré swansea-city 0 1 ## 7 1998 Danny Higginbotham manchester-united 0 1 ## 8 1999 Carl Cort wimbledon 6 10 ## 9 2007 Mikele Leigertwood sheffield-united 16 3 ## 10 2007 Moses Ashikodi watford 0 2 ## 11 2016 Cuco Martina southampton 11 4 ## 12 2003 Neil Danns blackburn-rovers 1 1 ## 13 2005 Dexter Blackstock southampton 8 1 ## 14 2019 Neil Etheridge cardiff-city 38 0 ## 15 2013 Emmerson Boyce wigan-athletic 36 0 ## 16 1997 Mart Poom derby-county 4 0 ## 17 2007 Matthew Briggs fulham 0 1 ## 18 2018 Nahki Wells burnley 0 9 ## 19 2012 Jason Roberts blackburn-rovers 5 5 ## 20 1998 Sagi Burton-Godwin crystal-palace 1 1 ## 21 2010 Gunnar Nielsen manchester-city 0 1 ## 22 2009 Leon Cort stoke-city 9 2 ## 23 2000 Adam Newton west-ham-united 0 2 ## 24 2004 Delroy Facey bolton-wanderers 0 1 ## 25 2013 Gaël Bigirimana newcastle-united 3 10 ## nation ranking ## 1 Montserrat 227 ## 2 Pakistan 204 ## 3 Somalia 199 ## 4 Curacao 188 ## 5 Curacao 186 ## 6 Curacao 186 ## 7 Gibraltar 181 ## 8 Guyana 179 ## 9 Antigua and Barbuda 179 ## 10 Antigua and Barbuda 179 ## 11 Curacao 179 ## 12 Guyana 175 ## 13 Antigua and Barbuda 174 ## 14 Philippines 174 ## 15 Barbados 173 ## 16 Estonia 169 ## 17 Guyana 169 ## 18 Bermuda 167 ## 19 Grenada 166 ## 20 St. Kitts and Nevis 160 ## 21 Faroe Islands 160 ## 22 Guyana 153 ## 23 St. Kitts and Nevis 147 ## 24 Grenada 145 ## 25 Burundi 143
So Zesh Rehman as the question supposes, is one of the players from extremely low ranked nations. However, he is beaten by Brandon Comley at QPR who represents Montserrat internationally, who fell as low as 227 in the ELO world rankings. The 204 for Zesh Rehman is different to the 168 listed in the question for three reasons:
- As earlier discussed we are using ELO rankings, not the official FIFA rankings (and are probably right to)
- This lists the lowest that nation fell in the time that player was in the Premier League, not just at the time Zesh Rehman (or any other player) made his debut for Pakistan (/other nation)
Indeed, Brandon Comley did not start to play for Montserrat until September 2018, by which time he was playing for Colchester in League Two.
What is interesting though, is that there are 19 instances of players of nationalities ranked lower than 164- many more than I would have imagined.
To truly answer the question, that is, to only count rankings in games Premier League players played in, we can scrape international match data from national_football-teams.com. 11v11.com which I’ve used thus far does list international appearances, but tends to have thinner data (and vice versa for national-football-teams.com with regards to Premier League squad data).
First, we’ll manually add the links to the profiles of players who are contenders for having the lowest ranked appearance (the lowest 20 in the above dataframe), then we can use this to scrape the international matches they’ve played in. This is then matched by data to the ELO ranking data to find the national ranking of the country they represented. Finally, we take the lowest ranked appearance for each player and see if any can beat Zesh Rehman.
head(players_intl_links) ## player ## 1 Brandon Comley ## 2 Zesh Rehman ## 3 George Elokobi ## 4 Leandro Bacuna ## 5 Kemy Agustien ## 6 Kenji Gorré ## url ## 1 https://www.national-football-teams.com/player/71845/Brandon_Comley.html ## 2 https://www.national-football-teams.com/player/12929/Zesh_Rehman.html ## 3 <NA> ## 4 https://www.national-football-teams.com/player/63672/Leandro_Bacuna.html ## 5 https://www.national-football-teams.com/player/59413/Kemy_Agustien.html ## 6 https://www.national-football-teams.com/player/74671/Kenji_Gorre.html get_international_matches <- function(player, url) { if(!is.na(url)) { df <- url %>% read_html() %>% html_nodes("#games > table") %>% html_table(fill = TRUE) %>% as.data.frame() %>% .[c(1:2, 5:7)] %>% filter(Date != "") %>% mutate(result = gsub("\n.*", "", Result), player = player, match_date = as.Date(Date)) %>% select(player, match_date, home = Home.Team, away = Away.Team.1, result, event = Event) } } low_ranked_appearances <- players_intl_links %>% pmap_df(., get_international_matches) %>% left_join(., players_national_elo, by = "player") %>% filter(match_date == date) lowest_ranked_appearances <- low_ranked_appearances %>% arrange(-ranking) %>% filter(!duplicated(player, fromLast = TRUE)) %>% mutate(match = paste(home, "vs", away)) %>% select(date, player, team, nation, ranking, match, result) lowest_ranked_appearances ## date player team nation ranking ## 1 2005-12-07 Zesh Rehman fulham Pakistan 197 ## 2 2018-11-13 Neil Etheridge cardiff-city Philippines 165 ## 3 2018-03-26 Cuco Martina everton Curacao 136 ## 4 2004-05-20 Jason Roberts portsmouth Grenada 131 ## 5 2008-03-26 Emmerson Boyce wigan-athletic Barbados 126 ## 6 2019-03-22 Pedro Obiang west-ham-united Equatorial Guinea 125 ## 7 2015-06-13 Modou Barrow swansea-city Gambia 122 ## 8 2006-11-21 Paul Ifill sheffield-united Barbados 119 ## 9 2019-06-21 Leandro Bacuna cardiff-city Curacao 118 ## 10 2003-06-07 Mart Poom sunderland Estonia 85 ## match result ## 1 Pakistan vs Sri Lanka 1:0 ## 2 Philippines vs Singapore 1:0 ## 3 Curaçao vs Bolivia 1:0 ## 4 Cuba vs Grenada 2:2 ## 5 Barbados vs Dominica 1:0 ## 6 Sudan vs Equatorial Guinea 1:4 ## 7 South Africa vs Gambia 0:0 ## 8 Barbados vs Saint Vincent & Grenadines 3:0 ## 9 Honduras vs Curaçao 0:1 ## 10 Estonia vs Andorra 2:0
And it appears not! It’s quite incredible that in the 13.5 years since the question was answered noone has even got close. This is probably due to the fact that though Pakistan were ranked 168th by FIFA at the time, using the ELO system, they come out as the 197th best team at the end of 2005.
Perhaps more surprising is that it seems that until 2003 the record was 85th, held by Mart Poom playing for Estonia (though this might also be because data on low ranking international matches gets worse as you go back past this point- Mart Poom probably played for Estonia in).
We can graph the appearances by Premier League players in matches involving low ranked nations to see how close people have gotten pretty easily. The below shows the rankings for the countries in the above printed data frame. Matches where Premier League players made an appearance are highlighted in black boxes.
p3 <- elo_data %>% filter(nation %in% low_ranked_appearances$nation) %>% mutate(date = as.Date(date)) %>% ggplot(aes(x = date, y = ranking, colour = nation, group = nation)) + geom_point(alpha = 0.5, size = 2) + geom_line(alpha = 0.5, size = 2) + geom_point(data = low_ranked_appearances, aes(x = match_date), colour = "black", fill = NA, size = 3,stroke = 2, shape = 22) + # colour by football shirt colour scale_colour_manual(values = c("yellow", "darkblue", "darkred", "lightblue", "green", "red", "darkgreen", "darkgoldenrod")) + scale_x_date() + labs(title = "Lowest National Ranking of Premier League International Caps", subtitle = "premier league player caps in red", y = "national team ELO ranking") + theme_minimal() plot(p3)
It seems Pakistan have continued to hover around the 200th position pretty consistently over the last 25 years. Indeed, if Zesh Rehman was still playing in the Premier League he would have broken his own record less than 1 month ago, when Pakistan sank to 201st position after a defeat to Cambodia. The Philippines, represented by Neil Etheridge in recent years, have been ranked lower, but not since 2010. Other than that Curacao can be seen to only have come close near to their formation as a FIFA member, and are clearly rapidly improving. Barbados and Grenada might be fruitful for any possible record taker, but are still a good 30 places higher than Pakistan.
Lowest Ranked Countries of Premier League Winners
We can answer the first question and find the players with the worst national team rankings by joining in and selecting for teams that won the league. This is done by first initializing a simple df with the league winners per season. The given answers to this question have differed in when they take the country’s rankings, whether at their lowest point within the season, or at the season’s end. I decided to use a different method- to take the average ranking of the nation over the whole season.
# data frame of EPL winners over year epl_winners <- data.frame( champion = c( "manchester-united", "manchester-united", "blackburn-rovers", "manchester-united", "manchester-united", "arsenal", "manchester-united", "manchester-united", "manchester-united", "arsenal", "manchester-united", "arsenal", "chelsea", "chelsea", "manchester-united", "manchester-united", "manchester-united", "chelsea", "manchester-united", "manchester-city", "manchester-united", "manchester-city", "chelsea", "leicester-city", "chelsea", "manchester-city", "manchester-city"), year = 1993:2019 ) # merge in winning temas epl_winning_squads <- players_national_elo %>% left_join(., epl_winners, by = "year") %>% # filter for players that win the league that season filter(team == champion) %>% # filter for the year that player wins the league filter(date < as.Date(paste0(year, "-06-30")) & date > as.Date(paste0(year-1, "-07-01"))) %>% group_by(player, year, team, apps, sub_apps, nation) %>% summarise(av_ranking = mean(ranking)) %>% arrange(-av_ranking) %>% # take the lowest ranking per player/year combination select(year, player, team, apps, sub_apps, nation, av_ranking) # show the 25 players with the worst ranking their nation had during that time head(epl_winning_squads, n = 25) ## # A tibble: 25 x 7 ## # Groups: player, year, team, apps, sub_apps [25] ## year player team apps sub_apps nation av_ranking ## <dbl> <chr> <chr> <int> <int> <chr> <dbl> ## 1 1998 Christopher~ arsenal 7 9 Liberia 114. ## 2 2010 Gaël Kakuta chelsea 0 1 Congo DR 106. ## 3 2004 Justin Hoyte arsenal 0 1 Trinidad and~ 102. ## 4 2013 Jonny Evans manchester-u~ 21 2 Northern Ire~ 101. ## 5 2002 Igors Stepa~ arsenal 6 0 Latvia 99.1 ## 6 2009 Manucho manchester-u~ 0 1 Angola 89.9 ## 7 2006 Eidur Gudjo~ chelsea 16 10 Iceland 79.8 ## 8 2003 Roy Carroll manchester-u~ 8 2 Northern Ire~ 76.5 ## 9 2001 David Healy manchester-u~ 0 1 Northern Ire~ 76 ## 10 1999 Dwight Yorke manchester-u~ 32 0 Trinidad and~ 75.8 ## # ... with 15 more rows
Here we have another quite clear winner Christopher Wreh, whose Liberia averaged a ranking of 114 over the 1997/1998 season. Gael Kakuta and Justin Hoyte (and Igors Stepanovs and Manucho) seem like false answers as they only achieved a handful of appearances on the way to winning the league, which only leaves Jonny Evans in 2013 as a real contender for this record. Northern Ireland in this season averaged just under 100th place.
Extra Credit
We can pretty quickly and easily see how these players compare to what we found above in terms of the lowest ranked nation they’ve turned out for. Again, first i had to manually add links to their pages of www.national-football-teams.com and then scrape each match they’ve appeared for their home country in.
head(champions_df) ## # A tibble: 6 x 2 ## player url ## <chr> <chr> ## 1 Christopher W~ https://www.national-football-teams.com/player/13930/Chri~ ## 2 Gaël Kakuta https://www.national-football-teams.com/player/67629/Gael~ ## 3 Justin Hoyte https://www.national-football-teams.com/player/52483/Just~ ## 4 Jonny Evans https://www.national-football-teams.com/player/16586/Jonn~ ## 5 Igors Stepano~ https://www.national-football-teams.com/player/3728/Igors~ ## 6 Manucho https://www.national-football-teams.com/player/14353/Manu~
When we do this scrape and analyse we get Jonny Evans as the Premier League winner who appeared for the lowest rated nation during a championship campaign. He’s Northern Ireland team slumped to a 2-0 defeat to Israel in 2013 which left them 106th in the world rankings. 2 months later, he won the Premier League with Manchester United. There’s relatively few challengers with only Igors Stepanovs (Latvia 101st, 6 appearances for Arsenal in 2002) getting close and then Roy Carroll (Northern Ireland), Eidur GOdjohnsen (Iceland), and Rihad Mahrez (Algeria) in the 80s (Manucho excluded due to lack of appearances in Manchester United’s 2009 Premier League campaign).
low_ranked_champs <- champions_df %>% pmap_df(., get_international_matches) %>% left_join(., players_national_elo, by = "player") %>% left_join(., epl_winners, by = "year") %>% filter(team == champion) %>% filter(match_date == date) lowest_ranked_champs <- low_ranked_champs %>% arrange(-ranking) %>% filter(!duplicated(player)) %>% mutate(match = paste(home, "vs", away)) %>% select(date, player, team, nation, ranking, match, result) head(lowest_ranked_champs, n = 10) ## date player team nation ranking ## 1 2013-03-26 Jonny Evans manchester-united Northern Ireland 106 ## 2 2001-11-14 Igors Stepanovs arsenal Latvia 101 ## 3 2009-06-14 Manucho manchester-united Angola 94 ## 4 2003-06-03 Roy Carroll manchester-united Northern Ireland 87 ## 5 2018-10-16 Riyad Mahrez manchester-city Algeria 85 ## 6 2006-02-28 Eidur Gudjohnsen chelsea Iceland 82 ## 7 2001-06-06 David Healy manchester-united Northern Ireland 82 ## 8 2016-06-13 Wes Morgan leicester-city Jamaica 72 ## 9 2000-09-02 Ryan Giggs manchester-united Wales 70 ## 10 2007-06-02 DONG Fang Zhou manchester-united China 64 ## match result ## 1 Northern Ireland vs Israel 0:2 ## 2 Latvia vs Russia 1:3 ## 3 Angola vs Guinea 0:0 ## 4 Italy vs Northern Ireland 2:0 ## 5 Benin vs Algeria 1:0 ## 6 Trinidad & Tobago vs Iceland 2:0 ## 7 Czech Republic vs Northern Ireland 3:1 ## 8 Uruguay vs Jamaica 3:0 ## 9 Belarus vs Wales 2:1 ## 10 USA vs China 4:1
However! One notable omission is Christopher Wreh, who played in 16 matches for Arsenal in the 1997/1998 season. Wreh only played 26 times for his nation so I assumed he just hadn’t appeared for them that year, but 11v11.com lists him playing on July 27th 1997 (1997/1998 season) at which point Liberia were ranked 110th in the world.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.