Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Recently, a Yorkshire national football team appeared in a league of national teams for stateless people. This got me wondering how the historic counties of the UK would do at the world cup. Could any of them compete with full international teams?
This is the complete script for an short article I wrote for CityMetric on the topic. It’s split over 5 separate parts and is pretty hefty but contains pretty much everything you need to clone the article. Last time, we found the position abilities of each player using LASSO regression. This time, we’ll geolocate the birthplace of the British players in our dataset to find which county team they’d be eligible for.
library(dplyr) library(magrittr) library(data.table) library(ggplot2) #use pediarr to query wikipedia to find the birthplace of players library(pediarr) #use googleway to geocode birthplaces library(googleway) #use sf to bin players into counties library(sf)
Find British Players Birthplaces
To select our county teams, we need to know where each British player was born (and thus their ‘county’ nationality). Fortunately, wikipedia has an extremely detailed database of thousands of footballers, incluiding their birthplace (which we can assume is at least reasonably correct).
First, the data needs to be filtered to include only players with British nationalities (English, Welsh, Scottish, or Northern Irish) or Irish. It’s very plausible that some players representing other countries would be born in England, and so eligible for the hypothetical county teams, but unlikely, and more trouble than it’s worth.
When filtering, I also remove players who have no wikipedia page/birthplace listed. For some of these, I was able to manually locate their birthplace. Some players don’t get matched very well (mostly due to Australian/American footballers) and it was easiest just to manually supply the links to their wikipedia page.
#players with no wikipedia birthplace listed players_missing_data <- c("Liam Lindsay","Greg Docherty","Mikey Devlin","Josh Dacres-Cogley", "Tom Broadbent","Callum Gribbin","Sam Hughes"," James Cook","Daniel Jarvis","Zachary Dearnley","Ro-Shaun Williams", "Jack Fitzwater","Jack Hamilton","Lewis Banks","Greg Bolger","Chris Shields", "Conor Wilkinson","Barry McNamee","Keith Ward","Simon Madden","Dylan Connolly", "Brian Gartland","Dinny Corcoran") #players whose birthplace was manually found missing_players_data <- readRDS("missing_player_birthplaces.rds") #players whose wikipedia page is manually linked manual_links <- readRDS("manual_links.rds")
The function below then iterates through every player with a nationality from the British Isles and searches for a matching wikipedia page.
It then looks for the birthplace of that player on their wikipedia page and returns a df containing the player and their birthplace.
It also tries to match the birthdate listed from FIFA18 with that on their wikipedia page as a check and throws a warning if they don’t match. I haven’t looked into if there are mismatches there but ~50 players overall don’t match perfectly.
uk_players_info <- all_players_data %>% #only want data to help identify players by wiki page select(id, name, nationality, birthdate) %>% #only include UK nations (+Ireland) filter(nationality %in% c("England", "Scotland", "Wales", "Northern Ireland", "Republic of Ireland")) %>% #remove duplicated names #might lose some players here but they're all so far down the pecking order effect should be minimal filter(!duplicated(name)) %>% #remove players who have no wikipedia birthplace filter(!name %in% players_missing_data) #function to find the wikipedia page of each player #returns a df with the player name and birthplace scraped from wikipedia get_info <- function(row) { #get player info name <- uk_players_info$name[row] birthday <- uk_players_info$birthdate[row] id <- uk_players_info$id[row] #search wikipedia using the player name search <- pediasearch(name, extract = TRUE, limit = 10) #if a troublesome search use manual link if(name %in% manual_links$name) { wiki_suffix <- manual_links$link[which(manual_links$name == as.character(name))] } else { #else find the wikipedia page suffix for the player if(search[1] == "" & length(search) == 1) { wiki_suffix <- name %>% gsub(" ", "_", .) } else { footballer <- grep("football", search)[1] wiki_suffix <- names(search)[footballer] %>% gsub(" ", "_", .) } } #read the info card from the players wikipedia page info_card <- read_html(paste0("https://en.wikipedia.org/wiki/", wiki_suffix)) %>% html_nodes(".vcard") %>% .[1] %>% html_table(fill = TRUE) %>% data.frame() names(info_card) <- paste0("X", 1:ncol(info_card)) info_card$X1 <- tolower(info_card$X1) #check if the wikipedia birthdate matches the FIFA one birthdate <- info_card %>% filter(X1 == "date of birth") birthdate <- birthdate$X2 %>% as.character() %>% gsub(" .*", "", .) %>% gsub("\\(|\\)", "", .) %>% as.Date() if(birthdate != birthday){ warning(paste(row, "birthdays do not match")) } #find the players birthplace birthplace <- info_card %>% filter(X1 == "place of birth") birthplace <- birthplace$X2 %>% gsub("\\[.*", "", .) #return info as a df df <- data.frame(id = id, name = name, birthdate = birthdate, birthplace = birthplace) return(df) } #run the function over the first 1333 players #after this very few players are found british_player_birthplaces <- rbindlist(lapply(1:1329, get_info)) %>% #bind in the manually found data rbind(., missing_players_data)
Now that we have the birthplaces for each player, we need to convert these into coordinates via geocoding. For this I use googleway, but the geocode() function from ggmap could also be used.
The function takes a place and a key (for the API which isn’t included in the knitted markdown) and finds the lat lon for that place. To save on API requests I only run it on unique birthplaces then merge this back into the dataset.
Once we have the lat/lon of each birthplace we can convert the df of players into an sf (spatial) object. If we do this, we see that a lot of players who are eligible for British nations aren’t actually born on the islands (e.g. Raheem Sterling was born in Jamaica). so I only select those which are born within the grouped spatial object of all 5 countries.
#geocodes locations using googlemaps #requires a google maps API key (hidden here) googleway_geocode <- function(place, key){ data <- google_geocode(place, key = key) latlon <- data$results$geometry$location[1,] %>% mutate(birthplace = place) #returns coordinates in the form latitude/longitude return(latlon) } birthplace_coords <- rbindlist(lapply(as.character(unique(british_player_birthplaces$birthplace)), googleway_geocode, key = key)) #also melt into one spatial row for subsetting later uk <- uk_counties %>% group_by("UK") %>% summarise() british_player_birthplaces <- british_player_birthplaces %>% merge(., birthplace_coords, by = "birthplace") %>% #convert to an sf object st_as_sf(coords = c("lng", "lat"), crs = st_crs(uk_counties)) %>% #keep only those born within the UK proper .[unlist(st_contains(uk, .)),]
If we plot the players, we see they tend to be grouped around the large cities in London, Lancashire, and Yorkshire, with realtively few in Northern Ireland, rural Wales and the Highlands
p <- ggplot(data = uk_counties) + geom_sf() + geom_sf(data = british_player_birthplaces, colour = "darkred", alpha = 0.3) + ggtitle("Players Born in Historic UK Counties") + theme_void() plot(p)
To find which county each player comes from, we can take the lat/lon of their birthplace and find which county shapefile contains it. The name of that county shapefile is then returned as a new column on the df of all British players
#find the historic county each player was born within british_player_birthplaces$county <- unlist(lapply(seq(nrow(british_player_birthplaces)), function(player) { #which county is there birthplace coordinates in container <- st_contains(uk_counties, british_player_birthplaces[player,]) if(length(unlist(container)) == 1) { #which county name is this county <- as.character(uk_counties$county[as.numeric(t(container))]) } else { county <- NA } return(county) }))
if we table the results of the county binning, we can see that many counties contain very few players, whereas some contain many more (e.g. Lancashire has 164 available players, whereas Cambridgeshire has only 5). Later, we will only look at counties that can field at least 10 outfield players + 1 goalkeeper.
#the number of players from each historic county table(british_player_birthplaces$county) ## ## Aberdeen Anglesey ## 12 1 ## Angus Ayrshire ## 3 12 ## Bedfordshire Berkshire ## 10 15 ## Berwickshire Buckinghamshire ## 1 15 ## Caithness Cambridgeshire ## 1 5 ## Cardiganshire Carmarthenshire ## 1 2 ## Carnarvonshire Cheshire ## 2 50 ## Cornwall County Antrim ## 5 13 ## County Armagh County Derry / Londonderry ## 2 7 ## County Down County Fermanagh ## 3 2 ## County Tyrone Cumberland ## 3 8 ## Denbighshire Derbyshire ## 4 13 ## Devon Dorset ## 17 3 ## Dumfriesshire Dunbartonshire ## 2 5 ## Dundee Durham ## 6 26 ## Edinburgh Essex ## 23 71 ## Fife Flintshire ## 5 4 ## Glamorgan Glasgow ## 12 35 ## Gloucestershire Hampshire ## 13 28 ## Herefordshire Hertfordshire ## 5 33 ## Huntingdonshire Inverness-shire ## 3 3 ## Kent Lanarkshire ## 50 18 ## Lancashire Leicestershire ## 164 12 ## Lincolnshire Middlesex ## 8 77 ## Midlothian Monmouthshire ## 5 5 ## Nairn Norfolk ## 1 6 ## Northamptonshire Northumberland ## 12 14 ## Nottinghamshire Oxfordshire ## 20 6 ## Perthshire Renfrewshire ## 3 3 ## Selkirkshire Shropshire ## 1 11 ## Somerset Staffordshire ## 12 46 ## Stirlingshire Suffolk ## 4 10 ## Surrey Sussex ## 63 16 ## Warwickshire West Lothian ## 44 1 ## Wigtownshire Wiltshire ## 1 6 ## Worcestershire Yorkshire ## 6 103
Obviously not all of these counties can field complete teams of 11 players, but for those who can, in the next post, we’ll start picking teams and seeing how counties and nations stack up against each other.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.