Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Every so often a question on The Guardian’s The Knowledge football trivia section piques my interest and is amenable to analysis using R. Previously, I looked at club name suffixes and young World Cup winners last August. This week (give or take), a question posed on twitter caught my attention:
@TheKnowledge_GU was just chatting to some colleagues in the kitchen at work about why Essex doesn’t have many big football clubs and it got me thinking. If you combined all the points from every league team in the ceremonial counties in England, which county would be on top?
— BoxBoron (@Rutland_Walker) January 8, 2019
To start with as always load the libraries needed to analyse this
#get data library(engsoccerdata) library(rvest) #munging library(tidyverse) library(magrittr) #spatial analysis library(sf) library(rgdal) #for plotting maps library(ggthemes)
The easiest way to get a total of points is using the engsoccerdata:: packages database of every English football match from the top four divisions (this does not include data from the 2017-2018, or 2018-2019 seasons). We can work out the points easily from the goals scored for each team
#load the data match_data <- engsoccerdata::england %>% #select only the necessary columns and melt select(season = Season, home, visitor, hgoal, vgoal, tier) %>% reshape2::melt(id.vars = c("season", "hgoal", "vgoal", "tier"), variable.name = "location", value.name = "team") %>% #will need to match this to location data so some club names need cleaning mutate(team_subbed = case_when( team == "Yeovil" ~ "Yeovil Town", team == "AFC Bournemouth" ~ "A.F.C. Bournemouth", team == "Halifax Town" ~ "F.C. Halifax Town", team == "Aldershot" ~ "Aldershot Town F.C", team == "Wimbledon" ~ "A.F.C. Wimbledon", team == "AFC Wimbledon" ~ "A.F.C. Wimbledon", team == "Macclesfield" ~ "Macclesfield Town", team == "Rushden & Diamonds" ~ "A.F.C. Rushden & Diamonds", team == "Milton Keynes Dons" ~ "Milton Keynes", team == "Dagenham and Redbridge" ~ "Dagenham & Redbridge", team == "Stevenage Borough" ~ "Stevenage" )) %>% #if cleaning isnt required, take original mutate(team_subbed = ifelse(is.na(team_subbed), team, team_subbed)) #peek at the data head(match_data) ## season hgoal vgoal tier location team team_subbed ## 1 1888 1 1 1 home Accrington F.C. Accrington F.C. ## 2 1888 0 2 1 home Accrington F.C. Accrington F.C. ## 3 1888 2 3 1 home Accrington F.C. Accrington F.C. ## 4 1888 5 1 1 home Accrington F.C. Accrington F.C. ## 5 1888 6 2 1 home Accrington F.C. Accrington F.C. ## 6 1888 3 1 1 home Accrington F.C. Accrington F.C.
The 388k (194k matches) data.frame seems daunting, but actually only results in many fewer unique teams that have played at least one match in the top 4 divisions in England
length(unique(match_data$team_subbed)) ## [1] 141
The location of each club can then be found using the wikipedia pages for them/their stadia. This matches 121 of the 141 clubs pretty nicely which is a fairly good percentage all things considered
#find the links to each clubs wikipedia page wiki <- read_html("https://en.wikipedia.org/wiki/List_of_football_clubs_in_England") %>% html_nodes("td:nth-child(1)") %>% .[which(grepl("href", .))] #get the names for each club wiki_clubs <- wiki %>% html_text() %>% gsub(" \\(.*\\)$", "", .) #can match 121/141 right off the bat (unique(match_data$team_subbed) %in% wiki_clubs) %>% which() %>% length() ## [1] 121
We can find the location of these matching clubs by finding the page for their stadia and then finding the coordinates. It’s a bit of a messy function because I was just jamming stuff together to get data out as best as possible. This takes ~1 minute to run through all 121 teams (for the blog post I actually saved an RDS of the output from this and load it just to save time/server calls)
matching_club_locations <- wiki %>% #take only the matching clubs .[which(wiki_clubs %in% unique(match_data$team_subbed))] %>% html_nodes("a") %>% #get the wiki page link html_attr("href") %>% paste0("https://en.wikipedia.org", .) %>% #for each club page find the stadium and its coordinates lapply(., function(team) { link <- read_html(team) %>% html_nodes(".label a") %>% .[1] %>% html_attr("href") %>% paste0("https://en.wikipedia.org",. ) coords <- link %>% read_html() %>% html_nodes("#coordinates a") %>% html_attr("href") %>% .[2] #if coords not found use NA if(is.na(coords)) { coord_df <- data.frame(lat = NA, lon = NA) } else { coords <- coords %>% paste0("https:", .) %>% read_html() %>% html_nodes(".geo") %>% html_text() %>% strsplit(., split = ", ") coord_df <- data.frame(lat = as.numeric(coords[[1]][1]), lon = as.numeric(coords[[1]][2])) } return(coord_df) }) %>% #bind everything together do.call(rbind, .) %>% #add the club name as a new column mutate(team = wiki_clubs[ which(wiki_clubs %in% unique(match_data$team_subbed)) ]) %>% #filter out missing data filter(!is.na(lat) & !is.na(lon))
Which gives us the location of 114 of our 141 clubs. Most of the remaining ones are now-defunct clubs (e.g. Middlesbrough Ironopolis, Leeds City etc.)
missing_teams <- unique(match_data$team_subbed)[which(!unique(match_data$team_subbed) %in% matching_club_locations$team)] missing_teams ## [1] "Accrington F.C." "Darwen" ## [3] "Burton Swifts" "Port Vale" ## [5] "Middlesbrough Ironopolis" "Rotherham Town" ## [7] "Burton Wanderers" "Loughborough" ## [9] "Blackpool" "New Brighton Tower" ## [11] "Burton United" "Leeds City" ## [13] "Rotherham County" "Bristol Rovers" ## [15] "Darlington" "Wigan Borough" ## [17] "Aberdare Athletic" "New Brighton" ## [19] "Thames" "Aldershot Town F.C" ## [21] "Hereford United" "Scarborough" ## [23] "Cheltenham" "A.F.C. Rushden & Diamonds" ## [25] "Accrington" "Crawley Town" ## [27] "Fleetwood Town"
Given it was a Saturday morning where I had nothing better to do, I simply located these clubs home grounds manually and created a data.frame for their locations. It’s not really great practice but whatever.
These are then all bound together and converted to an sf spatial object with the correct projection
#add in the missing locations missing_locations <- data.frame( lat = c(53.7646, 53.711772, 52.799, 53.049722, 54.5641, 53.42644, 52.8146, 52.7743, 53.804722, 53.4359, 52.799, 53.7778, 53.428367, 51.48622, 54.508425, 53.554914, 51.7127, 53.4292, 51.514431, 51.248386, 52.060719, 54.265478, 51.906158, 52.328033, 53.7646, 51.405083, 53.9165), lon = c(-2.358, -2.477292, -1.6354, -2.1925, -1.2456, -1.34377, -1.6335, -1.1992, -3.048056, -3.0377, -1.6354, -1.5722, -1.370231, -2.583134, -1.534394, -2.650661, -3.4374, -3.0407, 0.034739, -0.754789, -2.717711, -0.418247, -2.060211, -0.5999, -2.358, -0.281944, -3.0247), team = as.character(missing_teams) ) #bind together and convert to sf all_locations <- rbind(matching_club_locations, missing_locations) %>% st_as_sf(coords = c("lon", "lat"), crs = st_crs("+init=epsg:4326")) %T>% #make a quick plot of locations for sanity check plot()
Now that we have all the teams, we need the English historical county boundaries to group them by. I’d actually already used these for football analysis, looknig at if an independent Yorkshire could win the World Cup.
Load the data (the boundary file can be download from the Ordnance Survey) and make a quick plot of the boundaries and teams
(I also created an sf object engwal which is just the counties from England and Wales selected out for background plotting)
## OGR data source with driver: ESRI Shapefile ## Source: "C:\Users\Alaa\Desktop\geo_data\boundary\Data\Supplementary_Ceremonial", layer: "Boundary-line-ceremonial-counties_region" ## with 91 features ## It has 2 fields #load the boundary file counties <- readOGR(dsn = "path/to/file", layer = "county_boundaries") %>% #convert to sf and project as northing/easting st_as_sf(., crs = st_crs("+init=epsg:27700")) %>% #only interested in the county name select(county = NAME) %>% #transform the projection to match that of the club locations st_transform(., crs = st_crs("+init=epsg:4326")) engwal <- counties %>% .[c(1:54, 88, 90),] #make a quick plot of counties and teams ggplot() + geom_sf(data = counties, fill = NA) + geom_sf(data = all_locations) + ggtitle("Location of Teams to have Played Top\n 4 English Football Divisions") + theme_minimal() + theme(axis.title=element_blank(), axis.text=element_blank(), axis.ticks=element_blank())
(by the way the artifacts around Bristol and the Wirral are from the OS dataset- it’s very annoying)
Then we need to determine which teams are within which counties. The easiest way to do this is to use a spatial join of the team names in all_locations by which county they fall into (using st_contains from the sf package)
#bind the team names to each county counties %<>% st_join(., all_locations, join = st_contains) %>% #remove counties that contain zero teams filter(!is.na(team)) %>% mutate(county = as.character(county)) ## although coordinates are longitude/latitude, st_contains assumes that they are planar #quick plot of number of teams per county (missing = 0) counties %>% group_by(county) %>% summarise(n_clubs = n()) %>% ggplot(data = .) + geom_sf(data = engwal) + geom_sf(aes(fill = n_clubs), colour = "black") + scale_fill_viridis_c(option = "plasma", name = "# clubs") + ggtitle("Number of Top 4 Division Playing\n Teams in each Ceremonial County") + theme_minimal() + theme(axis.title=element_blank(), axis.text=element_blank(), axis.ticks=element_blank())
Which shows that most English historic counties (and a few Welsh ones due to teams like Cardiff City/ Swansea City etc.) have at least 1 team that has competed in the top 4 flights of English football at some point (those that do not are: Isle of Wight, Rutland, Surrey, Warwickshire, West Sussex and Cornwall).
To finally get the total number of points won by these teams, the county data needs to be joined back onto the match data from the top. First I clean it up a bit then make the left_join by team name. Finally the number of points per match is calculated using case_when and points are grouped by county and summed
county_match_data <- match_data %>% mutate(team = team_subbed) %>% select(-team_subbed) %>% left_join(., counties, by = "team") %>% mutate(points = case_when( location == "home" & hgoal > vgoal ~ 3, location == "visitor" & vgoal > hgoal ~ 3, location == "home" & hgoal < vgoal ~ 0, location == "visitor" & vgoal < hgoal ~ 0, hgoal == vgoal ~ 1 )) county_points <- county_match_data %>% group_by(county) %>% summarise(total_points = sum(points))
Perhaps unsurprisingly, the county with the most points is Greater London, with Greater Manchester following and other footballing hotspots/ large counties in the West Midlands, Lancashire and around Yorkshire in the trailing group
head(arrange(county_points, -total_points)) ## # A tibble: 6 x 2 ## county total_points ## <chr> <dbl> ## 1 Greater London 67189 ## 2 Greater Manchester 47203 ## 3 West Midlands 37413 ## 4 Lancashire 30808 ## 5 South Yorkshire 30061 ## 6 West Yorkshire 24947
By contrast, Worcestshire and Northumberland barely have any points, with a few Welsh counties also struggling
head(arrange(county_points, total_points)) ## # A tibble: 6 x 2 ## county total_points ## <chr> <dbl> ## 1 Worcestershire 275 ## 2 Northumberland 398 ## 3 Mid Glamorgan 744 ## 4 Somerset 813 ## 5 Gloucestershire 994 ## 6 Herefordshire 1739
If we group by tier as well as county, it’s possible to see how well each county has done at specific tiers.
county_match_data %>% group_by(county, tier) %>% summarise(total_points = sum(points)) %>% left_join(., select(counties, county), by = "county") %>% ggplot(data = .) + geom_sf(data = engwal) + geom_sf(aes(fill = total_points), colour = "black") + scale_fill_viridis_c(option = "plasma", name = "total points") + ggtitle("Number of Points Won by each County\n per Tier of English Football") + facet_wrap(~tier) + theme_minimal() + theme(axis.title=element_blank(), axis.text=element_blank(), axis.ticks=element_blank())
And for the Premier League era this clears up to
county_match_data %>% filter(season > 1991) %>% group_by(county, tier) %>% summarise(total_points = sum(points)) %>% left_join(., select(counties, county), by = "county") %>% ggplot(data = .) + geom_sf(data = engwal) + geom_sf(aes(fill = total_points), colour = "black", name = "total points") + scale_fill_viridis_c(option = "plasma") + ggtitle("Number of Points Won by each County\n per Tier of English Football", subtitle = "From Begining of 1992/1993 Season") + facet_wrap(~tier) + theme_minimal() + theme(axis.title=element_blank(), axis.text=element_blank(), axis.ticks=element_blank()) ## Warning: Ignoring unknown parameters: name
Which shows just how dominant London has been in the top division of English football (especially as it is only competitive at lower levels).
I had wanted to weight points by the average ELO of that league and see which county has the most weight-adjusted points but got bored for this small blog post.
Best,
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.