Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Brief summary
- Making a map of New York State by county
- Labeling areas of the map based on their centroid
- Answering this question: “Is there a section in New York State from which divers routinely underdive their qualifying scores at the state championships?”
Full post
I was recently officiating at a New York State High School Championship meet and got into a conversation with some other officials about diving. In New York swimming and diving are contested together with diving being one event of twelve at sectional and state meets. Similarly officals officate both swimming and diving, although when possible some preference is given to individual officals’ areas of expertise. I for example was (and am) a swimmer and while I can and do officiate diving I don’t at the state level. Instead officals with expertise in diving handle diving, and those with expertise in swimming handle swimming – all quite sensible.
I’ve mentioned sections a couple times now. New York State high school athletics are divided geographically into eleven sections, plus the New York City metro area. Sections are named with roman numerals, I through XI. Let’s make a map of where those sections are.
We’ll use the usmap
package to get county-by-county borders for New York State, and the rgeos
and sf
packages to plot the map. We’ll also use the tidyverse
collection, mostly for dplyr::mutate
, purrr::map
, the pipe (%>%
). To actually plot the map we’ll use ggplot2
plus viridis
and ggrepel
because I want it to look nice and be colorblind friendly. Later we’ll use flextable
to print out a nice table of results.
First we’ll load libraries, then use usmap::plot_usmap
to get x and y coordinates for New York State and the county borders. We’ll then extract those coordinates, called data
in the ny_map
object, to a dataframe. We extract data
because the usmap::plot_usmap
object also includes a bunch of other stuff we don’t care about right now.
library(tidyverse) library(usmap) library(rgeos) library(sf) library(viridis) library(ggrepel) library(flextable) ny_map <- usmap::plot_usmap("counties", include = c("NY")) df <- ny_map$data
Now we’ll label the counties based on which section they’re in using dplyr::case_when
.
df <- df %>% mutate( county = str_replace(county, " County", ""), Section = case_when( county %in% c( "Dutchess", "Putnam", "Rockland", "Westchester" ) ~ "I", county %in% c( "Albany", "Columbia", "Greene", "Montgomery", "Rensselaer", "Saratoga", "Schenectady", "Schoharie", "Warren", "Washington", "Fulton" ) ~ "II", county %in% c( "Cayuga", "Onondaga", "Oneida", "Madison", "Herkimer", "Jefferson", "Lewis", "Oswego" ) ~ "III", county %in% c( "Tompkins", "Broome", "Chemung", "Tioga", "Delaware", "Otsego", "Schuyler", "Cortland", "Chenango" ) ~ "IV", county %in% c( "Orleans", "Genesee", "Wyoming", "Allegany", "Steuben", "Livingston", "Monroe", "Wayne", "Ontario", "Yates", "Seneca" ) ~ "V", county %in% c( "Niagara", "Erie", "Chautauqua", "Cattaraugus" ) ~ "VI", county %in% c("Clinton", "Essex", "Hamilton") ~ "VII", county %in% c("Nassau") ~ "VIII", county %in% c("Orange", "Sullivan", "Ulster") ~ "IX", county %in% c("St. Lawrence", "Franklin") ~ "X", county %in% c("Suffolk") ~ "XI", TRUE ~ "Metro" ) )
Of course it’s not enough to just have the sections differentiated on a map. We need centered labels featuring section names. The center of an irregular shape is called a centroid, and can be calculated using rgeos::gCentroid
– that’s where we’ll put the label. First we’ll need to extract the x
and y
values from our df
and convert them into spacial coordinates using sp::coordinate
. After that we’ll clean up the resulting dataframe df_centroids
df_centroids <- df %>% split(as.factor(df$Section)) coords <- function(x){ coordinates(x) <- c("x", "y") return(x) } df_centroids <- map(df_centroids, coords) df_centroids <- map(df_centroids, gCentroid) df_centroids <- map(df_centroids, coordinates) df_centroids <- unlist(df_centroids, recursive = FALSE) df_centroids <- stack(df_centroids) df_centroids <- df_centroids %>% mutate_if(is.factor, as.character) %>% mutate(coord = case_when(str_detect(ind, "1") ~ "x", TRUE ~ "y"), ind = str_replace(ind, "\\d", "")) %>% pivot_wider(names_from = coord, values_from = values)
To take a look at df_centroids
we can see x
and y
values labeled as such, and ind
, which has the corresponding section name.
str(df_centroids) ## tibble [12 × 3] (S3: tbl_df/tbl/data.frame) ## $ ind: chr [1:12] "I" "II" "III" "IV" ... ## $ x : num [1:12] 2143439 2077969 1914280 1945690 2078878 ... ## $ y : num [1:12] -61830 101916 111371 -4687 -43960 ...
Between df
, which has our county-level state map, and df_centroids
which has the center of each section and the section name we now have enough to make a map.
ggplot() + geom_polygon( data = df, mapping = aes(x = x, y = y, group = fips, fill = Section), color = "white") + ggrepel::geom_label_repel(data = df_centroids, aes(label = ind, x = x, y = y), size = 3.5) + theme_void() + theme(legend.position = "none") + viridis::scale_fill_viridis(discrete = TRUE) + labs(title = "New York State Sections")
Lovely! Okay, back to the story. Several officials where talking about diving, and some were saying that divers from a particular section were underperforming relative to their qualifying score and that this had been going on for some time.
By way of explanation, athletes qualify for the state championships by meeting a certain standard, called a “cut”. In swimming this cut is a time – swim a particular race faster than the cut time, and you qualify. In diving the cut is a score – score higher than the cut score, and you qualify. The difference though is that times are objective, timing systems are automatic, so there’s no human element. Diving scores on the other hand are dependent on officials who judge diving, including me (below the state level). Officials take scoring very seriously and do their best, but there’s real variance between officials depending on how they’re positioned relative to the diving board, glare off the water or from windows, and personal factors as well. Once at the state championships athletes are seeded with respect their qualifying time or score, with better performing athletes given preference with respect to lane assignment (swimming) or competition order (diving).
What was being postulated in this conversation is that diving scores from a particular section were higher than they should have been, meaning that athletes who perhaps should not have qualified for the state championships at all where present and competing, and that some athletes were gaining a benefit with respect to their competition order on the basis of inflated scores. Serious stuff!
So the question is: “Is there a section in New York State from which divers routinely underdive their qualifying scores at the state championships?”
To look back at years of New York State diving results we’ll need to source those results, and then wrestle them into R. This is a job for my SwimmeR
package. It’s also a long job, taking upwards of 30 minutes depending on your computer and internet connection, because swimming and diving results are entangled. Rather than going through all of that I’m going to start with a subset, containing only diving results going back to 1998, and available here.
urlfile <- "https://raw.githubusercontent.com/gpilgrim2670/Pilgrim_Data/master/all_sections_clean.csv" all_sections_clean <- read_csv(url(urlfile))
The ID
variable indicates unique athletes and is an attempt to address people sometimes using nicknames (Greg vs. Gregory for example).
Fist we’ll collect all the divers who competed in a given year’s state championship meet.
all_state_divers <- all_sections_clean %>% group_by(Year, ID) %>% filter(any(str_detect(Meet, "States")))
Next we’ll create new variables Season_Best
for the highest score the athlete achieved in a non-state meet for a given year. This would be the athlete’s qualifying score. We’ll also collect each athlete’s score from that year’s state meet
all_state_divers <- all_state_divers %>% group_by(Year, ID) %>% unique() %>% mutate(Season_Best = max(as.numeric(Finals_Time[str_detect(Meet, "States", negate = TRUE)])), State_Score = max(as.numeric(Finals_Time[str_detect(Meet, "States")]), na.rm = TRUE))
Then we’ll filter
for only divers who placed in the top 20 at the state meet. This is because only the top 20 divers get to do all their dives. Divers are eliminated in rounds, after having completed a subset of dives until only the top 20 remain. Only scores from these top 20 can be reasonably compared to qualifying scores.
We’ll filter
to keep only divers with Season_Best >= 395
, a somewhat arbitrary cutoff, but one that removes divers who qualified during a 6 dive competition, rather than the 11 dives used at state championships. Please note that for several years now divers have not been able to qualify for the state championships with 6 dive scores.
These two new variables will be summarize
d to get each athlete’s Performance
at States relative to their qualifying score, and the Performance_Ratio
, the ratio between their qualifying and state scores.
getmode <- function(x) { unique_x <- unique(x[!is.na(x)]) unique_x[which.max(tabulate(match(x, unique_x)))] } all_state_divers <- all_state_divers %>% filter(str_detect(Meet, "States")) %>% ungroup() %>% filter(Place <= 20) %>% group_by(Year, ID) %>% filter(is.infinite(Season_Best) == FALSE, Season_Best >= 395) %>% summarise(Place = min(Place), Season_Best = max(Season_Best), State_Score = max(State_Score), Section = getmode(Section), Performance_Ratio = State_Score/Season_Best, Performance = State_Score - Season_Best)
Finally we’ll group_by
Section
and see if there’s a section that’s underdiving qualifying times.
all_state_divers_sum <- all_state_divers %>% ungroup() %>% # group_by(Section, Year) %>% group_by(Section) %>% filter(Year >= 2015) %>% summarise(Performance = round(mean(Performance, na.rm = TRUE),2), `Performance Ratio` = round(mean(Performance_Ratio, na.rm = TRUE), 2)) %>% arrange(Performance) all_state_divers_sum %>% flextable::flextable() %>% flextable::align(align = "center", part = "all") %>% flextable::bold(part = "header") %>% flextable::bg(bg = "#D3D3D3", part = "header") %>% flextable::autofit()
Section | Performance | Performance Ratio |
XI | -105.32 | 0.81 |
IV | -87.10 | 0.84 |
II | -84.72 | 0.84 |
IX | -82.58 | 0.84 |
III | -76.65 | 0.85 |
VIII | -65.68 | 0.87 |
V | -60.51 | 0.88 |
VI | -44.75 | 0.91 |
I | -43.44 | 0.91 |
Oh. Well this is embarrassing. There isn’t one section underdiving, there’s nine and Sections VII & X aren’t represented at all. That’s 100% of sections with divers in the top 20 that are underdiving by 10-20%.
Just for fun, let’s stick that on our map too.
df_centroids_perf <- df_centroids %>% left_join(all_state_divers_sum, by = c("ind" = "Section")) %>% mutate(ind = paste0(ind, ": ", `Performance Ratio`), naughty = case_when(is.na(Performance) ~ "N", TRUE ~ "Y")) ggplot() + geom_polygon( data = df, mapping = aes( x = x, y = y, group = fips, fill = Section ), color = "white" ) + ggrepel::geom_label_repel(data = df_centroids_perf, aes( label = ind, x = x, y = y, color = naughty), size = 3.5) + theme_void() + scale_color_manual(values = c("green", "red")) + theme(legend.position = "none") + viridis::scale_fill_viridis(discrete = TRUE) + labs(title = "New York State Sections With Underdiving Percentages")
To be fair – Section XI is underdiving by the widest margin, but since all the sections are doing it this is now an issue of degrees, not occurrence. As to why it’s happening – a few non conspiracy theories:
Winning glow. Everyone looks good when they’re winning. Swimmers, divers, those fancy dogs at Crufts, everyone. Only the top ~7% or so of athletes qualify for the state meet, so most who qualify won their previous meets, or at least finished very high. They benefit from the winning glow, and perhaps look better than the really should because they’re outshining their competition.
Dive order, which I mentioned before, is determined preferentially based on qualifying score. The highest qualifying divers get to choose their position in the order, and in practice they choose the last one available. This means that the highest qualifying diver goes last, second highest qualifying goes second-to-last etc. Since divers coming into states are the cream of the crop, as mentioned in point #1, they likely dove last, or near-last when they got their state qualifying score. Divers like to dive last because they reason, and not incorrectly, that if the quality of diving is perceived by the officials to increase, scores will also increase. So for example if an early dive merits a score of 5, a better dive done later must merit at least a 5.5, and if an official scored that first dive as a 6 instead of a 5, a later diver will benefit from that mistake, scoring a 6.5 where perhaps they only deserved a 5.5. Basically later divers stand to benefit from an accumulation of small mistakes made by officials earlier in the dive order. Officials for their part try not to make such mistakes, and not to compound them.
Higher quality officiating at state championship meets. In the same way that athletes at state championships are the best of the best, so are the officials. Diving officials at state meets are diving experts, above and beyond the level of regular diving officials (like me), who only score lower tier diving meets. These elite officials are perhaps more likely to spot issues that others might miss, and to not compound errors like discussed in point #2.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.