Scraping Websites and Building a Large Dataset with SwimmeR

Welcome on Swimming + Data Science

1 year ago

[This article was first published on Welcome on Swimming + Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Today we’re going to use SwimmeR to build a large database of results – that’s the overall goal. Having a more specific goal in mind can be useful though, so here goes.

My wife is a college administrator. It’s part of her job to keep up on what Kids These Days (TM) are doing and one thing they’re apparently doing is getting participation trophies. Looking back to my childhood I participated in a lot of swim meets, and swam a lot of races. For my trouble I’ve got piles of ribbons, enough medals to cover my entire head if I were to try and wear them all at once, and even a few plaques – but almost no trophies. My lack of trophies could be due to my modest swimming abilities, but I’ve checked with some of my faster friends and teammates and they report the same thing – very few trophies. I don’t want swimmers to be left out of the zeitgeist, and if that means participation trophies then I’ve got some wrongs to right.

Who should get these participation trophies though? Not everyone of course, just handing them out would cheapen them. I want to award participation trophies to only the most participatory athletes, the crème de la crème of participation. To that end today I’m going to try and determine the most participatory swimmers in my USA Swimming LSC for the 2018-2019 season. These ultra-participatory athletes will win the coveted Swimming + Data Science Participation Trophies.

Rules are as follows:

The winners shall be the athlete(s) who have participated in, and legally completed, the most individual events in the 2018-2019 Niagara LSC age group season and the athlete(s) who have swam the furthest distance in their events during the same period
Any individual event swam at a meet in Niagara LSC during the 2018-2019 age group season will count. Sectionals will also count if the results are on the Niagara LSC website in .pdf or .html form
Only swimmers from Niagara LSC are eligible
Tiebreaker will be the other category – total distance swam for the events category, and number of events swam for the distance category

To do this I’ll of course need results for all meets swam in Niagara LSC for 2018-2019 – SwimmeR, take your mark!

library(SwimmeR)
library(dplyr)
library(purrr)
library(stringr)
library(rvest)
library(flextable)

flextable_style <- function(x) {
  x %>%
    flextable() %>%
    bold(part = "header") %>% # bold header
    bg(bg = "#D3D3D3", part = "header") %>% # puts gray background behind the header row
    align_nottext_col(align = "center", header = TRUE, footer = TRUE) %>% # center alignment
    autofit()
}

Please note the following analysis was done using the development version of SwimmeR current on November 22nd, 2020, equivalent to version 0.6.0 from CRAN. Please make sure your version of SwimmeR is up-to-date.

Getting Links to Results

During the State-Off Tournament we sometimes built lists of links to pull results from. In those cases the link addresses followed a pattern, which we used to generate them within R. What do we do though if there are links we’re interested in, but their addresses don’t follow a pattern?

Niagara LSC has a repository of results for each season, and the 2018-2019 one is here. It’s basically a list of meets, each with a link called “Results”, and each with a different address. We want the content of all of those links, but there’s a lot and typing all the addresses in manually is of course out of the question. To get them we need the CSS selector that defines them. If you’re a CSS whiz maybe you can just read the website source code a determine it, but I use the Selector Gadget Google Chrome add-on. After a bit of clicking around I’ve determined that the selector in this case is called span a.

Now it’s time to deploy the rvest package. rvest is another hit from the good folks at RStudio and for our purposes today it’s pretty simple. We’ll need to direct it to the results repository and then extract all the links from span a. We’ll use read_html to read the results repository web page, then html_nodes to get the content of our selector, and also html_attr to make sure that content is a link. In this way we’ll build a list of links.

web_url <- "https://www.teamunify.com/team/eznslsc/page/times/2018-2019-results"
selector <- "span a"

page_contents <- read_html(web_url)
links <- html_attr(html_nodes(page_contents, selector), "href")

If we look at our list of links we’ll see that most of them look like .pdf files of results (what we want), but some of them, like the first one, are other junk.

head(links, 5)
## [1] "/cdn-cgi/l/email-protection#9fe8fafdf2feecebfaeddff1f6fef8feedfeece8f6f2b1f0edf8"              
## [2] "/eznslsc/UserFiles/File/Meet-Results/2018-2019/jets08082019_067546.pdf"                        
## [3] "/eznslsc/UserFiles/File/Meet-Results/2018-2019/ttsc08062019_079354.pdf"                        
## [4] "/eznslsc/UserFiles/File/Meet-Results/2018-2019/2019-ez-senrior-lc-champs-results_092276.pdf"   
## [5] "/eznslsc/UserFiles/File/Meet-Results/2018-2019/2019-ez-senrior-lc-champs-tt-results_029460.pdf"

We only want links that end in either “.pdf” or “.htm” because a) those are the results and b) SwimmeR can read those file types.

links <- links[links %>% 
  str_detect("\\.pdf$|.htm")]

Looking at head(links) again we can also see that these aren’t full links. They’re partials, missing their beginnings. Links should begin with something like “https://www.r-bloggers.com/”. In this case the desired beginning, as visible in your browser window, is “http://www.teamunify.com”. We’ve got to be a bit careful though – there might be a few fully formed links in our list. Let’s check.

links[str_detect(links, "http")]
## [1] "http://easternzoneswimming.org/meet_results/2019SpringSectionals_LongCourse_Results_Women.html"     
## [2] "http://easternzoneswimming.org/meet_results/2019SpringSectionals_LongCourse_Results_Men.html"       
## [3] "http://easternzoneswimming.org/meet_results/2019SpringSectionals_LongCourse_Results_TimeTrials.html"
## [4] "http://easternzoneswimming.org/meet_results/2019SpringSectionals_LongCourse_TeamScores.html"        
## [5] "http://easternzoneswimming.org/meet_results/2019SpringSectionals_LongCourse_HighPoint.html"         
## [6] NA

So there’s a few Sectional results that are already full links. Let’s modify the rest to make them complete as well.

links <- ifelse(str_detect(links, "http"), paste0("", links), paste0("http://www.teamunify.com", links))

Reading in Results

Now that we have our list of links we can use read_results to pull their contents into R. Because there might still be some bad links we’ll use the safely function inside of map. map will apply read_results to every element of links, but because read_results is wrapped in safely if it fails rather than posting an error and aborting it will note the issue and keep going, reading in the rest of the elements of links.

We’ll then name each element of the list of results produced by read_results with the link from whence it came.

Please note – running the following code will take several minutes. I also have the results hosted on github. You can download them directly using this code: raw_results <- readRDS(url("https://github.com/gpilgrim2670/Pilgrim_Data/raw/master/Niagara%20LSC%202018-2019%20Dataset/Niagara_raw_results.rds)).

raw_results <- map(links, safely(read_results, otherwise = NA))

names(raw_results) <- links

We now have a list of lists. Each sublist contains results (hopefully) and also an error register as elements 1 and 2 respectively. We want to remove sublists where the error register isn’t NULL – that is, sublists that have an error. We also then want only the first element (the results) of the remaining sublists – we don’t want the empty error register. Here’s a handy little function to do just that, which we can apply to raw_results to clean them.

discard_errors <- function(results) {
  element_extract <- function(lst, n) {
    sapply(lst, `[`, n)
  }
  clean_results <- discard(results, ~ !is.null(.x$error))
  clean_results <- element_extract(clean_results, 1)
  return(clean_results)
}

clean_results <- discard_errors(raw_results)

Parsing Results

With our clean_results in hand we now can pass them to swim_parse, again via map, to apply over every element of clean_results, and again inside safely, to power through (but note) any errors.

The typo and replacement arguments to swim_parse are used for correcting issues in the source materials. Their use requires a bit of practice so that typos can be correctly identified and fixed, but in general the issues relate to one of two things.

Special strings. Strings like “DQ” and “–” (two or three dashes together) have special meaning in swimming results, and therefore inside swim_parse. “DQ” obviously means a swimmer has been disqualified. The two (or three) dashes, “–” are used instead of a place number for athletes who were DQ’d, or scratched etc. Incorrect usage of these strings within results can cause problems, but are easily corrected with typo and replace
White space issues. swim_parse uses instances of two or more spaces to split results into dataframe columns. If those spaces aren’t as they should be, for example if an athlete or team name is long enough that it encroaches on the next column parsing issues will result.

Please note – running the following code will take a long time,as much as 30 minutes denpending on your internet connection and computer. I also have the results hosted on github. You can download them directly using this code: Niagara_Results <- readRDS(url("https://github.com/gpilgrim2670/Pilgrim_Data/raw/master/Niagara%20LSC%202018-2019%20Dataset/Niagara_2018_2019.rds")).

Niagara_Results <-
  map(
    clean_results,
    safely(swim_parse, otherwise = NA),
    typo = c(
      "111", # an actual typo in a kid's name
      "Greater Rochester Area YMCA --NI", # incorrect usage of special string "--"
      "(?<=[:alpha:]) (?=[:digit:])", # regex to identify any space that is preceded by a letter and followed by a number, so that it can be expanded to two+ spaces - a white space issue
      "PENINSULA WAVE RIDERS SWIMMING-A", # team name so long that it got to close to athlete's times - a white space issue
      "DQY", # wrong DQ string
      "XDQ", # wrong DQ string
      "Alexis/Lexi J", # an actual typo in a kid's name
      "La Face, Isabella or Bella F", # an actual typo in a kid's name
      "Cunningham, Rhys*", # an actual typo in a kid's name
      " FUT " # extraneous string used to signify a Futures cut
    ),
    replacement = c(
      "III",
      "Greater Rochester Area YMCA-NI",
      "   ",
      "PENINSULA WAVE RIDERS SWIMMING  ",
      "DQ",
      "DQ",
      "Alexis J",
      "La Face, Isabella",
      "Cunningham, Rhys",
      ""
    )
  )

We’ve got a list of results, but a few of them will contain errors. For example http://easternzoneswimming.org/meet_results/2019SpringSectionals_LongCourse_HighPoint.html doesn’t actually contain results, just a list of high-point scores. That one won’t read. We’ll just use our discard errors function again and then data.table::rbindlist to stick all the results together with a column called Source, containing the element name, which we already made the link it was downloaded from.

Niagara_Results <- discard_errors(Niagara_Results)

Niagara_Results <- data.table::rbindlist(Niagara_Results, use.names = TRUE, idcol = "Source", fill = TRUE)

How big is our data set? dim can tell us.

dim(Niagara_Results)
## [1] 108127     11

We’ve got 108127 rows and 11 columns. There are bigger sets for sure, but this one isn’t small.

Cleaning the Data

A few minor things before we get going. The Source column has appended a string to the end of our links. Let’s get rid of that. Let’s also convert the Age column to numeric, and then use select to order the columns in a sensible way.

Niagara_Results  <- Niagara_Results  %>%
  mutate(Source = str_remove(Source, "\\.result\\.result")) %>% 
  mutate(Age = as.numeric(Age)) %>% 
  select(Source, Place, Name, Age, Event, everything())

Names – A Thorny Problem

It’s tempting to just rush in and count events by athlete name – let’s try it.

Niagara_Results %>%
  filter(is.na(Name) == FALSE,
         DQ == 0) %>% # DQs don't count
  group_by(Name) %>%
  summarise(No_Swims = n(), # number of events swam by each athlete
            Team = get_mode(Team), # use only most common team name
            Age = round(mean(Age, na.rm = TRUE), 0)) %>%
  arrange(desc(No_Swims)) %>%
  head(5) %>% 
  flextable_style()

Name	No_Swims	Team	Age
Chung, Emily	99	Clarence Swim Club-NI	13
Anderson, Max	96	VELO-NI	10
Cavallerano, Jack S	96	Liverpool Jets Swim Club-NI	11
Anderson, Madden	91	VELO-NI	11
Barillari, Emma	91	Velocity Aquatics, Llc-NI	12

It worked I guess, we get some names, with Emily Chung on top. Let’s look a bit more closely at Miss Chung’s results though.

Niagara_Results %>%
  filter(is.na(Name) == FALSE,
         DQ == 0) %>% # DQs don't count
  filter(str_detect(Name, "Chung")) %>%
  filter(str_detect(Name, "Emily")) %>%
  group_by(Name) %>%
  summarise(No_Swim = n(), # number of events swam
            Team = get_mode(Team)) %>% # use only most common team name
  flextable_style()

Name	No_Swim	Team
Chung, Emily	99	Clarence Swim Club-NI
Emily Chung	4	Clarence Swim Club-NI

Turns out she’s here twice, once as “Chung, Emily” and again as “Emily Chung”. Our first analysis didn’t catch her as “Emily Chung” though. We’ll need to count both together, and other athletes may be similarly undercounted.

There’s another issue highlighted by Sidra El Ghissassi. Let’s take a look at her doings.

Niagara_Results %>%
  filter(is.na(Name) == FALSE,
         DQ == 0) %>% # DQs don't count
  filter(str_detect(Name, "El Ghissassi")) %>%
  filter(str_detect(Name, "Sidra")) %>%
  group_by(Name) %>%
  summarise(No_Swim = n(),
            # number of events swam
            Team = get_mode(Team),
            # use only most common team name
            Age = round(mean(Age, na.rm = TRUE), 0)) %>%
  flextable_style()

Name	No_Swim	Team	Age
El Ghissassi, Sidra	15	UNATTACHED-NI	9
El Ghissassi, Sidra S	77	Vestal Area Swim Team-NI	9
Sidra El Ghissassi	3	Vestal Area Swim Team-NI	9

She’s here three times, sometimes with a middle initial and sometimes not, sometimes with her last name first, sometimes with her first name first. What’s more she has two words in her last name. We need to keep track of all this stuff.

There’s also the question of nicknames. For example everyone calls me Greg, but my name is actually Gregory. In swimming results sometimes I’m Gregory and sometimes I’m Greg. The goal here is to be able to recognize “Emily Chung” and “Chung, Emily” as the same person, and also “Greg Pilgrim” and “Pilgrim, Gregory A” as the same person, plus “El Ghissassi, Sidra S” and “Sidra El Ghissassi” as the same person. Let’s take a whack at assigning everyone an ID based on the first 3 letters of their first name and the last 5 letters of their last name. To do that we’ll need to get everyone’s name in the same order – let’s do Firstname Lastname.

Niagara_Results <- Niagara_Results %>%
  mutate(Name_2 = str_replace(Name, " [:upper:]$", "")) %>% # removes middle initials
  mutate(Last_Name = case_when( # pulls out last name when it's followed by a comma
    str_detect(Name_2, ",") ~ str_split_fixed(Name_2, ",", n = 2)[, 1],
    TRUE ~ ""
  )) %>%
  na_if("") %>%
  mutate(First_Name = case_when(is.na(Last_Name) == FALSE ~ str_remove(Name_2, paste0(Last_Name, ",", " ")), # gets first name if last name was identified above
                                TRUE ~ "")) %>%
  na_if("") %>%
  mutate(Name_3 = case_when( # recombines first names and last names identified above into Firstname Lastname order
    is.na(First_Name) == FALSE &
      is.na(Last_Name) == FALSE ~ paste(First_Name, Last_Name, sep = " "),
    TRUE ~ Name_2 # names not identified above are already in Firstname Lastname order
  )) %>%
  mutate(ID = paste0( # make an ID value based on first 3 letters of first name, last 5 letters of last name
    stringr::str_extract(Name_3, "^.{3}"),
    str_extract(Name_3, ".{5}$")
  )) %>%
  select("Name_Rework" = Name_3, # cleanup columns
         ID,
         everything(),
         -Name_2,
         -First_Name,
         -Last_Name)

As a check, let’s look at Emily again and see if her ID is doing what we’d like.

Niagara_Results %>%
  filter(str_detect(Name, "Chung")) %>%
  filter(str_detect(Name, "Emily")) %>%
  group_by(Name) %>%
  summarise(No_Swim = n(),
            Team = get_mode(Team), # use only most common team name
            Age = round(mean(Age, na.rm = TRUE), 0),
            ID = unique(ID)) %>% # show ID
  flextable_style()

Name	No_Swim	Team	Age	ID
Chung, Emily	100	Clarence Swim Club-NI	12	EmiChung
Emily Chung	4	Clarence Swim Club-NI	13	EmiChung

And Sidra:

Niagara_Results %>%
  filter(str_detect(Name, "El Ghissassi")) %>%
  filter(str_detect(Name, "Sidra")) %>%
  group_by(Name) %>%
  summarise(No_Swim = n(),
            Team = get_mode(Team), # use only most common team name
            Age = round(mean(Age, na.rm = TRUE), 0),
            ID = unique(ID)) %>% 
  flextable_style()

Name	No_Swim	Team	Age	ID
El Ghissassi, Sidra	15	UNATTACHED-NI	9	Sidsassi
El Ghissassi, Sidra S	86	Vestal Area Swim Club-NI	9	Sidsassi
Sidra El Ghissassi	3	Vestal Area Swim Team-NI	9	Sidsassi

Both girls have their own ID, which is consistent across the multiple variants of their names. We’re going to stop here, but it is worth noting that there are real and sometimes unsolvable problems of this type – often called record linkage.

If there were two athletes named Emily Chung (and there might be) we wouldn’t be able to tell with this approach. We could maybe differentiate them based on team, but people do sometimes change teams, so even if there were Emily Chungs who swam for two different teams they could be the same person. We could also check their ages, but if you look back to Emily’s table above you’ll see that her age changes. She was 12, but then she turned 13. Probably. It’s also possible that there are two Emily Chungs who both swim for Clarence Swim Club, but one is a year older than the other. We could go even further and try to extract dates for each meet, and see if the (potentially) two Emily Chungs ever swim in the same meet while being different ages but if they didn’t we still couldn’t be sure there was only one Emily Chung.

Additionally if there was a third girl named Emily ZChung her ID would also be EmiChung, same as the original Emily Chung(s) – it’s tricky! We could instead make the ID from the last 6 characters of each name, but that doesn’t solve the fundamental issue.

What’s more, if your name is Andre, but you’re known as Dre – this approach doesn’t help you. Same for Rebecca/Becca and William/Bill.

My point isn’t that this is hopeless, but rather that data science can be hard, and it requires some checking of assumptions to make sure your code is actually doing what you think it’s doing. Solving your problem might be much harder than you first realize. There are more tools we could bring to bear on this names issue, and they may be the subject of another article down the line, but for now let’s press forward.

Distance

With our IDs in hand there’s now the matter of other category – distance swam. Distance is included in Event, but there are two wrinkles. The first is getting the number of yards/meters swam out, and the second is converting them all to a consistent unit, in our case yards.

We’ll extract all groups of digits from each Event using str_extract_all. This will get us the distances, but also the age groups, like “12” from “12 & Under” or “13” and “14” from “Boys 13-14…”. Crucially though the distance will always be the last set of numbers, so we can use tail to get it out.

Then we’ll do something similar with the strings “Yard” and “Meter”, multiplying distance by 1.1 whenever str_detect detects “Meter” in Event and keeping distance the same when str_detect detects “Yard”. A meter is 1.1 yards.

Niagara_Results <- Niagara_Results %>% 
  mutate(Event_Numb = str_extract_all(Event, "\\d{1,}")) %>% # get all groups of digits as a list
  rowwise() %>% # since rows contain lists we need this to act across rows rather than down columns
  mutate(Distance = as.numeric(tail(Event_Numb, 1))) %>% # take the last element of each list
  mutate(Distance = case_when(str_detect(Event, "Yard") ~ Distance ,
                              str_detect(Event, "Meter") ~ Distance * 1.1, # convert distances swam in meters to yards
                              TRUE ~ Distance)) %>% 
  select(-Event_Numb)

Niagara_Results <- Niagara_Results %>% 
  filter(is.na(Name) == FALSE,
         DQ == 0) %>% # DQ events don't count
  group_by(ID) %>% 
  summarise(No_Swims = n(), # count up total swims
            Name = get_mode(Name), # use only most common name
            Team = get_mode(Team), # use only most common team
            Age = round(mean(Age, na.rm = TRUE), 0),
            Distance = sum(Distance, na.rm = TRUE)) %>% 
  ungroup() %>% 
  select(-ID)

Final Results

We’ve reached the end, who will win the coveted Swimming + Data Science Participation Trophy for most events swam?

Niagara_Results %>% 
  arrange(desc(No_Swims)) %>% # sort by number of swims, highest to lowest
  head(5) %>% 
  flextable_style()

No_Swims	Name	Team	Age	Distance
183	Alessi, Luciana M	BAAC-NI	9	15635
138	Upcraft, Malia L	Mexico Tiger Sharks-NI	12	20535
118	Moran, Aiden	Clarence Swim Club-NI	10	20885
118	Cole, Matthew J	Oswego Laker Swim Club-NI	14	32655
117	Seelig, Jenna	Clarence Swim Club-NI	12	16845

It’s Luciana M. Alessi from Buffalo Aquatics (BAAC)! She swam a gargantuan 183 races averaging just over 85 yards per race. That’s a lot!

And now for the Swimming + Data Science Participation Trophy for most furthest distance swam…

Niagara_Results %>% 
  arrange(desc(Distance)) %>% # sort by distance swam, highest to lowest
  head(5) %>% 
  flextable_style()

No_Swims	Name	Team	Age	Distance
118	Cole, Matthew J	Oswego Laker Swim Club-NI	14	32655
73	Kruglov, Max	Star Swimming-NI	15	28345
80	Senglaub, Katie	Victor Swim Club-NI	15	26435
66	Signore, Christopher A	Town of Tonawanda Titans-NI	15	26275
82	Creed, Jason	VELO-NI	15	25755

It’s Matthew J. Cole from the Oswego Lakers, who swam 32,655 yards over 118 events for a ~275 yard per event average! That’s over 18 miles by the way.

A massive congratulations to both Luciana and Matthew for your staggering levels of participation! If either of you happen to be reading this (or if you’re a coach from BAAC or the Lakers), get in touch, I really will have trophies made and sent.

Conclusion

Thank you for joining us here at Swimming + Data Science where we’ve used SwimmeR to get a big dataset of swimming results and then answered the questions – who swam the most events and furthest distance. I hope you;ve enjoyed yourself – be sure to check back in next time for another adventure.

To leave a comment for the author, please follow the link and comment on their blog: Welcome on Swimming + Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.