Site icon R-bloggers

Scraping old player data

[This article was first published on Analysis of AFL, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As its been pointed out to me on that it would be handy if within fitzRoy that it should contain past players data from footywire.

So here is roughly how to do that.

Step 1 – get all the packages you need

library(rvest)
## Loading required package: xml2
library(tidyverse)
## ── Attaching packages ──────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0       ✔ purrr   0.3.0  
## ✔ tibble  2.0.1       ✔ dplyr   0.8.0.1
## ✔ tidyr   0.8.3       ✔ stringr 1.4.0  
## ✔ readr   1.3.1       ✔ forcats 0.4.0
## ── Conflicts ─────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()         masks stats::filter()
## ✖ readr::guess_encoding() masks rvest::guess_encoding()
## ✖ dplyr::lag()            masks stats::lag()
## ✖ purrr::pluck()          masks rvest::pluck()
library(naniar)

naniar isn’t something I have used on the blog before but I find its pretty handy. What we are doing here is we are web-scraping and we have an issue in our scrape. Basically the row in which we pluck our height, weight and position from sometimes contains height and weight but not position.

In that case when we make our data tidy, when we go to find a ‘’position’’ for a player who doesn’t have one, we get the whole row. Hopefully this will be a bit more clear with the screenshots and alike below.

Step 2 – Pick a team to scrape (this case I’m going to do West Coast)

Keeping in mind we want to scrape the webpages lets just look at a few eagles players and see how their pages are structured. To do this I don’t think we need to really dig into the html but lets just look at the pages.

First player I decided to click on was John Annear we can see we can get his date of birth, his height and weight.

John Annear screenshot footywire

Second player I decided to look at was Corey Adamson here we can see we can get his date of birth, his height, his weight, his position and his draft position.

Corey Adamson screenshot footywire

Thirdy player I decided to look at was Mark Lecras

Here we can see we get his date of birth, games played, height, weight, position and draft position.

Mark Lecras screenshot footywire

Then after randomly clicking on a few other players I am reasonble confident that these 3 cover the variety of different bits of information provided for all past players.

So now lets get scraping

Step 3 – Scrape a single player

For this example lets do Mark Lecras, the reason is he has the most amount of information for past players on his page.

cbind.fill <- function(...){
  nm <- list(...) 
  nm <- lapply(nm, as.matrix)
  n <- max(sapply(nm, nrow)) 
  do.call(cbind, lapply(nm, function (x) 
    rbind(x, matrix(, n-nrow(x), ncol(x))))) 
}

page<-read_html("https://www.footywire.com/afl/footy/pp-west-coast-eagles--mark-lecras")
  
  player<- page%>%
    html_nodes("#playerProfileName")%>%
    html_text()
  player
## [1] "Mark Lecras"
  playing.for<-  page%>%
    html_nodes("#playerProfileTeamDiv a b")%>%
    html_text() %>% as.tibble()
## Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.
  playing.for
## # A tibble: 1 x 1
##   value            
##   <chr>            
## 1 West Coast Eagles
  games<-page%>%
    html_nodes("#playerProfileData1")%>%
    html_text()%>%
    str_replace_all("[\r\n]" , "")%>%
    str_squish()%>%
    str_extract(pattern =("(?<=Games: ).*(?=Born:)"))%>%as.tibble()
  games
## # A tibble: 1 x 1
##   value 
##   <chr> 
## 1 "219 "
  born<-     page%>%
    html_nodes("#playerProfileData1")%>%
    html_text()%>%
    str_replace_all("[\r\n]" , "")%>%
    str_remove(".*Born: ")%>%
    str_squish() %>%  as.tibble()
  born
## # A tibble: 1 x 1
##   value          
##   <chr>          
## 1 August 30, 1986
  weight<-page%>%
    html_nodes("#playerProfileData2")%>%
    html_text()%>%
    str_replace_all("[\r\n]" , "")%>%
    str_squish()%>%
    str_extract(pattern =("(?<=Weight:).*(?=kg)"))%>%as.tibble()
  weight
## # A tibble: 1 x 1
##   value
##   <chr>
## 1 " 82"
  height<-page%>%
    html_nodes("#playerProfileData2")%>%
    html_text()%>%
    str_replace_all("[\r\n]" , "")%>%
    str_squish()%>%
    str_extract(pattern =("(?<=Height:).*(?=cm)"))%>%as.tibble()
  height
## # A tibble: 1 x 1
##   value 
##   <chr> 
## 1 " 184"
  draft_position <- page%>%
    html_nodes("#playerProfileDraftInfo")%>%
    html_text()%>%
    str_replace_all("[\r\n]" , "")%>%
    str_squish()%>%
    str_extract(pattern =("(?<=Drafted: ).*(?=by)"))%>%as.tibble()
  draft_position
## # A tibble: 1 x 1
##   value                                
##   <chr>                                
## 1 Round 3, Pick #37 2004 National Draft
  club_drafted <- page%>%
    html_nodes("#playerProfileDraftInfo a+ a")%>%
    html_text()%>%str_replace_all("[\r\n]" , "")%>%
    str_squish()%>%
    str_remove(".*by") %>% as.tibble()
  club_drafted
## # A tibble: 1 x 1
##   value            
##   <chr>            
## 1 West Coast Eagles
  position <-     page%>%
    html_nodes("#playerProfileData2")%>%
    html_text()%>%
    str_replace_all("[\r\n]" , "")%>%
    str_remove(".*Position: ")%>%
    str_squish() %>% as.tibble()
  position
## # A tibble: 1 x 1
##   value  
##   <chr>  
## 1 Forward
player_information <- cbind.fill(player, playing.for, games,born, weight, height,draft_position, club_drafted, position)
  
  player_information <- as.tibble(player_information)
  player_information
## # A tibble: 1 x 9
##   V1      value     V3     V4      V5    V6    V7            V8       V9   
##   <chr>   <chr>     <chr>  <chr>   <chr> <chr> <chr>         <chr>    <chr>
## 1 Mark L… West Coa… "219 " August… " 82" " 18… Round 3, Pic… West Co… Forw…

Selecter gadget is doing a great job, but what we can notice here is that the html_nodes do not change for height, weight, position. So in our earlier case with John Annear what this means is that in his position column it will contain the whole row as we were not able to remove words post ‘position’ like we did for Mark Lecras.

This is where naniar will come in handy, for all those rows that do not have a valid position (duplicated text values we have already in other columns) we can just use naniar and specifically its replace_with_na function to make those specific data entries na

Lets see how this would work below.

Step 4 – Scrape a whole retired team list.

url<-"https://www.footywire.com/afl/footy/ti-west-coast-eagles"


link<-read_html(url)%>%
  html_nodes(".lnormtop a")%>%
  html_attr("href")
# david-brown #nwws or ewmocw

url_players<-str_c("https://www.footywire.com/afl/footy/",link)
# url_players<-head(url_players,19)
#need to get rid of david brown
# url_players<-url_players[-21]
player_info <- function(x){
  # page <- read_html(x)
  page<-read_html(x)
  
  player<- page%>%
    html_nodes("#playerProfileName")%>%
    html_text()
  player
  
  playing.for<-  page%>%
    html_nodes("#playerProfileTeamDiv a b")%>%
    html_text() %>% as.tibble()
  playing.for
  games<-page%>%
    html_nodes("#playerProfileData1")%>%
    html_text()%>%
    str_replace_all("[\r\n]" , "")%>%
    str_squish()%>%
    str_extract(pattern =("(?<=Games: ).*(?=Born:)"))%>%as.tibble()
  games
  born<-page%>%
    html_nodes("#playerProfileData1")%>%
    html_text()%>%
    str_replace_all("[\r\n]" , "")%>%
    str_remove(".*Born: ")%>%
    str_squish() %>%  as.tibble()
  born
  
  weight<-page%>%
    html_nodes("#playerProfileData2")%>%
    html_text()%>%
    str_replace_all("[\r\n]" , "")%>%
    str_squish()%>%
    str_extract(pattern =("(?<=Weight:).*(?=kg)"))%>%as.tibble()
  weight
  
  height<-page%>%
    html_nodes("#playerProfileData2")%>%
    html_text()%>%
    str_replace_all("[\r\n]" , "")%>%
    str_squish()%>%
    str_extract(pattern =("(?<=Height:).*(?=cm)"))%>%as.tibble()
  height
  
  draft_position <- page%>%
    html_nodes("#playerProfileDraftInfo")%>%
    html_text()%>%
    str_replace_all("[\r\n]" , "")%>%
    str_squish()%>%
    str_extract(pattern =("(?<=Drafted: ).*(?=by)"))%>%as.tibble()
  draft_position
  club_drafted <- page%>%
    html_nodes("#playerProfileDraftInfo a+ a")%>%
    html_text()%>%str_replace_all("[\r\n]" , "")%>%
    str_squish()%>%
    str_remove(".*by") %>% as.tibble()
  club_drafted
  position <-     page%>%
    html_nodes("#playerProfileData2")%>%
    html_text()%>%
    str_replace_all("[\r\n]" , "")%>%
    str_remove(".*Position: ")%>%
    str_squish() %>% as.tibble()
  position
  
  
  
  
  
  
  
  #combine, name, and make it a tibble
  player_information <- cbind.fill(player, playing.for, games,born, weight, height,draft_position, club_drafted, position)
  
  player_information <- as.tibble(player_information)
  
  # print(x)
  # return(x)
  return(player_information)
}
footywire <- purrr::map_df(url_players, player_info)
footywire
## # A tibble: 238 x 9
##    V1      value   V3     V4      V5    V6    V7         V8      V9        
##    <chr>   <chr>   <chr>  <chr>   <chr> <chr> <chr>      <chr>   <chr>     
##  1 Corey … West C… <NA>   Februa… " 86" " 18… Round 2, … West C… Defender,…
##  2 Damien… West C… "54 "  March … " 75" " 18… Round 4, … Collin… Midfield  
##  3 John A… West C… <NA>   June 1… " 80" " 17… <NA>       <NA>    Playing H…
##  4 David … West C… "3 "   June 1… " 93" " 19… Round 3, … West C… Playing H…
##  5 Steven… West C… "79 "  Januar… " 83" " 18… Round 1, … West C… Forward   
##  6 Ashley… West C… <NA>   April … " 86" " 18… Round 2, … West C… Midfield  
##  7 Jason … Sydney… "193 " Novemb… " 10… " 20… Pick #1 1… West C… Ruck      
##  8 Drew B… West C… "265 " Februa… " 89" " 18… Pick #1 1… West C… Defender,…
##  9 Adrian… West C… <NA>   Decemb… " 85" " 17… <NA>       <NA>    Playing H…
## 10 Glen B… West C… <NA>   June 1… " 95" " 19… <NA>       <NA>    Playing H…
## # … with 228 more rows
names(footywire) <- c("player", "club", "games","born","weight","height",  "draft_position", "club_drafted", "position")
df_replace<-footywire%>%filter(!position %in% c("Midfield", "Defender", "Defender, Forward", "Defender, Midfield", 
                                                "Forward", "Forward, Ruck", "Midfield, Forward", "Ruck"))%>%
  select(position)


footywire_eagles<-footywire%>%naniar::replace_with_na(replace=list(position=df_replace))
footywire_eagles
## # A tibble: 238 x 9
##    player club  games born  weight height draft_position club_drafted
##    <chr>  <chr> <chr> <chr> <chr>  <chr>  <chr>          <chr>       
##  1 Corey… West… <NA>  Febr… " 86"  " 185" Round 2, Pick… West Coast …
##  2 Damie… West… "54 " Marc… " 75"  " 181" Round 4, Pick… Collingwood…
##  3 John … West… <NA>  June… " 80"  " 178" <NA>           <NA>        
##  4 David… West… "3 "  June… " 93"  " 194" Round 3, Pick… West Coast …
##  5 Steve… West… "79 " Janu… " 83"  " 180" Round 1, Pick… West Coast …
##  6 Ashle… West… <NA>  Apri… " 86"  " 188" Round 2, Pick… West Coast …
##  7 Jason… Sydn… "193… Nove… " 104" " 201" Pick #1 1991 … West Coast …
##  8 Drew … West… "265… Febr… " 89"  " 184" Pick #1 1992 … West Coast …
##  9 Adria… West… <NA>  Dece… " 85"  " 177" <NA>           <NA>        
## 10 Glen … West… <NA>  June… " 95"  " 195" <NA>           <NA>        
## # … with 228 more rows, and 1 more variable: position <chr>

Bingo there you have it, how to scrape a whole teams worth of data.

To leave a comment for the author, please follow the link and comment on their blog: Analysis of AFL.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.