Site icon R-bloggers

Age a factor?

[This article was first published on Analysis of AFL, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

fitzRoy is as always a work in progress and I know that James and I are always keen on getting ideas on how to encourage users to produce great content that hopefully we all love to read, learn and debate.

One of my personal favourite bloggers Tony over at matterofstats has some thoughts on age vs experience. To investigate this you need a couple of things mainly the players experience but also their age!

To work out age should be relatively simple you can just take their date of birth but the problem is how do you get the players date of birth?

So here is a quick blog post on how to get players age, for those who want to analyse it in detail. But also I think eventually it will get integrated within fitzRoy.

So like with anything, the first step in web-scraping the data is to first find the page you want to scrape it from, do one page and then see if we can go through all the pages that contain the data we are after.

An example of a page that contains all the players date of birth for the adelaide crows

So the first step would be lets just scrape that single page.

library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.8
## ✔ tidyr   0.8.2     ✔ stringr 1.3.1
## ✔ readr   1.3.1     ✔ forcats 0.3.0
## ── Conflicts ──────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(rvest)
## Loading required package: xml2
## 
## Attaching package: 'rvest'
## The following object is masked from 'package:purrr':
## 
##     pluck
## The following object is masked from 'package:readr':
## 
##     guess_encoding
adelaide<-read_html("https://afltables.com/afl/stats/alltime/adelaide.html", encoding = "UTF-8")

tables<-html_table(adelaide, fill = TRUE)
adelaide_crows_players<-tables[[1]] 
head(adelaide_crows_players)
##   Cap  #         Player        DOB    HT   WT   Games (W-D-L) Goals
## 1  66 23 McLeod, Andrew 1976-08-04 181cm 81kg 340 (185-0-155)   275
## 2  67  9 Edwards, Tyson 1976-08-06 178cm 81kg 321 (175-0-146)   192
## 3  51 32 Ricciuto, Mark 1975-06-08 184cm 93kg 312 (162-1-149)   292
## 4  38 34      Hart, Ben 1974-07-09 190cm 83kg 311 (166-1-144)    45
## 5  16  7   Smart, Nigel 1969-05-21 188cm 91kg 278 (140-1-137)   116
## 6  77 36 Goodwin, Simon 1976-12-26 185cm 86kg 275 (155-0-120)   162
##     Seasons    Debut     Last
## 1 1995-2010 18y 274d 33y 346d
## 2 1995-2010 18y 308d 33y 303d
## 3 1993-2007 17y 326d  32y 92d
## 4 1992-2006 17y 257d  32y 49d
## 5 1991-2004 21y 305d  35y 30d
## 6 1997-2010  20y 94d 33y 202d

So now we have an example page, the next thing we have to think about is how will we join this data onto the data in fitzRoy

Lets have a look at the player data in fitzRoy specifically the player data from afltables

library(fitzRoy)
stats <- get_afltables_stats(start_date = "1990-01-01", end_date = "2018-10-01")
## Returning data from 1990-01-01 to 2018-10-01
## Downloading data
## 
## Finished downloading data. Processing XMLs
## Finished getting afltables data
names(stats)
##  [1] "Season"                  "Round"                  
##  [3] "Date"                    "Local.start.time"       
##  [5] "Venue"                   "Attendance"             
##  [7] "Home.team"               "HQ1G"                   
##  [9] "HQ1B"                    "HQ2G"                   
## [11] "HQ2B"                    "HQ3G"                   
## [13] "HQ3B"                    "HQ4G"                   
## [15] "HQ4B"                    "Home.score"             
## [17] "Away.team"               "AQ1G"                   
## [19] "AQ1B"                    "AQ2G"                   
## [21] "AQ2B"                    "AQ3G"                   
## [23] "AQ3B"                    "AQ4G"                   
## [25] "AQ4B"                    "Away.score"             
## [27] "First.name"              "Surname"                
## [29] "ID"                      "Jumper.No."             
## [31] "Playing.for"             "Kicks"                  
## [33] "Marks"                   "Handballs"              
## [35] "Goals"                   "Behinds"                
## [37] "Hit.Outs"                "Tackles"                
## [39] "Rebounds"                "Inside.50s"             
## [41] "Clearances"              "Clangers"               
## [43] "Frees.For"               "Frees.Against"          
## [45] "Brownlow.Votes"          "Contested.Possessions"  
## [47] "Uncontested.Possessions" "Contested.Marks"        
## [49] "Marks.Inside.50"         "One.Percenters"         
## [51] "Bounces"                 "Goal.Assists"           
## [53] "Time.on.Ground.."        "Substitute"             
## [55] "Umpire.1"                "Umpire.2"               
## [57] "Umpire.3"                "Umpire.4"               
## [59] "group_id"
tail(stats)
## # A tibble: 6 x 59
##   Season Round Date       Local.start.time Venue Attendance Home.team  HQ1G
##    <dbl> <chr> <date>                <int> <chr>      <int> <chr>     <int>
## 1   2018 Gran… 2018-09-29             1430 "M.C…     100022 West Coa…     2
## 2   2018 Gran… 2018-09-29             1430 "M.C…     100022 West Coa…     2
## 3   2018 Gran… 2018-09-29             1430 "M.C…     100022 West Coa…     2
## 4   2018 Gran… 2018-09-29             1430 "M.C…     100022 West Coa…     2
## 5   2018 Gran… 2018-09-29             1430 "M.C…     100022 West Coa…     2
## 6   2018 Gran… 2018-09-29             1430 "M.C…     100022 West Coa…     2
## # ... with 51 more variables: HQ1B <int>, HQ2G <int>, HQ2B <int>,
## #   HQ3G <int>, HQ3B <int>, HQ4G <int>, HQ4B <int>, Home.score <int>,
## #   Away.team <chr>, AQ1G <int>, AQ1B <int>, AQ2G <int>, AQ2B <int>,
## #   AQ3G <int>, AQ3B <int>, AQ4G <int>, AQ4B <int>, Away.score <int>,
## #   First.name <chr>, Surname <chr>, ID <dbl>, Jumper.No. <dbl>,
## #   Playing.for <chr>, Kicks <dbl>, Marks <dbl>, Handballs <dbl>,
## #   Goals <dbl>, Behinds <dbl>, Hit.Outs <dbl>, Tackles <dbl>,
## #   Rebounds <dbl>, Inside.50s <dbl>, Clearances <dbl>, Clangers <dbl>,
## #   Frees.For <dbl>, Frees.Against <dbl>, Brownlow.Votes <dbl>,
## #   Contested.Possessions <dbl>, Uncontested.Possessions <dbl>,
## #   Contested.Marks <dbl>, Marks.Inside.50 <dbl>, One.Percenters <dbl>,
## #   Bounces <dbl>, Goal.Assists <dbl>, Time.on.Ground.. <int>,
## #   Substitute <int>, Umpire.1 <chr>, Umpire.2 <chr>, Umpire.3 <chr>,
## #   Umpire.4 <chr>, group_id <int>

So looking at the data from fitzRoy our best way to join on the date of births would be from joining based on player name, team playing for and season. This works best so long as there hasn’t been a team that has played two players of the same first name and last name in a game in a season.

So looking at what we just scraped from the adelaide crows page on afltables we have player names, the team they played for and seasons.

The main issue is that season is a little difficult to work with so lets edit the season column and the player name column so its a little bit easier to analyse.

Lets do the player name first.

In fitzRoy when we use the get_afltables_stats1 function, we have seperated player names by first name and surname columns. But when we just scraped the data from afltables and it doesn’t appear like this.

adelaide_crows_players<- adelaide_crows_players%>%
      separate(Player,c("Surname","First.name"),sep=",")
head(adelaide_crows_players)
##   Cap  #  Surname First.name        DOB    HT   WT   Games (W-D-L) Goals
## 1  66 23   McLeod     Andrew 1976-08-04 181cm 81kg 340 (185-0-155)   275
## 2  67  9  Edwards      Tyson 1976-08-06 178cm 81kg 321 (175-0-146)   192
## 3  51 32 Ricciuto       Mark 1975-06-08 184cm 93kg 312 (162-1-149)   292
## 4  38 34     Hart        Ben 1974-07-09 190cm 83kg 311 (166-1-144)    45
## 5  16  7    Smart      Nigel 1969-05-21 188cm 91kg 278 (140-1-137)   116
## 6  77 36  Goodwin      Simon 1976-12-26 185cm 86kg 275 (155-0-120)   162
##     Seasons    Debut     Last
## 1 1995-2010 18y 274d 33y 346d
## 2 1995-2010 18y 308d 33y 303d
## 3 1993-2007 17y 326d  32y 92d
## 4 1992-2006 17y 257d  32y 49d
## 5 1991-2004 21y 305d  35y 30d
## 6 1997-2010  20y 94d 33y 202d

A few football fans might wonder how hyphenate names work on afltables the first name that popped into my head was Nick Dal Santo and if we go to the St Kilda Saints page we can see that it should seperate fine.

Now lets look at season

So why is this a problem firstly?

Gary Ablett

stats%>%filter(First.name =="Gary" & Surname=="Ablett")%>%head()
## # A tibble: 6 x 59
##   Season Round Date       Local.start.time Venue Attendance Home.team  HQ1G
##    <dbl> <chr> <date>                <int> <chr>      <int> <chr>     <int>
## 1   1992 8     1992-05-09             1410 Kard…      18565 Geelong      10
## 2   1993 13    1993-06-27             1738 Foot…      46496 Adelaide      6
## 3   1994 4     1994-04-17             1745 Foot…      45638 Adelaide      3
## 4   1994 19    1994-07-31             1408 Kard…      15383 Geelong       2
## 5   1995 10    1995-06-03             1410 Kard…      26314 Geelong       3
## 6   1996 18    1996-08-03             1410 Kard…      17818 Geelong       5
## # ... with 51 more variables: HQ1B <int>, HQ2G <int>, HQ2B <int>,
## #   HQ3G <int>, HQ3B <int>, HQ4G <int>, HQ4B <int>, Home.score <int>,
## #   Away.team <chr>, AQ1G <int>, AQ1B <int>, AQ2G <int>, AQ2B <int>,
## #   AQ3G <int>, AQ3B <int>, AQ4G <int>, AQ4B <int>, Away.score <int>,
## #   First.name <chr>, Surname <chr>, ID <dbl>, Jumper.No. <dbl>,
## #   Playing.for <chr>, Kicks <dbl>, Marks <dbl>, Handballs <dbl>,
## #   Goals <dbl>, Behinds <dbl>, Hit.Outs <dbl>, Tackles <dbl>,
## #   Rebounds <dbl>, Inside.50s <dbl>, Clearances <dbl>, Clangers <dbl>,
## #   Frees.For <dbl>, Frees.Against <dbl>, Brownlow.Votes <dbl>,
## #   Contested.Possessions <dbl>, Uncontested.Possessions <dbl>,
## #   Contested.Marks <dbl>, Marks.Inside.50 <dbl>, One.Percenters <dbl>,
## #   Bounces <dbl>, Goal.Assists <dbl>, Time.on.Ground.. <int>,
## #   Substitute <int>, Umpire.1 <chr>, Umpire.2 <chr>, Umpire.3 <chr>,
## #   Umpire.4 <chr>, group_id <int>
stats%>%filter(First.name =="Gary" & Surname=="Ablett")%>%tail()
## # A tibble: 6 x 59
##   Season Round Date       Local.start.time Venue Attendance Home.team  HQ1G
##    <dbl> <chr> <date>                <int> <chr>      <int> <chr>     <int>
## 1   2018 19    2018-07-28             1410 "Kar…      28226 Geelong       2
## 2   2018 20    2018-08-03             1950 "M.C…      67054 Richmond      4
## 3   2018 21    2018-08-11             1345 "M.C…      59529 Hawthorn      0
## 4   2018 22    2018-08-18             1410 "Kar…      24507 Geelong       1
## 5   2018 23    2018-08-25             1345 "Kar…      28004 Geelong       4
## 6   2018 Elim… 2018-09-07             1950 "M.C…      91767 Melbourne     5
## # ... with 51 more variables: HQ1B <int>, HQ2G <int>, HQ2B <int>,
## #   HQ3G <int>, HQ3B <int>, HQ4G <int>, HQ4B <int>, Home.score <int>,
## #   Away.team <chr>, AQ1G <int>, AQ1B <int>, AQ2G <int>, AQ2B <int>,
## #   AQ3G <int>, AQ3B <int>, AQ4G <int>, AQ4B <int>, Away.score <int>,
## #   First.name <chr>, Surname <chr>, ID <dbl>, Jumper.No. <dbl>,
## #   Playing.for <chr>, Kicks <dbl>, Marks <dbl>, Handballs <dbl>,
## #   Goals <dbl>, Behinds <dbl>, Hit.Outs <dbl>, Tackles <dbl>,
## #   Rebounds <dbl>, Inside.50s <dbl>, Clearances <dbl>, Clangers <dbl>,
## #   Frees.For <dbl>, Frees.Against <dbl>, Brownlow.Votes <dbl>,
## #   Contested.Possessions <dbl>, Uncontested.Possessions <dbl>,
## #   Contested.Marks <dbl>, Marks.Inside.50 <dbl>, One.Percenters <dbl>,
## #   Bounces <dbl>, Goal.Assists <dbl>, Time.on.Ground.. <int>,
## #   Substitute <int>, Umpire.1 <chr>, Umpire.2 <chr>, Umpire.3 <chr>,
## #   Umpire.4 <chr>, group_id <int>

So these are two players who have the same name but played for the same team. So we couldn’t just join on first name and surname we need to know the season in which the player played. (This would break down the moment two players with the same name play for the same team).

So I am thinking one way to do this could be to get the start season and end season as a sequence, then we just duplicate the rows for all values in the sequence. Then we can join it onto the dataset in fitzRoy

So looking at the dataset if we look at the Seasons column the first 4 characters are the first season the player has played, and our end of sequence should be the last 4 characters. So we will create two new columns called start of sequence and end of sequence.

First lets get our start of sequences

adelaide_crows_players$start_sequence<-substr(adelaide_crows_players[,10], start = 1, stop = 4)
head(adelaide_crows_players)
##   Cap  #  Surname First.name        DOB    HT   WT   Games (W-D-L) Goals
## 1  66 23   McLeod     Andrew 1976-08-04 181cm 81kg 340 (185-0-155)   275
## 2  67  9  Edwards      Tyson 1976-08-06 178cm 81kg 321 (175-0-146)   192
## 3  51 32 Ricciuto       Mark 1975-06-08 184cm 93kg 312 (162-1-149)   292
## 4  38 34     Hart        Ben 1974-07-09 190cm 83kg 311 (166-1-144)    45
## 5  16  7    Smart      Nigel 1969-05-21 188cm 91kg 278 (140-1-137)   116
## 6  77 36  Goodwin      Simon 1976-12-26 185cm 86kg 275 (155-0-120)   162
##     Seasons    Debut     Last start_sequence
## 1 1995-2010 18y 274d 33y 346d           1995
## 2 1995-2010 18y 308d 33y 303d           1995
## 3 1993-2007 17y 326d  32y 92d           1993
## 4 1992-2006 17y 257d  32y 49d           1992
## 5 1991-2004 21y 305d  35y 30d           1991
## 6 1997-2010  20y 94d 33y 202d           1997

Now lets get our end of sequence

library(stringr)
adelaide_crows_players$end_sequence<-str_sub(adelaide_crows_players[,10], start = -4)
head(adelaide_crows_players)
##   Cap  #  Surname First.name        DOB    HT   WT   Games (W-D-L) Goals
## 1  66 23   McLeod     Andrew 1976-08-04 181cm 81kg 340 (185-0-155)   275
## 2  67  9  Edwards      Tyson 1976-08-06 178cm 81kg 321 (175-0-146)   192
## 3  51 32 Ricciuto       Mark 1975-06-08 184cm 93kg 312 (162-1-149)   292
## 4  38 34     Hart        Ben 1974-07-09 190cm 83kg 311 (166-1-144)    45
## 5  16  7    Smart      Nigel 1969-05-21 188cm 91kg 278 (140-1-137)   116
## 6  77 36  Goodwin      Simon 1976-12-26 185cm 86kg 275 (155-0-120)   162
##     Seasons    Debut     Last start_sequence end_sequence
## 1 1995-2010 18y 274d 33y 346d           1995         2010
## 2 1995-2010 18y 308d 33y 303d           1995         2010
## 3 1993-2007 17y 326d  32y 92d           1993         2007
## 4 1992-2006 17y 257d  32y 49d           1992         2006
## 5 1991-2004 21y 305d  35y 30d           1991         2004
## 6 1997-2010  20y 94d 33y 202d           1997         2010

Now we just need to duplicate the rows so we can join on the date of birth for the players.

To do this we can use tidyverse specifically we can use gather.

  gather(data=adelaide_crows_players, key = "career_start_end", value = "year", c("start_sequence", "end_sequence")) %>%head()
##   Cap  #  Surname First.name        DOB    HT   WT   Games (W-D-L) Goals
## 1  66 23   McLeod     Andrew 1976-08-04 181cm 81kg 340 (185-0-155)   275
## 2  67  9  Edwards      Tyson 1976-08-06 178cm 81kg 321 (175-0-146)   192
## 3  51 32 Ricciuto       Mark 1975-06-08 184cm 93kg 312 (162-1-149)   292
## 4  38 34     Hart        Ben 1974-07-09 190cm 83kg 311 (166-1-144)    45
## 5  16  7    Smart      Nigel 1969-05-21 188cm 91kg 278 (140-1-137)   116
## 6  77 36  Goodwin      Simon 1976-12-26 185cm 86kg 275 (155-0-120)   162
##     Seasons    Debut     Last career_start_end year
## 1 1995-2010 18y 274d 33y 346d   start_sequence 1995
## 2 1995-2010 18y 308d 33y 303d   start_sequence 1995
## 3 1993-2007 17y 326d  32y 92d   start_sequence 1993
## 4 1992-2006 17y 257d  32y 49d   start_sequence 1992
## 5 1991-2004 21y 305d  35y 30d   start_sequence 1991
## 6 1997-2010  20y 94d 33y 202d   start_sequence 1997

To check this gives us what we want, lets take an example player, say the great Andrew McLeod.

  gather(data=adelaide_crows_players, key = "career_start_end", value = "year", c("start_sequence", "end_sequence")) %>%filter(Surname=="McLeod")
##   Cap  # Surname First.name        DOB    HT   WT   Games (W-D-L) Goals
## 1  66 23  McLeod     Andrew 1976-08-04 181cm 81kg 340 (185-0-155)   275
## 2  66 23  McLeod     Andrew 1976-08-04 181cm 81kg 340 (185-0-155)   275
##     Seasons    Debut     Last career_start_end year
## 1 1995-2010 18y 274d 33y 346d   start_sequence 1995
## 2 1995-2010 18y 274d 33y 346d     end_sequence 2010

So we can see this gives us what we want, now all we have to do is expand this dataset so we have entries for Andrew McLeod for each year he was with the Crows.

So how do we do that?

gathered_dataset<- gather(data=adelaide_crows_players, key = "career_start_end", value = "year", c("start_sequence", "end_sequence"))


gathered_dataset$year<-as.numeric(gathered_dataset$year)
## Warning: NAs introduced by coercion
gathered_dataset<-gathered_dataset[complete.cases(gathered_dataset),]
adelaide_data_to_join<-gathered_dataset%>%
  select(Surname, First.name, DOB, career_start_end, year)%>%
  group_by(Surname, First.name, career_start_end)%>%
  complete(Surname, First.name, DOB, career_start_end, year=seq(min(year), max(year),by=1))

Now view it

head(adelaide_data_to_join)
## # A tibble: 6 x 5
## # Groups:   Surname, First.name, career_start_end [6]
##   Surname   First.name DOB        career_start_end  year
##   <chr>     <chr>      <chr>      <chr>            <dbl>
## 1 Abernethy " Bruce"   1962-05-10 end_sequence      1992
## 2 Abernethy " Bruce"   1962-05-10 start_sequence    1991
## 3 Anderson  " Greg"    1966-05-14 end_sequence      1996
## 4 Anderson  " Greg"    1966-05-14 start_sequence    1993
## 5 Armstrong " Tony"    1989-09-29 end_sequence      2011
## 6 Armstrong " Tony"    1989-09-29 start_sequence    2010

So it seems OK.

Now the next part is lets take the crows stuff as an example how do we join it back onto the fitzRoy afltables data.

Lets first quickly look at the two datasets we have * dataset in fitzRoy we have already stats * dataset of the adelaide crows date of births we have just created adelaide_data_to_join

head(stats)
## # A tibble: 6 x 59
##   Season Round Date       Local.start.time Venue Attendance Home.team  HQ1G
##    <dbl> <chr> <date>                <int> <chr>      <int> <chr>     <int>
## 1   1991 1     1991-03-22             1940 Foot…      44902 Adelaide      6
## 2   1991 1     1991-03-22             1940 Foot…      44902 Adelaide      6
## 3   1991 1     1991-03-22             1940 Foot…      44902 Adelaide      6
## 4   1991 1     1991-03-22             1940 Foot…      44902 Adelaide      6
## 5   1991 1     1991-03-22             1940 Foot…      44902 Adelaide      6
## 6   1991 1     1991-03-22             1940 Foot…      44902 Adelaide      6
## # ... with 51 more variables: HQ1B <int>, HQ2G <int>, HQ2B <int>,
## #   HQ3G <int>, HQ3B <int>, HQ4G <int>, HQ4B <int>, Home.score <int>,
## #   Away.team <chr>, AQ1G <int>, AQ1B <int>, AQ2G <int>, AQ2B <int>,
## #   AQ3G <int>, AQ3B <int>, AQ4G <int>, AQ4B <int>, Away.score <int>,
## #   First.name <chr>, Surname <chr>, ID <dbl>, Jumper.No. <dbl>,
## #   Playing.for <chr>, Kicks <dbl>, Marks <dbl>, Handballs <dbl>,
## #   Goals <dbl>, Behinds <dbl>, Hit.Outs <dbl>, Tackles <dbl>,
## #   Rebounds <dbl>, Inside.50s <dbl>, Clearances <dbl>, Clangers <dbl>,
## #   Frees.For <dbl>, Frees.Against <dbl>, Brownlow.Votes <dbl>,
## #   Contested.Possessions <dbl>, Uncontested.Possessions <dbl>,
## #   Contested.Marks <dbl>, Marks.Inside.50 <dbl>, One.Percenters <dbl>,
## #   Bounces <dbl>, Goal.Assists <dbl>, Time.on.Ground.. <int>,
## #   Substitute <int>, Umpire.1 <chr>, Umpire.2 <chr>, Umpire.3 <chr>,
## #   Umpire.4 <chr>, group_id <int>
tail(adelaide_data_to_join)
## # A tibble: 6 x 5
## # Groups:   Surname, First.name, career_start_end [6]
##   Surname First.name DOB        career_start_end  year
##   <chr>   <chr>      <chr>      <chr>            <dbl>
## 1 Wintle  " Darryl"  1976-04-07 end_sequence      1999
## 2 Wintle  " Darryl"  1976-04-07 start_sequence    1999
## 3 Wright  " Matthew" 1989-12-14 end_sequence      2015
## 4 Wright  " Matthew" 1989-12-14 start_sequence    2011
## 5 Young   " Will"    1990-08-03 end_sequence      2010
## 6 Young   " Will"    1990-08-03 start_sequence    2010

So remembering what we wanted to join on was First.name, Surname, year and Playing.for.

Well the column I forgot to add is the Playing.for column.

We can do that as follows:

adelaide_data_to_join$Playing.for<-"Adelaide" 

Now we can left_join it onto the stats dataset.

check_crows_dataset<-left_join(stats, adelaide_data_to_join, by = c("First.name" = "First.name", "Surname" = "Surname", "Season"="year", "Playing.for"="Playing.for"))

head(check_crows_dataset)
## # A tibble: 6 x 61
##   Season Round Date       Local.start.time Venue Attendance Home.team  HQ1G
##    <dbl> <chr> <date>                <int> <chr>      <int> <chr>     <int>
## 1   1991 1     1991-03-22             1940 Foot…      44902 Adelaide      6
## 2   1991 1     1991-03-22             1940 Foot…      44902 Adelaide      6
## 3   1991 1     1991-03-22             1940 Foot…      44902 Adelaide      6
## 4   1991 1     1991-03-22             1940 Foot…      44902 Adelaide      6
## 5   1991 1     1991-03-22             1940 Foot…      44902 Adelaide      6
## 6   1991 1     1991-03-22             1940 Foot…      44902 Adelaide      6
## # ... with 53 more variables: HQ1B <int>, HQ2G <int>, HQ2B <int>,
## #   HQ3G <int>, HQ3B <int>, HQ4G <int>, HQ4B <int>, Home.score <int>,
## #   Away.team <chr>, AQ1G <int>, AQ1B <int>, AQ2G <int>, AQ2B <int>,
## #   AQ3G <int>, AQ3B <int>, AQ4G <int>, AQ4B <int>, Away.score <int>,
## #   First.name <chr>, Surname <chr>, ID <dbl>, Jumper.No. <dbl>,
## #   Playing.for <chr>, Kicks <dbl>, Marks <dbl>, Handballs <dbl>,
## #   Goals <dbl>, Behinds <dbl>, Hit.Outs <dbl>, Tackles <dbl>,
## #   Rebounds <dbl>, Inside.50s <dbl>, Clearances <dbl>, Clangers <dbl>,
## #   Frees.For <dbl>, Frees.Against <dbl>, Brownlow.Votes <dbl>,
## #   Contested.Possessions <dbl>, Uncontested.Possessions <dbl>,
## #   Contested.Marks <dbl>, Marks.Inside.50 <dbl>, One.Percenters <dbl>,
## #   Bounces <dbl>, Goal.Assists <dbl>, Time.on.Ground.. <int>,
## #   Substitute <int>, Umpire.1 <chr>, Umpire.2 <chr>, Umpire.3 <chr>,
## #   Umpire.4 <chr>, group_id <int>, DOB <chr>, career_start_end <chr>

This didn’t work, which means that our column values don’t exactly align.

To check why this might occur lets check how the join from our adelaide_data_to_join works.

adelaide_data_to_join%>%
  filter(Playing.for=="Adelaide" &Surname=="Jarman")%>%head()
## # A tibble: 4 x 6
## # Groups:   Surname, First.name, career_start_end [4]
##   Surname First.name DOB        career_start_end  year Playing.for
##   <chr>   <chr>      <chr>      <chr>            <dbl> <chr>      
## 1 Jarman  " Andrew"  1966-01-14 end_sequence      1996 Adelaide   
## 2 Jarman  " Andrew"  1966-01-14 start_sequence    1991 Adelaide   
## 3 Jarman  " Darren"  1967-01-28 end_sequence      2001 Adelaide   
## 4 Jarman  " Darren"  1967-01-28 start_sequence    1996 Adelaide

So what we can see here is that there seems to be a space between the first letter of the First.name which is why our join isn’t working.

adelaide_data_to_join$First.name<-trimws(adelaide_data_to_join$First.name, which = c( "both","left", "right"))
check_crows_dataset<-left_join(stats, adelaide_data_to_join, by = c("First.name" = "First.name", "Surname" = "Surname", "Season"="year", "Playing.for"="Playing.for"))

head(check_crows_dataset)
## # A tibble: 6 x 61
##   Season Round Date       Local.start.time Venue Attendance Home.team  HQ1G
##    <dbl> <chr> <date>                <int> <chr>      <int> <chr>     <int>
## 1   1991 1     1991-03-22             1940 Foot…      44902 Adelaide      6
## 2   1991 1     1991-03-22             1940 Foot…      44902 Adelaide      6
## 3   1991 1     1991-03-22             1940 Foot…      44902 Adelaide      6
## 4   1991 1     1991-03-22             1940 Foot…      44902 Adelaide      6
## 5   1991 1     1991-03-22             1940 Foot…      44902 Adelaide      6
## 6   1991 1     1991-03-22             1940 Foot…      44902 Adelaide      6
## # ... with 53 more variables: HQ1B <int>, HQ2G <int>, HQ2B <int>,
## #   HQ3G <int>, HQ3B <int>, HQ4G <int>, HQ4B <int>, Home.score <int>,
## #   Away.team <chr>, AQ1G <int>, AQ1B <int>, AQ2G <int>, AQ2B <int>,
## #   AQ3G <int>, AQ3B <int>, AQ4G <int>, AQ4B <int>, Away.score <int>,
## #   First.name <chr>, Surname <chr>, ID <dbl>, Jumper.No. <dbl>,
## #   Playing.for <chr>, Kicks <dbl>, Marks <dbl>, Handballs <dbl>,
## #   Goals <dbl>, Behinds <dbl>, Hit.Outs <dbl>, Tackles <dbl>,
## #   Rebounds <dbl>, Inside.50s <dbl>, Clearances <dbl>, Clangers <dbl>,
## #   Frees.For <dbl>, Frees.Against <dbl>, Brownlow.Votes <dbl>,
## #   Contested.Possessions <dbl>, Uncontested.Possessions <dbl>,
## #   Contested.Marks <dbl>, Marks.Inside.50 <dbl>, One.Percenters <dbl>,
## #   Bounces <dbl>, Goal.Assists <dbl>, Time.on.Ground.. <int>,
## #   Substitute <int>, Umpire.1 <chr>, Umpire.2 <chr>, Umpire.3 <chr>,
## #   Umpire.4 <chr>, group_id <int>, DOB <chr>, career_start_end <chr>
check_crows_dataset%>%filter(Playing.for=="Adelaide")%>%select(DOB)%>%head()
## # A tibble: 6 x 1
##   DOB       
##   <chr>     
## 1 1962-03-18
## 2 1964-01-03
## 3 1970-01-06
## 4 1970-01-06
## 5 1970-06-30
## 6 1966-01-14

Ok so this seems to work now just need to do this for all teams.

Looking forward to reading anywork looking at age and its impact on team or player performance.

To leave a comment for the author, please follow the link and comment on their blog: Analysis of AFL.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.