Age a factor?
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
fitzRoy is as always a work in progress and I know that James and I are always keen on getting ideas on how to encourage users to produce great content that hopefully we all love to read, learn and debate.
One of my personal favourite bloggers Tony over at matterofstats has some thoughts on age vs experience. To investigate this you need a couple of things mainly the players experience but also their age!
To work out age should be relatively simple you can just take their date of birth but the problem is how do you get the players date of birth?
So here is a quick blog post on how to get players age, for those who want to analyse it in detail. But also I think eventually it will get integrated within fitzRoy.
So like with anything, the first step in web-scraping the data is to first find the page you want to scrape it from, do one page and then see if we can go through all the pages that contain the data we are after.
An example of a page that contains all the players date of birth for the adelaide crows
So the first step would be lets just scrape that single page.
library(tidyverse) ## ── Attaching packages ───────────────────────────────────────────────────────── tidyverse 1.2.1 ── ## ✔ ggplot2 3.1.0 ✔ purrr 0.2.5 ## ✔ tibble 1.4.2 ✔ dplyr 0.7.8 ## ✔ tidyr 0.8.2 ✔ stringr 1.3.1 ## ✔ readr 1.3.1 ✔ forcats 0.3.0 ## ── Conflicts ──────────────────────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() library(rvest) ## Loading required package: xml2 ## ## Attaching package: 'rvest' ## The following object is masked from 'package:purrr': ## ## pluck ## The following object is masked from 'package:readr': ## ## guess_encoding adelaide<-read_html("https://afltables.com/afl/stats/alltime/adelaide.html", encoding = "UTF-8") tables<-html_table(adelaide, fill = TRUE) adelaide_crows_players<-tables[[1]] head(adelaide_crows_players) ## Cap # Player DOB HT WT Games (W-D-L) Goals ## 1 66 23 McLeod, Andrew 1976-08-04 181cm 81kg 340 (185-0-155) 275 ## 2 67 9 Edwards, Tyson 1976-08-06 178cm 81kg 321 (175-0-146) 192 ## 3 51 32 Ricciuto, Mark 1975-06-08 184cm 93kg 312 (162-1-149) 292 ## 4 38 34 Hart, Ben 1974-07-09 190cm 83kg 311 (166-1-144) 45 ## 5 16 7 Smart, Nigel 1969-05-21 188cm 91kg 278 (140-1-137) 116 ## 6 77 36 Goodwin, Simon 1976-12-26 185cm 86kg 275 (155-0-120) 162 ## Seasons Debut Last ## 1 1995-2010 18y 274d 33y 346d ## 2 1995-2010 18y 308d 33y 303d ## 3 1993-2007 17y 326d 32y 92d ## 4 1992-2006 17y 257d 32y 49d ## 5 1991-2004 21y 305d 35y 30d ## 6 1997-2010 20y 94d 33y 202d
So now we have an example page, the next thing we have to think about is how will we join this data onto the data in fitzRoy
Lets have a look at the player data in fitzRoy specifically the player data from afltables
library(fitzRoy) stats <- get_afltables_stats(start_date = "1990-01-01", end_date = "2018-10-01") ## Returning data from 1990-01-01 to 2018-10-01 ## Downloading data ## ## Finished downloading data. Processing XMLs ## Finished getting afltables data names(stats) ## [1] "Season" "Round" ## [3] "Date" "Local.start.time" ## [5] "Venue" "Attendance" ## [7] "Home.team" "HQ1G" ## [9] "HQ1B" "HQ2G" ## [11] "HQ2B" "HQ3G" ## [13] "HQ3B" "HQ4G" ## [15] "HQ4B" "Home.score" ## [17] "Away.team" "AQ1G" ## [19] "AQ1B" "AQ2G" ## [21] "AQ2B" "AQ3G" ## [23] "AQ3B" "AQ4G" ## [25] "AQ4B" "Away.score" ## [27] "First.name" "Surname" ## [29] "ID" "Jumper.No." ## [31] "Playing.for" "Kicks" ## [33] "Marks" "Handballs" ## [35] "Goals" "Behinds" ## [37] "Hit.Outs" "Tackles" ## [39] "Rebounds" "Inside.50s" ## [41] "Clearances" "Clangers" ## [43] "Frees.For" "Frees.Against" ## [45] "Brownlow.Votes" "Contested.Possessions" ## [47] "Uncontested.Possessions" "Contested.Marks" ## [49] "Marks.Inside.50" "One.Percenters" ## [51] "Bounces" "Goal.Assists" ## [53] "Time.on.Ground.." "Substitute" ## [55] "Umpire.1" "Umpire.2" ## [57] "Umpire.3" "Umpire.4" ## [59] "group_id" tail(stats) ## # A tibble: 6 x 59 ## Season Round Date Local.start.time Venue Attendance Home.team HQ1G ## <dbl> <chr> <date> <int> <chr> <int> <chr> <int> ## 1 2018 Gran… 2018-09-29 1430 "M.C… 100022 West Coa… 2 ## 2 2018 Gran… 2018-09-29 1430 "M.C… 100022 West Coa… 2 ## 3 2018 Gran… 2018-09-29 1430 "M.C… 100022 West Coa… 2 ## 4 2018 Gran… 2018-09-29 1430 "M.C… 100022 West Coa… 2 ## 5 2018 Gran… 2018-09-29 1430 "M.C… 100022 West Coa… 2 ## 6 2018 Gran… 2018-09-29 1430 "M.C… 100022 West Coa… 2 ## # ... with 51 more variables: HQ1B <int>, HQ2G <int>, HQ2B <int>, ## # HQ3G <int>, HQ3B <int>, HQ4G <int>, HQ4B <int>, Home.score <int>, ## # Away.team <chr>, AQ1G <int>, AQ1B <int>, AQ2G <int>, AQ2B <int>, ## # AQ3G <int>, AQ3B <int>, AQ4G <int>, AQ4B <int>, Away.score <int>, ## # First.name <chr>, Surname <chr>, ID <dbl>, Jumper.No. <dbl>, ## # Playing.for <chr>, Kicks <dbl>, Marks <dbl>, Handballs <dbl>, ## # Goals <dbl>, Behinds <dbl>, Hit.Outs <dbl>, Tackles <dbl>, ## # Rebounds <dbl>, Inside.50s <dbl>, Clearances <dbl>, Clangers <dbl>, ## # Frees.For <dbl>, Frees.Against <dbl>, Brownlow.Votes <dbl>, ## # Contested.Possessions <dbl>, Uncontested.Possessions <dbl>, ## # Contested.Marks <dbl>, Marks.Inside.50 <dbl>, One.Percenters <dbl>, ## # Bounces <dbl>, Goal.Assists <dbl>, Time.on.Ground.. <int>, ## # Substitute <int>, Umpire.1 <chr>, Umpire.2 <chr>, Umpire.3 <chr>, ## # Umpire.4 <chr>, group_id <int>
So looking at the data from fitzRoy our best way to join on the date of births would be from joining based on player name, team playing for and season. This works best so long as there hasn’t been a team that has played two players of the same first name and last name in a game in a season.
So looking at what we just scraped from the adelaide crows page on afltables we have player names, the team they played for and seasons.
The main issue is that season is a little difficult to work with so lets edit the season column and the player name column so its a little bit easier to analyse.
Lets do the player name first.
In fitzRoy when we use the get_afltables_stats1
function, we have seperated player names by first name and surname columns. But when we just scraped the data from afltables and it doesn’t appear like this.
adelaide_crows_players<- adelaide_crows_players%>% separate(Player,c("Surname","First.name"),sep=",") head(adelaide_crows_players) ## Cap # Surname First.name DOB HT WT Games (W-D-L) Goals ## 1 66 23 McLeod Andrew 1976-08-04 181cm 81kg 340 (185-0-155) 275 ## 2 67 9 Edwards Tyson 1976-08-06 178cm 81kg 321 (175-0-146) 192 ## 3 51 32 Ricciuto Mark 1975-06-08 184cm 93kg 312 (162-1-149) 292 ## 4 38 34 Hart Ben 1974-07-09 190cm 83kg 311 (166-1-144) 45 ## 5 16 7 Smart Nigel 1969-05-21 188cm 91kg 278 (140-1-137) 116 ## 6 77 36 Goodwin Simon 1976-12-26 185cm 86kg 275 (155-0-120) 162 ## Seasons Debut Last ## 1 1995-2010 18y 274d 33y 346d ## 2 1995-2010 18y 308d 33y 303d ## 3 1993-2007 17y 326d 32y 92d ## 4 1992-2006 17y 257d 32y 49d ## 5 1991-2004 21y 305d 35y 30d ## 6 1997-2010 20y 94d 33y 202d
A few football fans might wonder how hyphenate names work on afltables the first name that popped into my head was Nick Dal Santo and if we go to the St Kilda Saints page we can see that it should seperate fine.
Now lets look at season
So why is this a problem firstly?
stats%>%filter(First.name =="Gary" & Surname=="Ablett")%>%head() ## # A tibble: 6 x 59 ## Season Round Date Local.start.time Venue Attendance Home.team HQ1G ## <dbl> <chr> <date> <int> <chr> <int> <chr> <int> ## 1 1992 8 1992-05-09 1410 Kard… 18565 Geelong 10 ## 2 1993 13 1993-06-27 1738 Foot… 46496 Adelaide 6 ## 3 1994 4 1994-04-17 1745 Foot… 45638 Adelaide 3 ## 4 1994 19 1994-07-31 1408 Kard… 15383 Geelong 2 ## 5 1995 10 1995-06-03 1410 Kard… 26314 Geelong 3 ## 6 1996 18 1996-08-03 1410 Kard… 17818 Geelong 5 ## # ... with 51 more variables: HQ1B <int>, HQ2G <int>, HQ2B <int>, ## # HQ3G <int>, HQ3B <int>, HQ4G <int>, HQ4B <int>, Home.score <int>, ## # Away.team <chr>, AQ1G <int>, AQ1B <int>, AQ2G <int>, AQ2B <int>, ## # AQ3G <int>, AQ3B <int>, AQ4G <int>, AQ4B <int>, Away.score <int>, ## # First.name <chr>, Surname <chr>, ID <dbl>, Jumper.No. <dbl>, ## # Playing.for <chr>, Kicks <dbl>, Marks <dbl>, Handballs <dbl>, ## # Goals <dbl>, Behinds <dbl>, Hit.Outs <dbl>, Tackles <dbl>, ## # Rebounds <dbl>, Inside.50s <dbl>, Clearances <dbl>, Clangers <dbl>, ## # Frees.For <dbl>, Frees.Against <dbl>, Brownlow.Votes <dbl>, ## # Contested.Possessions <dbl>, Uncontested.Possessions <dbl>, ## # Contested.Marks <dbl>, Marks.Inside.50 <dbl>, One.Percenters <dbl>, ## # Bounces <dbl>, Goal.Assists <dbl>, Time.on.Ground.. <int>, ## # Substitute <int>, Umpire.1 <chr>, Umpire.2 <chr>, Umpire.3 <chr>, ## # Umpire.4 <chr>, group_id <int> stats%>%filter(First.name =="Gary" & Surname=="Ablett")%>%tail() ## # A tibble: 6 x 59 ## Season Round Date Local.start.time Venue Attendance Home.team HQ1G ## <dbl> <chr> <date> <int> <chr> <int> <chr> <int> ## 1 2018 19 2018-07-28 1410 "Kar… 28226 Geelong 2 ## 2 2018 20 2018-08-03 1950 "M.C… 67054 Richmond 4 ## 3 2018 21 2018-08-11 1345 "M.C… 59529 Hawthorn 0 ## 4 2018 22 2018-08-18 1410 "Kar… 24507 Geelong 1 ## 5 2018 23 2018-08-25 1345 "Kar… 28004 Geelong 4 ## 6 2018 Elim… 2018-09-07 1950 "M.C… 91767 Melbourne 5 ## # ... with 51 more variables: HQ1B <int>, HQ2G <int>, HQ2B <int>, ## # HQ3G <int>, HQ3B <int>, HQ4G <int>, HQ4B <int>, Home.score <int>, ## # Away.team <chr>, AQ1G <int>, AQ1B <int>, AQ2G <int>, AQ2B <int>, ## # AQ3G <int>, AQ3B <int>, AQ4G <int>, AQ4B <int>, Away.score <int>, ## # First.name <chr>, Surname <chr>, ID <dbl>, Jumper.No. <dbl>, ## # Playing.for <chr>, Kicks <dbl>, Marks <dbl>, Handballs <dbl>, ## # Goals <dbl>, Behinds <dbl>, Hit.Outs <dbl>, Tackles <dbl>, ## # Rebounds <dbl>, Inside.50s <dbl>, Clearances <dbl>, Clangers <dbl>, ## # Frees.For <dbl>, Frees.Against <dbl>, Brownlow.Votes <dbl>, ## # Contested.Possessions <dbl>, Uncontested.Possessions <dbl>, ## # Contested.Marks <dbl>, Marks.Inside.50 <dbl>, One.Percenters <dbl>, ## # Bounces <dbl>, Goal.Assists <dbl>, Time.on.Ground.. <int>, ## # Substitute <int>, Umpire.1 <chr>, Umpire.2 <chr>, Umpire.3 <chr>, ## # Umpire.4 <chr>, group_id <int>
So these are two players who have the same name but played for the same team. So we couldn’t just join on first name and surname we need to know the season in which the player played. (This would break down the moment two players with the same name play for the same team).
So I am thinking one way to do this could be to get the start season and end season as a sequence, then we just duplicate the rows for all values in the sequence. Then we can join it onto the dataset in fitzRoy
So looking at the dataset if we look at the Seasons
column the first 4 characters are the first season the player has played, and our end of sequence should be the last 4 characters. So we will create two new columns called start of sequence and end of sequence.
First lets get our start of sequences
adelaide_crows_players$start_sequence<-substr(adelaide_crows_players[,10], start = 1, stop = 4) head(adelaide_crows_players) ## Cap # Surname First.name DOB HT WT Games (W-D-L) Goals ## 1 66 23 McLeod Andrew 1976-08-04 181cm 81kg 340 (185-0-155) 275 ## 2 67 9 Edwards Tyson 1976-08-06 178cm 81kg 321 (175-0-146) 192 ## 3 51 32 Ricciuto Mark 1975-06-08 184cm 93kg 312 (162-1-149) 292 ## 4 38 34 Hart Ben 1974-07-09 190cm 83kg 311 (166-1-144) 45 ## 5 16 7 Smart Nigel 1969-05-21 188cm 91kg 278 (140-1-137) 116 ## 6 77 36 Goodwin Simon 1976-12-26 185cm 86kg 275 (155-0-120) 162 ## Seasons Debut Last start_sequence ## 1 1995-2010 18y 274d 33y 346d 1995 ## 2 1995-2010 18y 308d 33y 303d 1995 ## 3 1993-2007 17y 326d 32y 92d 1993 ## 4 1992-2006 17y 257d 32y 49d 1992 ## 5 1991-2004 21y 305d 35y 30d 1991 ## 6 1997-2010 20y 94d 33y 202d 1997
Now lets get our end of sequence
library(stringr) adelaide_crows_players$end_sequence<-str_sub(adelaide_crows_players[,10], start = -4) head(adelaide_crows_players) ## Cap # Surname First.name DOB HT WT Games (W-D-L) Goals ## 1 66 23 McLeod Andrew 1976-08-04 181cm 81kg 340 (185-0-155) 275 ## 2 67 9 Edwards Tyson 1976-08-06 178cm 81kg 321 (175-0-146) 192 ## 3 51 32 Ricciuto Mark 1975-06-08 184cm 93kg 312 (162-1-149) 292 ## 4 38 34 Hart Ben 1974-07-09 190cm 83kg 311 (166-1-144) 45 ## 5 16 7 Smart Nigel 1969-05-21 188cm 91kg 278 (140-1-137) 116 ## 6 77 36 Goodwin Simon 1976-12-26 185cm 86kg 275 (155-0-120) 162 ## Seasons Debut Last start_sequence end_sequence ## 1 1995-2010 18y 274d 33y 346d 1995 2010 ## 2 1995-2010 18y 308d 33y 303d 1995 2010 ## 3 1993-2007 17y 326d 32y 92d 1993 2007 ## 4 1992-2006 17y 257d 32y 49d 1992 2006 ## 5 1991-2004 21y 305d 35y 30d 1991 2004 ## 6 1997-2010 20y 94d 33y 202d 1997 2010
Now we just need to duplicate the rows so we can join on the date of birth for the players.
To do this we can use tidyverse specifically we can use gather
.
gather(data=adelaide_crows_players, key = "career_start_end", value = "year", c("start_sequence", "end_sequence")) %>%head() ## Cap # Surname First.name DOB HT WT Games (W-D-L) Goals ## 1 66 23 McLeod Andrew 1976-08-04 181cm 81kg 340 (185-0-155) 275 ## 2 67 9 Edwards Tyson 1976-08-06 178cm 81kg 321 (175-0-146) 192 ## 3 51 32 Ricciuto Mark 1975-06-08 184cm 93kg 312 (162-1-149) 292 ## 4 38 34 Hart Ben 1974-07-09 190cm 83kg 311 (166-1-144) 45 ## 5 16 7 Smart Nigel 1969-05-21 188cm 91kg 278 (140-1-137) 116 ## 6 77 36 Goodwin Simon 1976-12-26 185cm 86kg 275 (155-0-120) 162 ## Seasons Debut Last career_start_end year ## 1 1995-2010 18y 274d 33y 346d start_sequence 1995 ## 2 1995-2010 18y 308d 33y 303d start_sequence 1995 ## 3 1993-2007 17y 326d 32y 92d start_sequence 1993 ## 4 1992-2006 17y 257d 32y 49d start_sequence 1992 ## 5 1991-2004 21y 305d 35y 30d start_sequence 1991 ## 6 1997-2010 20y 94d 33y 202d start_sequence 1997
To check this gives us what we want, lets take an example player, say the great Andrew McLeod.
gather(data=adelaide_crows_players, key = "career_start_end", value = "year", c("start_sequence", "end_sequence")) %>%filter(Surname=="McLeod") ## Cap # Surname First.name DOB HT WT Games (W-D-L) Goals ## 1 66 23 McLeod Andrew 1976-08-04 181cm 81kg 340 (185-0-155) 275 ## 2 66 23 McLeod Andrew 1976-08-04 181cm 81kg 340 (185-0-155) 275 ## Seasons Debut Last career_start_end year ## 1 1995-2010 18y 274d 33y 346d start_sequence 1995 ## 2 1995-2010 18y 274d 33y 346d end_sequence 2010
So we can see this gives us what we want, now all we have to do is expand this dataset so we have entries for Andrew McLeod for each year he was with the Crows.
So how do we do that?
gathered_dataset<- gather(data=adelaide_crows_players, key = "career_start_end", value = "year", c("start_sequence", "end_sequence")) gathered_dataset$year<-as.numeric(gathered_dataset$year) ## Warning: NAs introduced by coercion gathered_dataset<-gathered_dataset[complete.cases(gathered_dataset),] adelaide_data_to_join<-gathered_dataset%>% select(Surname, First.name, DOB, career_start_end, year)%>% group_by(Surname, First.name, career_start_end)%>% complete(Surname, First.name, DOB, career_start_end, year=seq(min(year), max(year),by=1))
Now view it
head(adelaide_data_to_join) ## # A tibble: 6 x 5 ## # Groups: Surname, First.name, career_start_end [6] ## Surname First.name DOB career_start_end year ## <chr> <chr> <chr> <chr> <dbl> ## 1 Abernethy " Bruce" 1962-05-10 end_sequence 1992 ## 2 Abernethy " Bruce" 1962-05-10 start_sequence 1991 ## 3 Anderson " Greg" 1966-05-14 end_sequence 1996 ## 4 Anderson " Greg" 1966-05-14 start_sequence 1993 ## 5 Armstrong " Tony" 1989-09-29 end_sequence 2011 ## 6 Armstrong " Tony" 1989-09-29 start_sequence 2010
So it seems OK.
Now the next part is lets take the crows stuff as an example how do we join it back onto the fitzRoy afltables data.
Lets first quickly look at the two datasets we have * dataset in fitzRoy we have already stats
* dataset of the adelaide crows date of births we have just created adelaide_data_to_join
head(stats) ## # A tibble: 6 x 59 ## Season Round Date Local.start.time Venue Attendance Home.team HQ1G ## <dbl> <chr> <date> <int> <chr> <int> <chr> <int> ## 1 1991 1 1991-03-22 1940 Foot… 44902 Adelaide 6 ## 2 1991 1 1991-03-22 1940 Foot… 44902 Adelaide 6 ## 3 1991 1 1991-03-22 1940 Foot… 44902 Adelaide 6 ## 4 1991 1 1991-03-22 1940 Foot… 44902 Adelaide 6 ## 5 1991 1 1991-03-22 1940 Foot… 44902 Adelaide 6 ## 6 1991 1 1991-03-22 1940 Foot… 44902 Adelaide 6 ## # ... with 51 more variables: HQ1B <int>, HQ2G <int>, HQ2B <int>, ## # HQ3G <int>, HQ3B <int>, HQ4G <int>, HQ4B <int>, Home.score <int>, ## # Away.team <chr>, AQ1G <int>, AQ1B <int>, AQ2G <int>, AQ2B <int>, ## # AQ3G <int>, AQ3B <int>, AQ4G <int>, AQ4B <int>, Away.score <int>, ## # First.name <chr>, Surname <chr>, ID <dbl>, Jumper.No. <dbl>, ## # Playing.for <chr>, Kicks <dbl>, Marks <dbl>, Handballs <dbl>, ## # Goals <dbl>, Behinds <dbl>, Hit.Outs <dbl>, Tackles <dbl>, ## # Rebounds <dbl>, Inside.50s <dbl>, Clearances <dbl>, Clangers <dbl>, ## # Frees.For <dbl>, Frees.Against <dbl>, Brownlow.Votes <dbl>, ## # Contested.Possessions <dbl>, Uncontested.Possessions <dbl>, ## # Contested.Marks <dbl>, Marks.Inside.50 <dbl>, One.Percenters <dbl>, ## # Bounces <dbl>, Goal.Assists <dbl>, Time.on.Ground.. <int>, ## # Substitute <int>, Umpire.1 <chr>, Umpire.2 <chr>, Umpire.3 <chr>, ## # Umpire.4 <chr>, group_id <int> tail(adelaide_data_to_join) ## # A tibble: 6 x 5 ## # Groups: Surname, First.name, career_start_end [6] ## Surname First.name DOB career_start_end year ## <chr> <chr> <chr> <chr> <dbl> ## 1 Wintle " Darryl" 1976-04-07 end_sequence 1999 ## 2 Wintle " Darryl" 1976-04-07 start_sequence 1999 ## 3 Wright " Matthew" 1989-12-14 end_sequence 2015 ## 4 Wright " Matthew" 1989-12-14 start_sequence 2011 ## 5 Young " Will" 1990-08-03 end_sequence 2010 ## 6 Young " Will" 1990-08-03 start_sequence 2010
So remembering what we wanted to join on was First.name
, Surname
, year
and Playing.for
.
Well the column I forgot to add is the Playing.for column.
We can do that as follows:
adelaide_data_to_join$Playing.for<-"Adelaide"
Now we can left_join it onto the stats
dataset.
check_crows_dataset<-left_join(stats, adelaide_data_to_join, by = c("First.name" = "First.name", "Surname" = "Surname", "Season"="year", "Playing.for"="Playing.for")) head(check_crows_dataset) ## # A tibble: 6 x 61 ## Season Round Date Local.start.time Venue Attendance Home.team HQ1G ## <dbl> <chr> <date> <int> <chr> <int> <chr> <int> ## 1 1991 1 1991-03-22 1940 Foot… 44902 Adelaide 6 ## 2 1991 1 1991-03-22 1940 Foot… 44902 Adelaide 6 ## 3 1991 1 1991-03-22 1940 Foot… 44902 Adelaide 6 ## 4 1991 1 1991-03-22 1940 Foot… 44902 Adelaide 6 ## 5 1991 1 1991-03-22 1940 Foot… 44902 Adelaide 6 ## 6 1991 1 1991-03-22 1940 Foot… 44902 Adelaide 6 ## # ... with 53 more variables: HQ1B <int>, HQ2G <int>, HQ2B <int>, ## # HQ3G <int>, HQ3B <int>, HQ4G <int>, HQ4B <int>, Home.score <int>, ## # Away.team <chr>, AQ1G <int>, AQ1B <int>, AQ2G <int>, AQ2B <int>, ## # AQ3G <int>, AQ3B <int>, AQ4G <int>, AQ4B <int>, Away.score <int>, ## # First.name <chr>, Surname <chr>, ID <dbl>, Jumper.No. <dbl>, ## # Playing.for <chr>, Kicks <dbl>, Marks <dbl>, Handballs <dbl>, ## # Goals <dbl>, Behinds <dbl>, Hit.Outs <dbl>, Tackles <dbl>, ## # Rebounds <dbl>, Inside.50s <dbl>, Clearances <dbl>, Clangers <dbl>, ## # Frees.For <dbl>, Frees.Against <dbl>, Brownlow.Votes <dbl>, ## # Contested.Possessions <dbl>, Uncontested.Possessions <dbl>, ## # Contested.Marks <dbl>, Marks.Inside.50 <dbl>, One.Percenters <dbl>, ## # Bounces <dbl>, Goal.Assists <dbl>, Time.on.Ground.. <int>, ## # Substitute <int>, Umpire.1 <chr>, Umpire.2 <chr>, Umpire.3 <chr>, ## # Umpire.4 <chr>, group_id <int>, DOB <chr>, career_start_end <chr>
This didn’t work, which means that our column values don’t exactly align.
To check why this might occur lets check how the join from our adelaide_data_to_join
works.
adelaide_data_to_join%>% filter(Playing.for=="Adelaide" &Surname=="Jarman")%>%head() ## # A tibble: 4 x 6 ## # Groups: Surname, First.name, career_start_end [4] ## Surname First.name DOB career_start_end year Playing.for ## <chr> <chr> <chr> <chr> <dbl> <chr> ## 1 Jarman " Andrew" 1966-01-14 end_sequence 1996 Adelaide ## 2 Jarman " Andrew" 1966-01-14 start_sequence 1991 Adelaide ## 3 Jarman " Darren" 1967-01-28 end_sequence 2001 Adelaide ## 4 Jarman " Darren" 1967-01-28 start_sequence 1996 Adelaide
So what we can see here is that there seems to be a space between the first letter of the First.name which is why our join isn’t working.
adelaide_data_to_join$First.name<-trimws(adelaide_data_to_join$First.name, which = c( "both","left", "right")) check_crows_dataset<-left_join(stats, adelaide_data_to_join, by = c("First.name" = "First.name", "Surname" = "Surname", "Season"="year", "Playing.for"="Playing.for")) head(check_crows_dataset) ## # A tibble: 6 x 61 ## Season Round Date Local.start.time Venue Attendance Home.team HQ1G ## <dbl> <chr> <date> <int> <chr> <int> <chr> <int> ## 1 1991 1 1991-03-22 1940 Foot… 44902 Adelaide 6 ## 2 1991 1 1991-03-22 1940 Foot… 44902 Adelaide 6 ## 3 1991 1 1991-03-22 1940 Foot… 44902 Adelaide 6 ## 4 1991 1 1991-03-22 1940 Foot… 44902 Adelaide 6 ## 5 1991 1 1991-03-22 1940 Foot… 44902 Adelaide 6 ## 6 1991 1 1991-03-22 1940 Foot… 44902 Adelaide 6 ## # ... with 53 more variables: HQ1B <int>, HQ2G <int>, HQ2B <int>, ## # HQ3G <int>, HQ3B <int>, HQ4G <int>, HQ4B <int>, Home.score <int>, ## # Away.team <chr>, AQ1G <int>, AQ1B <int>, AQ2G <int>, AQ2B <int>, ## # AQ3G <int>, AQ3B <int>, AQ4G <int>, AQ4B <int>, Away.score <int>, ## # First.name <chr>, Surname <chr>, ID <dbl>, Jumper.No. <dbl>, ## # Playing.for <chr>, Kicks <dbl>, Marks <dbl>, Handballs <dbl>, ## # Goals <dbl>, Behinds <dbl>, Hit.Outs <dbl>, Tackles <dbl>, ## # Rebounds <dbl>, Inside.50s <dbl>, Clearances <dbl>, Clangers <dbl>, ## # Frees.For <dbl>, Frees.Against <dbl>, Brownlow.Votes <dbl>, ## # Contested.Possessions <dbl>, Uncontested.Possessions <dbl>, ## # Contested.Marks <dbl>, Marks.Inside.50 <dbl>, One.Percenters <dbl>, ## # Bounces <dbl>, Goal.Assists <dbl>, Time.on.Ground.. <int>, ## # Substitute <int>, Umpire.1 <chr>, Umpire.2 <chr>, Umpire.3 <chr>, ## # Umpire.4 <chr>, group_id <int>, DOB <chr>, career_start_end <chr> check_crows_dataset%>%filter(Playing.for=="Adelaide")%>%select(DOB)%>%head() ## # A tibble: 6 x 1 ## DOB ## <chr> ## 1 1962-03-18 ## 2 1964-01-03 ## 3 1970-01-06 ## 4 1970-01-06 ## 5 1970-06-30 ## 6 1966-01-14
Ok so this seems to work now just need to do this for all teams.
Looking forward to reading anywork looking at age and its impact on team or player performance.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.