Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This week in AFL Nathan Fyfe one of the brownlow favourites got rubbed out for an ear tickler to Levi Greenwood
When looking at the Brownlow odds, its a bit surprising to see Max Gawn listed as the second favourite. While he is having a great season, historically Ruckman have not polled well. But is Max having a Brownlow worthy year?
One way to think about Max, is to compare him to his best every year, which I think personally was his 2016 year in which he was named in the All-Australian team.
One of the first things we could do, is actually check how did Max Gawn do in the 2016 Brownlow Medal count?
library(fitzRoy) library(tidyverse) ## -- Attaching packages -------------------------------- tidyverse 1.2.1 -- ## v ggplot2 2.2.1 v purrr 0.2.5 ## v tibble 1.4.2 v dplyr 0.7.5 ## v tidyr 0.8.1 v stringr 1.3.0 ## v readr 1.1.1 v forcats 0.3.0 ## -- Conflicts ----------------------------------- tidyverse_conflicts() -- ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() df<-fitzRoy::afldata df%>% filter(Season==2016)%>% group_by(First.name , Surname, Playing.for)%>% summarise(brownlowvotes=sum(Brownlow.Votes)) ## # A tibble: 656 x 4 ## # Groups: First.name, Surname [?] ## First.name Surname Playing.for brownlowvotes ## <fct> <fct> <fct> <int> ## 1 Aaron Francis Essendon 0 ## 2 Aaron Hall Gold Coast 11 ## 3 Aaron Mullett North Melbourne 0 ## 4 Aaron Sandilands Fremantle 0 ## 5 Aaron Vandenberg Melbourne 0 ## 6 Aaron Young Port Adelaide 5 ## 7 Adam Cooney Essendon 4 ## 8 Adam Kennedy Greater Western Sydney 0 ## 9 Adam Marcon Richmond 0 ## 10 Adam Oxley Collingwood 0 ## # ... with 646 more rows
Visually how can we compare his 2016 season to his 2018 season so far?
One way to do this is to look at a cleveland dot plot
This is something you see quite a bit of online when people are trying to compare many things but split by a factor. So in our case for this blog post, we are trying to compare many things (various Max Gawn statistics) split by factor (season 2016, 2018)
So how is Max going in 2018 compared to 2016, lets use this dotplots to visualise this.
Step 1 – Get the Data
Luckily we know we are using player level data and thankfully the great people over at footywire have collected stats that include meters gained (MG), intercepts (ITC) and a host of other things.
library(tidyverse) library(fitzRoy) df<-fitzRoy::player_stats df<-df%>% filter(Season != 2018) #filters out the 2018 data (incomeplete that was downloaded when installing fitzRoy for first time) df1<-fitzRoy::get_footywire_stats(9514:9611) #(end round 11) ## Getting data from footywire.com ## Finished getting data df2<-rbind(df, df1) #stacks the datasets on top of each other
So what steps are we taking here
- First we load our packages
library(tidyverse)
library(fitzRoy)
- Then we get all the player data from 2010-2018 from footywire
df<-fitzRoy::player_stats
- BUT depending on when you first installed fitzRoy you might not have all the up to date data.
- so to make sure we have all the up to date data, we delete the data that has Season 2018
df<-df%>%filter(Season != 2018)
- We then go to this page to click on the first game and last game so far in 2018 to get the unique game IDs. Then we scrape them freshly using fitzRoys scraper function
df1<-fitzRoy::get_footywire_stats(9514:9611)
- We then just stack the datasets on top of each other using
rbind
Step 2 – Filter out Max Gawn and his 2016, 2018 Seasons
df3<-df2%>% filter(Season %in% c(2016,2018))%>% filter(Player =="Max Gawn")
- We have 2 basic steps when we were filtering, the first step was to filter out the seasons 2016, 2018
filter(Season %in% c(2016,2018))
and the next step was to filter out the player Max Gawnfilter(Player =="Max Gawn")
Step 3 – Summarise Max Gawns averages for each season
df3<-df2%>% filter(Season %in% c(2016,2018))%>% filter(Player =="Max Gawn") %>% group_by(Season)%>% summarise(ave.ho=mean(HO), ave.CM=mean(CM), ave.SC=mean(SC), ave.MG=mean(MG), ave.ITC=mean(ITC), ave.AF=mean(AF), ave.SC=mean(SC), ave.Mi5=mean(MI5))
So remember to get summary measures by something we have to group_by
that something which is in this case Season
The summary measure I am thinking here is lets just start by looking at the mean
but of what variables?
Well one way to do this is by using names(df2)
this will give you the names of all your columns another way to think about it, is its a quick way to list the variables.
names(df2) ## [1] "Date" "Season" "Round" "Venue" ## [5] "Player" "Team" "Opposition" "Status" ## [9] "GA" "Match_id" "CP" "UP" ## [13] "ED" "DE" "CM" "MI5" ## [17] "One.Percenters" "BO" "TOG" "K" ## [21] "HB" "D" "M" "G" ## [25] "B" "T" "HO" "I50" ## [29] "CL" "CG" "R50" "FF" ## [33] "FA" "AF" "SC" "CCL" ## [37] "SCL" "SI" "MG" "TO" ## [41] "ITC" "T5"
From there, I’m going to pick the following variables:
- hitouts (HO)
- Contested Marks (CM)
- Meters Gained (MG)
- Intercepts (ITC)
- supercoach scores (SC)
- AFL fantasy scores (AF)
- Marks inside 50 (MI5)
and I want to summarise
them and summarise
works as follows
- summarise(new_variablename=summarymeasure(variable))
- example
new_variablename=ave.ho
summarymeasure = mean
variable = HO
- putting it all together
ave.ho=mean(HO)
Step 4 – Go from wide to long data
What is wide data and what is long data?
The best way in my opinion is to look at the same dataset but both ways.
So first wide
df3 ## # A tibble: 2 x 8 ## Season ave.ho ave.CM ave.SC ave.MG ave.ITC ave.AF ave.Mi5 ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2016 42.2 2.09 119. 149. 2.91 106. 1 ## 2 2018 48.5 2.09 130. 175. 3.45 116. 1 df4<-gather(df3,variables, values, -Season)
Next lets look at long:
df4 ## # A tibble: 14 x 3 ## Season variables values ## <dbl> <chr> <dbl> ## 1 2016 ave.ho 42.2 ## 2 2018 ave.ho 48.5 ## 3 2016 ave.CM 2.09 ## 4 2018 ave.CM 2.09 ## 5 2016 ave.SC 119. ## 6 2018 ave.SC 130. ## 7 2016 ave.MG 149. ## 8 2018 ave.MG 175. ## 9 2016 ave.ITC 2.91 ## 10 2018 ave.ITC 3.45 ## 11 2016 ave.AF 106. ## 12 2018 ave.AF 116. ## 13 2016 ave.Mi5 1 ## 14 2018 ave.Mi5 1
Step 5 – Give it the ggplot2 treatment
df4%>% ggplot(aes(x=values, y=variables)) +geom_point(aes(colour=Season))
Step 6 – Change colour to a factor, add in a connecter line
df4%>% ggplot(aes(x=values, y=variables)) +geom_point(aes(colour=as.factor(df4$Season)))+ geom_line(aes(group = variables))
Step 7 – Have a think, does this graph give you what you want?
To be honest its quite hard to see the differences for variables like ITC when compared on the same axis to the differences in SC scores as an example. So when thinking about looking at multiple variables a dotplot in this example doesn’t make much sense as the range of likely values varies a lot between variables. I.e we wouldn’t expect any AFL player to have as many intercepts as they get SC points.
So instead lets try facet_wrap
df4%>% ggplot(aes(x=as.factor(df4$Season), y=values)) + geom_point()+ facet_wrap(~variables,scales = "free")
Here we use facet_wrap
exactly like we have been doing in previous blog posts. We use the argument scales="free"
so that each graph for each different variable can be on an appropriate scale. We are probably not interested in absolute changes i.e. if intercepts go from 3 to 4 that is seen different to if his Supercoach score goes from 123 to 124
So what is missing here is that for when we are doing the dotplots is that we want our variables to be on the same scale.
To think about this a bit more clearly, lets come up with another example for comparision.
A better example of cleveland dot plots
When looking at the example plot
When looking at example plots like the one above, I like to think about it in terms of how can I do the same thing but for footy and how would my variables/values/observations change.
So if I wanted to do the same plot but for footy ruckman, I would change the city axis to be ruckman, the gender variable to be years I want to compare and to be measured on the same values I would use say SC scores.
So lets get the top 10 ruckman in 2016 and see how they stack up their versions of themselves today. A quick way to get actual ruckman is to just increase the number of HOs in the filter to an arbirary number you think captures the main ruck I went 15 but you could choose a higher or lower number yourselves.
df2%>% filter(Season==2016)%>% filter(HO>15)%>% group_by(Player)%>% summarise(ave.SC=mean(SC))%>% arrange(desc(ave.SC)) ## # A tibble: 47 x 2 ## Player ave.SC ## <chr> <dbl> ## 1 Max Gawn 119. ## 2 Todd Goldstein 109. ## 3 Nicholas Naitanui 106. ## 4 Jackson Trengove 99.6 ## 5 Shane Mumford 97.3 ## 6 Brodie Grundy 94.1 ## 7 Archie Smith 93 ## 8 Scott Lycett 92.5 ## 9 Sam Jacobs 91.0 ## 10 Stefan Martin 90.3 ## # ... with 37 more rows
This is where being a nuffie comes in handy and you can overlay your domain expertise. We hopefully know that Shane Mumford has retired so probably shouldn’t be used. Archie smith is based off three games
We can check that by using the below script
df2%>% filter(Season==2016)%>% filter(Player=="Archie Smith") ## Date Season Round Venue Player Team ## 1 2016-07-30 2016 Round 19 Gabba Archie Smith Brisbane ## 2 2016-08-06 2016 Round 20 Adelaide Oval Archie Smith Brisbane ## 3 2016-08-13 2016 Round 21 Gabba Archie Smith Brisbane ## Opposition Status GA Match_id CP UP ED DE CM MI5 One.Percenters BO ## 1 Port Adelaide Home 0 6329 17 5 15 78.9 1 0 1 0 ## 2 Adelaide Away 0 6339 9 5 9 60.0 1 0 4 0 ## 3 Carlton Home 0 6344 4 2 4 66.7 0 1 3 0 ## TOG K HB D M G B T HO I50 CL CG R50 FF FA AF SC CCL SCL SI MG TO ITC ## 1 80 2 17 19 1 0 0 5 30 2 9 1 0 6 1 96 96 8 1 8 65 1 1 ## 2 83 6 9 15 1 0 0 3 32 0 3 1 3 1 0 84 90 2 1 1 200 6 2 ## 3 63 2 4 6 1 1 1 2 8 0 1 3 0 1 2 35 52 1 0 4 26 1 0 ## T5 ## 1 0 ## 2 0 ## 3 0
So lets compare the following ruckman
- Max Gawn
- Todd Goldstein
- Nicholas Naitanui
- Sam Jacobs
- Stefan Martin
- Brodie Grundy
- Scott Lycett
To do this using the script we had before we would go.
df2%>% filter(Season %in% c(2016,2018))%>% filter(Player %in% c("Max Gawn","Todd Goldstein", "Nicholas Naitanui","Sam Jacobs", "Stefan Martin","Brodie Grundy", "Scott Lycett")) %>% group_by(Season, Player)%>% summarise(ave.SC=mean(SC))%>% ggplot(aes(x=ave.SC, y=Player)) +geom_point(aes(colour=as.factor(Season)))+ geom_line(aes(group = Player))
Now that we have a template we could compare a few variables if we wanted quickly by just copy and pasting the above script but changing say SC to CM
df2%>% filter(Season %in% c(2016,2018))%>% filter(Player %in% c("Max Gawn","Todd Goldstein", "Nicholas Naitanui","Sam Jacobs", "Stefan Martin","Brodie Grundy", "Scott Lycett")) %>% group_by(Season, Player)%>% summarise(ave.CM=mean(CM))%>% ggplot(aes(x=ave.CM, y=Player)) +geom_point(aes(colour=as.factor(Season)))+ geom_line(aes(group = Player))
Hopefully now you have a quickfire template to go off and explore AFL ruckman yourselves.
Do you think Max Gawn should be second fav?
As always hit me up on twitter if any Qs #makemeauseR
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.