Site icon R-bloggers

Max Gawn a Brownlow Fancy

[This article was first published on Analysis of AFL, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This week in AFL Nathan Fyfe one of the brownlow favourites got rubbed out for an ear tickler to Levi Greenwood

When looking at the Brownlow odds, its a bit surprising to see Max Gawn listed as the second favourite. While he is having a great season, historically Ruckman have not polled well. But is Max having a Brownlow worthy year?

Max Gawn Brownlow Odds 07-06-2018.

One way to think about Max, is to compare him to his best every year, which I think personally was his 2016 year in which he was named in the All-Australian team.

One of the first things we could do, is actually check how did Max Gawn do in the 2016 Brownlow Medal count?

library(fitzRoy)
library(tidyverse)
## -- Attaching packages -------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1     v purrr   0.2.5
## v tibble  1.4.2     v dplyr   0.7.5
## v tidyr   0.8.1     v stringr 1.3.0
## v readr   1.1.1     v forcats 0.3.0
## -- Conflicts ----------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
df<-fitzRoy::afldata 
df%>% 
  filter(Season==2016)%>% 
  group_by(First.name , Surname, Playing.for)%>% summarise(brownlowvotes=sum(Brownlow.Votes))
## # A tibble: 656 x 4
## # Groups:   First.name, Surname [?]
##    First.name Surname    Playing.for            brownlowvotes
##    <fct>      <fct>      <fct>                          <int>
##  1 Aaron      Francis    Essendon                           0
##  2 Aaron      Hall       Gold Coast                        11
##  3 Aaron      Mullett    North Melbourne                    0
##  4 Aaron      Sandilands Fremantle                          0
##  5 Aaron      Vandenberg Melbourne                          0
##  6 Aaron      Young      Port Adelaide                      5
##  7 Adam       Cooney     Essendon                           4
##  8 Adam       Kennedy    Greater Western Sydney             0
##  9 Adam       Marcon     Richmond                           0
## 10 Adam       Oxley      Collingwood                        0
## # ... with 646 more rows

Visually how can we compare his 2016 season to his 2018 season so far?

One way to do this is to look at a cleveland dot plot

Cleveland Dot Plot Example.

This is something you see quite a bit of online when people are trying to compare many things but split by a factor. So in our case for this blog post, we are trying to compare many things (various Max Gawn statistics) split by factor (season 2016, 2018)

So how is Max going in 2018 compared to 2016, lets use this dotplots to visualise this.

Step 1 – Get the Data

Luckily we know we are using player level data and thankfully the great people over at footywire have collected stats that include meters gained (MG), intercepts (ITC) and a host of other things.

library(tidyverse)
library(fitzRoy)
df<-fitzRoy::player_stats
df<-df%>%
  filter(Season != 2018) #filters out the 2018 data (incomeplete that was downloaded when installing fitzRoy for first time) 
df1<-fitzRoy::get_footywire_stats(9514:9611) #(end round 11)
## Getting data from footywire.com
## Finished getting data
df2<-rbind(df, df1) #stacks the datasets on top of each other

So what steps are we taking here

  • First we load our packages library(tidyverse) library(fitzRoy)
  • Then we get all the player data from 2010-2018 from footywire df<-fitzRoy::player_stats
  • BUT depending on when you first installed fitzRoy you might not have all the up to date data.
  • so to make sure we have all the up to date data, we delete the data that has Season 2018 df<-df%>%filter(Season != 2018)
  • We then go to this page to click on the first game and last game so far in 2018 to get the unique game IDs. Then we scrape them freshly using fitzRoys scraper function df1<-fitzRoy::get_footywire_stats(9514:9611)
  • We then just stack the datasets on top of each other using rbind

Step 2 – Filter out Max Gawn and his 2016, 2018 Seasons

df3<-df2%>%
  filter(Season %in% c(2016,2018))%>%
  filter(Player =="Max Gawn") 
  • We have 2 basic steps when we were filtering, the first step was to filter out the seasons 2016, 2018 filter(Season %in% c(2016,2018)) and the next step was to filter out the player Max Gawn filter(Player =="Max Gawn")

Step 3 – Summarise Max Gawns averages for each season

df3<-df2%>%
  filter(Season %in% c(2016,2018))%>%
  filter(Player =="Max Gawn") %>%
  group_by(Season)%>%
  summarise(ave.ho=mean(HO),
            ave.CM=mean(CM),
            ave.SC=mean(SC),
            ave.MG=mean(MG),
            ave.ITC=mean(ITC), 
            ave.AF=mean(AF),
            ave.SC=mean(SC),
            ave.Mi5=mean(MI5))

So remember to get summary measures by something we have to group_by that something which is in this case Season The summary measure I am thinking here is lets just start by looking at the mean but of what variables?

Well one way to do this is by using names(df2) this will give you the names of all your columns another way to think about it, is its a quick way to list the variables.

names(df2)
##  [1] "Date"           "Season"         "Round"          "Venue"         
##  [5] "Player"         "Team"           "Opposition"     "Status"        
##  [9] "GA"             "Match_id"       "CP"             "UP"            
## [13] "ED"             "DE"             "CM"             "MI5"           
## [17] "One.Percenters" "BO"             "TOG"            "K"             
## [21] "HB"             "D"              "M"              "G"             
## [25] "B"              "T"              "HO"             "I50"           
## [29] "CL"             "CG"             "R50"            "FF"            
## [33] "FA"             "AF"             "SC"             "CCL"           
## [37] "SCL"            "SI"             "MG"             "TO"            
## [41] "ITC"            "T5"

From there, I’m going to pick the following variables:

  • hitouts (HO)
  • Contested Marks (CM)
  • Meters Gained (MG)
  • Intercepts (ITC)
  • supercoach scores (SC)
  • AFL fantasy scores (AF)
  • Marks inside 50 (MI5)

and I want to summarise them and summarise works as follows

  • summarise(new_variablename=summarymeasure(variable))
  • example
  • new_variablename=ave.ho
  • summarymeasure = mean
  • variable = HO
  • putting it all together ave.ho=mean(HO)

Step 4 – Go from wide to long data

What is wide data and what is long data?

The best way in my opinion is to look at the same dataset but both ways.

So first wide

df3
## # A tibble: 2 x 8
##   Season ave.ho ave.CM ave.SC ave.MG ave.ITC ave.AF ave.Mi5
##    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>   <dbl>
## 1   2016   42.2   2.09   119.   149.    2.91   106.       1
## 2   2018   48.5   2.09   130.   175.    3.45   116.       1
df4<-gather(df3,variables, values, -Season)   

Next lets look at long:

df4
## # A tibble: 14 x 3
##    Season variables values
##     <dbl> <chr>      <dbl>
##  1   2016 ave.ho     42.2 
##  2   2018 ave.ho     48.5 
##  3   2016 ave.CM      2.09
##  4   2018 ave.CM      2.09
##  5   2016 ave.SC    119.  
##  6   2018 ave.SC    130.  
##  7   2016 ave.MG    149.  
##  8   2018 ave.MG    175.  
##  9   2016 ave.ITC     2.91
## 10   2018 ave.ITC     3.45
## 11   2016 ave.AF    106.  
## 12   2018 ave.AF    116.  
## 13   2016 ave.Mi5     1   
## 14   2018 ave.Mi5     1

Step 5 – Give it the ggplot2 treatment

df4%>%   
  ggplot(aes(x=values, y=variables)) +geom_point(aes(colour=Season))

Step 6 – Change colour to a factor, add in a connecter line

df4%>%   
  ggplot(aes(x=values, y=variables)) +geom_point(aes(colour=as.factor(df4$Season)))+ geom_line(aes(group = variables))

Step 7 – Have a think, does this graph give you what you want?

To be honest its quite hard to see the differences for variables like ITC when compared on the same axis to the differences in SC scores as an example. So when thinking about looking at multiple variables a dotplot in this example doesn’t make much sense as the range of likely values varies a lot between variables. I.e we wouldn’t expect any AFL player to have as many intercepts as they get SC points.

So instead lets try facet_wrap

df4%>%   
    ggplot(aes(x=as.factor(df4$Season), y=values)) +
  geom_point()+
  facet_wrap(~variables,scales = "free")

Here we use facet_wrap exactly like we have been doing in previous blog posts. We use the argument scales="free" so that each graph for each different variable can be on an appropriate scale. We are probably not interested in absolute changes i.e. if intercepts go from 3 to 4 that is seen different to if his Supercoach score goes from 123 to 124

So what is missing here is that for when we are doing the dotplots is that we want our variables to be on the same scale.

To think about this a bit more clearly, lets come up with another example for comparision.

A better example of cleveland dot plots

When looking at the example plot

When looking at example plots like the one above, I like to think about it in terms of how can I do the same thing but for footy and how would my variables/values/observations change.

So if I wanted to do the same plot but for footy ruckman, I would change the city axis to be ruckman, the gender variable to be years I want to compare and to be measured on the same values I would use say SC scores.

So lets get the top 10 ruckman in 2016 and see how they stack up their versions of themselves today. A quick way to get actual ruckman is to just increase the number of HOs in the filter to an arbirary number you think captures the main ruck I went 15 but you could choose a higher or lower number yourselves.

df2%>%
  filter(Season==2016)%>%
  filter(HO>15)%>%
  group_by(Player)%>%
  summarise(ave.SC=mean(SC))%>%
  arrange(desc(ave.SC))
## # A tibble: 47 x 2
##    Player            ave.SC
##    <chr>              <dbl>
##  1 Max Gawn           119. 
##  2 Todd Goldstein     109. 
##  3 Nicholas Naitanui  106. 
##  4 Jackson Trengove    99.6
##  5 Shane Mumford       97.3
##  6 Brodie Grundy       94.1
##  7 Archie Smith        93  
##  8 Scott Lycett        92.5
##  9 Sam Jacobs          91.0
## 10 Stefan Martin       90.3
## # ... with 37 more rows

This is where being a nuffie comes in handy and you can overlay your domain expertise. We hopefully know that Shane Mumford has retired so probably shouldn’t be used. Archie smith is based off three games

We can check that by using the below script

df2%>%
  filter(Season==2016)%>%
  filter(Player=="Archie Smith")
##         Date Season    Round         Venue       Player     Team
## 1 2016-07-30   2016 Round 19         Gabba Archie Smith Brisbane
## 2 2016-08-06   2016 Round 20 Adelaide Oval Archie Smith Brisbane
## 3 2016-08-13   2016 Round 21         Gabba Archie Smith Brisbane
##      Opposition Status GA Match_id CP UP ED   DE CM MI5 One.Percenters BO
## 1 Port Adelaide   Home  0     6329 17  5 15 78.9  1   0              1  0
## 2      Adelaide   Away  0     6339  9  5  9 60.0  1   0              4  0
## 3       Carlton   Home  0     6344  4  2  4 66.7  0   1              3  0
##   TOG K HB  D M G B T HO I50 CL CG R50 FF FA AF SC CCL SCL SI  MG TO ITC
## 1  80 2 17 19 1 0 0 5 30   2  9  1   0  6  1 96 96   8   1  8  65  1   1
## 2  83 6  9 15 1 0 0 3 32   0  3  1   3  1  0 84 90   2   1  1 200  6   2
## 3  63 2  4  6 1 1 1 2  8   0  1  3   0  1  2 35 52   1   0  4  26  1   0
##   T5
## 1  0
## 2  0
## 3  0

So lets compare the following ruckman

  • Max Gawn
  • Todd Goldstein
  • Nicholas Naitanui
  • Sam Jacobs
  • Stefan Martin
  • Brodie Grundy
  • Scott Lycett

To do this using the script we had before we would go.

df2%>%
    filter(Season %in% c(2016,2018))%>%
    filter(Player %in% c("Max Gawn","Todd Goldstein",
                         "Nicholas Naitanui","Sam Jacobs",
                         "Stefan Martin","Brodie Grundy",
                         "Scott Lycett")) %>%
               group_by(Season, Player)%>%
               summarise(ave.SC=mean(SC))%>%
    ggplot(aes(x=ave.SC, y=Player)) +geom_point(aes(colour=as.factor(Season)))+ geom_line(aes(group = Player))

Now that we have a template we could compare a few variables if we wanted quickly by just copy and pasting the above script but changing say SC to CM

df2%>%
    filter(Season %in% c(2016,2018))%>%
    filter(Player %in% c("Max Gawn","Todd Goldstein",
                         "Nicholas Naitanui","Sam Jacobs",
                         "Stefan Martin","Brodie Grundy",
                         "Scott Lycett")) %>%
               group_by(Season, Player)%>%
               summarise(ave.CM=mean(CM))%>%
    ggplot(aes(x=ave.CM, y=Player)) +geom_point(aes(colour=as.factor(Season)))+ geom_line(aes(group = Player))

Hopefully now you have a quickfire template to go off and explore AFL ruckman yourselves.

Do you think Max Gawn should be second fav?

As always hit me up on twitter if any Qs #makemeauseR

To leave a comment for the author, please follow the link and comment on their blog: Analysis of AFL.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.