Site icon R-bloggers

Plotting Background Data

[This article was first published on Analysis of AFL, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One of the best things about the online R community is there are lots and lots of great people to follow online especially. One person I follow dataandme consistently links great how to do things on R tutorials.

Recently I saw this post where she linked a post by drsimonj running through how to plot background data in R. My first thought as when reading most data viz things is can I do this with footy?

Step 1 – Figuring out what I want to visualise

Luckily I received this message recently and thought that this would be a pretty good application of visualising background data.

I’m interested in seeing the way the opposition style affects a teams handball vs kick disposals. So for last x seasons, what was each teams unconditional average handball/kicks ratio. Then look at each game based on the opponent, and summarise the average incremental effect that each opposition team has on handballs/kicks. So 16 numbers where positive means they allow more handballs to kicks than league avg.

What I think its handy to do, is to draw using a bit of pen and paper what you think you want the graph to look like.

Step 2a – Get the data for each team by round

First thing we do is we get the datasets, and then we create the variable we want (hb2k) which is our handball to kick ratio for a team.

library(fitzRoy)
## Warning: package 'fitzRoy' was built under R version 3.5.1
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.5.1
## -- Attaching packages ----------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0     v purrr   0.2.5
## v tibble  1.4.2     v dplyr   0.7.7
## v tidyr   0.8.1     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0
## Warning: package 'ggplot2' was built under R version 3.5.1
## Warning: package 'tidyr' was built under R version 3.5.1
## Warning: package 'purrr' was built under R version 3.5.1
## Warning: package 'dplyr' was built under R version 3.5.1
## Warning: package 'stringr' was built under R version 3.5.1
## -- Conflicts -------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
df<-fitzRoy::player_stats

df1<-fitzRoy::get_footywire_stats(9514:9594) #gets 2018 data until end of round 9
## Getting data from footywire.com
## Finished getting data
df<-df%>%
      filter(Season != 2018) #filters out the 2018 data (incomeplete that was downloaded when installing fitzRoy for first time) 
df2<-rbind(df, df1) #stacks the datasets on top of each other

  df2%>%
    select(Season, Round, K,HB, Team, Opposition, Date) %>%
    group_by(Season, Round, Team, Opposition, Date)%>%
    summarise(tk=sum(K), thb=sum(HB)) %>%
    mutate(hb2k=thb/tk)
## # A tibble: 3,408 x 8
## # Groups:   Season, Round, Team, Opposition [3,406]
##    Season Round          Team      Opposition Date          tk   thb  hb2k
##     <dbl> <chr>          <chr>     <chr>      <date>     <int> <int> <dbl>
##  1   2010 Elimination F~ Carlton   Sydney     2010-09-05   191   128 0.670
##  2   2010 Elimination F~ Fremantle Hawthorn   2010-09-04   216   113 0.523
##  3   2010 Elimination F~ Hawthorn  Fremantle  2010-09-04   197   148 0.751
##  4   2010 Elimination F~ Sydney    Carlton    2010-09-05   173   177 1.02 
##  5   2010 Grand Final    Collingw~ St Kilda   2010-09-25   198   128 0.646
##  6   2010 Grand Final    Collingw~ St Kilda   2010-10-02   250   129 0.516
##  7   2010 Grand Final    St Kilda  Collingwo~ 2010-09-25   193   145 0.751
##  8   2010 Grand Final    St Kilda  Collingwo~ 2010-10-02   178   148 0.831
##  9   2010 Preliminary F~ Collingw~ Geelong    2010-09-17   228   101 0.443
## 10   2010 Preliminary F~ Geelong   Collingwo~ 2010-09-17   179   229 1.28 
## # ... with 3,398 more rows
  • From df2 we select the columns we want which are Season, Round, K (which is kicks), HB which is handballs, Team Opposition and Date
  • We then use group_by so we can summarise the total kicks tk and the total handballs thb by that a team has done vs a given oppostion (Team, Opposition) for a game (Season, Round, Date)
  • As teams don’t play each other more than once per Round you would think Date would be redundant right? But no in 2010 we had a replay of a grand final
  • lastly we create the variable we want using mutate hb2k which is the total handballs divided by the total kicks for a given game. . # Step 2b – Get the teams yearly average for the straight lines
  df2%>%
    select(Season, Round, K,HB, Team, Opposition, Date ) %>%
    group_by(Season, Round, Team, Opposition, Date)%>%
    summarise(tk=sum(K), thb=sum(HB)) %>%
    mutate(hb2k=thb/tk) %>%
  group_by(Season, Opposition) %>%
  summarise(average_hb2k=mean(hb2k)) %>%
  filter(Season==2018)
## # A tibble: 18 x 3
## # Groups:   Season [1]
##    Season Opposition       average_hb2k
##     <dbl> <chr>                   <dbl>
##  1   2018 Adelaide                0.763
##  2   2018 Brisbane                0.810
##  3   2018 Carlton                 0.745
##  4   2018 Collingwood             0.795
##  5   2018 Essendon                0.788
##  6   2018 Fremantle               0.729
##  7   2018 Geelong                 0.701
##  8   2018 Gold Coast              0.721
##  9   2018 GWS                     0.760
## 10   2018 Hawthorn                0.824
## 11   2018 Melbourne               0.836
## 12   2018 North Melbourne         0.741
## 13   2018 Port Adelaide           0.718
## 14   2018 Richmond                0.801
## 15   2018 St Kilda                0.767
## 16   2018 Sydney                  0.657
## 17   2018 West Coast              0.753
## 18   2018 Western Bulldogs        0.717

Remember we wanted to be able to reference for a given team what is their average handball to kick ratio conceded. One way to do this is to add a reference line. To get the values for the reference line, we use group_by(Season, Opposition) which allows us then to summarise by Season for each Opposition team their average handball to kick ratio. summarise(average_hb2k=mean(hb2k)). Then as we want a graph for 2018 we use filter(Season==2018) to achieve this. Which means we get the average values by Oppostion for 2018 only.

Step 3 – What team do you want to see?

For tonights game lets look at Collingwood vs the Western Bulldogs. Lets have a look at what the teams handball to kick ratio has been when playing Collingwood and what it has been when playing Western Bulldogs

df2%>%
  select(Season, Round, K,HB, Team, Opposition, Date ) %>%
  group_by(Season, Round, Team, Opposition, Date)%>%
  summarise(tk=sum(K), thb=sum(HB)) %>%
  mutate(hb2k=thb/tk) %>%
  filter(Season==2018)%>%
  filter(Opposition=="Collingwood")%>%
  ggplot(aes(x=Date, y=hb2k)) +geom_point() +
  geom_text(aes(label=Team), size=2) 

Our first check is that the teams that are playing the pies are in the right order. We can do this using geom_text(aes(label=Team)). We can check the pies footywire page and see that they do play hawks round 1

Our next check is that the datapoints are correct, we can check this by just adding a data label.

df2%>%
  select(Season, Round, K,HB, Team, Opposition, Date ) %>%
  group_by(Season, Round, Team, Opposition, Date)%>%
  summarise(tk=sum(K), thb=sum(HB)) %>%
  mutate(hb2k=thb/tk) %>%
  filter(Season==2018)%>%
  filter(Opposition=="Collingwood")%>%
  ggplot(aes(x=Date, y=hb2k)) +geom_point() +
  geom_text(aes(label=Team), size=2) +
  geom_text(aes(label=hb2k), vjust=-1, size=1.5)

We check if Hawthorns handball to kick ratio vs the pies in round 1 was 0.7866.

Step 4 Get the Competitions yearly average handball to kick ratio

df2%>%
  select(Season, Round, K,HB, Team, Opposition, Date ) %>%
  group_by(Season, Round, Team, Opposition, Date)%>%
  summarise(tk=sum(K), thb=sum(HB)) %>%
  mutate(hb2k=thb/tk) %>%
  group_by(Season)%>%
  summarise(meanhb2k=mean(hb2k))
## # A tibble: 9 x 2
##   Season meanhb2k
##    <dbl>    <dbl>
## 1   2010    0.845
## 2   2011    0.761
## 3   2012    0.739
## 4   2013    0.737
## 5   2014    0.774
## 6   2015    0.788
## 7   2016    0.834
## 8   2017    0.819
## 9   2018    0.757

From here we can see that the average handball to kick ratio for 2018 so far is 0.757.

Step 5 – Putting it all together for one team

df2%>%
  select(Season, Round, K,HB, Team, Opposition, Date ) %>%
  group_by(Season, Round, Team, Opposition, Date)%>%
  summarise(tk=sum(K), thb=sum(HB)) %>%
  mutate(hb2k=thb/tk) %>%
  filter(Season==2018)%>%
  filter(Opposition=="Collingwood")%>%
  ggplot(aes(x=Date, y=hb2k)) +geom_point() +
  geom_text(aes(label=Team), size=2) +
  geom_hline(yintercept = 0.757) +ggtitle("Opponents handball 2 kick ratio vs Collingwood")

Step 6 – Compare the pair

From above, we cacn see all we need to do is change “Collingwood” to “Western Bulldogs”

df2%>%
  select(Season, Round, K,HB, Team, Opposition, Date ) %>%
  group_by(Season, Round, Team, Opposition, Date)%>%
  summarise(tk=sum(K), thb=sum(HB)) %>%
  mutate(hb2k=thb/tk) %>%
  filter(Season==2018)%>%
  filter(Opposition=="Western Bulldogs")%>%
  ggplot(aes(x=Date, y=hb2k)) +geom_point() +
  geom_text(aes(label=Team), size=2) +
  geom_hline(yintercept = 0.757) +ggtitle("Opponents handball 2 kick ratio vs Western Bulldogs")

df2%>%
  select(Season, Round, K,HB, Team, Opposition, Date ) %>%
  group_by(Season, Round, Team, Opposition, Date)%>%
  summarise(tk=sum(K), thb=sum(HB)) %>%
  mutate(hb2k=thb/tk) %>%
  filter(Season==2018)%>%
  filter(Opposition=="Collingwood")%>%
  ggplot(aes(x=Date, y=hb2k)) +geom_point() +
  geom_text(aes(label=Team), size=2) +
  geom_hline(yintercept = 0.757) +ggtitle("Opponents handball 2 kick ratio vs Collingwood")

Step 7 – But what about the background stuff?

First lets do Collingwood, I’m thinking lets see a similar plot, but lets change Date to Round as games are on different days of the week and lets make Collingwood a different colour to the rest of the competition.

df3<-df2%>%
    select(Season, Round, K,HB, Team, Opposition, Date ) %>%
    group_by(Season, Round, Team, Opposition, Date)%>%
    summarise(tk=sum(K), thb=sum(HB)) %>%
    mutate(hb2k=thb/tk) %>%
    filter(Season==2018)%>%
    filter(Opposition == "Collingwood")



df2%>%
    select(Season, Round, K,HB, Team, Opposition, Date ) %>%
    group_by(Season, Round, Team, Opposition, Date)%>%
    summarise(tk=sum(K), thb=sum(HB)) %>%
    mutate(hb2k=thb/tk) %>%
    filter(Season==2018)%>%
    filter(Opposition != "Collingwood") %>%
    ggplot(aes(x=Round,y=hb2k))+
  geom_point(colour="grey", alpha=0.5)+
    geom_point(data=df3,colour="black")

We can check our black points are the points we want simply by labelling them.

df2%>%
    select(Season, Round, K,HB, Team, Opposition, Date ) %>%
    group_by(Season, Round, Team, Opposition, Date)%>%
    summarise(tk=sum(K), thb=sum(HB)) %>%
    mutate(hb2k=thb/tk) %>%
    filter(Season==2018)%>%
    filter(Opposition != "Collingwood") %>%
    ggplot(aes(x=Round, y=hb2k))+geom_point(data=df3)+
    geom_point(colour="grey") +geom_text(data=df3, aes(label=round(hb2k,2),hjust=-0.2))

We use geom_text(data=df3,aes(label=round(hb2k,2),hjust=-.2)) to label our data, what this does is takes the Collingwood only subdataset data=df3 and we label it by the corresponding hb2k variable we created earlier. We also round the numbers so it doesn’t look too messy to 2 decimals places using round(hb2k, 2) and we move the label to slightly right of the actual data point using hjust=-0.2

So what are the steps involved in this?

After reading the post, you can think of it as two parts.

  • Part one – plot the data faded without the subdata you want to emphasize.

  • Part two – plot the subdata you want to emphasize and colour it differently to the rest to make it stand out.

So to do this, first lets get the subdata we want which is Collingwoods handball to kick ratio conceded to teams. We call this subdataset df3

df3<-df2%>%
    select(Season, Round, K,HB, Team, Opposition, Date ) %>%
    group_by(Season, Round, Team, Opposition, Date)%>%
    summarise(tk=sum(K), thb=sum(HB)) %>%
    mutate(hb2k=thb/tk) %>%
    filter(Season==2018)%>%
    filter(Opposition == "Collingwood")

Then we plot the full dataset without the Collingwood data, we then want to make this look faded we do this by colouring the points in grey geom_point(colour="grey" and fading the points by using , alpha=0.5)

df2%>%
    select(Season, Round, K,HB, Team, Opposition, Date ) %>%
    group_by(Season, Round, Team, Opposition, Date)%>%
    summarise(tk=sum(K), thb=sum(HB)) %>%
    mutate(hb2k=thb/tk) %>%
    filter(Season==2018)%>%
    filter(Opposition != "Collingwood") %>%
    ggplot(aes(x=Round,y=hb2k))+
  geom_point(colour="grey", alpha=0.5)

Lastly we then add in the Collingwood data and make it black geom_point(data=df3,colour="black") so it stands out a bit more from our faded grey earlier for the rest of the competition.

df2%>%
    select(Season, Round, K,HB, Team, Opposition, Date ) %>%
    group_by(Season, Round, Team, Opposition, Date)%>%
    summarise(tk=sum(K), thb=sum(HB)) %>%
    mutate(hb2k=thb/tk) %>%
    filter(Season==2018)%>%
    filter(Opposition != "Collingwood") %>%
    ggplot(aes(x=Round,y=hb2k))+
    geom_point(colour="grey", alpha=0.5)+
    geom_point(data=df3,colour="black")

That’s cool but how do you do the cool small multiple plots that had a bit more colour to it?

  df4<-df2%>%
    select(Season, Round, K,HB, Team, Opposition, Date ) %>%
    group_by(Season, Round, Team, Opposition, Date)%>%
    summarise(tk=sum(K), thb=sum(HB)) %>%
    mutate(hb2k=thb/tk) %>%
    filter(Season==2018)
  d_bg <- df4[, -4]

  ggplot(df4, aes(x = thb, y = tk, colour=Opposition)) +
    geom_point(data=d_bg, colour = "grey", alpha = .5)+
    geom_point() + guides(colour = FALSE)+ geom_text(aes(label=Team), size=1.5)+
  
    facet_wrap(~ Opposition)+
    guides(colour = FALSE) +
    theme_bw()

Ok so what is actually going on here?

This post here has a great explanation and is by far better than anything I could do myself so please read it!

To leave a comment for the author, please follow the link and comment on their blog: Analysis of AFL.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.