Recreating the NBA lead tracker graphic

[This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

For each NBA game, nba.com has a really nice graphic which tracks the point differential between the two teams throughout the game. Here is the lead tracker graphic for the game between the LA Clippers and the Phoenix Suns on 10 Dec 2018:

I thought it would be cool to try recreating this graphic with R. You might ask: why try to replicate something that exists already? If we are able to pull out the data underlying this graphic, we could do much more than just replicate what is already out there; we have the power to make other visualizations which could be more informative or powerful. (For example, how does this chart look like for all games that the Golden State Warriors played in? Or, how does the chart look like for each quarter of the game?)

The full R code for this post can be found here. For a self-contained script that accepts a game ID parameter and produces the lead tracker graphic, click here.

First, we load the packages that we will use:

library(lubridate)
library(rvest)
library(stringr)
library(tidyverse)

We can get play-by-play data from Basketball-Reference.com (here is the link for the LAC @ PHX game on 2018-12-10). Here is a snippet of the play-by-play table on that webpage, we would like to extract the columns in red:

Play-by-play data from basketball-reference.com.

The code below extracts the webpage, then pulls out rows from the play-by-play table:

# get webpage
url <- paste0("https://www.basketball-reference.com/boxscores/pbp/", current_id,
              ".html")
webpage <- read_html(url)

# pull out the events from the play-by-play table
events <- webpage %>% 
    html_nodes("#pbp") %>%
    html_nodes("tr") %>% 
    html_text()

events is a character vector that looks like this:

We would really like to pull out the data in the boxes above. Timings are easy enough to pull out with regular expressions (e.g. start of the string: at least 1 digit, then :, then at least one digit, then ., then at least one digit). Pulling out the score is a bit trickier: we can’t just use the regular expression denoting a dash – with a number on each side. An example of why that doesn’t work is in the purple box above. Whenever a team scores, basketball-reference.com puts a “+2” or “+3” on the left or right of the score, depending on which team scored. In events, these 3 columns get smushed together into one string. If the team on the left scores, pulling out number-dash-number will give the wrong value (e.g. the purple box above would give 22-2 instead of 2-2).

To avoid this issue, we extract the “+”s that may appear on either side of the score. In fact, this has an added advantage: we only need to extract a score if it is different from the previous timestamp. As such, we only have to keep the scores which have a “+” on either side of it. We then post-process the scores.

# get event times & scores
times  <- str_extract(events, "^\\d+:\\d+.\\d+")
scores <- str_extract(events, "[\\+]*\\d+-\\d+[\\+]*")
scores <- ifelse(str_detect(scores, "\\+"), scores, NA)

df <- data.frame(time = times, score = scores, stringsAsFactors = FALSE) %>%
    na.omit()

# remove the +'s
parseScore <- function(x) {
    if (startsWith(x, "+")) {
        return(str_sub(x, 3, str_length(x)))
    } else if (endsWith(x, "+")) {
        return(str_sub(x, 1, str_length(x) - 1))
    } else {
        return(x)
    }
}
df$score <- sapply(df$score, parseScore)

Next, we split the score into visitor and home score and compute the point differential (positive means the visitor team is winning):

# split score into visitor and home score, get home advantage
df <- df %>% 
    separate(score, into = c("visitor", "home"), sep = "-") %>%
    mutate(visitor = as.numeric(visitor), 
           home = as.numeric(home),
           time = ms(time)) %>%
    mutate(visitor_adv = visitor - home)

Next we need to process the timings. Each of the 4 quarters lasts for 12 minutes, while each overtime period (if any) lasts for 5 minutes. The time column shows the amount of time remaining in the current period. We will amend the times so that they show the time elapsed (in seconds) from the start of the game. This notion of time makes it easier for plotting, and works for any number of overtime periods as well.

# get period of play (e.g. Q1, Q2, ...)
df$period <- NA
period <- 0
prev_time <- ms("0:00")
for (i in 1:nrow(df)) {
    curr_time <- df[i, "time"]
    if (prev_time < curr_time) {
        period <- period + 1
    }
    df[i, "period"] <- period
    prev_time <- curr_time
}

# convert time such that it runs upwards. regular quarters are 12M long, OT 
# periods are 5M long
df <- df %>% mutate(time = ifelse(period <= 4, as.duration(12 * 60) - as.duration(time), as.duration(5 * 60) - as.duration(time))) %>%
    mutate(time = ifelse(period <= 4,
                         time + as.duration(12 * 60 * (period - 1)),
                         time + as.duration(12 * 60 * 4) + 
                             as.duration(5 * 60 * (period - 5))
                         ))

At this point, we have enough to make crude approximations of the lead tracker graphic:

ggplot() +
    geom_line(data = df, aes(x = time, y = visitor_adv)) +
    labs(title = "LAC @ PHX, 2018-12-10") +
    theme_minimal() +
    theme(plot.title = element_text(size = rel(1.5), face = "bold", hjust = 0.5))

ggplot() +
    geom_step(data = df, aes(x = time, y = visitor_adv)) +
    labs(title = "LAC @ PHX, 2018-12-10") +
    theme_minimal() +
    theme(plot.title = element_text(size = rel(1.5), face = "bold", hjust = 0.5))

Getting the fill colors that NBA.com’s lead tracker has requires a bit more work. We need to split the visitor_adv into two columns: the visitor’s lead (0 if they are behind) and the home’s lead (0 if they are behind). We can then draw the chart above and below the x-axis as two geom_ribbons. (It’s a little more complicated than that, see this StackOverflow question and this gist for details.) Colors were obtained using imagecolorpicker.com.

df$visitor_lead <- pmax(df$visitor_adv, 0)
df$home_lead    <- pmin(df$visitor_adv, 0)

df_extraSteps <- df %>% mutate(visitor_adv = lag(visitor_adv),
                               visitor_lead = lag(visitor_lead),
                               home_lead = lag(home_lead))
df2 <- bind_rows(df_extraSteps, df) %>%
    arrange(time)

ggplot() +
    geom_ribbon(data = df2, aes(x = time, ymin = 0, ymax = visitor_lead), fill = "#F7174E") +
    geom_ribbon(data = df2, aes(x = time, ymin = home_lead, ymax = 0), fill = "#F16031") +
    labs(title = "LAC @ PHX, 2018-12-10") +
    theme_minimal() +
    theme(plot.title = element_text(size = rel(1.5), face = "bold", hjust = 0.5))

Almost there! The code below does some touch up to the figure, giving it the correct limits for the y-axis as well as vertical lines for the end of the periods.

# get score differential range (round to nearest 5)
ymax <- round(max(df$visitor_adv) * 2, digits = -1) / 2
ymin <- round(min(df$visitor_adv) * 2, digits = -1) / 2

# get period positions and labels
periods <- unique(df$period)
x_value <- ifelse(periods <= 4, 12 * 60 * periods, 
                  12 * 60 * 4 + 5 * 60 * (periods - 4))
x_label <- ifelse(periods <= 4, paste0("Q", periods), 
                  paste0("OT", periods - 4))

ggplot() +
    geom_ribbon(data = df2, aes(x = time, ymin = 0, ymax = visitor_lead), fill = "#F7174E") +
    geom_ribbon(data = df2, aes(x = time, ymin = home_lead, ymax = 0), fill = "#F16031") +
    geom_vline(aes(xintercept = x_value), linetype = 2, col = "grey") +
    scale_y_continuous(limits = c(ymin, ymax)) +
    labs(title = "LAC @ PHX, 2018-12-10") +
    scale_x_continuous(breaks = x_value, labels = x_label) +
    theme_minimal() +
    theme(plot.title = element_text(size = rel(1.5), face = "bold", hjust = 0.5),
          axis.title.x = element_blank(), panel.grid.minor.x = element_blank(),
          panel.grid.minor.y = element_blank())

The figure above is what we set out to plot. However, since we have the underlying data, we can now make plots of the same data that may reveal other trends (code at the end of this R file). Here are the line and ribbon plots where we look at the absolute score rather than the point differential:

Here, we add points to the line plot to indicate whether a free throw, 2 pointer or 3 pointer was scored:

The possibilities are endless!

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)