Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A few weeks ago, I was listening to The Athletic NBA Show podcast (Episode 581: “5 Players I was wrong about, 20 Games in Contenders, and Sam Vecenie on the 2021 Rookie Class”) and the following statistic caught my attention:
Question: Since the 2000-2001 NBA finals, there have been 42 teams in the NBA finals. (This episode aired in Dec 2021, so the most recent finals was the 2020-2021 edition.) At the 20-game mark in the regular season, how many of these 42 teams were below 0.500 (that is, less than 50% winning percentage)?
Answer: Zero (!) (One of the guests on the show guessed 6.)
(If you want to hear the full segment on the stat, it’s the first topic of the podcast, right after the recap of the week.) Since a full regular season in the NBA is 82 games per team, 20 games represents just under a quarter of the season. It seemed a little surprising that NBA finalists were so consistently >= 0.500 so early in the season, so I wanted to verify the statistic for myself.
What I found was even more surprising: none of the teams in the NBA finals since the 1981-1982 edition were below 0.500 at the 20-game mark! That’s 40 years of NBA finals, almost double the number considered in the podcast. The 1980-1981 Houston Rockets were the most recent NBA finalists to be under 0.500 at the 20-game mark: they were 9-11.
The rest of this post goes through the data and R code to verify this statistic. The full R script is available here.
To verify this statistic, I needed two from two different sources. The first data source is Tristan Malherbe’s NBA Finals and MVPs dataset hosted on data.world. This tells us which teams were in the NBA finals in each year. The dataset only goes to the 2017-2018 season, so I added NBA finals data up to the 2020-2021 season (inclusive) manually. The second data source is Wyatt Walsh’s Basketball Dataset hosted on Kaggle. This is an extremely comprehensive dataset on the NBA stored as an SQLite database; we only use the Game
table, which contains statistics on every regular season game.
(Note: If you play around with the data, you might find some oddities/errors. For example, there are one or two teams that had more than 82 games in the regular season when there should be at most 82 games. There is also some missing data. Since this is quick analysis for a blog post I didn’t try to ensure 100% correctness; this is something one should do or strive toward for more high-stakes analyses.)
First, let’s load the regular season data. I’m being a bit lazy by importing the whole Game
table, but since the data isn’t very big we don’t suffer a huge efficiency loss.
library(DBI) library(tidyverse) theme_set(theme_bw()) firstYear <- 1980 # represents 1980-1981 season # these filepaths are specific to my local drive mainFile <- "../data/nba-kaggle-wyattowalsh/basketball.sqlite" finalsFile <- "../data/datatouille-nba-finals-and-mvps/data/data-updated.csv" # get all regular season games (only relevant columns # selected) mydb <- dbConnect(RSQLite::SQLite(), mainFile) df <- dbGetQuery(mydb, "SELECT * FROM Game") dbDisconnect(mydb) regular_df <- df %>% mutate(GAME_DATE = as.Date(GAME_DATE), SEASON = as.numeric(SEASON)) %>% filter(SEASON >= firstYear) %>% select(SEASON, GAME_DATE, TEAM_NAME_HOME, TEAM_NAME_AWAY, WL_HOME, WL_AWAY) %>% arrange(SEASON, GAME_DATE) head(regular_df) # SEASON GAME_DATE TEAM_NAME_HOME TEAM_NAME_AWAY WL_HOME WL_AWAY # 1 1980 1980-10-10 Boston Celtics Cleveland Cavaliers W L # 2 1980 1980-10-10 Detroit Pistons Washington Bullets L W # 3 1980 1980-10-10 Seattle SuperSonics Los Angeles Lakers L W # 4 1980 1980-10-10 Phoenix Suns Golden State Warriors W L # 5 1980 1980-10-10 Philadelphia 76ers Milwaukee Bucks L W # 6 1980 1980-10-10 Utah Jazz Portland Trail Blazers W L
Next, let’s load the NBA finals data. Note that this data uses a different convention for labeling the season (e.g. 2021 refers to the 2020-2021 season), so we have to subtract 1 in order for the season column to match across the two data sources. For the rest of this analysis, season
# get NBA finals info # season = year - 1 to match regular_df finals_df <- read_csv(finalsFile) %>% select(season = year, nba_champion, nba_vice_champion) %>% mutate(season = season - 1) %>% filter(season >= firstYear) names(finals_df) <- toupper(names(finals_df)) head(finals_df) # # A tibble: 6 × 3 # SEASON NBA_CHAMPION NBA_VICE_CHAMPION # <dbl> <chr> <chr> # 1 1980 Boston Celtics Houston Rockets # 2 1981 Los Angeles Lakers Philadelphia 76ers # 3 1982 Philadelphia 76ers Los Angeles Lakers # 4 1983 Boston Celtics Los Angeles Lakers # 5 1984 Los Angeles Lakers Boston Celtics # 6 1985 Boston Celtics Houston Rockets
I’ve never heard the term “vice champion” before! Since the dataset uses that to refer to the team that was in the finals but didn’t win, I’ll use that term too. Join the two data frames and keep just the games that an NBA finalist played in:
joined_df <- regular_df %>% inner_join(finals_df, by = "SEASON") %>% filter(TEAM_NAME_HOME == NBA_CHAMPION | TEAM_NAME_HOME == NBA_VICE_CHAMPION | TEAM_NAME_AWAY == NBA_CHAMPION | TEAM_NAME_AWAY == NBA_VICE_CHAMPION) %>% mutate(FOR_CHAMPION = TEAM_NAME_HOME == NBA_CHAMPION | TEAM_NAME_AWAY == NBA_CHAMPION, FOR_VICE_CHAMPION = TEAM_NAME_HOME == NBA_VICE_CHAMPION | TEAM_NAME_AWAY == NBA_VICE_CHAMPION)
Next, let’s compute the win percentages across games for the NBA finalists. I did it separately for the champions and vice champions then joined the two datasets together since it was a little easier to figure out the code.
# Compute win percentages across games for champions champion_df <- joined_df %>% filter(FOR_CHAMPION) %>% select(SEASON, GAME_DATE, TEAM_NAME_HOME, WL_HOME, NBA_CHAMPION) %>% group_by(SEASON) %>% mutate(GAME = 1, WIN = ifelse((TEAM_NAME_HOME == NBA_CHAMPION & WL_HOME == "W") | (TEAM_NAME_HOME != NBA_CHAMPION & WL_HOME == "L"), 1, 0)) %>% transmute(TEAM_TYPE = "Champion", SEASON = factor(SEASON), GAMES_PLAYED = cumsum(GAME), WINS = cumsum(WIN), WIN_PCT = WINS / GAMES_PLAYED) # Compute win percentages across games for vice champions vice_champion_df <- joined_df %>% filter(FOR_VICE_CHAMPION) %>% select(SEASON, GAME_DATE, TEAM_NAME_HOME, WL_HOME, NBA_VICE_CHAMPION) %>% group_by(SEASON) %>% mutate(GAME = 1, WIN = ifelse((TEAM_NAME_HOME == NBA_VICE_CHAMPION & WL_HOME == "W") | (TEAM_NAME_HOME != NBA_VICE_CHAMPION & WL_HOME == "L"), 1, 0)) %>% transmute(TEAM_TYPE = "Vice Champion", SEASON = factor(SEASON), GAMES_PLAYED = cumsum(GAME), WINS = cumsum(WIN), WIN_PCT = WINS / GAMES_PLAYED) # put champion & vice champion data together # final_win_pct_df is used for the year labels in the plots win_pct_df <- rbind(champion_df, vice_champion_df) final_win_pct_df <- win_pct_df %>% group_by(TEAM_TYPE, SEASON) %>% slice_tail(n = 1) head(win_pct_df) # # A tibble: 6 × 5 # # Groups: SEASON [1] # TEAM_TYPE SEASON GAMES_PLAYED WINS WIN_PCT # <chr> <fct> <dbl> <dbl> <dbl> # 1 Champion 1980 1 1 1 # 2 Champion 1980 2 1 0.5 # 3 Champion 1980 3 2 0.667 # 4 Champion 1980 4 2 0.5 # 5 Champion 1980 5 3 0.6 # 6 Champion 1980 6 3 0.5
We are now ready to make the plot! First, let’s make the plot of win percentage vs. games played for the NBA champions:
ggplot(champion_df, aes(x = GAMES_PLAYED, y = WIN_PCT, col = SEASON)) + geom_line() + geom_hline(yintercept = 0.5, linetype = "dashed") + geom_vline(xintercept = 20, linetype = "dashed") + geom_text(data = filter(final_win_pct_df, TEAM_TYPE == "Champion"), aes(x = max(GAMES_PLAYED), y = WIN_PCT, label = SEASON), hjust = 0, size = 3) + coord_cartesian(ylim = c(0, 1)) + theme(legend.position = "none") + labs(title = "Win percentage by games played (champions)", x = "Games played", y = "Win percentage")
Indeed, at 20 games all NBA champions were 0.500 or better. Only one team was exactly at 0.500: the 2005-2006 Miami Heat.
Let’s make the same plot for the vice champions:
ggplot(vice_champion_df, aes(x = GAMES_PLAYED, y = WIN_PCT, col = SEASON)) + geom_line() + geom_hline(yintercept = 0.5, linetype = "dashed") + geom_vline(xintercept = 20, linetype = "dashed") + geom_text(data = filter(final_win_pct_df, TEAM_TYPE == "Vice Champion"), aes(x = max(GAMES_PLAYED), y = WIN_PCT, label = SEASON), hjust = 0, size = 3) + coord_cartesian(ylim = c(0, 1)) + theme(legend.position = "none") + labs(title = "Win percentage by games played (vice champions)", x = "Games played", y = "Win percentage")
Again, at 20 games all NBA vice champions after the 1980-1981 season were 0.500 or better. Only one team at 0.500 was the 2004-2005 Detroit Pistons. As we can see from the plot, the most recent NBA finalist to be below 0.500 was the 1980-1981 Houston Rockets, who also happen to be the only NBA finalist in this time window to finish the entire regular season under 0.500.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.