Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Home-court advantage in the NBA
For obvious reasons, 2020 has not been the best of years, and alongside the terrible societal and economical turmoil created by the COVID-19 pandemic, the sports and entertainment industry has also undergone some dramatic transformations. Of all major sports leagues, the NBA was able to adapt quickly and efficiently to the crisis (albeit at a cost for players and staff, who had to live in a bubble for an extended amount of time), thereby setting a standard to all other leagues on how to safely resume play. Due to the COVID-19 crisis, the NBA paused its season in mid-March 2020, but after extensive planning, was able to resume play in July by creating a bubble from which very few people were able to come in and out. In this bubble, all eligible teams had to complete the remainder of the season and playoffs, and all games tooks place on neutral grounds in which the only fans included the players, coaching and refereeing staff and a few select members of the media.
Now I would pick an arena packed of loud fans any day of the week, but the NBA bubble provided the environment for a unique real-world experiment to estimate the potential impact of fans and home-court advantage on team performance. While the first half of the season saw the schedule operate as normal with home and away games for each team, the second half of the season saw all all games being played on neutral grounds. In this post, we will analyze some NBA game data for the last few seasons to asess whether home-court advantage does indeed impact team performance.
Collecting NBA game data
To begin, I wrote a quick chunk of Python code to scrape NBA game results from the www.basketball-reference.com
. The code snippet below iterates through all combinations of year/month since 2015 and extracts the results in a Pandas Dataframe.
Running the chunk of code above will store all the data in a raw_dict
dictionnary where keys are tuples in the shape of (year, 'month')
(for example, (2015, 'january')
, (2020, 'aprild')
, etc.) and values are Pandas Dataframes containing results played during the correspong year/month pair. For example, you can access all NBA data for December 2019 by typing the command:
Cleaning NBA game data
Once we have collected the data that we need for this analysis, we can engage in some minor data cleaning/processing in order to prepare the data for further analysis. The chunk of code below concatenates all monthly NBA data into a single DataFrame and cleans up some of the row values and column names. We also assign an additional column score_diff
, which shows the score difference between the home and away team. If the score_diff
value is positive, then this means that the home team scored more points than the away team (i.e. the home team won). Inversely, if the score_diff
value is negative, then this means that the home team scored less points than the away team (i.e. the home team lost).
Running the chunk of code above will store all the data in a winloss_data
Dataframe, which should contain the following data:
year | month | Date | Start (ET) | visiting_team | visiting_team_score | home_team | home_team_score | Unnamed: 6 | Unnamed: 7 | Attend. | Notes | score_diff | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2015 | january | Thu, Jan 1, 2015 | 8:00p | Denver Nuggets | 101 | Chicago Bulls | 106 | Box Score | nan | 21794 | nan | 5 |
1 | 2015 | january | Thu, Jan 1, 2015 | 8:00p | Sacramento Kings | 110 | Minnesota Timberwolves | 107 | Box Score | nan | 13337 | nan | -3 |
2 | 2015 | january | Fri, Jan 2, 2015 | 7:00p | Cleveland Cavaliers | 91 | Charlotte Hornets | 87 | Box Score | nan | 19307 | nan | -4 |
3 | 2015 | january | Fri, Jan 2, 2015 | 7:00p | Brooklyn Nets | 100 | Orlando Magic | 98 | Box Score | nan | 17008 | nan | -2 |
4 | 2015 | january | Fri, Jan 2, 2015 | 7:30p | Dallas Mavericks | 119 | Boston Celtics | 101 | Box Score | nan | 18624 | nan | -18 |
A quick analysis of home-court advantage in the NBA
Now that we have gathered and cleaned up our NBA game data, we can proceed to a small analysis of the data. To begin, we can simply look at the summary statistics for the score_diff
column across different seasons.
year | count | mean | std | min | 25% | 50% | 75% | max |
---|---|---|---|---|---|---|---|---|
2015 | 1311 | 2.39054 | 13.5141 | -54 | -7 | 4 | 11 | 53 |
2016 | 1316 | 3.02736 | 13.5698 | -51 | -6 | 4 | 12 | 50 |
2017 | 1309 | 3.08327 | 13.8825 | -44 | -6 | 4 | 12 | 49 |
2018 | 1312 | 2.36662 | 13.7199 | -48 | -7 | 4 | 11 | 61 |
2019 | 1319 | 2.77786 | 14.3607 | -56 | -7 | 4 | 12 | 50 |
2020 | 1258 | 1.86248 | 13.4501 | -41 | -7 | 0 | 10 | 49 |
We can immediately see that the 2020 season has different summary statistics values than all other seasons. On average, the “home” team scored ~1.5 fewer points during the 2020 season (remember that score_diff
shows the score difference between the home and away team. If the score_diff
value is positive, then this means that the home team scored more points than the away team.). Similarly, the median value for the score_diff
was equal to 0
during the 2020 season, whereas it was equal to 4
for all other seasons since 2015. This indicates that until the 2020 season, 50% of teams playing at home scored 4 or more points than their opponents. In 2020, 50% of teams playing at home scored the same number of points than their opponents. If we recall that half of the 2020 season was played on neutral grounds, the numbers above indicate (but do not prove) that home-court advantage may have a significant impact on the win probability of the home team.
We can also visualize the overall distribution of the score differentials between the home and away teams across different seasons:
The distribution of the score differentials between the home and away teams also displays some interesting patterns. The distributions for seasons 2015-2019 appear to follow a bi-modal distribution, with two spikes around zero (there is no such thing as a draw in basketball). In all cases, the “positive” spike is larger than the “negative” spike, which aligns with the previous observation that home teams were more likely to win their games during those seasons. This bi-model distribution is not as clearly defined for the 2020 season, where instead we see a significantly larger “negative” spike, which indicates that home teams were more likely to lose their games during that season.
When plotting the normalized cumulative distribution function for the score differentials between the home and away teams, we obtain the following figure:
Again, the 2020 season stands apart from all other previous seasons included in our dataset, which shows a distinct uptick of games on the negative side of the zero
mark that denotes games where the “away” team won. However, if we really want to show the potential impact of home-court advantage, we can focus on the before/after distribution of score differentials during the 2020 season. To begin, we define the months during games were played inside and outside of the bubble:
We can then compute and plot the normalized cumulative distribution function for the score differentials before and after the bubble:
The figure clearly shows the impact that moving to a “bubble” environnment had on the score differentials between home and away teams. By entering a bubble in which all games were played on neutral arenas, the NBA negated the impact that fans could have on the outcome of games. As a result, we can reasonably conclude that the NBA bubble had a real impact on team performance, and that home-court advantage is a real phenomenom.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.