How to Build a Predictive Model for NBA Games

[This article was first published on R – Predictive Hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

In this tutorial, we will provide an example of how you can build a starting predictive model for NBA Games. The steps are the following:

  • Scrape the game results from the ESPN for each team.
  • Transform the data, generate some features and get the running totals of each team per game.
  • Build the Predictive Model
  • Make Predictions

Scrape the Data

We would like to get the results per team. The ESPN URL is of the form https://www.espn.com/nba/team/schedule/_/name/tor where the last part is for the team. So, for the Toronto Raptors is tor for Boston Celtics is bos and so on. Let’s have a look at the Boston Celtics page:

How to Build a Predictive Model for NBA Games 1

Actually, we care for the columns DATE, OPPONENT, RESULT and W-L. Let’s create a script to get the results of all teams and to store them in a data frame called by_team. Note that I had to find myself the team codes, such as tor, mil, den and so on.

library(rvest)
library(lubridate)
library(tidyverse)
library(stringr)
library(zoo)
library(h2o)
library(lubridate)



teams<-c("tor", "mil", "den", "gs", "ind", "phi", "okc", "por", "bos", "hou", "lac", "sa",
         "lal", "utah", "mia", "sac", "min", "bkn", "dal", "no", "cha", "mem", "det", "orl",
         "wsh", "atl", "phx", "ny", "chi", "cle")

teams_fullname<-c("Toronto", "Milwaukee", "Denver", "Golden State", "Indiana", "Philadelphia", "Oklahoma City","Portland",
                  "Boston", "Houston", "LA", "San Antonio", "Los Angeles", "Utah", "Miami", "Sacramento", "Minnesota", "Brooklyn",
                  "Dallas", "New Orleans", "Charlotte", "Memphis", "Detroit", "Orlando", "Washington", "Atlanta", "Phoenix",
                  "New York", "Chicago", "Cleveland")

by_team<-{}
for (i in 1:length(teams)) {
  url<-paste0("http://www.espn.com/nba/team/schedule/_/name/", teams[i])
  #print(url)
  webpage <- read_html(url)
  team_table <- html_nodes(webpage, 'table')
  team_c <- html_table(team_table, fill=TRUE, header = TRUE)[[1]]
  team_c<-team_c[1:which(team_c$RESULT=="TIME")-1,]
  team_c$URLTeam<-toupper(teams[i])
  team_c$FullURLTeam<-(teams_fullname[i])
  by_team<-rbind(by_team, team_c)
}

# remove the postponed games
by_team<-by_team%>%filter(RESULT!='Postponed')
 
How to Build a Predictive Model for NBA Games 2
How to Build a Predictive Model for NBA Games 3

Transform the Data and Feature Engineering

Now, we will need to clean and modify the data so that to able to train the model. This is the most difficult part of Machine Learning Modelling. What we actually need, is the running percentage of wins of each team before the game as well as the final outcome (Win=1, Lost=0). However, we will take into consideration other features such as the percentage of wins in the last 10 games, as well as the percentage of wins when the team plays home and when it plays away. Let’s start:

by_team_mod<-by_team%>%select(-(`Hi Points`:`Hi Assists`))%>%mutate(CleanOpponent = str_replace(str_extract(str_replace(OPPONENT, "^vs",""), "[A-Za-z].+"), " \\*",""), 
                                                                    HomeAway= ifelse(substr(OPPONENT,1,2)=="vs", "Home", "Away"), WL=`W-L`)%>%
  separate(WL, c("W", "L"), sep="-")%>%mutate(Tpct=as.numeric(W) / (as.numeric(L)+as.numeric(W)))%>%mutate(dummy=1, Outcome=ifelse(substr(RESULT,1,1)=="W",1,0))%>%
  group_by(URLTeam)%>%mutate(Rank = row_number(), TeamMatchID=paste0(Rank,URLTeam,HomeAway), TLast10=rollapplyr(Outcome, 10, sum, partial = TRUE)/ rollapplyr(dummy, 10, sum, partial = TRUE))%>%
  group_by(URLTeam, HomeAway)%>%mutate(Rpct=cumsum(Outcome)/cumsum(dummy), RLast10=rollapplyr(Outcome, 10, sum, partial = TRUE)/ rollapplyr(dummy, 10, sum, partial = TRUE))%>%
  mutate_at(vars(Rpct, RLast10), funs(lag))%>%group_by(URLTeam)%>%mutate_at(vars(Tpct, TLast10), funs(lag))%>%na.omit()%>%
  select(TeamMatchID, Rank, DATE, URLTeam, FullURLTeam, CleanOpponent, HomeAway,Tpct,TLast10 , Rpct, RLast10, Outcome)
How to Build a Predictive Model for NBA Games 4

The Tpct and the TLast10 is the running total win rate up to now and for the last 10 games respectively for the URL team. The Rpct and the RLast10 is the relevant running total win rate up to now and for the last 10 games respectively for the URL team, whereby relevant we mean the home and the away. Please pay attention to the lag function that we have used since we want the running total up until the game, without including the outcome of the game, since this is what we try to predict. Otherwise, we would have “data leakage”.

Now, we should convert the Rpct and the RLast10 to HRpct and HRLast10 if they are referred to Home or to ARpct and ARLast10 if they are referred to Away. Let’s do it:


df <- data.frame(matrix(ncol = 16, nrow = 0))
x <- c(colnames(by_team_mod), "HRpct", "HRLast10",  "ARpct", "ARLast10")
colnames(df) <- x


for (i in 1:nrow(by_team_mod)) {
  if(by_team_mod[i,"HomeAway"]=="Home") {
    df[i,c(1:14)]<-data.frame(by_team_mod[i,c(1:12)], by_team_mod[i,c(10:11)])
  }
  else {
    
    df[i,c(1:12)]<-by_team_mod[i,c(1:12)]
    df[i,c(15:16)]<-by_team_mod[i,c(10:11)]
    
  }
}

# fill the NA values with the previous ones, group by team

df<-df%>%group_by(URLTeam)%>%fill(HRpct , HRLast10, ARpct,  ARLast10, .direction=c("down"))%>%ungroup()%>%na.omit()%>%filter(Rank>=10)
 

Notice that for the Machine Learning Model, we included the running total of at least 10 games (filter(Rank>=10))

How to Build a Predictive Model for NBA Games 5

The final step is to create the “full_df” which is an inner join of the “Home df” and the “Away df“.

# create the home df
H_df<-df%>%filter(HomeAway=="Home")%>%ungroup()
colnames(H_df)<-paste0("H_", names(H_df))


# create the away df
A_df<-df%>%filter(HomeAway!="Home")%>%ungroup()
colnames(A_df)<-paste0("A_", names(A_df))


Full_df<-H_df%>%inner_join(A_df, by=c("H_CleanOpponent"="A_FullURLTeam", "H_DATE"="A_DATE"))%>%
  select(H_DATE, H_URLTeam, A_URLTeam, H_Tpct, H_TLast10, H_HRpct, H_HRLast10, H_ARpct, H_ARLast10, 
         A_Tpct, A_TLast10, A_HRpct, A_HRLast10, A_ARpct, A_ARLast10,  H_Outcome)

How to Build a Predictive Model for NBA Games 6

Build the Predictive Model

Now we are ready to build the Machine Learning model. We will work with the H2O library and with the Random Forest, although we could have used other algorithms such as Logistic Regression etc.

# Build the model

h2o.init()
Train_h2o<-as.h2o(Full_df)

Train_h2o$H_Outcome<-as.factor(Train_h2o$H_Outcome)

# random forest model
model1 <- h2o.randomForest(y = 16, x=c(4:15 ), training_frame = Train_h2o, max_depth=4 )

h2o.performance(model1)

How to Build a Predictive Model for NBA Games 7

Make Predictions

The model is ready and we are able to make predictions. We will give as input the Home Team and the Away Team and the algorithm will return the corresponding probabilities of each team to win. What we want is to get the most recent data of each team, which will be the predictors of the model. In order to get the most recent observation by team, we will use the slice(n()).

#######################
### most recent by team
#######################


### create an empty data frame and fill it in order to get the summary statistics


df <- data.frame(matrix(ncol = 16, nrow = 0))
x <- c(colnames(by_team_mod), "HRpct", "HRLast10",  "ARpct", "ARLast10")
colnames(df) <- x


for (i in 1:nrow(by_team_mod)) {
  if(by_team_mod[i,"HomeAway"]=="Home") {
    df[i,c(1:14)]<-data.frame(by_team_mod[i,c(1:12)], by_team_mod[i,c(10:11)])
  }
  else {
    
    df[i,c(1:12)]<-by_team_mod[i,c(1:12)]
    df[i,c(15:16)]<-by_team_mod[i,c(10:11)]
    
  }
}


# fill the NA values with the previous ones group by team

m_df<-df%>%group_by(URLTeam)%>%fill(HRpct , HRLast10, ARpct,  ARLast10, .direction=c("down"))%>%ungroup()%>%
  na.omit()%>%group_by(URLTeam)%>%slice(n())%>%ungroup()

How to Build a Predictive Model for NBA Games 8

Let’s get the predictions of the following 5 games:

How to Build a Predictive Model for NBA Games 9


### Make predictions

df<-{}
a<-c("DET", "BOS", "ATL", "ORL", "PHI")
h<-c("CHA","BKN", "TOR", "MIA", "CHI")

for (i in 1:length(a)) {
  
  
  
  th<-m_df%>%filter(URLTeam==h[i])%>%select(Tpct:ARLast10, -Outcome)
  colnames(th)<-paste0("H_", colnames(th))
  
  ta<-m_df%>%filter(URLTeam==a[i])%>%select(Tpct:ARLast10, -Outcome)
  colnames(ta)<-paste0("A_", colnames(ta))
  
  
  pred_data<-cbind(th,ta)
  
  
  
  tmp<-data.frame(Away=a[i], Home=h[i],as.data.frame(predict(model1,as.h2o(pred_data))))
  df<-rbind(df, tmp)
  
}

df<-df%>%select(-predict)
df
 
How to Build a Predictive Model for NBA Games 10

So, according to the model, the DET has 29.5% chances to win against CHA and BOS 40.9% to win against BKN and so on.

Final Thoughts

This is a relatively simple model. We can enrich it by taking into account other features such as the injuries, the days between two games, the traveling distance of the teams and so.

To leave a comment for the author, please follow the link and comment on their blog: R – Predictive Hacks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)