Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
by John Mount (more articles) and Nina Zumel (more articles)
When fitting and selecting models in a data science project, how do you know that your final model is good? And how sure are you that it's better than the models that you rejected? In this Part 2 of our four part mini-series "How do you know if your model is going to work?" we develop in-training set measures. Previously we worked on:
- Part 1: Defining the scoring problem
In-training set measures
The most tempting procedure is to score your model on the data used to train it. The attraction is this avoids the statistical inefficiency of denying some of your data to the training procedure.
Run it once procedure
A common way to asses score quality is to run your scoring function on the data used to build your model. We might try comparing several models scored by AUC or deviance (normalized to factor out sample size) on their own training data as shown below.
- The null model (on the graph as "null model"). This is the performance of the best constant model (model that returns the same answer for all datums). In this case it is a model scores each and every row as having an identical 7% chance of churning. This is an important model that you want to better than. It is also a model you are often competing against as a data science as it is the "what if we treat everything in this group the same" option (often the business process you are trying to replace).The data scientist should always compare their work to the null model on deviance (null model AUC is trivially 0.5) and packages like logistic regression routinely report this statistic.
- The best single variable model (on the graph as "best single variable model"). This is the best model built using only one variable or column (in this case using a GAM logistic regression as the modeling method). This is another model the data scientist wants to out perform as it represents the "maybe one of the columns is already the answer case" (if so that would be very good for the business as they could get good predictions without modeling infrastructure).The data scientist should definitely compare their model to the best single variable model. Until you significantly outperform the best single variable model you have not outperformed what an analyst can find with a single pivot table.
At this point it would be tempting to pick the random forest model as the winner as it performed best on the training data. There are at least two things wrong with this idea:
To read more: win-vector blog: How do you know if your model is going to work? Part 2: In-training set measures
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.