Kaggle Competition In 30 Minutes: Predict Home Credit Default Risk With R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
We were very excited when Home Credit teamed up with Kaggle to host the Home Credit Default Risk Challenge. Default risk is a topic that impacts all financial institutions, one that machine learning can help solve. We decided to flip the goal of this challenge: Kaggle competitions are performance driven, where a data scientist has months to fine tune a model to get maximum performance. This is not reality. We turned this goal upside down, focusing on a combination of speed, efficiency, and performance.
In the real world businesses have finite schedules for data science projects. Because of this, we redesigned the competition goal to get very good results quickly. We took a look at the top performance (measured by AUC or area under the curve), which is currently around 0.80. We decided to see how quickly we could develop a competitive model. Follow along and learn how we got a model of 0.70 on a Kaggle submission in about 30 minutes with H2O automated machine learning (AutoML)!
Related Business Machine Learning Articles
-
HR Analytics: Using Machine Learning To Predict Employee Turnover
-
Customer Analytics: Using Deep Learning With Keras To Predict Customer Churn
-
Sales Analytics: How To Use Machine Learning To Predict And Optimize Product Backorders
Learning Trajectory
In this article, we flip the Kaggle Competition goal upside down focusing on a combination of efficiency and performance. Matt shows you how to get a very good model that gets a 0.70 AUC in about 30 minutes of analysis effort.
We kick the project off with an overview of credit default risk. You’ll begin by understanding what credit default risk is and how machine learning can help identify personal loan applications that are likely to lead to default. We’ll also go over why a Kaggle challenge redesign is more representative of what you’ll need to do in the real world.
Next, we’ll go through the process Matt used to build a competitive model in about 30 minutes. This includes about 10 minutes of strategizing and 20 minutes of actual coding. You’ll see how Matt used the following process to efficiently build a high performance model.
H2O AutoML Preprocessing Workflow
We end on next steps, which provides details on our end-to-end data science project course that will transform your data science abilities in 10 weeks.
Transform Your Abilities In 10 Weeks
If you’re interested in learning how to apply critical thinking and data science while solving a real-world business problem following an end-to-end data science project, check out Data Science For Business (DS4B 201-R). Over the course of 10 weeks you will solve an end-to-end Employee Churn data science project following our systematic Business Science Problem Framework.
Overview
We’ll briefly touch on what credit default risk is, how machine learning can help, and why a 30-Minute Kaggle Challenge.
What Is Credit Default Risk?
In the world of finance, Credit or Default Risk is the probability that companies or individuals will be unable to make the required payments on their debt obligations. Lenders are exposed to default risk on every extension of credit. Think of financial institutions like Citi, which lend to millions of people worldwide, or JP Morgan Chase & Co, which lend to hundreds of thousands of companies worldwide.
JP Morgan: A Major Lender Of Corporate Loans
Every time Citi or JP Morgan extends a loan to a person or a corporation it assesses the level of risk involved in the transaction. This is because every once in a while people or companies won’t be able to make payments. In the dynamic world we live in unfortunate events happen and circumstances change. Default risk is comprised of two components:
-
Systemic Risk – The risk related to an external system of interconnected relationships (think of this as the risk associated with they economy, interest rate changes, inflation, recessions, and wars). This risk is often considered unavoidable. The only option is to hedge or not be in the market.
-
Unsystematic Risk – The risk inherent to a person or company (think of this as a company that has a competitor that moves in driving down sales). This risk is avoidable through diversification or making bets on many assets preferably uncorrelated.
In this article, we’ll be focusing on a specific type of risk called Credit or Default Risk, which has both systemic and unsystemic drivers. The main point is that the drivers of default risk can be measured and analyzed for patterns related to default. As a result, the probability of default for a person or an institution is not random. This is where machine learning can help.
The probability of default for a person or an institution is not random. This is where machine learning can help.
Machine Learning For Personal Default Risk
Like any machine learning problem, default risk has drivers, or features related to the person or company that have a relationship with the likelihood of a default. As discussed previously, some of the drivers are systemic and others are unsystemic. Identifying and measuring these drivers are the keys to predicting default.
For the Kaggle Competition, Home Credit (the company) has supplied us with data from several data sources. The following Data Architecture Diagram shows the interrelationships between the data files provided.
Data Architecture Diagram For Home Credit Default Risk Competition
For the purposes of this article, we’ll focus on the application_train.csv
and application_test.csv
files, which contain a significant amount of useful information for predicting credit default. This is the main source of information for people that have applied for personal loans including features related to their loan application.
Several other data sources are provided such as bureau.csv
and previous_application.csv
. These data have “one-to-many relationships”, which can be combined through a process called feature engineering. In a tentative future article, we will go into strategies for feature engineering with data that have “one-to-many relationships” (e.g. creating features from bureau.csv
that can be combined with application_train.csv
).
Why A Kaggle Challenge Redesign?
While we enjoy the idea of the Kaggle Challenge, it’s not representative of a real world scenario where projects have fixed time schedules. The Kaggle Competition goal is to compete on maximum performance. The goal in the real world is quite often to get a good model within a defined project schedule.
Time to completion is important because projects have fixed resources. The longer you are on a project, the more you cost the company money and the more you delay other projects that you could be helping with.
The task of developing a high performance model is just one small piece of a data scientists job. It’s unrealistic to think that an organization will be OK with you (the data scientist) dedicating hundreds of man-hours to squeezing every last drop of performance out of the model. It’s not reality. Refer to the saying, “Great is the enemy of Good”. This is often the case in real-world data science.
Real-World End-To-End Data Science Project Structure
In the real world, you won’t have months that you can dedicate to a model. Rather, you will have maybe a day or two. The majority of your project time will be spent on:
-
Understanding the business (2 weeks)
-
Communicating with decision makers to turn their input into a data collection strategy (2 weeks)
-
Collecting and cleaning the data (1 week to several months)
-
Building a model (1 to 2 days)
-
Converting machine learning results to expected savings or profitability from a business change/improvement and communicating results (1 to 2 weeks)
-
Releasing these results into production via web applications (Several weeks)
This is what we call an “end-to-end” data science project.
Interested in Solving an End-To-End Data Science Project?
If you’re interested in learning how to apply critical thinking and data science while solving a real-world business problem following an end-to-end data science project, check out Data Science For Business (DS4B 201-R). Over the course of 10 weeks you will solve an end-to-end Employee Churn data science project following our systematic Business Science Problem Framework.
Completing A Kaggle Competition In 30 Minutes
By Matt Dancho, Founder of Business Science
Why 30-minutes for a Kaggle Challenge?
Because I wanted to show you that you that if you leverage high performance tools, you can drastically cut your modeling time down while getting a very good model. As a result, you can spend more time getting better features (up front in the project) and communicating results (after the model is developed). If you get more time after the first round of models are built, you can always go back and improve model performance by adding more data (more features) and improving model performance.
I’ll be providing details how I personally approached the Kaggle Competition and what I did each step of the way to get a good solution fast. Here’s the general process with code that I used to complete a high performing model in about 30 minutes.
Important Note: I did not set out with 30 minutes in mind.
My actual time was roughly 30 minutes.
Basis For Kaggle Competition Redesign
One of my favorite episodes of the Tim Ferris Podcast is one in which he interviews Walter O’Brien, “the real life Scorpion”. Walter is the basis for the hit TV show, Scorpion, that brings together a team of hackers and intellectual misfits to solve crime with technology. While the podcast episode sparked a lot of controversy related to Walter’s history and claims that could not be verified, I found a particular piece of the episode very interesting: The setup of the 1993 Computing Olympics, which Walter reportedly participated in.
In the final challenge of the Computing Olympics, a computer scientist had a fixed amount of time (3 hours) to solve a very complicated business problem by creating an algorithm. The best contestants followed this strategy:
-
The top contestants began by critically thinking: Thinking about the problem and what tools they had in their arsenal to solve it. This could be upwards of 50% to 2/3rd of the allotted competition time.
-
They spent the remaining 50% to 1/3rd speed coding: Hammering out their solution, converting to code with the programming languages that they knew would be the best tool for the scenario.
There’s a lot of value in this approach because of the critical thinking process and the fact that in the real world you will likely only have a fixed amount of days you can work on a model before the business deadline is hit. I wanted to see if this approach could be applied to the Home Credit Default Risk challenge.
Setup And Data (5 min)
This part took about 5 minutes excluding the time to download the data and place it into a folder.
You can collect the data from the Kaggle Home Credit Default Risk Competition site. Just a heads up, the datasets in full total 688MB so we need to be mindful of space but more importantly RAM.
Once the data is downloaded, I added it to a folder in my home directory called “00_data”.
Libraries
I loaded the following libraries to tackle this Home Credit Default Risk problem. The tidyverse
pulls in dplyr
, ggplot2
, and several other must-haves (e.g. forcats
, purrr
, etc). The recipes
package is loaded for preprocessing. The h2o
package is going to do the heavy lifting via automated machine learning (AutoML).
Data
Next, I read the data with read_csv()
, which takes care of much of the column formats for us.
Note that the files size is 158MB. This can be a problem when making multiple copies of this data in R, which can quickly cause us to run out of RAM. A strategy I used was to remove previous iterations of the data as I went through processing. You’ll see me use rm()
a lot.
Next, I took a quick slice()
of the data to see what features are present. I use kable()
to format the table for printing here, but this was not done during the 30 minutes of coding.
SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | FONDKAPREMONT_MODE | HOUSETYPE_MODE | TOTALAREA_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
100002 | 1 | Cash loans | M | N | Y | 0 | 202500 | 406597.5 | 24700.5 | 351000 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.018801 | -9461 | -637 | -3648 | -2120 | NA | 1 | 1 | 0 | 1 | 1 | 0 | Laborers | 1 | 2 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.0830370 | 0.2629486 | 0.1393758 | 0.0247 | 0.0369 | 0.9722 | 0.6192 | 0.0143 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0369 | 0.0202 | 0.0190 | 0.0000 | 0.0000 | 0.0252 | 0.0383 | 0.9722 | 0.6341 | 0.0144 | 0.0000 | 0.0690 | 0.0833 | 0.1250 | 0.0377 | 0.022 | 0.0198 | 0 | 0 | 0.0250 | 0.0369 | 0.9722 | 0.6243 | 0.0144 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0375 | 0.0205 | 0.0193 | 0.0000 | 0.00 | reg oper account | block of flats | 0.0149 | Stone, brick | No | 2 | 2 | 2 | 2 | -1134 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
100003 | 0 | Cash loans | F | N | N | 0 | 270000 | 1293502.5 | 35698.5 | 1129500 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | -16765 | -1188 | -1186 | -291 | NA | 1 | 1 | 0 | 1 | 1 | 0 | Core staff | 2 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 0.3112673 | 0.6222458 | NA | 0.0959 | 0.0529 | 0.9851 | 0.7960 | 0.0605 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0130 | 0.0773 | 0.0549 | 0.0039 | 0.0098 | 0.0924 | 0.0538 | 0.9851 | 0.8040 | 0.0497 | 0.0806 | 0.0345 | 0.2917 | 0.3333 | 0.0128 | 0.079 | 0.0554 | 0 | 0 | 0.0968 | 0.0529 | 0.9851 | 0.7987 | 0.0608 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0132 | 0.0787 | 0.0558 | 0.0039 | 0.01 | reg oper account | block of flats | 0.0714 | Block | No | 1 | 0 | 1 | 0 | -828 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500 | 135000.0 | 6750.0 | 135000 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.010032 | -19046 | -225 | -4260 | -2531 | 26 | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 1 | 2 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Government | NA | 0.5559121 | 0.7295667 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0 | 0 | 0 | 0 | -815 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
100006 | 0 | Cash loans | F | N | Y | 0 | 135000 | 312682.5 | 29686.5 | 297000 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.008019 | -19005 | -3039 | -9833 | -2437 | NA | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2 | 2 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | NA | 0.6504417 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2 | 0 | 2 | 0 | -617 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NA | NA | NA | NA | NA | NA |
100007 | 0 | Cash loans | M | N | Y | 0 | 121500 | 513000.0 | 21865.5 | 513000 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.028663 | -19932 | -3038 | -4311 | -3458 | NA | 1 | 1 | 0 | 1 | 0 | 0 | Core staff | 1 | 2 | 2 | THURSDAY | 11 | 0 | 0 | 0 | 0 | 1 | 1 | Religion | NA | 0.3227383 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0 | 0 | 0 | 0 | -1106 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
100008 | 0 | Cash loans | M | N | Y | 0 | 99000 | 490495.5 | 27517.5 | 454500 | Spouse, partner | State servant | Secondary / secondary special | Married | House / apartment | 0.035792 | -16941 | -1588 | -4970 | -477 | NA | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 2 | 2 | 2 | WEDNESDAY | 16 | 0 | 0 | 0 | 0 | 0 | 0 | Other | NA | 0.3542247 | 0.6212263 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0 | 0 | 0 | 0 | -2536 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
100009 | 0 | Cash loans | F | Y | Y | 1 | 171000 | 1560726.0 | 41301.0 | 1395000 | Unaccompanied | Commercial associate | Higher education | Married | House / apartment | 0.035792 | -13778 | -3130 | -1213 | -619 | 17 | 1 | 1 | 0 | 1 | 1 | 0 | Accountants | 3 | 2 | 2 | SUNDAY | 16 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.7747614 | 0.7239999 | 0.4920601 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 | 0 | 1 | 0 | -1562 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 2 |
100010 | 0 | Cash loans | M | Y | Y | 0 | 360000 | 1530000.0 | 42075.0 | 1530000 | Unaccompanied | State servant | Higher education | Married | House / apartment | 0.003122 | -18850 | -449 | -4597 | -2379 | 8 | 1 | 1 | 1 | 1 | 0 | 0 | Managers | 2 | 3 | 3 | MONDAY | 16 | 0 | 0 | 0 | 0 | 1 | 1 | Other | NA | 0.7142793 | 0.5406545 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2 | 0 | 2 | 0 | -1070 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
100011 | 0 | Cash loans | F | N | Y | 0 | 112500 | 1019610.0 | 33826.5 | 913500 | Children | Pensioner | Secondary / secondary special | Married | House / apartment | 0.018634 | -20099 | 365243 | -7427 | -3514 | NA | 1 | 0 | 0 | 1 | 0 | 0 | NA | 2 | 2 | 2 | WEDNESDAY | 14 | 0 | 0 | 0 | 0 | 0 | 0 | XNA | 0.5873340 | 0.2057473 | 0.7517237 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
100012 | 0 | Revolving loans | M | N | Y | 0 | 135000 | 405000.0 | 20250.0 | 405000 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.019689 | -14469 | -2019 | -14437 | -3992 | NA | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 1 | 2 | 2 | THURSDAY | 8 | 0 | 0 | 0 | 0 | 0 | 0 | Electricity | NA | 0.7466436 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2 | 0 | 2 | 0 | -1673 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NA | NA | NA | NA | NA | NA |
application_train_tbl: First 10 Rows
A few points about the data:
-
The training data contains 307K observations, which are people applied for and received loans
-
The “TARGET” column identifies whether or not the person defaulted
-
The remaining 121 features describe various attributes about the person, loan, and application
-
There are several additional data sets (
bureau.csv
,previous_application.csv
). I ignored these since most of the important data was likely to be in the main file. These auxiliary files contain data that is one-to-many relationship, which would require aggregation (a feature engineering method) and considerably more work for the unknown benefit.
I was more interested in agile iteration: Building a good model quickly, then going back to try to improve later.
Train And Test Sets
Next, I formatted the data as train and test sets. I noted that the “application_test_tbl” did not have a TARGET column (I found this out when I tried to process it with recipes
during the first go-around). The “test” set is the set that the Kaggle Competition is measured against that we will submit later. Note that I remove the unnecessary data sets after I’m finished with them in order to save memory.
Data Inspection
I used one of my favorite packages for data investigation, skimr
. This package helps me understand how much missing data I’m dealing with and overall sense of the data by data type.
Normally I just run skim()
on the data. However, for the purposes of outputting in a readable format on the web, I’ll use skim_to_list()
I see that there are 3 data type formats: integer, numeric, and character. For H2O we need numeric and factor. I see some of the integer values are “FLAG”, which typically indicates a factor.
Analysis Strategy (5 min)
The first 5 minutes were dedicated to mocking up a plan. I began by framing a solution on a piece of scrap paper.
Analysis Strategy on Scrap Paper
The general workflow (1) is at the top where I wanted to accomplish the following goals:
-
Read the data
-
Prepare the data
-
Model - This step I wanted to use H2O because I knew I could leverage automated machine learning (AutoML) to vastly speed up the modeling process.
-
Make predictions
The second half (2) of the scrap paper is how I planned to execute each step. The workflow involved a special tool using recipes
to accomplish the preprocessing. I planned to minimally transform the data addressing missing values and categorical data.
With a plan in hand, I got started.
Preprocessing Strategy (5 min)
I leveraged my knowledge of how H2O AutoML works under the hood to develop a simplified strategy to tackle preprocessing. H2O AutoML handles common tasks internally such as creating dummy variables (or one hot encoding), performing statistical transformations (e.g. YeoJohnson), etc. As a result, this greatly simplifies the preprocessing tasks that I needed to worry about.
The main things I focused on were:
-
Character data that needs to be factored:
h2o.automl()
requires any character data to be converted to a factor. -
Numeric data that has a low number of unique counts: This is common in business data where there are numeric features that have low number of unique values. These are typically categorical data.
-
Missing values: Missing values need to be either imputed or to have the columns and or rows dropped. I made a point to impute all data before modeling with
h2o.automl()
although I believe that AutoML has methods for handling missing data.
Here’s the simplified strategy I used for preprocessing for AutoML. I diagrammed it after the fact to show the flow of how the entire process works beyond my “chicken scratch” scrap paper diagram shown previously. Keep in mind that H2O handles most data transformations “under the hood”, which is really useful for getting good models quickly.
H2O AutoML Preprocessing Workflow
Interested in Learning Preprocessing Strategies For Business?
If you're interested in learning how to apply the preprocessing strategies using R Programming beyond the simplified strategy shown, we teach preprocessing with recipes
in Chapter 3 of Data Science For Business (DS4B 201-R) as part of our 10-Week end-to-end Employee Churn data science project.
Let’s implement the H2O AutoML Preprocessing Workflow shown in the diagram.
Character Data
First, I collected the names of all character data. These columns will be converted into factors.
Numeric Factor Data
Next, I looked at which numeric data should be factored (categorical). These typically have a low number of unique levels. I used a few dplyr
and purrr
operations to get a count of the unique levels. The real trick is using map_df()
with the anonymous function ~ unique(.) %>% length()
, which counts the unique values in each column, then gather()
-ing the results in the long format.
I used a factor limit of 7 meaning any values with less than 7 unique observations will not be converted to factors. I then collected the column names by pull()
-ing the key column and storing as character data.
Missing Data
I wanted to see how much missing data I was dealing with. The trick here was using the summarize_all()
function with an anonymous function (.funs = ~ sum(is.na(.)) / length(.)
) to calculate the percentage of missing data.
A number of the features have over 65% missing values. My plan is to impute all of the data based on mean for numeric and mode for categorical. More on this next.
Preprocessing Implementation And H2O Modeling (15 min)
This step took a little about 15 minutes in total. We first use recipes
to preprocess the data for H2O AutoML. We next model with h2o
.
Recipes Preprocessing
I used recipes
to quickly preprocess the data for AutoML.
-
Converting character to factor: I used
step_string2factor()
for the character names we stored earlier. -
Converting numeric values with few unique levels to factor: I used
step_num2factor()
for the numeric column names we stored earlier. -
Imputing missing data: I used
step_meanimpute()
for the numeric features andstep_modeimpute()
for the categorical features.
After applying the step_
functions, I prepared the recipe using prep()
. The trick here was to use stringsAsFactors = FALSE
, which eliminates a pesky error with .
Next, I used bake()
on the training and testing data sets to implement the minimal transformations.
We can see the results of the transformation. For brevity, I show the first 30 columns and use the glimpse()
function to quickly scan the data.
Before Transformation
Before transformation (x_train_tbl
), we can see that NA
values are present and many of the flags are coded as integer.
After Transformation
After the transformation (x_train_processed_tbl
), the NA
values are imputed, and we are left with factor and numeric (integer and double) data.
I also converted the target to a factor, which is the format needed for H2O to perform binary classification. The trick is to convert the numeric value to character then to factor.
Last, I removed the unprocessed datasets and recipe object to save memory.
Interested in Learning Preprocessing Strategies For Business?
If you're interested in learning how to apply the preprocessing strategies using R Programming beyond the simplified strategy shown, we teach preprocessing with recipes
in Chapter 3 of Data Science For Business (DS4B 201-R) as part of our 10-Week end-to-end Employee Churn data science project.
Modeling
I implemented Automated Machine Learning (AutoML) with H2O. First, I loaded h2o
and initialized a local cluster with h2o.init()
. We can also supply h2o.no_progress()
to eliminate progress bars.
Next, I converted the dataframe to an H2O Frame using as.h2o()
. H2O expects both the target and the training features to be in the same data frame, so I used bind_cols()
first.
I split the training data into train, validation, and test sets into a 70/15/15 split using h2o.splitFrame()
. Note that I are only used the loan application training data and not the testing data for this part (the testing data will be used to make the Kaggle Competition submission).
Next, I ran h2o.automl()
. I set max_runtime_secs = 90
to speed the training time up at the expense of some(default is 3600, which can take quite a while). The resulting leader model is produced in approximately 6 minutes.
Interested in Learning H2O AutoML For Business?
If you're interested in learning how to perform H2O Automated Machine Learning using R Programming beyond the simplified modeling shown, we teach H2O AutoML modeling (including Grid Search and Cross Validation) in Chapter 4 of Data Science For Business (DS4B 201-R) as part of our 10-Week end-to-end Employee Churn data science project.
H2O Performance And Kaggle Submission
This part was excluded from the timed challenge since the model was developed and this is simply testing the validity of the results and submitting the final solution.
H2O Performance
We can perform a brief performance analysis to assess the AutoML Leader model. First, let’s create an H2O performance object using h2o.performance()
. We’ll set new data to test_h2o
. Note that this is not the application test set, but rather a random sample of 15% of the application training set.
Next, let’s output the Confusion Matrix using the h2o.confusionMatrix()
function. It’s important to understand that this model uses a threshold of 0.139, which maximizes the F1. This is NOT always the best case for the business. If you want to learn how to optimize the classification threshold, our Data Science For Business With R course covers Expected Value, Threshold Optimization, and Sensitivity Analysis in Chapters 7 and 8.
We can see that 1528 people that defaulted were correctly detected, 5516 were incorrectly predicted to be default but actually did not, and 2185 were incorrectly predicted to not default, but actually did. This last group is called False Negatives, and this population can be EXTREMELY COSTLY to the lender. Typically a lot more costly than False Positives. This is why threshold optimization is important. Check out DS4B 201-R if you want to learn how to do threshold optimization.
We can also check out the AUC on the performance set. We’re getting about 0.73, which is pretty good.
Interested in Learning H2O Performance Analysis For Business?
If you're interested in learning how to do performance analysis for H2O modeling using R Programming beyond the simplified performance analysis shown, we teach:
- Chapter 5: H2O Performance Analysis (ROC, Precision vs Recall, Gain & Lift) on an entire leaderboard of H2O models (stacked ensembles, deep learning, random forest, and more)
- Chapter 7: Expected Value Framework and targeting using cost/benefit analysis
- Chapter 8: Optimizing the threshold to maximize profitability and performing sensitivity analysis to account for unknowns.
Check out Data Science For Business (DS4B 201-R) as part of our 10-Week end-to-end Employee Churn data science project.
Kaggle Submission
This is the moment of truth. I made my final predictions for the Kaggle Competition. Here’s the procedure and final results.
First, we can make some predictions using h2o.predict()
.
Next, let’s convert to a tibble and bind with the correct SK_ID_CURR. We’ll grab the SK_ID_CURR and p1 (positive class probability) columns, which is the format expected for the Kaggle Competition.
Finally, we can output the submission to a CSV file using the write_csv()
function.
Submitting my results, I got an AUC of approximately 0.70. Not bad for 30 minutes of modeling.
Kaggle Submission
The important thing to understand is that this is a great baseline for a model. We can begin to improve it, which can take significant effort depending on how far you want to go.
We did not perform feature engineering and any additional model tuning beyond what AutoML performs by default. Doing so will get us closer to the top results.
Next Steps (Transform Your Abilities)
It’s critical to understand that modeling is just one part of the overall data science project. Don’t mistake - it’s an important part, but other parts are equally if not more important, which is why our Data Science For Business With R (DS4B 201-R) course is successfully teaching data science students how to apply data science in the real world.
We teach end-to-end data science. This means you learn how to:
-
Chapter 1: Sizing The Problem, Custom Functions With Tidy Eval: Learn how to understand if a business problem exists by sizing the problem. In addition, create custom functions using
Tidy Eval
, a programming interface that is needed if you want to build custom functions usingdplyr
,ggplot2
. -
Chapter 2: Exploration with
GGally
,skimr
- We show you how to explore data and identify relationships efficiently and effectively -
Chapter 3:
recipes
, Premodeling Correlation Analysis: We teach you how to use recipes to develop data transformation workflow and we show you how to perform a pre-modeling correlation analysis so you do not move into modeling prematurely, which again saves you time -
Chapters 4 and 5: H2O AutoML: You will first learn how to use all of the major
h2o
functions to perform automated machine learning for binary classification including working with the H2O leaderboard, visualizing results, and performing grid search and cross validation. In the next chapter, you learn how to evaluate multiple models against each other with ROC, Precision vs Recall, Gain and Lift, which is necessary to explain your choices for best model to the business stakeholders (your manager and key decision makers). -
Chapter 6: LIME: You learn how to create local explanations to interpret black-box machine learning models. Explanations for the model predictions are critical because the business cannot do anything to improve unless they understand why. We show you how with
lime
. -
Chapters 7 and 8: Expected Value Framework, Threshold Optimization, Sensitivity Analysis: These are my two favorite chapters because they show you how to link the churn classification model to financial results, and they teach
purrr
for iterative grid-search style optimization! These are POWERFUL CONCEPTS. -
Chapter 9: Recommendation Algorithm: We again use a correlation analysis but in a different way. We discretize, creating a special visualization that enables us to develop a recommendation strategy.
Data Science For Business With R (DS4B 201-R)
Learn everything you need to know to complete a real-world, end-to-end data science project with the R programming language. Transform your abilities in 10 weeks.
Business Science University Course Roadmap
We have one course out and two courses coming soon!
Data Science For Business With R (DS4B 201-R)
Over the course of 10 weeks, we teach you how to solve an end-to-end data science project. Available now!
Transform you abilities by solving employee churn over a 10-week end-to-end data science project in DS4B 201-R
Building A Shiny Application (DS4B 301-R)
Our next course teaches you how to take the H2O Model, LIME explanation results, and the recommendation algorithm you develop in DS4B 201-R and turn it into a Shiny
Web Application that predicts employee attrition! Coming in Q3 2018.
Shiny App That Predicts Attrition and Recommends Management Strategies, Taught in DS4B 301-R (Building A Shiny Web App)
Kelly O’Briant is lead developer for the Shiny App Course coming soon. She’s a brilliant software engineer / data scientist that knows how to make a great looking and performing Shiny app.
Data Science For Business With Python (DS4B 201-P)
Did we mention with have a DS4B Python Course coming?!?! Well we do! Coming in Q4 2018.
The problem changes: Customer Churn! The tools will be H2O, LIME, and a host of other tools implemented in Python + Spark.
Python Track: Data Science For Business With Python And Spark
Favio Vazquez, Principle Data Scientist at OXXO, is building the Python + Spark equivalent of DS4B 201-R. He’s so talented knowing Python, Spark, and R, along with a host of other data science tools.
Don’t Miss A Beat
- Sign up for the Business Science “5 Topic Friday” Newsletter!
- Get started with Business Science University to learn how to solve real-world data science problems from Business Science
- Check out our Open Source Software
Connect With Business Science
If you like our software (anomalize
, tidyquant
, tibbletime
, timetk
, and sweep
), our courses, and our company, you can connect with us:
- business-science on GitHub
- Business Science, LLC on LinkedIn
- bizScienc on twitter
- Business Science, LLC on Facebook
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.