Site icon R-bloggers

Predictive Power Score vs CorrelationFunnel

[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This article is part of a R-Tips Weekly, a weekly video tutorial that shows you step-by-step how to do common R coding tasks.


Here are the links to get set up. ????

(Click image to play tutorial)


Battle of the EDA Packages

Exploratory Data Analysis (EDA) is what every data scientist does to understand actionable insights from the data. This process used to take forever. Not anymore. We’re kicking the tires on 2 EDA packages in a Battle Royale to determine which one reigns supreme.

As you follow along, you can use my Ultimate R Cheatsheet. It consolidates the most important R packages (ones I use every day) into 1 cheatsheet. We’ll also be using the tidyverse and tidymodels / xgboost from the cheatsheet as the tie-breaker.

So let’s get started. You’re making these data visualizations today:

The Drawbacks of Correlation
Did you know that you miss important insights with correlation?

Correlation plots are must-know plots. Everyone understands correlations (even non-technical people). Correlation are great for explaining insights in simple terms:

But Correlation has serious drawbacks.

Enter the Predictive Power Score, a new technique for finding relationships.

The Contender: Predictive Power Score
An enhanced version of the traditional correlation that scores feature importance

Predictive Power Score fixes the issues. It works with:

Here’s the implementation in the ppsr package.

This outputs a nice “PPScore Heatmap”, which looks very similar to a Correlation Matrix Heatmap. We can see that Tenure and TotalCharges are the top features identified with relationship to Churn.

Awesome. In a few lines of code, we got some insights telling us to look closer at TotalCharges and Tenure.

But did the PPScore miss anything?

Let’s try another method: My Correlation Funnel R Package.

The Defender: Correlation Funnel
Uses the “binning trick” to fix the issues of traditional correlation

Correlation Funnel fixes the issues with traditional correlation using a clever binning strategy (i.e. the “binning trick”) that allows categorical data to be one-hot encoded and numeric features to be binned into binary groups.

This creates an easy to infer plot that shows not only magnitude of relationship, but also which parts of the feature correlate most with the segment of interest (e.g. Contract= Month-to-Month correlates with Churn=Yes).

We see new features that PPScore didn’t have:

Which is right? We need a tie-breaker.

Tie-Breaker: XGBoost + VIP
Use feature importance from a model designed for non-linear relationships

To help us decide which to believe, we can use a 3rd model – XGBOOST:

Watch the YouTube Video for this one.



And the results are in, the top features for each method are:

Interpretation of Results

Your boss

After he sees your new EDA skills…

But if you really want to improve your data skills…

Here’s how to master R.

What happens after you learn R for Business from Matt.

This is career acceleration.


SETUP R-TIPS WEEKLY PROJECT

  1. Get the Code
  2. Check out the R-Tips Setup Video

Once you take these actions, you’ll be set up to receive R-Tips with Code every week. =)




???? Top R-Tips Tutorials you might like:


To leave a comment for the author, please follow the link and comment on their blog: business-science.io.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.