Excel to R, Part 2 – Speed Up Exploratory Data Analysis 100X (R Code!)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
You’re a Business Analyst – well versed in tools like Tableau, PowerBI, and maybe even SQL, but you want to take your data analytics abilities to the next level, by improving productivity and making predictive business insights with data science (rather than just descriptive insights). Then R is the language for you. In this article, you’ll learn how to perform Exploratory Data Analysis (EDA) 100X FASTER using R!!!
Don’t believe me? This is one of the techniques that one of my Business Science University students used to place 3rd in a time-sensitive Tableau competition… without knowing Tableau! He used R and Correlation Analysis to understand his data and develop the best story delivering critical business insights fast. You’ll learn one of the tools he used in this article.
Bonus! – I’m demonstrating a new R package that’s under developement: correlationfunnel
. You’ll see how I use it to get business insights fast!
Article Series Overview
This is the 2nd Article in a series on why you should learn R if you’re coming from a Business Analytics background.
Articles in Series
-
Excel to R, Part 2 – Speed Up Exploratory Data Analysis 100X (You’re here)
-
Excel to R, Part 3 – Automated Business Reporting (Coming Soon)
What You Learn Today: Correlation Funnel for 100X EDA Speedup
I am going to show you one of the simplest and fastest ways to analyze your data. In this article, I’m going to introduce you to a concept called Correlation Analysis using my new correlationfunnel
R package. First, we’ll see why Exploratory Data Analysis (EDA) in Excel (with Pivot Tables
) is painfully slow and easy to miss important aspects of your data. Then, we’ll show you how to create a Correlation Funnel Visualization that brings out the best features FAST!
Here’s what the Correlation Funnel looks like for the bank marketing campaign dataset we will use. High correlation features go to the top. From the data visualization, we can quickly see that DURATION (319_Inf) and POUTCOME (success) are the top positive correlations to TERM DEPOSIT (yes). I’ll explain what this means later. For now just realize its (1) really important business insights, and (2) we can do this FAST!
Correlation Funnel – Fast EDA
Table of Contents
Bank Marketing Campaign Data
This 3-Part series follows a real-world study that companies like Bank of America an others that provide Certificates of Deposit (CDs) (Term Deposits) would go through to analyze customer purchasing behavior.
The bank is interested in contacting customers that are likely to enroll in a Term Deposit (CD for a fixed term). The banks love these products because they can earn interest on the customers money. More enrollments means more revenue for the bank.
The dataset we’ll use is a modified version of the “Bank Marketing Data Set” provided by the UCI Machine Learning Repository.1 The version we will use is in an Excel
file with multiple tabs covering the business process.
You can get the Bank Marketing Campaign data set here in Excel here.
The Problem with Pivot Tables
The normal approach to Exploratory Data Analysis (EDA) is to investigate each feature, mining for relationships to some goal or target. In our case, enrollment in the TERM DEPOSIT (financial product).
In Excel, we use Pivot Tables to do this. The Marketing Campaign has a 16 Dependent Features (excluding the target and the ID field). This would involve making at least 16 Plots (one for each comparison to the target). There are 2 problems with this approach:
-
Pivot Tables are not systematic – It’s easy to miss one of the features or to get side tracked in the analysis project.
-
Pivot Tables are time consuming to make – Each Pivot Table can take 10+ minutes to make and format
Here’s an example of one pivot table. It took me about 10 minutes to make and format. It looks great, but it’s going to be a long day if I have to do 15 more of these just to investigate relationships.
Pivot Table – Takes 10 Minutes Per Plot x 16 Plots = 160 Minutes (~ 3hrs of nonstop visualization = Long Day)
The Solution – Use Correlation Analysis
Let’s see how we can get straight to business insights.
1.0 Libraries
R packages we’ll need:
correlationfunnel
– Install from GitHubtidyverse
readxl
tidyquant
knitr
You’ll need to install my new R package, correlationfunnel
, from GitHub (I’ll most likely be adding to CRAN in the near future).
The tidyverse
, readxl
, tidyquant
, knitr
packages can be installed from CRAN using install.packages()
.
Load the libraries.
2.0 Read Data
We can quickly load the marketing data and join the appropriate excel sheets (4 through 7) using the code from the Previous Article in the Excel to R series.
Key Points:
- Data from the 4 tables in the Excel Sheets (4 – 7) were joined by “ID” column
- Target is TERM_DEPOSIT
- Data has 16 Predictors (AGE:POUTCOME Columns)
- ID is a unique identifier, and will not provide predictive value
- No “date” or “date-time” features. DAY and MONTH are already “feature engineered”. We could add Day of Week if we new what year this data set was from, but we’ll leave that for another day.
- Data is formatted as either numeric continuous features (e.g. AGE is
double
format) or Categorical features (e.g. JOB ischr
format) - We still need to check for missing values, which will throw a correlation analysis off
3.0 Check Data for Missing Values
I’m going to do a quick check for missing values. The following code does a quick count of NA
values by column using the combo of map_df()
and gather()
.
Key Points:
- We can see that no missing values are present, so we are OK to move onto Correlation Analysis.
4.0 Correlation Analysis
I made a 3-step process to make performing a Correlation Analysis super simple.
Step 1 – Convert to Binary Format
We will first convert the features into a binary format using a function called binarize()
. The data is feature engineered:
- Numeric data are binned into 4 quantiles (categories) using
n_bins
. - Then all data are categorical and the categories are converted to a one-hot encoded binary format. You can use the
thresh_infreq
parameter to prevent categories with 100s of infrequent categories from adding additional columns.
Don’t worry if the “binarization” step sounds like gibberish. It’s just a step we need to do to tease out correlations.
AGE__-Inf_33 | AGE__33_39 | AGE__39_48 | AGE__48_Inf | JOB__admin. | JOB__blue.collar | JOB__entrepreneur | JOB__housemaid | JOB__management | JOB__retired | JOB__self.employed | JOB__services | JOB__student | JOB__technician | JOB__unemployed | JOB__unknown | MARITAL__divorced | MARITAL__married | MARITAL__single | EDUCATION__primary | EDUCATION__secondary | EDUCATION__tertiary | EDUCATION__unknown | DEFAULT__no | DEFAULT__yes | BALANCE__-Inf_72 | BALANCE__72_448 | BALANCE__448_1428 | BALANCE__1428_Inf | HOUSING__no | HOUSING__yes | LOAN__no | LOAN__yes | CONTACT__cellular | CONTACT__telephone | CONTACT__unknown | DAY__-Inf_8 | DAY__8_16 | DAY__16_21 | DAY__21_Inf | MONTH__apr | MONTH__aug | MONTH__dec | MONTH__feb | MONTH__jan | MONTH__jul | MONTH__jun | MONTH__mar | MONTH__may | MONTH__nov | MONTH__oct | MONTH__sep | DURATION__-Inf_103 | DURATION__103_180 | DURATION__180_319 | DURATION__319_Inf | CAMPAIGN__-Inf_2 | CAMPAIGN__2_3 | CAMPAIGN__3_Inf | POUTCOME__failure | POUTCOME__other | POUTCOME__success | POUTCOME__unknown | TERM_DEPOSIT__no | TERM_DEPOSIT__yes |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
Key Points:
- All data is binary
- Numeric data was made categorical. The continuous features were binned into ranges.
- All categorical data (including binned numeric data) were one-hot encoded. This just means they were converted into binary features (0’s and 1’s)
Step 2 – Correlate the Data to the Target
Next, we need to correlate()
our “binarized” features to the target. Our target (our goal) is whether or not the customers enroll in the “TERM DEPOSIT”. However, we need to be aware that our TERM_DEPOSIT column is now 2 columns (binarized):
- TERM_DEPOSIT__yes
- TERM_DEPOSIT__no (opposite of TERM_DEPOSIT__yes)
This is perfectly fine because we can just correlate all columns to “TERM_DEPOSIT__yes”, which contains 1’s if the customer enrolled.
feature | bin | correlation |
---|---|---|
TERM_DEPOSIT | no | -1.0000000 |
TERM_DEPOSIT | yes | 1.0000000 |
DURATION | 319_Inf | 0.3178767 |
POUTCOME | success | 0.3067882 |
DURATION | -Inf_103 | -0.1908996 |
POUTCOME | unknown | -0.1670507 |
CONTACT | unknown | -0.1509350 |
HOUSING | no | 0.1391727 |
HOUSING | yes | -0.1391727 |
CONTACT | cellular | 0.1358729 |
Key Points:
-
The resulting data contains the feature bin correlation to the TERM_DEPOSIT__yes column. This is our measure of relationship to the target (goal).
-
The format is a the format needed for the Correlation Funnel (next)
Step 3 – Make a Correlation Funnel!
The final step is to make a correlation funnel. We can correlation data from the previous step, piping (%>%
) into the plot_correlation_funnel()
function.
5.0 Business Insights
Insights are critical to business. Here’s how we can use the Correlation Funnel to develop a Customer Story.
Insights from the Correlation Funnel
From the Correlation Funnel, we can see that the following groups of customers have a much greater correlation with enrollment in the TERM DEPOSIT product.
-
When the DURATION, the amount of time a prospect is engaged in marketing campaign material, is 319 seconds or longer.
-
When POUTCOME, whether or not a prospect has previously enrolled in a product, is “success”.
-
When CONTACT, the medium used to contact the person, is “cellular”
-
When HOUSING, whether or not the contact has a HOME LOAN is “no”
We can then focus our time on exploring the high correlation features rather than plotting 16 different plots (or more if we have many features in our dataset).
Develop the Customer Story
Here’s an example of how we can quickly investigate DURATION, the top correlation feature, which is critical to us being able to develop a customer story. We’ll make a visualization with ggplot2
that shows the median engagement duration for the marketing campaign. Those enrolling in the TERM DEPOSIT product are spending 3X more time engaging with the campaign.
Conclusion
R
can dramatically speed up the exploratory data analysis process – All we need is a few new tactics. Rather than searching through features, we can use Correlation Analysis to point us in the right direction.
We saw how to implement Correlation Analysis using the new correlationfunnel
R package. This pointed us right to the high correlation features in the data set.
The best part is that the Correlation Analysis is only 5 lines of code. We can quickly correlation analysis pipeline by piping (%>%
) using our 3 functions: binarize()
, correlate()
, and plot_correlation_funnel()
.
This is a lot better than using Excel Pivot Tables
. I estimated about 160 Minutes (almost 3 hours of non-stop Pivot Table visualizations, not including possibly missing information or getting sidetracked. Correlation analysis in 5 lines of code is an easy win.
More Awesome EDA Packages
Want to go even further with Exploratory Data Analysis? Here are several great EDA packages that can help you understand data issues (cleanliness) and get data prepared for Correlation Analysis!
-
correlationfunnel – My new correlation analysis R package
-
Data Explorer – Automates Exploration and Data Treatment. Amazing for investigating features quickly and efficiently including by data type, missing data, feature engineering, and identifying relationships.
-
naniar – For understanding missing data.
-
UpSetR – For generating upset plots
-
GGally – The
ggpairs()
function is one of my all-time favorites for visualizing many features quickly.
Correlation Analysis Success Story
Stephen Lung is my student at Business Science University. Stephen recently placed 3rd in a Tableau competition. Here’s the AMAZING part – Stephen did NOT know Tableau.
How did Stephen win 3rd place without knowing Tableau?
Stephen beat out peers with 2+ years of Tableau skills. How is this possible? Stephen used R
and Correlation Analysis to rapidly iterate through the business dataset, finding features that were critical to the story. The final data product was in Tableau, but R
was the real workhorse.
How did Stephen learn R?
Stephen learned R
through Business Science University. Stephen is taking the following courses:
-
Business Analysis with R (DS4B 101-R) – A 7-Week course where you learn the foundations of data science for business using
R
, thetidyverse
,parsnip
(Machine Learning),dplyr
(data wrangling),ggplot2
(visualization),rmarkdown
(reporting), and more! -
Advanced Machine Learning & Business Consulting (DS4B 201-R) – A 10-Week program where you learn a framework for solving business problems with data science, and you apply cutting-edge tools including
H2O
Automatic Machine Learning,LIME
Local Feature Explanation, Correlation Analysis, and ROI Analysis. -
Learning Labs PRO – A continuous learning system where students receive new 1-hour courses every 2-weeks on advanced topics like wrangling large data sets, creating API’s, and more!
References
-
[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014 ↩
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.