How to Automate EDA with DataExplorer in R

AbdulMajedRaja RS

3 years ago

[This article was first published on r-bloggers on Programming with R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

EDA (Exploratory Data Analysis) is one of the key steps in any Data Science Project. The better the EDA is the better the Feature Engineering could be done. From Modelling to Communication, EDA has got much more hidden benefits that aren’t often emphasised while beginners start while teaching Data Science for beginners.

The Problem

That said, EDA is also one of the areas of the Data Science Pipeline where a lot of manual code is written for different types of plots and different types for inference. Let’s you’d want to visualize a bar plot of a categorical variable and you’d want to visualize a histogram of a continuous variable to understand their distribution. All these things increase the number of lines of code and also there by number of lines of code which could be time consuming if you’re participating in Hackathons or Online Competitions like Kaggle where time-bound response is usually required to move ahead in the leaderboard.

The Solution

That’s where the tools of Automated EDA comes very handy and one such popular tool for Automated EDA in R is DataExplorer by Boxuan Cui.

DataExplorer

The stable version of DataExplorer can be installed from CRAN.

install.packages("DataExplorer")

And if you’d like to try on the development version:

if (!require(devtools)) install.packages("devtools")
devtools::install_github("boxuancui/DataExplorer", ref = "develop")

Automating EDA – Get started

Before we start with EDA, We should first get the data that we would like explore. In this case, We’ll use data generated by fakir

library(fakir)
library(tidyverse)
library(DataExplorer)
web <- fakir::fake_visits()
glimpse(web)
## Observations: 365
## Variables: 8
## $ timestamp <date> 2017-01-01, 2017-01-02, 2017-01-03, 2017-01-04, 2017-…
## $ year      <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, …
## $ month     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ home      <int> 352, 203, 103, 484, 438, NA, 439, 273, 316, 193, 322, …
## $ about     <int> 176, 115, 59, 113, 138, 75, 236, 258, 206, 260, NA, 29…
## $ blog      <int> 521, 492, 549, 633, 423, 478, 364, 529, 320, 315, 578,…
## $ contact   <int> NA, 89, NA, 331, 227, 289, 220, 202, 367, 369, 241, 28…
# year,month,day to factor 

web$year <- as.factor(web$year)
web$month <- as.factor(web$month)
web$day <- as.factor(web$day)

To go with glimpse(), DataExplorer itself has got a function called introduce()

introduce(web)
## # A tibble: 1 x 9
##    rows columns discrete_columns continuous_colu… all_missing_col…
##   <int>   <int>            <int>            <int>            <int>
## 1   365       8                4                4                0
## # … with 4 more variables: total_missing_values <int>,
## #   complete_rows <int>, total_observations <int>, memory_usage <dbl>

The same introduce() could also be plotted in a pretty graph.

plot_intro(web)

Automating EDA – Missing

Personally, The most useful function of DataExplorer is to plot_missing() values.

plot_missing(web)

That’s so handy that I don’t have to copy paste any custom function from SO or my previous code.

Automating EDA – Continuous

As with most EDA on Continuous variables (numbers), We’ll start of with Histogram that can help us understand the underlying distributions.

And that’s just one function plot_histogram()

DataExplorer::plot_histogram(web)

And a similar function for density plot plot_density()

plot_density(web)

That’s all univariate and if we get on with `bivariate1, we can start off with boxplots with respect to a categorical variable.

plot_boxplot(web, by= 'month',  ncol = 2)
## Warning: Removed 111 rows containing non-finite values (stat_boxplot).

And, the super-useful correlation plot.

plot_correlation(web, cor_args = list( 'use' = 'complete.obs'))
## 2 features with more than 20 categories ignored!
## timestamp: 365 categories
## day: 31 categories
## Warning in cor(x = structure(list(home = c(352L, 203L, 103L, 484L, 438L, :
## the standard deviation is zero
## Warning: Removed 32 rows containing missing values (geom_text).

If in case, you want the correlation plot to be plotted only for continuous variables:

plot_correlation(web, type = 'c',cor_args = list( 'use' = 'complete.obs'))

Well, that’s how simple it’s to make a bunch of plots for continous variables.

Automating EDA – Categorical

A bar plot to combine a categorical and a continuous variable. By default (with no with value), plot_bar() plots the categorical variable against the frequency/count.

plot_bar(web,maxcat = 20, parallel = TRUE)
## 2 columns ignored with more than 20 categories.
## timestamp: 365 categories
## day: 31 categories

Also, We’ve got an option to specify the name of the continuous variable to be summed up.

plot_bar(web,with = c("home"), maxcat = 20, parallel = TRUE)
## 2 columns ignored with more than 20 categories.
## timestamp: 365 categories
## day: 31 categories

EDA Report

While those above ones are specific functions for a specific type of plot (but plotted for the whole dataset) making EDA a very quick process.

create_report()

create_report() helps us in generating an output report combining all the required plots for different types of variables.

Plot Aesthetics

It’s worthy enough to mention that these ggplots that are built aren’t the final version as DataExplorer allows us to supply ggtheme theme name and theme_config to pass on the theme paramaters. Also functions like plot_box() or plot_histgram() also takes in the plot-specific arguments. For more details on this check out relevant help files.

plot_intro(web,
             ggtheme = theme_minimal(),
             title = "Automated EDA with Data Explorer",
             )

Summary

DataExplorer is extremely handy for automating EDA in a lot of use-cses like Missing Values reporting in an ETL process or Basic EDA in a Hackathon. It’s definitely another generalist-tool that could be customized for better usage.

If you liked this, Please subscribe to my Data Science Newsletter and also share it with your friends!

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers on Programming with R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.