Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
EDA (Exploratory Data Analysis) is one of the key steps in any Data Science Project. The better the EDA is the better the Feature Engineering could be done. From Modelling to Communication, EDA has got much more hidden benefits that aren’t often emphasised while beginners start while teaching Data Science for beginners.
The Problem
That said, EDA is also one of the areas of the Data Science Pipeline where a lot of manual code is written for different types of plots and different types for inference. Let’s you’d want to visualize a bar plot of a categorical variable and you’d want to visualize a histogram of a continuous variable to understand their distribution. All these things increase the number of lines of code and also there by number of lines of code which could be time consuming if you’re participating in Hackathons or Online Competitions like Kaggle where time-bound response is usually required to move ahead in the leaderboard.
The Solution
That’s where the tools of Automated EDA comes very handy and one such popular tool for Automated EDA in R is DataExplorer
by Boxuan Cui.
DataExplorer
The stable version of DataExplorer
can be installed from CRAN.
install.packages("DataExplorer")
And if you’d like to try on the development version:
if (!require(devtools)) install.packages("devtools") devtools::install_github("boxuancui/DataExplorer", ref = "develop")
Automating EDA – Get started
Before we start with EDA, We should first get the data that we would like explore. In this case, We’ll use data generated by fakir
library(fakir) library(tidyverse) library(DataExplorer) web <- fakir::fake_visits() glimpse(web) ## Observations: 365 ## Variables: 8 ## $ timestamp <date> 2017-01-01, 2017-01-02, 2017-01-03, 2017-01-04, 2017-… ## $ year <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, … ## $ month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … ## $ day <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,… ## $ home <int> 352, 203, 103, 484, 438, NA, 439, 273, 316, 193, 322, … ## $ about <int> 176, 115, 59, 113, 138, 75, 236, 258, 206, 260, NA, 29… ## $ blog <int> 521, 492, 549, 633, 423, 478, 364, 529, 320, 315, 578,… ## $ contact <int> NA, 89, NA, 331, 227, 289, 220, 202, 367, 369, 241, 28… # year,month,day to factor web$year <- as.factor(web$year) web$month <- as.factor(web$month) web$day <- as.factor(web$day)
To go with glimpse()
, DataExplorer
itself has got a function called introduce()
introduce(web) ## # A tibble: 1 x 9 ## rows columns discrete_columns continuous_colu… all_missing_col… ## <int> <int> <int> <int> <int> ## 1 365 8 4 4 0 ## # … with 4 more variables: total_missing_values <int>, ## # complete_rows <int>, total_observations <int>, memory_usage <dbl>
The same introduce()
could also be plotted in a pretty graph.
plot_intro(web)
Automating EDA – Missing
Personally, The most useful function of DataExplorer is to plot_missing()
values.
plot_missing(web)
That’s so handy that I don’t have to copy paste any custom function from SO or my previous code.
Automating EDA – Continuous
As with most EDA on Continuous variables (numbers), We’ll start of with Histogram that can help us understand the underlying distributions.
And that’s just one function plot_histogram()
DataExplorer::plot_histogram(web)
And a similar function for density plot plot_density()
plot_density(web)
That’s all univariate
and if we get on with `bivariate1, we can start off with boxplots with respect to a categorical variable.
plot_boxplot(web, by= 'month', ncol = 2) ## Warning: Removed 111 rows containing non-finite values (stat_boxplot).
And, the super-useful correlation plot.
plot_correlation(web, cor_args = list( 'use' = 'complete.obs')) ## 2 features with more than 20 categories ignored! ## timestamp: 365 categories ## day: 31 categories ## Warning in cor(x = structure(list(home = c(352L, 203L, 103L, 484L, 438L, : ## the standard deviation is zero ## Warning: Removed 32 rows containing missing values (geom_text).
If in case, you want the correlation plot to be plotted only for continuous variables:
plot_correlation(web, type = 'c',cor_args = list( 'use' = 'complete.obs'))
Well, that’s how simple it’s to make a bunch of plots for continous variables.
Automating EDA – Categorical
A bar plot to combine a categorical and a continuous variable. By default (with no with
value), plot_bar()
plots the categorical variable against the frequency/count.
plot_bar(web,maxcat = 20, parallel = TRUE) ## 2 columns ignored with more than 20 categories. ## timestamp: 365 categories ## day: 31 categories
Also, We’ve got an option to specify the name of the continuous variable to be summed up.
plot_bar(web,with = c("home"), maxcat = 20, parallel = TRUE) ## 2 columns ignored with more than 20 categories. ## timestamp: 365 categories ## day: 31 categories
EDA Report
While those above ones are specific functions for a specific type of plot (but plotted for the whole dataset) making EDA a very quick process.
create_report()
create_report()
helps us in generating an output report combining all the required plots for different types of variables.
Plot Aesthetics
It’s worthy enough to mention that these ggplots that are built aren’t the final version as DataExplorer
allows us to supply ggtheme
theme name and theme_config
to pass on the theme paramaters. Also functions like plot_box()
or plot_histgram()
also takes in the plot-specific arguments. For more details on this check out relevant help files.
plot_intro(web, ggtheme = theme_minimal(), title = "Automated EDA with Data Explorer", )
Summary
DataExplorer
is extremely handy for automating EDA in a lot of use-cses like Missing Values reporting in an ETL process or Basic EDA in a Hackathon. It’s definitely another generalist-tool that could be customized for better usage.
If you liked this, Please subscribe to my Data Science Newsletter and also share it with your friends!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.