How to Automate Exploratory Analysis Plots
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Updates
Interested in automating code? Learn more R automation tips with How to Automate Excel with R and How to Automate PowerPoint with R.
Getting Started
When plotting different charts during your exploratory data analysis, you sometimes end up doing a lot of repetitive coding. What we’ll show here is a better way to do your EDA, and with less unnecessary coding and more flexibility. So, let me introduce you to the powerful package combo ggplot2
and purrr
.
ggplot2 is an awesome package for data visualization very well know in the Data Science community and probably the library that you use to build your charts during the EDA. And, purrr package enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors.
Here will use an implementation similar to loops, but written in a more efficient way and easier to read, purrr::map()
function.
Data
The dataset is an imaginary HR dataset made by data scientists of IBM Company. In our analysis, we’ll look just at categorical variables, and plotting the proportion of each class within the categorical variable.
The dataset has a total of 35 features, being 9 of them categorical, and also which we will use.
Requirements
Before we start to build our plot, we need to specify which variable will be used in the analysis. We’ll use just look at categorical features, in order to see the proportion between different classes, we’ll write a named vector with this information.
The set_names
function is super handy for naming character vectors since it can use the values of the vector as names.
Create a plot
My approach to this problem is, first plot the chart that you want, and second, replace the variable with an input of a function. This part is where you need to put more effort into coding.
In the code chunk below you’ll find a plot for a specific variable, “Attrition”.
I like to use this kind of plot when we have many different plots together, instead of using bar charts. The columns of bar charts can throw to the user too much information when just the end of the bar is important.
One tip to really grasp the steps for building this kind of chart (lollipop chart) is to thinking plots like layers (grammar of graphics) and put one on top of the other.
There are three core layers that need to be built in sequence:
geom_segment()
geom_point()
geom_label()
The rest of the plot is trivial to any ggplot chart that you already build.
Now we need to replace the variable used before as input in a function.
Creating our plotting function
To do this replacement we will use the pronoun .data
from rlang package, this pronoun allows you to be explicit about where to find objects when programming with data masked functions.
Using this strategy we will have the following function:
Using purrr
Here is the important step where we apply the function that we create to all character features in the dataset. And also, we’ll apply the cowplot::plot_grid()
that put together all ggplot2 objects in all_plots
list.
Conclusion
In this tutorial, you learned how to save time when was needed to plot a chart a lot of times. I hope that was useful for you.
Author: Luciano Oliveira Batista
Luciano is a chemical engineer and data scientist in training. Learn more on his blog at lobdata.com.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.