Exploratory Data Analysis Guide
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Quick Overview
Exploring-Data
is a place where I share easily digestible content aimed at making the wrangling and exploration of data more efficient (+fun).
Sign up Here to join the many other subscribers who also nerd out on new tips and tricks ????
And if you enjoy the post be sure to share it
TweetR for Data Science
Hadley Wickham and Garrett Grolemund wrote an incredible book called R for Data Science (R4DS). In the book they teach how to “turn raw data into understanding, insight, and knowledge.” The authors do this by being laser focused on the tools that help the data-practitioner import, tidy, transform, visualize, and model data (+communicate findings):
I dug into the chapter on Exploratory Data Analysis (EDA) this past week.
The chapter is ???? and I highly recommend it ????
Exploring Data
I was super excited about the chapter and the knowledge it packed!
In my own path to improving my EDA skills I thought I’d capture what stuck out most in the form of a blog post.
EDA Overview
The authors of R4DS explain EDA as the process of “using visualization and transformation to Explore
your Data
in a systematic way.” They recommend an iterative cycle as follows:
- Generate
questions
about your data.- Search for
answers
by visualizing, transforming, and modeling your data.- Use what you learn to refine your
questions
and/or generate newquestions.
Favorite Quotes (from Intro)
The chapter was packed full of content – here are the quotes that stood out the most from the Introduction:
“EDA is not a formal process with a strict set of rules.”
“EDA is a state of mind.” – I love this one ????
“During the initial phases of EDA you should feel free to investigate every idea that occurs to you.”
“As your exploration continues, you will home in on a few particularly productive areas that you’ll eventually write up and communicate to others.”
Questions Drive EDA
One of my biggest take always from the chapter was the emphasis on EDA being a creative process that is driven by
asking questions
– and lots of them ????
Section on Questions
The section titled Questions
was packed full of nuggets – here are a few of the quotes that stuck out:
“EDA is fundamentally a creative process.”
“Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use
questions
as tools to guide your investigation.”“the key to asking quality
questions
is to generate a large quantity ofquestions.
” – So Good ????“each new
question
that you ask will expose you to a new aspect of your data and increase your chance of making a discovery.”“There is no rule about which
questions
you should ask to guide your research. However, two types ofquestions
will always be useful for making discoveries within your data. You can loosely word thesequestions
as:”
- What type of variation occurs within my variables?
- What type of covariation occurs between my variables?
Section on Variation
We are always interested in discovering what explains the variation seen in data and so the section on Variation
was very interesting. Here are a few of the quotes that stood out:
“Variation is the tendency of the values of a variable to change from measurement to measurement.”
“Every variable has its own pattern of variation, which can reveal interesting information.”
“The best way to understand that pattern is to visualize the distribution of the variable’s values.”
Visualizing Distributions
This post wouldn’t be complete without a bit of code and a few visualizations ????
# load libraries library(tidyverse) library(tidyquant) library(patchwork) theme_set(theme_tq())
Categorical Variables
- “To examine the distribution of a categorical variable, use a bar chart:”
p1 <- ggplot(diamonds) + geom_bar(aes(x = cut)) + labs(title = "Bar Charts for Categorical Variables") p1
Continuous Variables
“To examine the distribution of a continuous variable, use a histogram:”
“You can set the width of the intervals in a histogram with the
binwidth
argument, which ismeasured in the units of the x variable
.”
p2 <- ggplot(diamonds %>% filter(carat < 3)) + geom_histogram(aes(x = carat), binwidth = 1.0) + labs(title = "Binwidth: 1.0") p3 <- ggplot(diamonds %>% filter(carat < 3)) + geom_histogram(aes(x = carat), binwidth = 0.5) + labs(title = "Binwidth = 0.5") p4 <- ggplot(diamonds %>% filter(carat < 3)) + geom_histogram(aes(x = carat), binwidth = 0.1) + labs(title = "Binwidth = 0.1") p5 <- ggplot(diamonds %>% filter(carat < 3)) + geom_histogram(aes(x = carat), binwidth = 0.01) + labs(title = "Binwidth = 0.01")
- “You should always
explore a variety of binwidths
when working with histograms, as different binwidths can reveal different patterns.”
# Show plots patchwork <- p2 + p3 + p4 + p5 patchwork + plot_annotation( title = "Histograms for Continuous Variables", caption = "Notice the different patterns that are revealed." )
Typical Values
Remember how the author’s emphasize the use of questions
to drive EDA?
This section on Typical
Values
continues that thread and suggests the following questions
to ask when Visualizing
Distributions
:
“Which values are the most common? Why?”
“Which values are rare? Why? Does that match your expectations?”
“Can you see any unusual patterns? What might explain them?”
Another learning from this section is to focus on the clusters revealed when Visualizing
your Distributions.
For example, take a look at the Histogram
using a Binwidth
of 0.01
; obvious clusters are revealed that are not obvious in the other 3 plots.
p5
The author’s point out that “clusters of similar values suggest that subgroups exist in your data.”
What do you think they suggest doing in this situation? Ask more questions
????
Understand the subgroups by asking:
“How are the observations within each cluster similar to each other?”
“How are the observations in separate clusters different from each other?”
“How can you explain or describe the clusters?”
“Why might the appearance of clusters be misleading?”
Wrap Up
The chapter goes on to discuss unusual values, missing values, covariation, and briefly goes into using models to further your EDA. I highly recommend giving it a read and trying out the exercises in the chapter.
Get the code here: Github Repo.
Questions.
Questions.
Questions.
The biggest learning I took from this chapter, and what I’ve started using in my own EDA process, is getting better at asking questions.
Not only that, but really to let my curiosity drive the EDA early on in the process.
And my favorite quote is probably this one:
“the key to asking quality questions is to generate a large quantity of questions.”
Watch an Expert do EDA
This is a bonus
for making it this far in the post ????
David Robinson is an expert when it comes to all things EDA
and R
. For the last year or so he has been recording weekly screencasts where he Explores
Data
he has never seen before.
I watch these videos often to get insights into how to think analytically when doing EDA.
You can check them out here at his YouTube channel: Tidy Tuesday R Screencasts
Learn R Quickly
I’ve expedited my R
and Data-Science
journey using the courses over at Business Science University
The instructor, Matt Dancho, has given me a 15% discount to share with my audience. Get the discount and join me on the journey.
Link to my favorite R
course (with 15% off discount): Data Science for Business 101
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.