Site icon R-bloggers

Exploratory Data Analysis: The 5-Number Summary – Two Different Methods in R

[This article was first published on The Chemical Statistician » R programming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Continuing my recent series on exploratory data analysis (EDA), today’s post focuses on 5-number summaries, which were previously mentioned in the post on descriptive statistics in this series.  I will define and calculate the 5-number summary in 2 different ways that are commonly used in R.  (It turns out that different methods arise from the lack of universal agreement among statisticians on how to calculate quantiles.)  I will show that the fivenum() function uses a simpler and more interpretable method to calculate the 5-number summary than the summary() function.  This post expands on a recent comment that I made to correct an error in the post on box plots.

 

> y = seq(1, 11, by = 2)
> y
[1]  1  3  5  7  9 11
> fivenum(y)
[1]  1  3  6  9 11
> summary(y)
     Min.   1st Qu.   Median    Mean     3rd Qu.    Max. 
     1.0     3.5       6.0       6.0      8.5       11.0

Why do these 2 methods of calculating the 5–number summary in R give different results?  Read the rest of this post to find out the answer!

 

Previous posts in this series on EDA include

What is a 5-Number Summary?

A 5-number summary is a set of 5 descriptive statistics for summarizing a continuous univariate data set.  It consists of the data set’s

This is a simple but very useful way of summarizing your data for several reasons.

2 Different Ways to Get the 5-Number Summary in R

There are 2 functions that are commonly used to calculate the 5-number summary in R.

I have discovered a subtle but important difference in the way the 5-number summary is calculated between these two functions.

Here is an instance when they provide the same output.

> x = seq(1, 9, by = 2)
> x
[1] 1 3 5 7 9
> fivenum(x)
[1] 1 3 5 7 9
> summary(x)
     Min.   1st Qu.   Median    Mean    3rd Qu.   Max. 
      1       3          5        5       7        9

Here is an instance when they provide different output.

> y = seq(1, 11, by = 2)
> y
[1]  1  3  5  7  9 11
> fivenum(y)
[1]  1  3  6  9 11
> summary(y)
     Min.   1st Qu.   Median    Mean     3rd Qu.    Max. 
     1.0     3.5       6.0       6.0      8.5       11.0

*fivenum() does not have an argument for controlling the number of decimal places in its output, while summary() has the “digits” option for doing so.  You may need to invoke this option in summary() to get more decimal places to when comparing its output with fivenum()’s output.

Notice that x has an odd number of data, while y has an even number of data.  The 2 functions gave the same output for x, but different 1st and 3rd quartiles for y.  What causes this difference?

The Difference Between fivenum() and summary()

The difference between fivenum() and summary() lies in the lack of universal agreement on how the 1st and 3rd quartiles should be calculated.

Here is how fivenum() calculates the 1st and 3rd quartiles.

  1. Sort your data from smallest to largest
  2. Find the median.  If your data set has an odd number of data, then the median is the datum such that the number of data above the median is the same as the number of data below the median.  If your data set has an even number, n, of data, the median is the average of the (n/2)th and (n/2 + 1)th largest data.
  3. Find the set, L, of data below the median.  The 1st quartile is the median of L.
  4. Find the set, U, of data above the median.  The 3rd quartile is the median of U.

summary() uses the quantile() function to calculate the 25% and 75% quantiles as the 1st and 3rd quartiles.  Thus, let’s discuss how quantile() calculates quantiles.  (See “Terminology Clarification” near the end of this post on the definitions of quantile and percentile.)

There is no universal agreement on how quantiles are calculated among statisticians (Hyndman and Fan, 1996).  The quantile() function’s documentation shows 9 different ways to calculate quantiles, with Type 7 being used for summary().  Here is how Type 7 works:

  1. Sort the data, , from smallest to largest.  Denote the ordered statistics as .
  2. Assign the minimum, , as the 0% quantile and the maximum, , as the 100% quantile.
  3. The position of the q% quantile along the ordered data is at , where n is the sample size.  Thus, the position of the 0% quantile is ; this is the first number along the ordered data, so the 0% quantile is the minimum.  Denote this position as .
  4. If the position, , from Step #3 is an integer, than simply extract the ordered datum from the list of ordered data – this is the q% quantile.
  5. If the position, , from Step #3 is not an integer, but a decimal number, then let’s find the 2 integers immediately below and above .  Denote these integers as  and , respectively.  To be precise,

Distinguishing fivenum() and summary() – An Example

Consider again the data set y.

> y = seq(1, 11, by = 2)
> y
[1]  1  3  5  7  9 11

Let’s follow the above steps for summary() and find the 1st quartile accordingly.

  1. y is already sorted in ascending order.
  2. The position of the 25% quantile is $1 + (6 – 1)25/100 = 2.25$.
  3. This position is not an integer, so we cannot simply extract the 2.25th ordered datum from y.

Conclusion

The R functions fivenum() and summary() use different methods to calculate the 5-number summary.  Given the complexity of summary()’s method and the ease of calculation and intepretation of fivenum()’s method, I encourage using fivenum(), and I will use it from now on in my blog posts.

I asked about the differences between these 2 methods by initiating a discussion thread called “5-Number Summaries in R” in the LinkedIn group “R Programming”.  I thank Marco Biffino and Mukul Mehta for sharing valuable contributions in this thread.  I also thank David Maxwell and Allan Reese from the Centre for Environment, Fisheries and Aquaculture Science in Great Britain for noting and explaining these issues in personal emails with me.

Terminology Clarifications

*Here is the definition of percentile that I learned in my introductory statistics class (STAT 270 at Simon Fraser University):

The pth percentile** of a data set sorted from smallest to largest is the value such that p percent of the data are at or below this value.  The quartiles are special percentiles; the 1st quartile is the 25th percentile, and the 3rd quartile is the 75th percentile.  The median is also a quartile – it is the 50th percentile.

**The terms quantile and percentile denote essentially the same thing.  However, percentile refers to the percentage of the data at or below its value, while quantile refers to the fraction of data at or below its value.  In the context of probability distributions and cumulative distribution functions (CDFs), I see “quantile” being used all the time, and rarely see “percentile” being used.  (In the context of CDFs, quantiles are just the values of the random variable or, equivalently, the inverse CDF.)  Nonetheless, they do mean the same things.

References


Filed under: Descriptive Statistics, R programming Tagged: 5-number summary, data, data analysis, descriptive statistics, exploratory data analysis, five-number summary, fivenum(), maximum, median, minimum, quantile, quartile, R, R programming, statistics, summary statistics, summary()

To leave a comment for the author, please follow the link and comment on their blog: The Chemical Statistician » R programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.