Exploratory Data Analysis: The 5-Number Summary – Two Different Methods in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
Continuing my recent series on exploratory data analysis (EDA), today’s post focuses on 5-number summaries, which were previously mentioned in the post on descriptive statistics in this series. I will define and calculate the 5-number summary in 2 different ways that are commonly used in R. (It turns out that different methods arise from the lack of universal agreement among statisticians on how to calculate quantiles.) I will show that the fivenum() function uses a simpler and more interpretable method to calculate the 5-number summary than the summary() function. This post expands on a recent comment that I made to correct an error in the post on box plots.
> y = seq(1, 11, by = 2) > y [1] 1 3 5 7 9 11 > fivenum(y) [1] 1 3 6 9 11 > summary(y) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.0 3.5 6.0 6.0 8.5 11.0
Why do these 2 methods of calculating the 5–number summary in R give different results? Read the rest of this post to find out the answer!
Previous posts in this series on EDA include
- Descriptive statistics
- Box plots
- The conceptual foundations of kernel density estimation
- How to construct kernel density plots and rug plots in R
- Violin plots
- The conceptual foundations of empirical cumulative distribution functions (CDFs)
- 2 ways of plotting empirical CDFs in R
- Conceptual foundations of histograms and how to plot them in R
What is a 5-Number Summary?
A 5-number summary is a set of 5 descriptive statistics for summarizing a continuous univariate data set. It consists of the data set’s
- minimum
- 1st quartile
- median
- 3rd quartile
- maximum
This is a simple but very useful way of summarizing your data for several reasons.
- the median gives a measure of the centre of the data
- the minimum and maximum give the range of the data
- the 1st and 3rd quartiles give a sense of the spread of the data, especially when compared to the minimum, maximum, and median
2 Different Ways to Get the 5-Number Summary in R
There are 2 functions that are commonly used to calculate the 5-number summary in R.
I have discovered a subtle but important difference in the way the 5-number summary is calculated between these two functions.
Here is an instance when they provide the same output.
> x = seq(1, 9, by = 2) > x [1] 1 3 5 7 9 > fivenum(x) [1] 1 3 5 7 9 > summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. 1 3 5 5 7 9
Here is an instance when they provide different output.
> y = seq(1, 11, by = 2) > y [1] 1 3 5 7 9 11 > fivenum(y) [1] 1 3 6 9 11 > summary(y) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.0 3.5 6.0 6.0 8.5 11.0
*fivenum() does not have an argument for controlling the number of decimal places in its output, while summary() has the “digits” option for doing so. You may need to invoke this option in summary() to get more decimal places to when comparing its output with fivenum()’s output.
Notice that x has an odd number of data, while y has an even number of data. The 2 functions gave the same output for x, but different 1st and 3rd quartiles for y. What causes this difference?
The Difference Between fivenum() and summary()
The difference between fivenum() and summary() lies in the lack of universal agreement on how the 1st and 3rd quartiles should be calculated.
Here is how fivenum() calculates the 1st and 3rd quartiles.
- Sort your data from smallest to largest
- Find the median. If your data set has an odd number of data, then the median is the datum such that the number of data above the median is the same as the number of data below the median. If your data set has an even number, n, of data, the median is the average of the (n/2)th and (n/2 + 1)th largest data.
- Find the set, L, of data below the median. The 1st quartile is the median of L.
- Find the set, U, of data above the median. The 3rd quartile is the median of U.
summary() uses the quantile() function to calculate the 25% and 75% quantiles as the 1st and 3rd quartiles. Thus, let’s discuss how quantile() calculates quantiles. (See “Terminology Clarification” near the end of this post on the definitions of quantile and percentile.)
There is no universal agreement on how quantiles are calculated among statisticians (Hyndman and Fan, 1996). The quantile() function’s documentation shows 9 different ways to calculate quantiles, with Type 7 being used for summary(). Here is how Type 7 works:
- Sort the data, , from smallest to largest. Denote the ordered statistics as .
- Assign the minimum, , as the 0% quantile and the maximum, , as the 100% quantile.
- The position of the q% quantile along the ordered data is at , where n is the sample size. Thus, the position of the 0% quantile is ; this is the first number along the ordered data, so the 0% quantile is the minimum. Denote this position as .
- If the position, , from Step #3 is an integer, than simply extract the ordered datum from the list of ordered data – this is the q% quantile.
- If the position, , from Step #3 is not an integer, but a decimal number, then let’s find the 2 integers immediately below and above . Denote these integers as and , respectively. To be precise,
Distinguishing fivenum() and summary() – An Example
Consider again the data set y.
> y = seq(1, 11, by = 2) > y [1] 1 3 5 7 9 11
Let’s follow the above steps for summary() and find the 1st quartile accordingly.
- y is already sorted in ascending order.
- The position of the 25% quantile is $1 + (6 – 1)25/100 = 2.25$.
- This position is not an integer, so we cannot simply extract the 2.25th ordered datum from y.
Conclusion
The R functions fivenum() and summary() use different methods to calculate the 5-number summary. Given the complexity of summary()’s method and the ease of calculation and intepretation of fivenum()’s method, I encourage using fivenum(), and I will use it from now on in my blog posts.
I asked about the differences between these 2 methods by initiating a discussion thread called “5-Number Summaries in R” in the LinkedIn group “R Programming”. I thank Marco Biffino and Mukul Mehta for sharing valuable contributions in this thread. I also thank David Maxwell and Allan Reese from the Centre for Environment, Fisheries and Aquaculture Science in Great Britain for noting and explaining these issues in personal emails with me.
Terminology Clarifications
*Here is the definition of percentile that I learned in my introductory statistics class (STAT 270 at Simon Fraser University):
The pth percentile** of a data set sorted from smallest to largest is the value such that p percent of the data are at or below this value. The quartiles are special percentiles; the 1st quartile is the 25th percentile, and the 3rd quartile is the 75th percentile. The median is also a quartile – it is the 50th percentile.
**The terms quantile and percentile denote essentially the same thing. However, percentile refers to the percentage of the data at or below its value, while quantile refers to the fraction of data at or below its value. In the context of probability distributions and cumulative distribution functions (CDFs), I see “quantile” being used all the time, and rarely see “percentile” being used. (In the context of CDFs, quantiles are just the values of the random variable or, equivalently, the inverse CDF.) Nonetheless, they do mean the same things.
References
- Ross Ihaka’s lecture slides on quantiles for Statistics 787 at the University of Auckland.
- John Verzani. “simpleR – Using R for Introductory Statistics”
- “Sample Quantiles in Statistical Packages” by Rob J. Hyndman and Yanan Fan. The American Statistician. Vol. 50, No. 4 (November, 1996), pp. 361-365
Filed under: Descriptive Statistics, R programming Tagged: 5-number summary, data, data analysis, descriptive statistics, exploratory data analysis, five-number summary, fivenum(), maximum, median, minimum, quantile, quartile, R, R programming, statistics, summary statistics, summary()
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.