Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
An explanation of quartiles, quintiles deciles, and boxplots.
Previously
“Again with variability of long-short decile tests” and its predecessor discusses using deciles but doesn’t say what they are.
The *iles
These are concepts that have to do with approximately equally sized groups created from sorted data. There are 4 groups with quartiles, 5 with quintiles and 10 with deciles.
But it isn’t quite as easy as perhaps it should be. These words are used for two different concepts:
- the data in a group
- the dividing line between groups
So the top decile can be either the 10% of the data that are biggest, or a point that divides that group of biggest values from the next smaller group.
There is one fewer dividing line than groups.
Boxplot
The premier graph that uses quartiles is the boxplot.
Figure 1 is an example boxplot — it shows the daily log returns during 2012 (so far) for a particular stock (MMM).
Figure 1: Boxplot of daily log returns of MMM in 2012 year-to-date.
The middle half of the data is in the box. The line inside the box is the median.
The interquartile range (IQR) is the 3rd quartile minus the 1st quartile. This is one of the statistics given in the market portraits.
The whiskers end at a data point, but can be no longer than some length. In R the default is that whiskers can be no longer than 1.5 times the IQR.
Points beyond the whiskers are sometimes referred to as “outliers”. That is not particularly good nomenclature. For there to be an outlier, there really needs to be a model.
Boxplots are useful for a single variable, but they really shine when you have multiple variables to compare. Figure 3 is a boxplot of daily returns of a few stocks.
Figure 3: Boxplot of daily log returns of a few stocks in 2012 year-to-date.
Epilogue
Some are watching it from the wings
Some are standing in the center
–from “People’s Parties” by Joni Mitchell
< embed width="450" type="application/x-shockwave-flash" src="https://www.youtube.com/v/4HMHA0f-7v4?version=3&hl=en_GB" allowFullScreen="true" allowscriptaccess="always" allowfullscreen="true" />
Appendix R
R is a good environment for whatever *ile you are interested in.
multiple boxplots
There are three likely ways of getting multiple boxplots.
The first is to give the boxplot function a list. In particular, a data frame is a list. So we can do:
boxplot(data.frame(retMat[,1:5])) # Figure 4
This extracts the first 5 columns of the return matrix and then changes it into a data frame.
Figure 4: Boxplot from a list (data frame).
The second way is via a formula. An R formula involves the ~ operator. boxplot expects the formula to be of the form:
values ~ groups
where values
and groups
are vectors of the same length.
It is often the case that you’ll give a data frame as the data
argument, and the columns of that data frame hold the variables that appear in the formula. However, that is not mandatory:
boxplot(retMat[,1] ~ substr(rownames(retMat), 6,7)) # Figure 5
Here we use the first column of the return matrix as the values part of the formula, and extract the month of the year (2012) from the row names to use as the groups. Hence we have a separate boxplot for each month of 2012. The result is Figure 5.
Figure 5: Daily returns of MMM by month of 2012 (base).
All of the graphics above use the base graphics in R. An alternative is the ggplot2
package. We can do the same thing as in Figure 5 with the ggplot2
command:
qplot(substr(rownames(retMat), 6,7), retMat[,1], geom='boxplot') # Figure 6
Figure 6: Daily returns of MMM by month of 2012 (ggplot2).
annotated boxplot
The function that created Figure 2 was:
function (filename = "boxexplan.png") { if(length(filename)) { png(file=filename, width=512) par(mar=c(5,4, 0, 2) + .1) } bp <- boxplot(retMat[,1] * 100, ylab="Daily log returns (%)", col="gold", xlim=c(.5, 3)) bps <- bp$stats loc <- c(mean(bps[1:2]), bps[2:4], mean(bps[4:5]), -2.5, 2.5) arrows(2, loc, c(1.1, 1.3, 1.3, 1.3, 1.1, 1.1, 1.1), loc) text(2.03, loc, adj=0, c("bottom whisker", "1st quartile", "2nd quartile (median)", "3rd quartile", "top whisker", "extreme data points", "extreme data points")) if(length(filename)) { dev.off() } }
computing dividing lines
The quantile
function will provide values that divide the groups. For example deciles can be computed as:
quantile(x, probs=seq(.1, .9, by=.1))
or possibly:
quantile(x, probs=seq(0, 1, by=.1))
finding groups
We’ll go from simplistic to less simple.
1. do-it-yourself
The do-it-yourself approach is:
cut(x, quantile(x,(0:10)/10), labels=FALSE, include.lowest=TRUE)
2. cut2
You can use the cut2
function in the Hmisc
package. If you want deciles, then you would do something along the lines of:
cut2(x, m=length(x)/10)
3. quantcut
The quantcut
function in gtools
takes care of a problem that can arise in the solutions above. If there are a lot of repeated values, then the dividing points may not be unique. quantcut
checks for that and if it occurs, then it reduces the number of groups returned.
4. fixer-upper
You may really want the same number of groups that you requested. Hence you could end up with observations that have the same value but are in different groups.
In general, the number of data points is not divisible by the number of groups. A nicety is to make groups in the center have an additional observation, rather than the first (smallest) groups.
Here is a function to do that:
ntile <- function (x, ngroups, na.rm=FALSE) { # function to get ntile # (quartile, quintile, decile, etc.) # groups from a vector # placed in the public domain 2012 # by Burns Statistics # testing status: # seems to work stopifnot(is.numeric(ngroups), length(ngroups) == 1, ngroups > 0) if(na.rm) { x <- x[!is.na(x)] } else if(nas <- sum(is.na(x))) { stop(nas, " missing values present") } nx <- length(x) if(nx < ngroups) { stop("more groups (", ngroups, ") than observations (", nx, ")") } basenum <- nx %/% ngroups extra <- nx %% ngroups repnum <- rep(basenum, ngroups) if(extra) { eloc <- seq(floor((ngroups - extra)/2 + 1), length=extra) repnum[eloc] <- repnum[eloc] + 1 } split(sort(x), rep(1:ngroups, repnum)) }
This can be used like:
ntile(rnorm(63), 10) # deciles ntile(rnorm(63), 5) # quintiles ntile(rnorm(63), 4) # quartiles
Anyone see any problems with this function?
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.