Checking for Normality with Quantile Ranges and the Standard Deviation

Eric Cai - The Chemical Statistician

9 years ago

[This article was first published on The Chemical Statistician » R programming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

I was reading Michael Trosset’s “An Introduction to Statistical Inference and Its Applications with R”, and I learned a basic but interesting fact about the normal distribution’s interquartile range and standard deviation that I had not learned before. This turns out to be a good way to check for normality in a data set.

In this post, I introduce several traditional ways of checking for normality (or goodness of fit in general), talk about the method that I learned from Trosset’s book, then build upon this method by possibly coming up with a new way to check for normality. I have not fully established this idea, so I welcome your thoughts and ideas.

Common Ways to Check for Normality

Introductory statistics teaches many commonly known and popular ways to check if a univariate data set comes from a normal distribution or not:

– plotting a histogram (i.e. an empirical distribution) and comparing it to the curve of a normal PDF (i.e. the theoretical distribution)

– checking if the data fall close to the identity line in a normal Q-Q plot

– checking if the distribution follows the 68-95-99.7 rule

In my graduate mathematical statistics course, I learned some hypothesis tests that can be used to more rigorously check for normality or, more generally, goodness of fit to a given probability distribution. Wikipedia has an entire article on checking for normality, including many of the following tests:

– Pearson chi-squared test

– Shapiro-Wilk test

– Shapiro-Francia test

– Anderson-Darling test

– Kolmogorov-Smirnov Test

Another Simple Method and a New Test?

Today, I will share with you another simple but effective way to check for normality. I learned this tip from reading Page 150 in Michael Trosset’s “An Introduction to Statistical Inference and its Applications with R“. Despite its simplicity, I had never learned of it before, yet it is quite intuitive. This method has also motivated me to generalize it for a new hypothesis test, though I’m struggling with the distribution of the test statistic. I can’t find a method to check for normality that is similar to what I’m thinking of, so it may be a new test for normality. In any case, it needs further work and thought, and I would be glad to hear your ideas.

The Inter-Quartile Range and Normality

It turns out that the interquartile range of a normal random variable is 1.34898 times its standard deviation. In other words, if is a normal random variable with a mean of and a standard deviation of , then

where is the inverse cumulative distribution function (CDF) or quantile function of , and is a probability. Thus, if your univariate data’s sample interquartile range is not roughly 1.35 times the sample standard deviation, then you have reason to believe that your data do not come from the normal distribution.

Source: Ark0n and Gato ocioso, Wikimedia

I wanted to mathematically prove this fact with the inverse CDF of the normal distribution (also called the probit function), but I then learned that there is no closed-form expression of the normal CDF or its inverse. Thus, I tried to check this in R by computing the quotient of the interquartile range as divided by the standard deviation for multiple values of . I used the qnorm() function in R to compute the quantiles (i.e. the inverse normal CDF values).

##### Checking for Normality with the Inter-Quartile Range and the Standard Deviation
##### By Eric Cai - The Chemical Statistician

# Create a vector of standard deviations
sigma.vector = seq(1,20,by=0.5)
for (sigma in sigma.vector)
{
print(abs(diff(qnorm(c(0.75, 0.25), mean = 2, sd = sigma))/sigma))
}

If you run the above script, you’ll find that every value is 1.34898.

I then wondered if, for other quantile ranges, this quotient is a constant for all values of . Indeed, if you play around with other values of probabilities in the 2-vector in the first argument of qnorm(), you’ll find that it is true. For example, I tried c(0.57, 0.29), and the quotient was 0.7297589 for all values of .

A New Test for Normality?

Given that, for every quantile range, there exists a positive number such that

I then wondered if a hypothesis test could be developed to somehow to test for normality by

1) aggregating multiple empirical quantile ranges

2) calculating their deviations from for the theoretical quantile ranges

3) testing if the aggregated deviation exceeds a certain critical value

The test statistic could be something like

The hard part, of course, is determining the distribution of this test statistic, which I call AD for “aggregate deviation” in my proposed hypothesis test. Once I figure that out, I can then find critical values for that third step.

Unfortunately, my mathematical/statistical knowledge is limited in this regard. If you have any ideas, please share them in the comments.

Filed under: Applied Statistics, Descriptive Statistics, Mathematical Statistics, R programming Tagged: applied statistics, data, data analysis, descriptive statistics, goodness of fit, inter-quartile, inter-quartile range, mathematical statistics, normal, normal distribution, normality, normality test, qnorm(), quantile, quantile function, quantile range, R, R programming, statistics

To leave a comment for the author, please follow the link and comment on their blog: The Chemical Statistician » R programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.