Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
I was reading Michael Trosset’s “An Introduction to Statistical Inference and Its Applications with R”, and I learned a basic but interesting fact about the normal distribution’s interquartile range and standard deviation that I had not learned before. This turns out to be a good way to check for normality in a data set.
In this post, I introduce several traditional ways of checking for normality (or goodness of fit in general), talk about the method that I learned from Trosset’s book, then build upon this method by possibly coming up with a new way to check for normality. I have not fully established this idea, so I welcome your thoughts and ideas.
Common Ways to Check for Normality
Introductory statistics teaches many commonly known and popular ways to check if a univariate data set comes from a normal distribution or not:
– plotting a histogram (i.e. an empirical distribution) and comparing it to the curve of a normal PDF (i.e. the theoretical distribution)
– checking if the data fall close to the identity line in a normal Q-Q plot
– checking if the distribution follows the 68-95-99.7 rule
In my graduate mathematical statistics course, I learned some hypothesis tests that can be used to more rigorously check for normality or, more generally, goodness of fit to a given probability distribution. Wikipedia has an entire article on checking for normality, including many of the following tests:
Another Simple Method and a New Test?
Today, I will share with you another simple but effective way to check for normality. I learned this tip from reading Page 150 in Michael Trosset’s “An Introduction to Statistical Inference and its Applications with R“. Despite its simplicity, I had never learned of it before, yet it is quite intuitive. This method has also motivated me to generalize it for a new hypothesis test, though I’m struggling with the distribution of the test statistic. I can’t find a method to check for normality that is similar to what I’m thinking of, so it may be a new test for normality. In any case, it needs further work and thought, and I would be glad to hear your ideas.
The Inter-Quartile Range and Normality
It turns out that the interquartile range of a normal random variable is 1.34898 times its standard deviation. In other words, if
where
Source: Ark0n and Gato ocioso, Wikimedia
I wanted to mathematically prove this fact with the inverse CDF of the normal distribution (also called the probit function), but I then learned that there is no closed-form expression of the normal CDF or its inverse. Thus, I tried to check this in R by computing the quotient of the interquartile range as divided by the standard deviation for multiple values of
##### Checking for Normality with the Inter-Quartile Range and the Standard Deviation ##### By Eric Cai - The Chemical Statistician # Create a vector of standard deviations sigma.vector = seq(1,20,by=0.5) for (sigma in sigma.vector) { print(abs(diff(qnorm(c(0.75, 0.25), mean = 2, sd = sigma))/sigma)) }
If you run the above script, you’ll find that every value is 1.34898.
I then wondered if, for other quantile ranges, this quotient is a constant for all values of
A New Test for Normality?
Given that, for every quantile range, there exists a positive number
I then wondered if a hypothesis test could be developed to somehow to test for normality by
1) aggregating multiple empirical quantile ranges
2) calculating their deviations from
3) testing if the aggregated deviation exceeds a certain critical value
The test statistic could be something like
The hard part, of course, is determining the distribution of this test statistic, which I call AD for “aggregate deviation” in my proposed hypothesis test. Once I figure that out, I can then find critical values for that third step.
Unfortunately, my mathematical/statistical knowledge is limited in this regard. If you have any ideas, please share them in the comments.
Filed under: Applied Statistics, Descriptive Statistics, Mathematical Statistics, R programming Tagged: applied statistics, data, data analysis, descriptive statistics, goodness of fit, inter-quartile, inter-quartile range, mathematical statistics, normal, normal distribution, normality, normality test, qnorm(), quantile, quantile function, quantile range, R, R programming, statistics
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.