Site icon R-bloggers

How to make a rough check to see if your data is normally distributed

[This article was first published on The Beginner Programmer, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Now then, in the previous article I wrote about hypothesis testing with data that is normally distributed, in this article I’m going to post some quick test you can do to check if it is fairly safe to assume your data is normal. I would like to highlight the fact that you can never be 100% sure that your data follows a particular distribution, however, in order to be able to say something about the phenomenon you are trying to understand, it is often practical to approximate its distribution with the best fit according to the measurements (the samples). At the very least this is a starting point.

First, I should point out that if you have a small sample (< 10 samples) then it is really hard to draw some conclusions, but if the sample is large enough (as a broad and very generous rule of thumb > 30) then we can consider some indices and plots.

Skewness:
First of all, if your data is normally distributed, then it should be approximately symmetric. In order to asses this characteristic you can either use a boxplot or the skewness index g4. However I suggest to use both. Suppose you have two datasets as below:

If we plot the boxplots we obtain:

Remembering that the black thick line is the median and that the coloured part of the boxplot contains 50% of the data, we can already see that the distribution on the left is approximately symmetric, while the one on the right has a right tail (right skew).
Then, by running the skewness function we check the numbers in order to double check our guess

Since the skewness of the first dataset is close to zero it seems reasonable to assume the first dataset is normally distributed. Note that the beta simulated dataset has a positive skewness significantly higher than 0. A final check at the histograms can help a bit more

Ok, the histogram may cast some doubt on whether the first dataset is normal, however it certainly helps us ruling out normality for the second dataset

Finally, we run the kurtosis test, which gives us some information on the shape of the bell curve

We are now in a pretty safe position to assume that the first dataset is normally distributed, since the kurtosis is pretty close to 3, and we can see that the second seems to have a “peak” which is indeed what the histogram tells us. Having so many samples in the second dataset makes us pretty confident in ruling out normality, however it could also not be the case, for instance, for some reason, say the instruments with which we made the measurements were biased, we could have ended up with biased data.

Edit: Last day, while I was writing this article I forgot a very important plot: the qqplot! The function qqnorm lets you compare  the quantiles of your data with those of the normal distribution, giving you another bit of information that could point you in the right direction.
By calling the function qqnorm(data_beta) and qqnorm(data_norm) we obtain the following plots

Now, ideally, what you would expect if your data is approximately normally distributed, is that the quantiles of the sample match the theoretical samples, therefore you should get a 45 degree line. Can you guess which one of the two plots is the one generated by qqnorm(data_beta)? 😉

To leave a comment for the author, please follow the link and comment on their blog: The Beginner Programmer.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.