What statistical test should I do?
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Being a teaching assistant in statistics for students with diverse backgrounds, I have the chance to see what is globally not well understood by students.
I have realized that it is usually not a problem for students to do a specific statistical test when they are told which one to use (as long as they have good resources and they have been attentive during classes, of course). However, it appears that the task is much more difficult for them when they need to choose what test to do.
This article presents a flowchart to help students in selecting the most appropriate statistical test based on a couple of criteria:
Due to the large number of tests, the image is quite wide so it may not render well on all screens. In that case, you can see it in full screen by clicking directly on the image or by clicking on the following link:
As you can see in the flowchart, the selection of the most appropriate test is based on:
- the number of variables of interest: one, two or more than two variables
- the variable type: quantitative or qualitative
- in case of a qualitative variable, the number of groups and whether they are independent or paired (i.e., dependent)
- whether you want the parametric or nonparametric version1
Summarizing so many tests in a single image is not an easy task. The goal of this flowchart is to provide students with a quick and easy way to select the most appropriate statistical test (or to see what are the alternatives). Obviously, this flowchart is not exhaustive. There are many other tests but most of them have been omitted on purpose to keep it simple and readable. I decided to keep it simple so that the flowchart is not overwhelming, with the hope that it is still complete and precise enough for most students.
For the sake of completeness, here are a few additional remarks about this flowchart:
- Tests for more than 2 variables are applicable to the case of 2 variables as well. For simplicity, I however tend to suggest the simplest test when more than one is possible. For instance, with two quantitative variables, both a correlation test and a simple linear regression can be done. In introductory statistics classes, I will most likely teach the concept of correlation but not necessarily the concept of linear regression. For this reason, I will most likely recommend a correlation test over a linear regression, unless the students have a more advanced level.
- The Kolmogorov-Smirnov test, used to compare a sample with a reference probability distribution or to compare two samples, has been omitted because it is generally not taught in introductory classes. Keep in mind, however, that this test is useful both in the uni and bivariate cases.
- Normality tests (such as, among others, the Shapiro-Wilk or Kolmogorov-Smirnov test) have also been omitted as they are part of another family of tests (they are used to answer the question “Is my dataset well-modeled by a normal distribution?”). Remember, nonetheless, that they are very useful to verify the normality assumption required in many hypothesis tests. For example, the Student’s t-test requires that the data follow approximately a normal distribution in case of small sample size. If this is not the case, the nonparametric version (i.e., the Wilcoxon test) should be preferred. The same goes for ANOVA and many other statistical tests.
- The flowchart could be extended to include more advanced linear or non-linear models, but this is beyond its scope and goal. Remember that I created it to help non-experts to see more clearly and have a broad overview of the most common statistical tests, not to confuse them even more.
I hope this guide will help you in determining the right statistical test. Feel free to share it with all students who might be interested.
As always, if you have a question or a suggestion (for example, if I missed a test which you believe should be included), please add it as a comment so other readers can benefit from the discussion.
A parametric test means that it is based on a theoretical statistical distribution, which depends on some defined parameters. On the contrary, a nonparametric test does not rely on data belonging to any particular parametric family of probability distributions. Nonparametric tests have the same objective as their parametric counterparts. However, they have two advantages over parametric tests: (i) they do not require the assumption of normality of distributions and (ii) they can deal with outliers. The trade-off is that nonparametric tests are usually less powerful than their corresponding parametric version when the normality assumption holds. Therefore, all else being equal, with a nonparametric test you are less likely to reject the null hypothesis when it is false if the data follow a normal distribution. It is thus preferred to use the parametric version when the assumptions are met.↩︎
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.