What is a Permutation Test?
[This article was first published on Econometrics Beat: Dave Giles' Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Permutation tests, which I’ll be discussing in this post, aren’t that widely used by econometricians. However, they shouldn’t be overlooked.
Let’s begin with some background discussion to set the scene. This might seem a bit redundant, but it will help us to see how permutation tests differ from the sort of tests that we usually use in econometrics.
Background Motivation
When you took your first course in economic statistics, or econometrics, no doubt you encountered some of the basic concepts associated with testing hypotheses. I’m sure that the first exposure that you had to this was actually in terms of “classical”, Neyman-Pearson, testing.
It probably wasn’t described to you in so many words. It would have just been “statistical hypothesis testing”. The whole procedure would have been presented, more or less, along the following lines:
Now, you probably didn’t need to be reminded of all of this. However, there are some key aspects to this hypothesis testing story that I’ve highlighted above, and which are worth remembering when we consider (below) a different possible way of proceeding.
Permutation Tests
Let’s suppose that we want to test some hypothesis, and we have a sample of size n that we plan to use. This sample could be very small, and just how we obtained it is not that important. In particular, we’re not going to assume that it was obtained randomly. Moreover, the (null) hypothesis that we’re wanting to test doesn’t have to be expressed in terms of a statement about some parameter associated with a population. It can just be some general statement. In fact, we’re not going to that much interested in “populations” as such, so we certainly don’t have to make a whole bunch of heroic assumptions involving properties such as “normality”, or “constant variance”.
This sounds different, already!
We also have in mind some alternative hypothesis, and we’ll see shortly why this is important. Then, using our sample of data, {x1, x2, x3, …., xn}, we calculate a statistic that seems to be relevant when it comes to examining the possible truth of our null hypothesis. For instance we may want it to be one that increases monotonically as we move further from H0 in the direction of HA. This statistic does not have to be pivotal. W’ell call the value of our statistic from this original sample the “observed statistic”, and label it Sobs.
Now comes the interesting part. We “shuffle” (i.e., permute) the n sample observations. Perhaps they’re now ordered {x3, x2, xn,….., x1}. We re-calculate the test statistic. Then we repeat the process for every possible permutation of the sample. This will give us n! test statistics (including Sobs). Let’s call their values Sπ, where π refers to a particular permutation of the sample.
Finally, we ask the question, “Is Sobs very different from the other Sπ values?” Or, more particularly, what fraction of the n! values of Sπ are as “extreme”, or “more extreme”, than our original Sobs?”
This is where the alternative hypothesis comes in. The meaning of “extreme” is determined by HA being one-sided or two-sided.
The fraction of the Sπ values that are as extreme, or more extreme, than Sobs is the p-value for our test.
What we have here is a testing method that is “non-parametric”, or “distribution free”. However, it’s more than that. We can show that the the test will be “exact”, in the sense that it will have exactly the significance level that we decide we want, no matter how small the sample is!
Now, let’s be clear. All of this comes at a cost. Actually, there are three types of “cost” involved.
First, we’re going to be making an assumption about our data. This is actually quite a mild assumption, especially when we compare it with the various strong assumptions that underlie classical hypothesis testing. And often it will clear from the context of our testing problem that this assumption is obviously satisfied.
This key assumption is usually described as “exchangeability” of the data. That is, shuffling the observed data points keeps the data-set just as likely as the original one.
Second, this procedure that I’ve outlined isn’t going to be applicable to all testing problems, including some obvious and simple ones. For example, if we have a single sample (as above) and our test statistic is chosen to be the sample average, x* (or the sample median, say), then every Sπ will have the same value of Sobs, and we’ll get nowhere.
Third, you can see immediately that even with very modest sample sizes, the number of possible permutations becomes extremely large, very quickly. For example, if n = 10 there are 10! = 3,628,800 permutations; and if n = 100 there are 9.33×10157.
Yikes!
We can’t do much about the second of these “costs”. We’ll have to focus on testing problems that are suited to being solved using this particular tool. However, the computational cost can be handled quite easily, as we’ll see below.
Now, why does a permutation test actually work at all?
Even with today’s computing power, the large number of permutations associated with even modest sample sizes is still a challenge. For this reason, in practice we typically just randomly sample a large number of permutations from the total number that are available. This is easy to do, and although the resulting test is no longer (literally) “exact”, we can make it as accurate as wish by choosing a suitable number of random permutations.
Let’s begin with some background discussion to set the scene. This might seem a bit redundant, but it will help us to see how permutation tests differ from the sort of tests that we usually use in econometrics.
Background Motivation
When you took your first course in economic statistics, or econometrics, no doubt you encountered some of the basic concepts associated with testing hypotheses. I’m sure that the first exposure that you had to this was actually in terms of “classical”, Neyman-Pearson, testing.
It probably wasn’t described to you in so many words. It would have just been “statistical hypothesis testing”. The whole procedure would have been presented, more or less, along the following lines:
- We want to test the validity of statement (“null hypothesis”) about a parameter associated with a well-defined underlying population.
- We also state clearly what situation will prevail if the hypothesis to be tested is not true. We call this the “alternative hypothesis”.
- Then we take a carefully constructed sample from the population of interest. For example, we might use simple random sampling, so that all sample values are mutually independent of each other.
- We combine the sample values into a single statistic. Usually, this would be a statistic that had already been found to be a “good” estimator of the parameter under test. Typically, this estimator would have to be transformed (e.g., “standardized”) to make it “pivotal” – that is, having a sampling distribution that does not depend on any other unknown parameters.
- Then, we ask – “if in fact the null hypothesis were actually true, is the value of our test statistic “surprising”, or not?
- If it is surprising, we would reject the null hypothesis. If not, we would not reject the null hypothesis.
You would have met two ways of implementing step 6, above:
Either – by checking if the test statistic’s value was “more extreme” than a critical value (or values) obtained by choosing an acceptable significance level in advance; and using the sampling distribution of your test statistic, based on the assumptions about the underlying population and the sampling method that was used.
Or – by asking, “if the null hypothesis really were true, how likely is it that I’d see a value for my test statistic that’s as extreme (or more extreme) than what I’ve actually observed here (again, taking into account the assumptions about the underlying population and the sampling method that was used). The answer to this question is, of course, the familiar p-value.
Now, you probably didn’t need to be reminded of all of this. However, there are some key aspects to this hypothesis testing story that I’ve highlighted above, and which are worth remembering when we consider (below) a different possible way of proceeding.
- First, notice that there’s this idea that we know (or can reasonably assume) quite a lot about the underlying population. For instance, we might assume that the population values follow a Normal distribution, with a known variance, but a mean that is unknown and whose value we’re testing. That’s quite a stretch, in many cases. (Yes, we can think about our test’s “robustness” to these assumptions, but that’s a different issue.)
- Second, the type of sampling that we use is crucial. Usually, simple random sampling is assumed. In reality, much (most) of the data that we use have been obtained using “complex surveys”, involving multi-level cluster sampling and stratified sampling. This changes everything!
- Finally, we have to be able to precisely specify the null and alternative hypotheses (in advance of seeing the sample data), and these hypotheses have to relate to some parameter(s) in the population.
Suppose that the underlying population is N[μ, σ2], where σ2 is known, and we obtain a sample of n observations using simple random sampling. Then, you’ll recall that the sample average, x* is the most efficient unbiased estimator of μ. Using x* as our test statistic sounds like a good idea, but its sampling distribution is N[μ, σ2/n], so it depends on the unknown value of μ. This test statistic isn’t “pivotal”.
However, if we transform x* into z = (x* – μ) / (σ / √(n)), this new statistic has a sampling distribution that is N[0 , 1]. It is now pivotal. (That’s what matters – not that we have a standardized distribution!)
If all of the assumptions that we’ve made are correct, then we can easily test the null hypothesis H0: μ = 1 (say) against the alternative hypothesis, HA: μ > 1 (say). If we assume that H0 is true, or μ = 1, then we can compute a value for z, because now it’s as if everything is known.
Now we can proceed as outlined above, because we have a value for our test statistic, and we know its sampling distribution if H0 were true.Let’s think about a totally different way of testing some hypothesis that interests us. There are three “bulleted” points just before the example above. Let’s toss them all out of the window and start afresh!
Permutation Tests
Let’s suppose that we want to test some hypothesis, and we have a sample of size n that we plan to use. This sample could be very small, and just how we obtained it is not that important. In particular, we’re not going to assume that it was obtained randomly. Moreover, the (null) hypothesis that we’re wanting to test doesn’t have to be expressed in terms of a statement about some parameter associated with a population. It can just be some general statement. In fact, we’re not going to that much interested in “populations” as such, so we certainly don’t have to make a whole bunch of heroic assumptions involving properties such as “normality”, or “constant variance”.
This sounds different, already!
We also have in mind some alternative hypothesis, and we’ll see shortly why this is important. Then, using our sample of data, {x1, x2, x3, …., xn}, we calculate a statistic that seems to be relevant when it comes to examining the possible truth of our null hypothesis. For instance we may want it to be one that increases monotonically as we move further from H0 in the direction of HA. This statistic does not have to be pivotal. W’ell call the value of our statistic from this original sample the “observed statistic”, and label it Sobs.
Now comes the interesting part. We “shuffle” (i.e., permute) the n sample observations. Perhaps they’re now ordered {x3, x2, xn,….., x1}. We re-calculate the test statistic. Then we repeat the process for every possible permutation of the sample. This will give us n! test statistics (including Sobs). Let’s call their values Sπ, where π refers to a particular permutation of the sample.
Finally, we ask the question, “Is Sobs very different from the other Sπ values?” Or, more particularly, what fraction of the n! values of Sπ are as “extreme”, or “more extreme”, than our original Sobs?”
This is where the alternative hypothesis comes in. The meaning of “extreme” is determined by HA being one-sided or two-sided.
The fraction of the Sπ values that are as extreme, or more extreme, than Sobs is the p-value for our test.
What we have here is a testing method that is “non-parametric”, or “distribution free”. However, it’s more than that. We can show that the the test will be “exact”, in the sense that it will have exactly the significance level that we decide we want, no matter how small the sample is!
Now, let’s be clear. All of this comes at a cost. Actually, there are three types of “cost” involved.
First, we’re going to be making an assumption about our data. This is actually quite a mild assumption, especially when we compare it with the various strong assumptions that underlie classical hypothesis testing. And often it will clear from the context of our testing problem that this assumption is obviously satisfied.
This key assumption is usually described as “exchangeability” of the data. That is, shuffling the observed data points keeps the data-set just as likely as the original one.
Third, you can see immediately that even with very modest sample sizes, the number of possible permutations becomes extremely large, very quickly. For example, if n = 10 there are 10! = 3,628,800 permutations; and if n = 100 there are 9.33×10157.
Yikes!
Now, why does a permutation test actually work at all?
Basically, if the data are exchangeable then every permuted sample will be as likely to occur as any other one. So every Sπ will be as likely to arise as any other one. Any observed differences will be small and random if H0 is true. If Sobs is “unusual”, then something is amiss and we would be inclined to reject our null hypothesis.Permutation tests date back to the 1930’s, and were first proposed by Fisher (1935), Pitman (1937a, 1937b, 1938) and others. This is certainly not a new idea! However, as you can imagine, in the 1930’s these tests could be used only with very small samples and this limited their appeal to some degree.
Even with today’s computing power, the large number of permutations associated with even modest sample sizes is still a challenge. For this reason, in practice we typically just randomly sample a large number of permutations from the total number that are available. This is easy to do, and although the resulting test is no longer (literally) “exact”, we can make it as accurate as wish by choosing a suitable number of random permutations.
Usually (but not always), when we use a sample of all possible permutations the resulting test is referred to as a “randomization test”. However, be aware that some authors use the terms “permutation” and “randomization” interchangeably in this context. And sometimes the term “exact test” is used in either case.
Obviously, permutation/randomization tests are still going to be computationally intensive. But they’re very manageable. There are several R packages for various types of randomization tests. These include the “coin” package; the “jmuOutlier” package; and the “RATest” package.
As I’ve noted already, I think that students of econometrics need to be more aware of permutation tests, even though they aren’t well-suited for some testing problems. A brief historical perspective is given by David (2008).
As I’ve noted already, I think that students of econometrics need to be more aware of permutation tests, even though they aren’t well-suited for some testing problems. A brief historical perspective is given by David (2008).
Back in 1995, Peter Kennedy wrote a very useful “overview” paper about permutation tests (Kennedy, 1995) for an econometrics audience. His paper also contains a useful summary of the philosophical arguments for and against permutation tests – something that I won’t go into here.
A lot has happened in the associated literature since 1995, of course, especially when it comes to applying these tests in the context of multiple regression analysis. I’ll say a bit more about the latter in the follow-up post that I’m currently preparing.
Some Examples
Let’s take a look at a couple of simple examples that will illustrate the use of permutation/randomization tests. The first involves a “treatment effect” – a classic situation where permutation tests are used.
Example 1: Suppose that two econometrics students, Jane and John, sit a pre-semester test and a final exam. Here are their % scores:
Jane John (Sample) Average
Pre-Semester Test: 70 75 72.5
Final Exam.: 76 72 74.0
Based on this (tiny) sample we’re interested in whether or not the course improves students’ knowledge of econometrics. You might think of conducting a two-sample t-test of the null hypothesis that the (population) mean scores are the same in the test and in the exam., against a one-sided (positive) alternative hypothesis.
However, nothing is known or has been assumed about the underlying population distributions; about how the sample was selected, etc. So, the conditions that are needed for the t-test may not hold, and our inferences could be badly distorted. We certainly can’t appeal to standard asymptotic results, given the very small sample size that we have here.
Can a permutation test help us?
Consider the null hypothesis that there is no improvement in knowledge, and the alternative hypothesis that there is a positive improvement. We don’t need a fancy test statistic. For instance, we can simply use the difference between the “after” and “before” sample average scores. Then, the observed value of our test statistic is Sobs = (74.0 – 72.5) = 1.5. Yes, this is positive, but is it different from zero in any important sense? (I’m trying to avoid the term “statistically significant” here!)
Notice that there are 4 scores in the table above, and the 2 scores associated with the pre-semester test can be assigned in [4! / (2! 2!)] = 6 ways. Here are those 6 possible permutations:
70 72 75 76 Sπ = (X*F – X*P)
π
1 P P F F 4.5
2 P F P F 1.5 (Sobs)
3 P F F P 0.5
4 F P P F -0.5
5 F P F P -1.5
6 F F P P -4.5
Notice that there are 4 scores in the table above, and the 2 scores associated with the pre-semester test can be assigned in [4! / (2! 2!)] = 6 ways. Here are those 6 possible permutations:
70 72 75 76 Sπ = (X*F – X*P)
π
1 P P F F 4.5
2 P F P F 1.5 (Sobs)
3 P F F P 0.5
4 F P P F -0.5
5 F P F P -1.5
6 F F P P -4.5
Here, “P” and “F” refer to the pre-semester test and final exam. respectively. X*P and X*F are the sample average (permuted) scores in the test and in the final exam. Sπ denotes our test statistic for permutation π (= 1 to 6).
Given that we have a one-sided alternative hypothesis, the question is, “what fraction of the six Sπ values is greater than or equal to Sobs?” The answer is (2/6) = 0.333. This is the p-value associated with our permutation test. On the basis of its value I, personally, wouldn’t reject the null hypothesis. I’d conclude that the course led to no improvement in students’ knowledge.
If we had a 2-sided alternative hypothesis – namely that the course led to no change in performance, then the p-value would be 0.667. Yes, we’re doubling the one-sided p-value in this case, because the distribution of the Sπ values happens to be symmetric. More importantly, we get this answer by noting that four out of the six | Sπ | values are greater than or equal to | Sobs |.
Now, you might wonder if the particular form of test statistic that we chose affected the result in any substantive way. Actually, the answer is “no”. If we converted the Sπ statistics into (pseudo-) “t-statistics”, using the usual standardization, and then proceeded as above, we’get exactly the same p-value for the permutation test. S should be monotone with respect to changes in the data that make the null less likely. Choice may (often will) affect power of the test.
In this first example it was a simple matter to spell out all of the possible permutations that we needed to consider, and then we could apply an exact permutation test. Now let’s look at a second simple example which is also a classic permutation test. In this case, because of the sample size, random selection among all possible permutations has to be used.
Example 2:
Suppose that we want to test the hypothesis that there is no correlation between the supply (Q) of a particular good and its market price, P. We’d probably want to have a one-sided alternative hypothesis, saying that there is a positive such correlation.
Suppose that we have a sample of 20 prices and the corresponding 20 quantities supplied:
Suppose that we have a sample of 20 prices and the corresponding 20 quantities supplied:
P: 5.0, 4.8, 4.7, 4.0, 5.3, 4.1, 5.5, 4.7, 3.3, 4.0, 4.0, 4.6, 5.3, 3.0, 3.5, 3.9, 4.7, 5.0, 5.2, 4.6
Q: 60, 59, 58, 47, 65, 48, 67, 70, 55, 63, 62, 65, 71, 56, 59, 60, 74, 77, 78, 62
We’re not going to assume that the population(s) from which we’ve sampled are normal, or anything else for that matter. We’re going to use the sample (Pearson) correlation coefficient as our test statistic. We’ll keep the ordering of the Q values just as they are in the observed sample, and we’ll permute the 20 observations on P. Even with this small sample there are 2.43×1018 possible permutations of the observations on P. So, we’ll draw a random selection of 50,000 of these permutations in constructing our randomization test.Q: 60, 59, 58, 47, 65, 48, 67, 70, 55, 63, 62, 65, 71, 56, 59, 60, 74, 77, 78, 62
The sample correlation for the 20 original values of P and Q is robs = 0.6183. We’ll denote the sample correlation associated with Q and a particular set of permuted P observations by rπ (π = 1, …, 50000). The (one-sided) p-value is the fraction of these 50,000 values of rπ that are greater than or equal to robs. You’ll find both EViews and R code for this example on the code page of this blog.
The p-values that I got using EViews and R were 0.0017 and 0.0016 respectively. The distribution of the 50,000 correlation coefficients generated in R looks like this:
Again, if we wanted to compute a two-sided p-value this would be the fraction of the 50,000 values of | rπ | that are greater than or equal to | robs |. Modifying the R code, we get a value of 0.00348 for this p-value.
In any case, on the basis of the very small p-value I’d (personally) reject the hypothesis that there is no correlation between price and quantity supplied.
Finally, although it was quite natural to use the sample correlation coefficient, r, as the test statistic for our hypothesis, again this is not actually necessary. Recalling the formula for r, we can use the test statistic, Sπ = Σ(PiQi). (The range of summation is over the sample size, n.) It differs from rπ, only by a location shift and a scaling. Constructing a permutation test of H0 using Sπ instead of rπ, we get exactly the same p-value(s). The EViews program code that I’ve supplied illustrates this. (If you’re not an EViews user, you can open the program file with any text editor, and there are enough comments in the code for you to see what’s going on.)
The important things to note about these two examples are:
- The permutation/randomization tests were constructed and applied without the need for any particular knowledge or assumptions about some underlying population.
- There was no need for the sampling to be strictly random.
- It was legitimate to “shuffle” (permute) the observations, treating them as if they were “exchangeable”. (This will not be the case for some testing problems.)
- The tests are very simple, conceptually, to construct.
- Although they are modestly computer-intensive, this “cost” is on a par with bootstrapping, for example.
By now, you’re probably wanting to see how these tests can help in the situation that we’re most familiar with as econometricians – regression analysis. My next post will illustrate this, and it will also demonstrate that permutation/randomization tests are exact when it comes to their significance level.
Fisher, R. A., 1935. The Design of Experiments. Oliver & Boyd, Edinburgh.
Kennedy, P. E., 1995. Randomization tests in econometrics. Journal of Business and Economic Statistics, 13, 85-94.
Pitman, E., 1937a. Significance tests which may be applied to samples from any populations: I. Journal of the Royal Statistical Society, B, 4, 119-130.
Pitman, E., 1937b. Significance tests which may be applied to samples from any populations: II,.Journal of the Royal Statistical Society, B, 4, 225-232.
Pitman, E., 1938. Significance tests which may be applied to samples from any populations: III. Biometrika, 29, 322-335.
To leave a comment for the author, please follow the link and comment on their blog: Econometrics Beat: Dave Giles' Blog.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.