Lilliefors, Kolmogorov-Smirnov and cross-validation
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In statistics, Kolmogorov–Smirnov test is a popular procedure to test, from a sample {x1,⋯,xn} is drawn from a distribution F, or usually Fθ0, where Fθ is some parametric distribution. For instance, we can test H0:Xi∼N(0,1) (where θ0=(μ0,σ20)=(0,1)) using that test. More specifically, I wanted to discuss today p-values. Given n let us draw N(0,1) samples of size n, and compute the p-values of Kolmogorov–Smirnov tests
n=300 p = rep(NA,1e5) for(s in 1:1e5){ X = rnorm(n,0,1) p[s] = ks.test(X,"pnorm",0,1)$p.value }
We can visualise the distribution of the p-values below (I added some Beta distribution fit here)
library(fitdistrplus) fit.dist = fitdist(p,"beta") hist(p,probability = TRUE,main="",xlab="",ylab="") vu = seq(0,1,by=.01) vv = dbeta(vu,shape1 = fit.dist$estimate[1], shape2 = fit.dist$estimate[2]) lines(vu,vv,col="dark red", lwd=2)
It looks like it is quite uniform (theoretically, the p-value is uniform). More specifically, the p-value was lower than 5% in 5% of the samples
[note: here I compute ‘mean(p<=.05)’ but I have some trouble with the ‘<‘ and ‘>’ symbols, as always]
mean(p<=.05) [1] 0.0479
i.e. we wrongly reject H0:Xi∼N(0,1) is 5% of the samples.
As discussed previously on the blog, in many cases, we do care about the distribution, and not really the parameters, so we wish to test something like H0:Xi∼N(μ,σ2), for some μ and σ2. Therefore, a natural idea can be to test H0:Xi∼N(ˆμ,ˆσ2), for some estimates of μ and σ2. That’s the idea of Lilliefors test. More specifically, Lilliefors test suggests to use , Kolmogorov–Smirnov statistics, but corrects the p-value. Indeed, if we draw many samples, and use Kolmogorov–Smirnov statistics and its classical p-value to test for H0:Xi∼N(ˆμ,ˆσ2),
n=300 p = rep(NA,1e5) for(s in 1:1e5){ X = rnorm(n,0,1) p[s] = ks.test(X,"pnorm",mean(X),sd(X))$p.value }
we see clearly that the distribution of p-values is no longer uniform
fit.dist = fitdist(p,"beta") hist(p,probability = TRUE,main="",xlab="",ylab="") vu = seq(0,1,by=.01) vv = dbeta(vu,shape1 = fit.dist$estimate[1], shape2 = fit.dist$estimate[2]) lines(vu,vv,col="dark red", lwd=2)
More specifically, if xi‘s are actually drawn from some Gaussian distribution, there are no chance to reject H0, the p-value being almost never below 5%
mean(p<=.05) [1] 0.00012
Usually, to interpret that result, the heuristics is that ˆμ and ˆσ2 are both based on the sample, while previously 0 and 1 where based on some prior knowledge. Somehow, it reminded me on the classical problem when mention when we introduce cross-validation, which is Goodhart’s law
When a measure becomes a target, it ceases to be a good measure
i.e. we cannot assess goodness of fit using the same data as the ones used to estimate parameters. So here, why not use some hold-out (or cross-validation) procedure : split the dataset in two parts, {x1,⋯,xk} (with \(k
p = matrix(NA,1e5,4) for(s in 1:1e5){ X = rnorm(n,0,1) p[s,1] = ks.test(X,"pnorm",0,1)$p.value p[s,2] = ks.test(X,"pnorm",mean(X),sd(X))$p.value p[s,3] = ks.test(X[1:200],"pnorm",mean(X[201:300]),sd(X[201:300]))$p.value p[s,4] = ks.test(X[201:300],"pnorm",mean(X[1:200]),sd(X[1:200]))$p.value }
Again, we can visualize the distributions of p-values, in the case where 1/3 of the data is used to estimate μ and σ2, and 2/3 of the data is used to test
fit.dist = fitdist(p[,3],"beta") hist(p[,3],probability = TRUE,main="",xlab="",ylab="") vu=seq(0,1,by=.01) vv=dbeta(vu,shape1 = fit.dist$estimate[1], shape2 = fit.dist$estimate[2]) lines(vu,vv,col="dark red", lwd=2)
and in the case where 2/3 of the data is used to estimate μ and σ2, and 1/3 of the data is used to test
fit.dist = fitdist(p[,4],"beta") hist(p[,4],probability = TRUE,main="",xlab="",ylab="") vu=seq(0,1,by=.01) vv=dbeta(vu,shape1 = fit.dist$estimate[1], shape2 = fit.dist$estimate[2]) lines(vu,vv,col="dark red", lwd=2)
Observe here that we (wrongly) reject too frequently H0, since the p-values are below 5% in 25% of the scenarios, in the first case (less data used to estimate), and 9% of the scenarios, in the second case (less data used to test)
mean(p[,3]<=.05) [1] 0.24168 mean(p[,4]<=.05) [1] 0.09334
We can actually compute that probability as a function of k/n
n=300 p = matrix(NA,1e4,99) for(s in 1:1e4){ X = rnorm(n,0,1) KS = function(p) ks.test(X[1:(p*n)],"pnorm",mean(X[(p*n+1):n]),sd(X[(p*n+1):n]))$p.value p[s,] = Vectorize(KS)((1:99)/100) }
The evolution of the probability is the following
prob5pc = apply(p,2,function(x) mean(x<=.05)) plot((1:99)/100,prob5pc)
so, it looks like we can use some sort of hold-out procedure to test for H0:Xi∼N(μ,σ2), for some μ and σ2, using Kolmogorov–Smirnov test with μ=ˆμ and σ2=ˆσ2 but the proportion of data used to estimate those quantities should be (much) larger that the one used to compute the statistics. Otherwise, we clearly reject too frequently \H0.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.