Lilliefors, Kolmogorov-Smirnov and cross-validation
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In statistics, Kolmogorov–Smirnov test is a popular procedure to test, from a sample \(\{x_1,\cdots,x_n\}\) is drawn from a distribution \(F\), or usually \(F_{\theta_0}\), where \(F_{\theta}\) is some parametric distribution. For instance, we can test \(H_0:X_i\sim\mathcal{N(0,1)}\) (where \(\theta_0=(\mu_0,\sigma_0^2)=(0,1)\)) using that test. More specifically, I wanted to discuss today \(p\)-values. Given \(n\) let us draw \(\mathcal{N}(0,1)\) samples of size \(n\), and compute the \(p\)-values of Kolmogorov–Smirnov tests
n=300 p = rep(NA,1e5) for(s in 1:1e5){ X = rnorm(n,0,1) p[s] = ks.test(X,"pnorm",0,1)$p.value }
We can visualise the distribution of the \(p\)-values below (I added some Beta distribution fit here)
library(fitdistrplus) fit.dist = fitdist(p,"beta") hist(p,probability = TRUE,main="",xlab="",ylab="") vu = seq(0,1,by=.01) vv = dbeta(vu,shape1 = fit.dist$estimate[1], shape2 = fit.dist$estimate[2]) lines(vu,vv,col="dark red", lwd=2)
It looks like it is quite uniform (theoretically, the \(p\)-value is uniform). More specifically, the \(p\)-value was lower than 5% in 5% of the samples
[note: here I compute ‘mean(p<=.05)’ but I have some trouble with the ‘<‘ and ‘>’ symbols, as always]
mean(p<=.05) [1] 0.0479
i.e. we wrongly reject \(H_0:X_i\sim\mathcal{N(0,1)}\) is 5% of the samples.
As discussed previously on the blog, in many cases, we do care about the distribution, and not really the parameters, so we wish to test something like \(H_0:X_i\sim\mathcal{N(\mu,\sigma^2)}\), for some \(\mu\) and \(\sigma^2\). Therefore, a natural idea can be to test \(H_0:X_i\sim\mathcal{N(\hat\mu,\hat\sigma^2)}\), for some estimates of \(\mu\) and \(\sigma^2\). That’s the idea of Lilliefors test. More specifically, Lilliefors test suggests to use , Kolmogorov–Smirnov statistics, but corrects the \(p\)-value. Indeed, if we draw many samples, and use Kolmogorov–Smirnov statistics and its classical \(p\)-value to test for \(H_0:X_i\sim\mathcal{N(\hat\mu,\hat\sigma^2)}\),
n=300 p = rep(NA,1e5) for(s in 1:1e5){ X = rnorm(n,0,1) p[s] = ks.test(X,"pnorm",mean(X),sd(X))$p.value }
we see clearly that the distribution of \(p\)-values is no longer uniform
fit.dist = fitdist(p,"beta") hist(p,probability = TRUE,main="",xlab="",ylab="") vu = seq(0,1,by=.01) vv = dbeta(vu,shape1 = fit.dist$estimate[1], shape2 = fit.dist$estimate[2]) lines(vu,vv,col="dark red", lwd=2)
More specifically, if \(x_i\)‘s are actually drawn from some Gaussian distribution, there are no chance to reject \(H_0\), the \(p\)-value being almost never below 5%
mean(p<=.05) [1] 0.00012
Usually, to interpret that result, the heuristics is that \(\hat\mu\) and \(\hat\sigma^2\) are both based on the sample, while previously \(0\) and \(1\) where based on some prior knowledge. Somehow, it reminded me on the classical problem when mention when we introduce cross-validation, which is Goodhart’s law
When a measure becomes a target, it ceases to be a good measure
i.e. we cannot assess goodness of fit using the same data as the ones used to estimate parameters. So here, why not use some hold-out (or cross-validation) procedure : split the dataset in two parts, \(\{x_1,\cdots,x_k\}\) (with \(k
p = matrix(NA,1e5,4) for(s in 1:1e5){ X = rnorm(n,0,1) p[s,1] = ks.test(X,"pnorm",0,1)$p.value p[s,2] = ks.test(X,"pnorm",mean(X),sd(X))$p.value p[s,3] = ks.test(X[1:200],"pnorm",mean(X[201:300]),sd(X[201:300]))$p.value p[s,4] = ks.test(X[201:300],"pnorm",mean(X[1:200]),sd(X[1:200]))$p.value }
Again, we can visualize the distributions of \(p\)-values, in the case where \(1/3\) of the data is used to estimate \(\mu\) and \(\sigma^2\), and \(2/3\) of the data is used to test
fit.dist = fitdist(p[,3],"beta") hist(p[,3],probability = TRUE,main="",xlab="",ylab="") vu=seq(0,1,by=.01) vv=dbeta(vu,shape1 = fit.dist$estimate[1], shape2 = fit.dist$estimate[2]) lines(vu,vv,col="dark red", lwd=2)
and in the case where \(2/3\) of the data is used to estimate \(\mu\) and \(\sigma^2\), and \(1/3\) of the data is used to test
fit.dist = fitdist(p[,4],"beta") hist(p[,4],probability = TRUE,main="",xlab="",ylab="") vu=seq(0,1,by=.01) vv=dbeta(vu,shape1 = fit.dist$estimate[1], shape2 = fit.dist$estimate[2]) lines(vu,vv,col="dark red", lwd=2)
Observe here that we (wrongly) reject too frequently \(H_0\), since the \(p\)-values are below 5% in 25% of the scenarios, in the first case (less data used to estimate), and 9% of the scenarios, in the second case (less data used to test)
mean(p[,3]<=.05) [1] 0.24168 mean(p[,4]<=.05) [1] 0.09334
We can actually compute that probability as a function of \(k/n\)
n=300 p = matrix(NA,1e4,99) for(s in 1:1e4){ X = rnorm(n,0,1) KS = function(p) ks.test(X[1:(p*n)],"pnorm",mean(X[(p*n+1):n]),sd(X[(p*n+1):n]))$p.value p[s,] = Vectorize(KS)((1:99)/100) }
The evolution of the probability is the following
prob5pc = apply(p,2,function(x) mean(x<=.05)) plot((1:99)/100,prob5pc)
so, it looks like we can use some sort of hold-out procedure to test for \(H_0:X_i\sim\mathcal{N(\mu,\sigma^2)}\), for some \(\mu\) and \(\sigma^2\), using Kolmogorov–Smirnov test with \(\mu=\hat\mu\) and \(\sigma^2=\hat\sigma^2\) but the proportion of data used to estimate those quantities should be (much) larger that the one used to compute the statistics. Otherwise, we clearly reject too frequently \(\H_0\).
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.