Maximum Likelihood versus Goodness of Fit
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Thursday, I got an interesting question from a colleague of mine (JP). I mean, the way I understood the question turned out to be a nice puzzle (but I have to confess I might have misunderstood). The question is the following : consider a i.i.d. sample of continuous variables. We would like to choose between two (parametric) families for the distribution, and . If we use maximum likelihood techniques, we get two estimators, one for each family, and . Clearly, is a much better than , in the sense of a standard goodness of fit test (e.g. Kolmogorov-Smirnov since the sample is assumed to be obtained from a continuous variable). Does that mean that family is (somehow) better than family ?
This is my interpretation of the question, and I found it amusing. So I will try to show (using simulated samples) that some odd situations can easily be obtained.
Consider a sample from a mixture of log-normal distributions,
> set.seed(228) > X=exp(c(rnorm(50,1,1),rnorm(50,2,1.2)))
Consider two standard families for positive random variables: a Gamma distribution and a lognormal distribution.
> library(MASS) > ab=fitdistr(X,"gamma") > ms=fitdistr(X,"lognormal")
If we want to visualized those two distributions, let us use
> vab=pgamma(u,ab$estimate[1],ab$estimate[2]) > vms=plnorm(u,ms$estimate[1],ms$estimate[2]) > plot(ecdf(X)) > lines(u,vab,col="red") > lines(u,vms,col="blue")
Here, we get
What else can we say ? actually, we can also compute Kolmogorov-Smirnov statistic,
where
This can be done using
> ks.test(X,"plnorm",ms$estimate[1],ms$estimate[2]) One-sample Kolmogorov-Smirnov test data: X D = 0.0693, p-value = 0.7231 alternative hypothesis: two-sided > ks.test(X,"pgamma",ab$estimate[1],ab$estimate[2]) One-sample Kolmogorov-Smirnov test data: X D = 0.148, p-value = 0.02507 alternative hypothesis: two-sided
From a theoretical point of view, we should not look at the p-values, since the null-distribution is based on a fixed distribution, not a fitted one (see the Lilliefors tests for normal samples). But still. The Gamma distribution seems to be very far away from the true distribution. The statistics is twice the one we have with our lognormal distribution. And one p-value is 72%, while the other one is 2.5%. Here, we should prefer this lognormal distribution to that Gamma one. But here, we did consider only one distribution in each family. Does that mean that we cannot find one Gamma distribution that will be better than all possible lognormal distributions ? Better, for instance, according to Kolmogorov-Smirnov statistics…
Well, it is possible to use another strategy to find appropriate parameters. We can minimize this statistic actually. Define
> ks1=function(ms) {m=ms[1];s=ms[2];ks.test(X,"plnorm",m,s)$statistic} > ks2=function(ab) {a=ab[1];b=ab[2];ks.test(X,"pgamma",a,b)$statistic}
and compute
> n1=nlm(ks1,c(ms$estimate[1],ms$estimate[2])) > n1 $minimum [1] 0.05252692 $estimate [1] 1.547437 1.121864 > n2=nlm(ks2,c(ab$estimate[1],ab$estimate[2])) > n2 $minimum [1] 0.04737725 $estimate [1] 1.1449041 0.167041
So here, it is possible to find a distribution much closer to the empirical sample, within the Gamma family actually.
> vab=pgamma(u,n2$estimate[1],n2$estimate[2]) > vms=plnorm(u,n1$estimate[1],n1$estimate[2]) > lines(u,vab,col="red",lwd=2) > lines(u,vms,col="blue",lwd=2)
What would be the point here? Maybe just the idea that the maximum likelihood estimator is only one estimator among a lot of them. And if it has interesting asymptotic properties, on small samples, it might not be the best estimator to consider…
And to be completely honest, I’ve been cheating here… I mean, not really cheating (not more than any researcher using a statistical test to validate the findings). But here, I did fix the seed of the random number generator. Actually, such example does not occur that frequently. Here, out of 1000 samples, I got this odd conclusion almost 15 times. And the smaller the sample, the more likely we can observe that story, where the maximum likelihood estimator can be far away from the best fit. Here is the proportion of opposite conclusions, as a function of the sample size,
> SIM=function(ns=1000,n=100){ + t=0 + for(s in 1:ns){ + set.seed(s) + X=exp(c(rnorm(n/2,1,1),rnorm(n/2,2,1.2))) + ks1=function(ms) {m=ms[1];s=ms[2];ks.test(X,"plnorm",m,s)$statistic} + ks2=function(ab) {a=ab[1];b=ab[2];ks.test(X,"pgamma",a,b)$statistic} + library(MASS) + ab=fitdistr(X,"gamma") + ms=fitdistr(X,"lognormal") + n1=nlm(ks1,c(ms$estimate[1],ms$estimate[2])) + n2=nlm(ks2,c(ab$estimate[1],ab$estimate[2])) + if( (ks.test(X,"plnorm",ms$estimate[1],ms$estimate[2])$statistic- + ks.test(X,"pgamma",ab$estimate[1],ab$estimate[2])$statistic) + *(n1$minimum-n2$minimum)<=0 ) t=t+1 + } + return(t/ns)} > VM=rep(NA,20) > VS=seq(10,200,by=10) > for(i in 1:20){VM[i]=SIM(n=VS[i],ns=1000)} > plot(VS,VM,type="p")
So to provide a more complete answer to JP’s question, with a very large sample, I guess that your intuition should be valid…. but clearly not on a small sample.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.