Site icon R-bloggers

More significant? so what…

[This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Following my non-life insurance class, this morning, I had an interesting question from a student, that I will try to illustrate, and reformulate as accurately as possible. Consider a simple regression model, with one variable of interest, and one possible explanatory variable. Assume that we have two possible models, with the following output (yes, I do hide interesting parts here, but it is to get quickly to my student’s point)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.92883    0.06391  14.534   <2e-16 ***
X           -0.12499    0.06108  -2.046   0.0421 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

for the first model – a GLM with some distribution, and some link function – and

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.92901    0.06270  14.817   <2e-16 ***
X           -0.09883    0.05816  -1.699   0.0909 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

for the second one – with another GLM, with another distribution, but the same link function (I guess I could have changed it, but it does not really matter here). Then, I got the following statement “I would like to choose the first model because the explanatory variable is more significant, and therefore, this model should have a stronger predictive power“.

That’s a nice idea, isn’t it ? Actually, I guess this is why I love teaching, because I will never be able to think about such an idea by myself. Because when you look at that statement, somehow it could make sense. Except that from my point of view, it is not valid at all. My first thought was to recall is standard example in statistical inference : you cannot not claim that a distribution is better than another one just by looking at the parameter estimates.

> fitdistr(Y,"normal")
      mean          sd    
  0.93685011   0.90700830 
 (0.06413517) (0.04535042)
> fitdistr(Y,"exponential")
      rate   
  1.06740661 
 (0.07547704)

Can I claim that the Gaussian distribution is better than the exponential one because parameter estimates have smaller standard deviation ? Because somehow, this is what we did when we claimed previously that the first model was better than the second one.

Let me get back on the outputs of the two regressions, and let me explain what I did. Actually, I wanted to have a story close to the one on the Gaussian versus exponential fit. So I did generate some exponential random variable,

> set.seed(5)
> n=200
> U=runif(n); 
> Y=-log(U)

Here, we can visualize the histogram of this sample, as well as the the estimated exponential distribution

> hist(Y,proba=TRUE,col="light green",border="white",lwd=2,breaks=seq(0,5.3333333333333,by=.333333333))
> x=seq(0,6,by=.02)
> lines(x,dexp(x,1/mean(Y)),col="red",lty=2)

On top of that, let us fit a gamma distribution. Using a GLM (where the regression is here on a constant – only), just to practice because later on, we will use a gamma regression on that variable

> reg0=glm(Y~1,family=Gamma(link="identity"))
> a=reg0$coefficient
> b=summary(reg0)$dispersion
> lines(x,dgamma(x,shape=1/b,scale=a*b),col="blue")

Now, we need a covariate, to run some regressions. What I wanted is some variable slightly correlated with our previous variable. Slightly, just to make sure that our -value in the regression will be close to 5% or 10%. So here, I did generate a variable so that the pair has Clayton copula, with coefficient 0.1 (which is small, extremely small)

> a=.1
> set.seed(5)
> n=200
> U=runif(n); 
> V=(U^(-a)*(runif(n)^(-a/(1+a))-1)+1)^(-1/a)
> Y=-log(U)
> X=qnorm(V)

To visualize the copula of the variables, we can use

> cop=function(u,v){
+ (a+1)*(u*v)^(-(a+1))*
+ (u^(-a)+v^(-a)-1)^(-(2*a+1)/a) }
> x=y=seq(.05,.95,by=.05)
> z=outer(x,y,cop)
> mat=persp(x,y,z,col="green",shade=TRUE,xlim=c(0,1),ylim=c(0,1),zlim=c(0,2),theta=-30,
+ ticktype ="detailed",zlab="")

We should be not far away from the independence (actually, there is a negative – significant – correlation (Pearson’s correlation)). Now, consider two models,

The outputs are the following (you will recognize the outputs given previously)

> reg1=lm(Y~X)
> reg2=glm(Y~X,family=Gamma(link="identity"))
> summary(reg1)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.92883    0.06391  14.534   <2e-16 ***
X           -0.12499    0.06108  -2.046   0.0421 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9021 on 198 degrees of freedom
Multiple R-squared:  0.02071,	Adjusted R-squared:  0.01576 
F-statistic: 4.187 on 1 and 198 DF,  p-value: 0.04206

> summary(reg2)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.92901    0.06270  14.817   <2e-16 ***
X           -0.09883    0.05816  -1.699   0.0909 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Gamma family taken to be 0.9086447)

    Null deviance: 229.72  on 199  degrees of freedom
Residual deviance: 226.58  on 198  degrees of freedom
AIC: 379.22

Number of Fisher Scoring iterations: 10

And here are the two predictions,

So, which model should we use? As usual, my answer will be “let’s have a look at the data” instead of looking only at tables of figures. Using some code posted a few days ago, let us visualize the two regressions. The Gaussian model is here

(for the lower part, I do not go below 0 since we do have, here, a positive variable that we would like to model) while the gamma on is here

And if we believe that the explanatory variable has no predictive power (since we can claim that the parameter is not significant in the regression), and we remove it from the regression, we get

Here, I do believe that the gamma (not to say the exponential) model is better because it is clearly more coherent with properties of the variable of interest. I trust more the confidence interval obtained above on the gamma model, than the one obtained with a Gaussian distribution. Even if the parameter in the regression is “more significant”.

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.