Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This morning, Benoit sent me an email, about an exercise he found in an econometric textbook, about linear regression. Consider the following dataset,
Here, variable X denotes the income, and Y the expenses. The goal was to fit a linear regression (actually, in the email, it was mentioned that we should try to fit an heteroscedastic model, but let us skip this part). So Benoit’s question was more or less: how do you fit a linear regression from a contingency table?
Usually, when I got an email on Saturday morning, I try to postpone. But the kids had their circus class, so I had some time to answer. And this did not look like a complex puzzle… Let us import this dataset in R, so that we can start playing
> df=read.table("http://freakonometrics.free.fr/baseexo.csv",sep=";",header=TRUE) > M=as.matrix(df[,2:ncol(df)]) > M[is.na(M)]<-0 > M X14 X19 X21 X23 X25 X27 X29 X31 X33 X35 [1,] 74 13 7 1 0 0 0 0 0 0 [2,] 6 4 2 7 4 0 0 0 0 0 [3,] 2 3 2 2 4 0 0 0 0 0 [4,] 1 1 2 3 3 2 0 0 0 0 [5,] 2 0 1 3 2 0 6 0 0 0 [6,] 2 0 2 1 0 0 1 2 1 0 [7,] 0 0 0 2 0 0 1 1 3 0 [8,] 0 1 0 1 0 0 0 0 2 0 [9,] 0 0 0 0 1 1 0 1 0 1
The first idea I had was to use those counts as weights. Weighted least squares should be perfect. The dataset is built from this matrix,
> W=as.vector(M) > x=df[,1] > X=rep(x,ncol(M)) > y=as.numeric(substr(names(df)[-1],2,3)) > Y=rep(y,each=nrow(M)) > base1=data.frame(X1=X,Y1=Y,W1=W)
Here we have
> head(base1,10) X1 Y1 W1 1 16 14 74 2 23 14 6 3 25 14 2 4 27 14 1 5 29 14 2 6 31 14 2 7 33 14 0 8 35 14 0 9 37 14 0 10 16 19 13
The regression is the following,
> summary(reg1) Call: lm(formula = Y1 ~ X1, data = base1, weights = W1) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.35569 2.03022 2.145 0.038 * X1 0.68263 0.09016 7.572 3.04e-09 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 7.892 on 40 degrees of freedom Multiple R-squared: 0.589, Adjusted R-squared: 0.5787 F-statistic: 57.33 on 1 and 40 DF, p-value: 3.038e-09
It looks like the output is the same as what Benoit found, so we should be happy. Now, I had a second thought. Why not create the implied dataset. Using replicates, we should be able to create the dataset that was used to get this contingency table,
> vX=vY=rep(NA,sum(W)) > sumW=c(0,cumsum(W)) > for(i in 1:length(W)){ + if(W[i]>0){ + vX[(1+sumW[i]):sumW[i+1]]=X[i] + vY[(1+sumW[i]):sumW[i+1]]=Y[i] + }} > base2=data.frame(X2=vX,Y2=vY)
Here, the dataset is much larger, and there is no weight,
> tail(base2,10) X2 Y2 172 31 31 173 33 31 174 37 31 175 31 33 176 33 33 177 33 33 178 33 33 179 35 33 180 35 33 181 37 35
If we run a linear regression on this dataset, we obtain
> summary(reg2) Call: lm(formula = Y2 ~ X2, data = base2) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.35569 0.95972 4.538 1.04e-05 *** X2 0.68263 0.04262 16.017 < 2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 3.731 on 179 degrees of freedom Multiple R-squared: 0.589, Adjusted R-squared: 0.5867 F-statistic: 256.5 on 1 and 179 DF, p-value: < 2.2e-16
If we compare the two regressions, we have
> rbind(coefficients(summary(reg1)), + coefficients(summary(reg2))) Estimate Std. Error t value Pr(>|t|) (Intercept) 4.3556857 2.03021637 2.145429 3.804237e-02 X1 0.6826296 0.09015771 7.571506 3.038443e-09 (Intercept) 4.3556857 0.95972279 4.538483 1.036711e-05 X2 0.6826296 0.04261930 16.016913 2.115373e-36
The estimators are exactly the same (which does not surprise me), but standard deviation (ans significance levels) are quite different. And to be honest, I find that surprising. Which approach here is the most legitimate (since they are finally not equivalent)?
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.