[This article was first published on Freakonometrics - Tag - R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Visualization is a key to success in regression analysis. This is one of the (many) reasons I am also suspicious when I read an article with a quantitative (econometric) analysis without any graph. Consider for instance the following dataset, obtained from http://data.worldbank.org/, with, for each country, the GDP per capita (in some common currency) and the infant mortality rate (deaths before the age of 5),Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
> library(gdata) > XLS1=read.xls("http://api.worldbank.org/datafiles /NY.GDP.PCAP.PP.CD_Indicator_MetaData_en_EXCEL.xls", sheet = 1) > data1=XLS1[-(1:28),c("Country.Name","Country.Code","X2010")] > names(data1)[3]="GDP" > XLS2=read.xls("http://api.worldbank.org/datafiles /SH.DYN.MORT_Indicator_MetaData_en_EXCEL.xls", sheet = 1) > data2=XLS2[-(1:28),c("Country.Code","X2010")] > names(data2)[2]="MORTALITY" > data=merge(data1,data2) > head(data) Country.Code Country.Name GDP MORTALITY 1 ABW Aruba NA NA 2 AFG Afghanistan 1207.278 149.2 3 AGO Angola 6119.930 160.5 4 ALB Albania 8817.009 18.4 5 AND Andorra NA 3.8 6 ARE United Arab Emirates 47215.315 7.1
If we estimate a simple linear regression –
> regBB=lm(MORTALITY~GDP,data=data) > summary(regBB) Call: lm(formula = MORTALITY ~ GDP, data = data) Residuals: Min 1Q Median 3Q Max -45.24 -29.58 -12.12 16.19 115.83 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 67.1008781 4.1577411 16.139 < 2e-16 *** GDP -0.0017887 0.0002161 -8.278 3.83e-14 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 39.99 on 167 degrees of freedom (47 observations deleted due to missingness) Multiple R-squared: 0.2909, Adjusted R-squared: 0.2867 F-statistic: 68.53 on 1 and 167 DF, p-value: 3.834e-14We can look at the scatter plot, including the linear regression line, and some confidence bounds,
> plot(data$GDP,data$MORTALITY,xlab="GDP per capita", + ylab="Mortality rate (under 5)",cex=.5) > text(data$GDP,data$MORTALITY,data$Country.Name,pos=3) > x=seq(-10000,100000,length=101) > y=predict(regBB,newdata=data.frame(GDP=x), + interval="prediction",level = 0.9) > lines(x,y[,1],col="red") > lines(x,y[,2],col="red",lty=2) > lines(x,y[,3],col="red",lty=2)
We should be able to do a better job here. For instance, if we look at the Box-Cox profile likelihood,
> boxcox(regBB)
it looks like taking the logarithm of the mortality rate should be better, i.e.
> regLB=lm(log(MORTALITY)~GDP,data=data) > summary(regLB) Call: lm(formula = log(MORTALITY) ~ GDP, data = data) Residuals: Min 1Q Median 3Q Max -1.3035 -0.5837 -0.1138 0.5597 3.0583 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.989e+00 7.970e-02 50.05 <2e-16 *** GDP -6.487e-05 4.142e-06 -15.66 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7666 on 167 degrees of freedom (47 observations deleted due to missingness) Multiple R-squared: 0.5949, Adjusted R-squared: 0.5925 F-statistic: 245.3 on 1 and 167 DF, p-value: < 2.2e-16 > plot(data$GDP,data$MORTALITY,xlab="GDP per capita", + ylab="Mortality rate (under 5) log scale",cex=.5,log="y") > text(data$GDP,data$MORTALITY,data$Country.Name) > x=seq(300,100000,length=101) > y=exp(predict(regLB,newdata=data.frame(GDP=x)))* + exp(summary(regLB)$sigma^2/2) > lines(x,y,col="red") > y=qlnorm(.95, meanlog=predict(regLB,newdata=data.frame(GDP=x)), + sdlog=summary(regLB)$sigma^2) > lines(x,y,col="red",lty=2) > y=qlnorm(.05, meanlog=predict(regLB,newdata=data.frame(GDP=x)), + sdlog=summary(regLB)$sigma^2) > lines(x,y,col="red",lty=2)
on the log scale or
> plot(data$GDP,data$MORTALITY,xlab="GDP per capita", + ylab="Mortality rate (under 5) log scale",cex=.5)
on the standard scale. Here we use quantiles of the log-normal distribution to derive confidence intervals.
But why shouldn’t we take also the logarithm of the GDP ? We can fit a model
> regLL=lm(log(MORTALITY)~log(GDP),data=data) > summary(regLL) Call: lm(formula = log(MORTALITY) ~ log(GDP), data = data) Residuals: Min 1Q Median 3Q Max -1.13200 -0.38326 -0.07127 0.26610 3.02212 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 10.50192 0.31556 33.28 <2e-16 *** log(GDP) -0.83496 0.03548 -23.54 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.5797 on 167 degrees of freedom (47 observations deleted due to missingness) Multiple R-squared: 0.7684, Adjusted R-squared: 0.767 F-statistic: 554 on 1 and 167 DF, p-value: < 2.2e-16 > plot(data$GDP,data$MORTALITY,xlab="GDP per capita ", + ylab="Mortality rate (under 5)",cex=.5,log="xy") > text(data$GDP,data$MORTALITY,data$Country.Name) > x=exp(seq(1,12,by=.1)) > y=exp(predict(regLL,newdata=data.frame(GDP=x)))* + exp(summary(regLL)$sigma^2/2) > lines(x,y,col="red") > y=qlnorm(.95, meanlog=predict(regLL,newdata=data.frame(GDP=x)), + sdlog=summary(regLL)$sigma^2) > lines(x,y,col="red",lty=2) > y=qlnorm(.05, meanlog=predict(regLL,newdata=data.frame(GDP=x)), + sdlog=summary(regLL)$sigma^2) > lines(x,y,col="red",lty=2)
on the log scales or
> plot(data$GDP,data$MORTALITY,xlab="GDP per capita ", + ylab="Mortality rate (under 5)",cex=.5)
on the standard scale. If we compare the last two predictions, we have
with in blue is the log model, and in red is the log-log model (I did not include the first one for obvious reasons).
To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics - Tag - R-english.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.