Site icon R-bloggers

An Inconvenient Statistic

[This article was first published on Fear and Loathing in Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
As I sit here waiting on more frigid temperatures subsequent to another 10 inches of snow, suffering from metastatic cabin fever, I can’t help but ponder what I can do examine global warming/climate change.  Well, as luck would have it, R has the tools to explore this controversy.  Using two packages, vars and forecast, I will see if I should be purchasing carbon offsets or continue with a life of conspicuous consumption, oblivious to the consequences of my actions.

The concept is to find data on man-made carbon emissions and global surface temperatures.  Then, using vector autoregression to identify the proper number of lags to put into a granger causality model.  I will not get into any theory here, but you can see a discussion of granger causality in my very first post where I showed how to solve the age-old mystery of what comes first, the chicken or the egg (tongue firmly planted in cheek).

It is important to point out that two prior papers have shown no causal linkage between CO2 emissions and surface temperatures (Triacca, 2005 & an unpublished manuscript from Bilancia/Vitale).  In essence, past observations of CO2 concentrations do not improve the statistical predictions of current surface temperatures.  Be that as it may, I will attempt to duplicate such an analysis, giving any adventurous data scientist the tools and techniques to dig into this conundrum on their own.

Where can we find the data?  Global CO emission estimates can be found at the Carbon Dioxide Information Analysis Center (CDIAC) at the following website –  http://cdiac.ornl.gov/.  You can download data of total emissions of fossil fuel combustion and cement manufacture.  Surface temperature takes some detective work, but a clever soul can find it at the website of the UK Met Office Hadley Centre, part of the climate research center at the University of East Anglia, website – http://www.metoffice.gov.uk/hadobs/hadcrut4/ .  An anomaly is calculated as the difference between the average annual surface temperature versus the average of the reference years, 1961 – 1990.

The data have common years from 1850 until 2010 and I downloaded and put it into a .csv for import into R.  Now, it’s on to the code!

> require(forecast)
> require(vars)
> var.data = read.csv(file.choose())
> head(var.data)
  Year CO2   Temp
1 1850  54 -0.374
2 1851  54 -0.219
3 1852  57 -0.223
4 1853  59 -0.268
5 1854  69 -0.243
6 1855  71 -0.264

> #put data into a time series
> carbon.ts = ts(CO2, frequency=1, start=c(1850), end=c(2010))
> temp.ts = ts(Temp, frequency=1, start=c(1850), end=c(2010))
#subset the data from 1900 until 2010
> surfacetemp = window(temp.ts, start=c(1900), end=c(2010))
> co2 = window(carbon.ts, start=c(1900), end=c(2010))
> climate.ts = cbind(co2, surfacetemp)
> plot(climate.ts)

















> #determine stationarity and number of lags to achieve stationarity
> ndiffs(co2, alpha = 0.05, test = c(“adf”))
[1] 1
> ndiffs(surfacetemp, alpha = 0.05, test = c(“adf”))
[1] 1

Using the adf test above in the ndiffs command of the forecast package, we can see that a 1st difference will allow us to achieve stationarity, which is necessary for vector autoregression and granger causality.

> #difference to achieve stationarity
> d.co2 = diff(co2)
> d.temp = diff(surfacetemp)

> #again, we need a mts class dataframe
> climate2.ts = cbind(d.co2, d.temp)
> plot(climate2.ts)


















> #determine the optimal number of lags for vector autoregression
> VARselect(climate2.ts, lag.max=10) $selection

AIC(n)  HQ(n)  SC(n) FPE(n)
     7          3         1        7

I find that the above divergence in the tests for optimal VAR modeling is quite common.  Now, one can peruse the literature for what is the best statistical test to determine optimal lag length, but I like to use brute force and ignorance and try all of the above (i.e. lags 1, 3 and 7).

> #vector autoregression with lag1
> var = VAR(climate2.ts, p=1)

It is important now to test for serial autocorrelation in the model residuals and below is for the Portmanteau test (several options in the vars package are available).

> serial.test(var, lags.pt=10, type=”PT.asymptotic”)

Portmanteau Test (asymptotic)

data:  Residuals of VAR object var
Chi-squared = 55.4989, df = 36, p-value = 0.01996

#The null hypothesis is no serial correlation, so we can reject it with extreme prejudice…on to var3
> var3 = VAR(climate2.ts, p=3)
> serial.test(var3, lags.pt=10, type=”PT.asymptotic”)

Portmanteau Test (asymptotic)

data:  Residuals of VAR object var3
Chi-squared = 36.1256, df = 28, p-value = 0.1394

That is more like it.  You can review the details of the var model, in this case temperature, if you so choose:

> summary(var3, equation=”d.temp”)

VAR Estimation Results:
=========================
Endogenous variables: d.co2, d.temp
Deterministic variables: const
Sample size: 107
Log Likelihood: -548.435
Roots of the characteristic polynomial:
0.7812 0.7265 0.7265 0.6491 0.5846 0.5846
Call:
VAR(y = climate2.ts, p = 3)


Estimation results for equation d.temp:
=======================================
d.temp = d.co2.l1 + d.temp.l1 + d.co2.l2 + d.temp.l2 + d.co2.l3 + d.temp.l3 + const

               Estimate     Std. Error        t value   Pr(>|t|)  
d.co2.l1     7.603e-05  1.014e-04        0.749    0.455372  
d.temp.l1   -4.103e-01  9.448e-02     -4.343    3.37e-05 ***
d.co2.l2    -2.152e-05  1.115e-04      -0.193    0.847339  
d.temp.l2   -3.922e-01  9.544e-02     -4.109    8.15e-05 ***
d.co2.l3     7.905e-05  1.041e-04       0.759    0.449465  
d.temp.l3   -3.366e-01  9.263e-02     -3.633   0.000444 ***
const        7.539e-03  1.340e-02        0.563    0.574960  

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Residual standard error: 0.1014 on 100 degrees of freedom
Multiple R-Squared: 0.254, Adjusted R-squared: 0.2093
F-statistic: 5.676 on 6 and 100 DF,  p-value: 4.15e-05



Covariance matrix of residuals:
              d.co2           d.temp
d.co2   10972.588    -1.28920
d.temp        -1.289     0.01028

Correlation matrix of residuals:
             d.co2       d.temp
d.co2   1.0000     -0.1214
d.temp -0.1214     1.0000

> #does co2 granger cause temperature
> grangertest(d.temp ~ d.co2, order=3)

Granger causality test

Model 1: d.temp ~ Lags(d.temp, 1:3) + Lags(d.co2, 1:3)
Model 2: d.temp ~ Lags(d.temp, 1:3)
  Res.Df   Df        F          Pr(>F)
1    100              
2    103   -3    0.5064     0.6787

> #Clearly the model is not significant, so we can say that carbon emissions do not granger-cause surface temperatures.

> #does temperature granger cause co2
> grangertest(d.co2 ~ d.temp, order =3)

Granger causality test

Model 1: d.co2 ~ Lags(d.co2, 1:3) + Lags(d.temp, 1:3)
Model 2: d.co2 ~ Lags(d.co2, 1:3)
  Res.Df    Df         F        Pr(>F)
1    100              
2    103     -3    0.7799    0.5079

> #try again using lag 7
> grangertest(d.temp ~ d.co2, order=7)

Granger causality test

Model 1: d.temp ~ Lags(d.temp, 1:7) + Lags(d.co2, 1:7)
Model 2: d.temp ~ Lags(d.temp, 1:7)
  Res.Df    Df         F        Pr(>F)
1     88              
2     95      -7    0.5817    0.7691

Again, nothing significant using lag 7.  So, using this data and the econometric techniques spelled out above, it seems there is no causal effect (statistically speaking) between fossil fuel emissions and global surface temperatures.  Certainly, this is not the final word on the matter as there is much measurement error in the data that the stewards have attempted to account for.

On a side note, we can use vars for predictions and forecast for time series plots of the predicted values.

> predict(var3, n.ahead=6, ci=0.95)

$d.co2
         fcst                 lower            upper             CI
[1,] 202.5888      -2.717626     407.8953     205.3065
[2,] 110.3385  -105.847948     326.5249     216.1864
[3,] 192.1802    -26.160397     410.5207     218.3406
[4,] 152.5464    -74.948000     380.0408     227.4944
[5,] 108.4343  -122.198058     339.0666     230.6323
[6,] 123.9001  -107.882219     355.6824     231.7823

$d.temp
             fcst                    lower            upper               CI
[1,]  0.026737000    -0.1719770    0.2254510    0.1987140
[2,] -0.057081637    -0.2731569    0.1589936    0.2160753
[3,]  0.040419451    -0.1803409    0.2611798    0.2207603
[4,]  0.032591047    -0.1893108    0.2544929    0.2219019
[5,]  0.013708836    -0.2143756    0.2417933    0.2280844
[6,] -0.004319714   -0.2324070    0.2237675    0.2280873

> fcst = forecast(var3)
> plot(fcst)


















So what can we conclude from this exercise?  Well, let’s look to the good Doctor, Hunter S. Thompson for some philosophical insight.  He would likely advise us…

“res ipsa locquitur”

References:

BILANCIA, MASSIMO, and DOMENICO VITALE. “GRANGER CAUSALITY ANALYSIS OF BIVARIATE CLIMATIC TIME SERIES: A NOTE ON THE ROLE OF CO2 EMISSIONS IN GLOBAL CLIMATE WARMING.”

Morice, C. P., J. J. Kennedy, N. A. Rayner, and P. D. Jones (2012), Quantifying uncertainties in global and regional temperature change using an ensemble of observational estimates: The HadCRUT4 dataset, J. Geophys. Res., 117, D08101, doi:10.1029/2011JD017187.

Triacca, U, Is Granger causality analysis appropriate to investigate the relationship between atmospheric concentration of carbon dioxide and global surface air temperature?, Theoretical and Applied Climatology, 81, 133-135





To leave a comment for the author, please follow the link and comment on their blog: Fear and Loathing in Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.