What value is cross country GDP correlation? [Part One]
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The above graph borders on chartjunk (and is nothing like Paul Butler’s amazing Facebook map). We can see some variation in color but mostly it is a set of lines between 152 country capitals with no means to determine which lines are important! However the creation of the graph and the data behind it are interesting. Visualizing correlation between multiple series is often difficult without some additional structure or information. We can plot blocks of scatterplots in a data frame but readability suffers for more than ten or so variables. I also think ccf()
refuses to render for multiple time series of more than ten variables. If we have information strictly ordering the variables (and hopefully related to correlation) we can make a heatmap or a 3d plot–with countries we can order by GDP per capita or some other variable but such an ordering becomes unwieldy. But the alternative isn’t pretty, as you see above.
This chart started out as an exploration into correlation of GDP growth between countries. An abundance of evidence supports the claim that correlation between asset classes moves to one in a downturn, but this may not be the case for countries (at least for an unweighted sample). There is also some evidence suggesting that movements in GDP across countries have become more synchonrized, which accords with our basic story of increasing financialization and decreasing transaction costs. Testing these hypothesis rigorously is beyond the scope of this blog, but I want to poke at a few interesting avenues for research.
- Part One: Can we reliably measure correlation in GDP growth? Without a rigorous panel framework or a functional form for the process generating per capita GDP can we naively compute (and rely on naive computation) of correlation? How stationary are GDP growth series?
- Part Two: How informative are the classification schemes for countries. First in a back of the envelope test and later in a Bayesian framework, we want to test the informational content of region, income or cultural classifications available to us. In relationship to economic outcomes high level classification schemes are often arbitrary–the easiest example being the need to split Northern Africa and Sub-Saharan Africa for economic analysis. But sorting countries by income may be just as uninformative when it comes to predictive when countries move together. Membership in the OECD represents (roughly) a statement about a level of GDP, not a rate.
- Part Three: What is the appropriate timescale for measuring correlation between countries? If we were to optimally group countries into maximum (or minimum) covariance bundles, how much does the era for backtesting matter? I leave this to the end because working with the full range of dates involves sensibly dealing with missing values in the GDP series and may take some time.
To measure GDP correlation we downloaded GDP per capita for 207 countries from the World Bank using the WDI package available on CRAN and maintained by Vincent Arel-Bundock at the University of Michigan. With it we can get country-year data from 1991-2009 (a range chosen simply to minimize missing values due to name changes, revolutions, etc.) on GDP and a number of other indicators. If you haven’t tried out the package please do so, it is a wonderful tool.
Once the data are imported into a dataframe we can convert them to ts
(or xts
) and check for stationarity. Eyeballing GDP series gives us a hint that they are non-stationary but this is far from clear for GDP growth. Though ccf()
may sometimes detrend series before showing a result, a non-stationary series can result in us finding correlation (either within a single series or among multiple series) where non may exist had we checked the detrended series. More worrisome is the shaky foundation provided by a stationary series for further modeling.
Using Cheung and Lai (1995) as a rough guide, we cannot reject non-stationarity for most of the GDP series, however we can safely reject non-stationarity for the bulk (though not all) of the differenced GDP series. I suspect these numbers would improved if we converted differenced GDP to growth rate (and not per capita change), but that will wait for part three.
Now we can convert our time series (with 171 countries, removing those with missing years) and roughly 20 years into observations of country pair correlation–roughly 14,000 observations because . Some statistical violence is performed in expanding 3000 country-year observations to 14,000 observations, but I’m confident we haven’t violated the Prime Directive. With only GDP growth as a guide, we have no idea which of these observations are economically important and in part one at least no attempt is made to determine statistical significance. Further, many of these correlations are simply accounting results. Even if there is some positive correlation between Albania and Vietnam’s growth we have no reason to believe such a result arises out of an economic relationship between the two or even a coherent shared causal factor. But what do we find?
The distribution is not centered around zero. Unfortunately due to our ambiguous stationarity tests we cannot conclusively say whether the positive correlation is due to an underlying trend, massive movement together during the recent recession or a third and potentially more interesting explanation. At least it explains why most of the lines in the map are purple!
As we explore this issue over the next few weeks look for some more respectable econometrics and more informative graphs. Code to reproduce the above graphs or make your own is below.
#Long list of packages here. WDI is used to pull the data down, | |
#plyr and reshape are used for melt/cast and ddply | |
#tseries is used for the ADF test and could be excluded if you | |
#weren't interest in that. ggplot2 is used to bring in | |
#rescale() but that could be done by hand if you wanted. | |
#maps and geosphere are used for the great circle plotting | |
library(WDI) | |
library(plyr) | |
library(reshape) | |
library(tseries) | |
library(maps) | |
library(geosphere) | |
library(ggplot2) | |
GDP.input<- WDI(country="all",indicator="NY.GDP.PCAP.CD",start=1991,end=2009,extra=TRUE) | |
GDP.input<- subset(GDP.input, GDP.input$Region != "Aggregates") | |
names(GDP.input)[4]<- c("GDP.PCAP") | |
GDP.min<- GDP.input[,2:4] | |
names(GDP.min)[3]<- c("GDP.PCAP") | |
GDP.cast<-recast(GDP.min, id.var=c("country","year"),measure.var="GDP.PCAP",formula= year ~ country) | |
GDP.pre.ts<-as.data.frame(do.call("cbind",GDP.cast)) | |
GDP.pre.ts<-GDP.pre.ts[,-1] | |
rm(GDP.cast) | |
rm(GDP.min) | |
#thanks to http://stackoverflow.com/questions/4862178/r-remove-rows-with-nas-in-data-frame | |
col.has.na<- apply(GDP.pre.ts, 2, function(x){any(is.na(x))}) | |
GDP.pre.ts<-GDP.pre.ts[,!col.has.na] | |
GDP.ts<-ts(GDP.pre.ts, start=1991,end=2009) | |
rm(GDP.pre.ts) | |
block.adf<-function(data) { | |
adf.res<- matrix(0, nrow=ncol(data), ncol=1) | |
options(warn=-1) | |
for (i in 1:ncol(data)) { | |
adf.res[i,]<- adf.test(data[,i], alternative="explosive")[[1]] | |
} | |
options(warn=0) | |
rownames(adf.res)<- colnames(data) | |
colnames(adf.res)<- c("ADF.stat") | |
return(adf.res) | |
} | |
GDP.stat<- block.adf(GDP.ts) | |
plot.adf<- function() { | |
plot(density(block.adf(log(GDP.ts))),ylim=c(0.0,0.8),xlim=c(-5,2),main="Are GDP or GDP growth series stationary?",xlab="Value of Augmented Dickey-Fuller test statistic") | |
text(-.75,0.6,labels="Differenced GDP",col=3) | |
text(0,0.4,labels="GDP") | |
lines(density(block.adf(diff(log(GDP.ts)))),col=3) | |
abline(v=-1.931,col=2,lty=2) | |
text(0,0.75,labels="Rough ADF cutoff for stationarity",col=2) | |
} | |
GDP.diff<- diff(GDP.ts) | |
#I'll admit this is a mess and could be vectorized. I think the double for loop | |
#takes up more processor time than rendering the final plot, but it is close. | |
#If I have time I'll switch this to apply() but it isn't straightforward because | |
#We want to operate over columns and rows. | |
bad.ccf.fxn<- function() { | |
ccf.mat<- matrix(0,ncol(GDP.diff),ncol(GDP.diff)) | |
rownames(ccf.mat)<- colnames(ccf.mat) <-colnames(GDP.diff) | |
for(i in 1:ncol(GDP.diff)) { | |
for(j in 1:ncol(GDP.diff)) { | |
ccf.mat[i,j]<- mean(ccf(GDP.diff[,i],GDP.diff[,j],plot=FALSE,lag.max=1)$acf) | |
} | |
} | |
return(ccf.mat) | |
} | |
df.ccf<- as.data.frame(bad.ccf.fxn()) | |
#This is a hack. The final column data frame needs to behave like a database | |
#where we could normally store relationships quite easily | |
#melting the transpose of the df stores one row for each country pair | |
df2.melt<- melt(t(df.ccf)) | |
#Cleanup of character strings for merging with the maps df | |
#the first line drops own country correlations | |
df2.melt<- df2.melt[unclass(df2.melt$X1) != unclass(df2.melt$X2),] | |
df2.melt$X1<- as.character(df2.melt$X1) | |
df2.melt$X1<- tolower(df2.melt$X1) | |
df2.melt$X1<- gsub(" ", "", df2.melt$X1) | |
df2.melt$X1<- as.factor(df2.melt$X1) | |
df2.melt$X2<- as.character(df2.melt$X2) | |
df2.melt$X2<- tolower(df2.melt$X2) | |
df2.melt$X2<- gsub(" ", "", df2.melt$X2) | |
df2.melt$X2<- as.factor(df2.melt$X2) | |
data(world.cities) | |
world.caps<- subset(world.cities, capital == 1, select=c(country.etc,lat,long)) | |
world.caps$country.etc<- gsub(" ", "", world.caps$country.etc) | |
world.caps$country.etc<- tolower(world.caps$country.etc) | |
#I didn't change any country but the US, mostly because missing a few european countries | |
#wouldn't go noticed in the plot | |
world.caps$country.etc[215]<-"unitedstates" | |
world.caps<- world.caps[order(world.caps[,1]),] | |
#dropping missing countries in either dataframe. | |
#Maybe not necessary with an argument to merge() | |
df2.melt<- df2.melt[df2.melt$X1 %in% world.caps[,1],] | |
world.caps<- world.caps[world.caps[,1] %in% df2.melt$X1,] | |
#This is ugly with the iterative renaming. | |
names(df2.melt)[2]<-"country.etc" | |
merged.ccf<- merge(df2.melt, world.caps, by.x="country.etc") | |
names(merged.ccf)<-c("start.c","country.etc","corr","stl", "slg") | |
merged.ccf<- merge(merged.ccf, world.caps, by.x="country.etc") | |
names(merged.ccf)<-c("start.c","country.etc","corr","stl", "slg","elat", "elon") | |
merged.ccf<- merged.ccf[,c(1:3,5,4,7,6)] | |
names(merged.ccf)[4:7]<- c("start.lon","start.lat","end.lon","end.lat") | |
#Rescale the correlation to 0,1 in order to serve as an input to rgb() | |
merged.ccf$corr.res<- rescale(merged.ccf$corr) | |
#start.cord and end.cord are the capital cities for country pairs. | |
#The if statement draws lines differently for pairs where the shortest | |
#path is over the dateline. The breakAtDateLine argument | |
#outputs a list and we use only the first half | |
#Sort of cargo cult programming but it works. | |
plotm<-function() { | |
map(fill=TRUE,col=rgb(0.3,0.3,0.3,0.5)) | |
for (i in 1:nrow(merged.ccf)) { | |
start.cord<- as.matrix(cbind(merged.ccf[i,4], merged.ccf[i,5])) | |
end.cord<- as.matrix(cbind(merged.ccf[i,6], merged.ccf[i,7])) | |
gci<- gcIntermediate(p1=start.cord, p2=end.cord, addStartEnd=TRUE, breakAtDateLine=TRUE) | |
if (is.list(gci) == TRUE) | |
lines(gci[[1]], col=rgb(1 - merged.ccf[i,8],0, merged.ccf[i,8],0.01)) | |
else lines(gci, col=rgb(1 - merged.ccf[i,8],0, merged.ccf[i,8],0.01)) | |
} | |
} | |
#doesn't need to be a function but is easy enough to call. Just don't name it plot.ecdf() as it | |
#will conflict with the convenience function for ecdf() | |
plot.corr.cdf<- function() { | |
plot(ecdf(merged.ccf[,3]),main="Cumulative Distribution of Cross-Country GDP growth correlation") | |
} |
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.