Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Bootstrap your way into robust inference. Wow, that was fun to write..
Introduction
Say you made a simple regression, now you have your \( \widehat{\beta} \). You wish to know if it is significantly different from (say) zero. In general, people look at the statistic or p.value reported by their software of choice, (heRe). Thing is, this p.value calculation relies on the distribution of your dependent variable. Your software assumes normal distribution if not told differently, how so? for example, the (95%) confidence interval is \( \widehat{\beta} \pm 1.96 \times sd( \widehat{\beta}) \), the 1.96 comes from the normal distribution.
It is advisable not to do that, the beauty in bootstrapping* is that it is distribution untroubled, it’s valid for dependent which is Gaussian, Cauchy, or whatever. You can defend yourself against misspecification, and\or use the tool for inference when the underlying distribution is unknown.
I was not here 40 years ago but they say calculations were slow back then, not any more, take advantage of your Dell or your “meant for you” MacBook. We will see that it is better to add the non in the nonparametric, specifically in financial context, but that holds in general. You can still keep your distribution assumption but at least have a look at what happens when you relax it. The way to do that is by using bootstrap, the idea is intuitive and simple.
The idea
John Fox wrote: “The population is to the sample as the sample is to the bootstrap samples”.. Tadaaam.. but what does this mean? your estimate of \( \beta \) which came from the sample, is suppose to estimate the “real” \( \beta \), the population \( \beta \), which is unknown. Now take a sample from the sample, we call that sample a bootstrap sample, estimate your \( \beta \) according to this (bootstrap)sample, now this new estimate is an estimate for your original \( \widehat{\beta} \), the one coming from the original data. For clarity, say you have 3 observations, first is {x = 0.7,y = 0.6}, second is {whatever}, third is {whatever}, now, an example of sample from the sample would be the shuffled ordering: first is {whatever}, second is {x = 0.7,y = 0.6}, third is {whatever}. This shuffling, or permuted sample is what we call the bootstrap sample. Now we estimate the same statistic (in our case \( \beta \)) again.
Repeat this sample and estimate many times, you have many bootstrapped estimates, now you can check how do these guys behave. One thing you can do with it is to bootstrap confidence intervals (CI) for your estimate, without the need for underlying distributional assumption.
In R
In R, “boot” package will do the trick:
library(boot) #load the package # Now we need the function we would like to estimate # In our case the beta: betfun = function(data,b,formula){ # b is the random indexes for the bootstrap sample d = data[b,] return(lm(d[,1]~d[,2], data = d)$coef[2]) # thats for the beta coefficient } # now you can bootstrap: bootbet = boot(data="your data here", statistic=betfun, R=5000) # R is how many bootstrap samples names(bootbet) plot(bootbet) hist(bootbet$t, breaks = 100) |
You can zoom in on which indices were picked at each bootstrap take, what was the exact permutation, you do that using the function boot.array:
zoombot = boot.array(bootbet, indices = T) dim(zoombot) hist(zoombot[1,], breaks = 100) # this is the frequency of each index, [1,] for the first #bootstrap run |
More transparent
You can better understand it if you can code it yourself, for a simple problem like bootstrapping confidence intervals it is even simpler and faster:
nb = 5000 ; bet = NULL ; n = NROW("your data here") ptm <- proc.time() # have a look at the time it takes for (i in 1:nb){ unifnum = sample(c(1:n),n,replace = T) # pick random indices bet[i] = lm(data[unifnum,1]~data[unifnum,2])$coef[2] } proc.time() - ptm # about 80 seconds on my end |
How bad are your current confidence interval?
Does it really matter? maybe it’s not worth the trouble. As an illustration I use stock returns which are known to have fat tails, meaning more observations far from the center. Take a look at the following figure:
That is the Morgan Stanley \(\widehat{\beta}\) with the market. The estimate is centered at 1.87. The green vertical lines are the (95%) confidence interval reported by the the “lm” function, the red vertical lines are the equivalent nonparametric confidence intervals, the light blue curve is the normal density.
Notice how different is the interval from the nonparametric bootstrap, which is more accurate in this case. For example, it may be that the parameter is actually 2, you can see the software output rejects this possibility since it assumes normality, yet the bootstrap confidence interval indeed covers the value 2. So, an investor who thinks that “in my portfolio all beta’s are smaller than 2 with a 95% CI” is mistaken to hold Morgan Stanley as such. This check takes about 80 seconds, so I would think twice before plugging it in a double “for” loop. However, if you are social scientist or similar, beef up your standard (normal) output with this robust analysis.
Wrap up
Here you can see that when you use the bootstrapped confidence interval, when the normal distribution assumption IS valid, you are not that worse off (though you will need to wait these 80 seconds nontheless
Again, I simulated from a normal distribution using same center and standard deviation as reported by the “lm”, you can see that the intervals are close to each other which concludes this post, using the parametric confidence interval, from the assumed normal distribution is in some sense suboptimal, since you don’t lose much even if it IS normal.
You can see the remaining code and some references below. Thanks for reading.
Comments
* Naturally we only cover the idea here, there is lots of bootstrapping literature, we only tasted nonparametric bootstrap.
Related literature
An Introduction to the Bootstrap (Chapman & Hall/CRC Monographs on Statistics & Applied Probability)
Resampling Methods: A Practical Guide to Data Analysis
library(quantmod) ; library(xtable) ; library(tseries) library(ggplot2) ; library(fields) ; library(grDevices) tckr = c('MS', 'SPY') end<- format(Sys.Date(),"%Y-%m-%d") # yyyy-mm-dd start<-format(Sys.Date() - 365*10,"%Y-%m-%d") # 10 years of data dat1 = (getSymbols(tckr[1], src="yahoo", from=start, to=end, auto.assign = FALSE)) dat2 = (getSymbols(tckr[2], src="yahoo", from=start, to=end, auto.assign = FALSE)) ret1 = as.numeric((dat1[,4] - dat1[,1])/dat1[,1] ) ret2 = as.numeric((dat2[,4] - dat2[,1])/dat2[,1]) plot(ret1) length(as.numeric(ret1)) ; length(as.numeric(ret1)) plot(as.numeric(ret1)~as.numeric(ret2)) t1 = as.data.frame(cbind(as.numeric(ret1),as.numeric(ret2))) names(t1) <- tckr lm1 = lm(MS~SPY, data = t1) ; summary(lm1) nb = 5000 ; bet = NULL ; n = NROW(t1) ptm <- proc.time() for (i in 1:nb){ unifnum = sample(c(1:n),n,replace = T) bet[i] = lm(t1[unifnum,1]~t1[unifnum,2])$coef[2] } proc.time() - ptm den = density(bet) par(mfrow = c(1,1)) bethat <- bquote(Histogram~of ~bold(widehat(beta))) h = hist(bet, breaks = 100, freq = NULL, probability = F, ylim = c(0,280), xlab = expression(bold(widehat(beta))), , cex.lab = 1.6, main = bethat ) lwd1 = 2.5 xline(mean(bet), col = 4, lwd = lwd1) ; xline(lm1$coef[2], col = 6, lwd = lwd1) xline(mean(bet)+1.96*summary(lm1)$coef[2,2], col = 3, lwd = lwd1) xline(mean(bet)-1.96*summary(lm1)$coef[2,2], col = 3, lwd = lwd1) xline(quantile(bet,.025), col = 2, lwd = lwd1) ; xline(quantile(bet,.975), col = 2, lwd = lwd1) xfit<-seq(min(bet),max(bet),length=length(bet)) yfit<-dnorm(xfit,mean=mean(bet),sd=sd(bet)) yfit <- yfit*diff(h$mids[1:2])*length(bet) lines(xfit, yfit, col="blue", lwd=2) yfit2<-dnorm(xfit,mean=lm1$coef[2],sd=summary(lm1)$coef[2,2]) yfit2 <- yfit2*diff(h$mids[1:2])*length(bet) lines(xfit, yfit2, col=5, lwd=2) ################################################### ### Now I generate actually from the normal distribution ################################################### eps = rnorm(n, mean(ret1), sd(ret1)) fakey = summary(lm1)$coef[1,1]+summary(lm1)$coef[2,1]*ret1 +eps # hist(fakey, offset=0.16, width=0.13, add=TRUE, col = "green", breaks = 200) lm2 = lm(fakey~ret1) fakebet = NULL ptm <- proc.time() for (i in 1:nb){ unifnum = sample(c(1:n),n,replace = T) fakebet[i] = lm(fakey[unifnum]~ret1[unifnum])$coef[2] } proc.time() - ptm fakebethat <- bquote(Histogram~of~fake~bold(widehat(beta))) h2 = hist(fakebet, breaks = 100, freq = NULL, probability = F, ylim = c(0,470), xlab = expression(bold(widehat(beta))), , cex.lab = 1.6, main = fakebethat ) xline(mean(fakebet), col = 4, lwd = lwd1) xline(lm2$coef[2], col = 6, lwd = lwd1) xline(mean(fakebet)+1.96*summary(lm2)$coef[2,2], col = 3, lwd = lwd1) xline(mean(fakebet)-1.96*summary(lm2)$coef[2,2], col = 3, lwd = lwd1) xline(quantile(fakebet,.025), col = 2, lwd = lwd1) ; xline(quantile(fakebet,.975), col = 2, lwd = lwd1) xfit<-seq(min(fakebet),max(fakebet),length=length(fakebet)) yfit<-dnorm(xfit,mean=mean(fakebet),sd=sd(fakebet)) yfit <- yfit*diff(h2$mids[1:2])*length(fakebet) lines(xfit, yfit, col="blue", lwd=2) yfit2<-dnorm(xfit,mean=lm2$coef[2],sd=summary(lm2)$coef[2,2]) yfit2 <- yfit2*diff(h2$mids[1:2])*length(fakebet) lines(xfit, yfit2, col=5, lwd=2) |
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.