Basic Linear Regressions for Finance
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Linear Regression
In statistics, linear regression is a linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). The relationships are modeled using linear basis functions, essentially replacing each input with a function of the input. This is linear regression:
\[Y = \alpha + \beta_1 f_1(X) + \beta_2 f_2(X) + … + \beta_n f_n(X) + \epsilon\]
This is only a subclass of linear regression:
\[Y = \alpha + \beta_1 X_1 + \beta_2 X_1 + … + \beta_n X_n+ \epsilon\]
This is linear regression as well:
\[Y = \alpha + \beta_1 X_1^2 + \beta_2 log(X_1) + … + \beta_n sin(X_n)+ \epsilon\]
Estimation
In R, the lm
function is used to fit linear models. For panel data, the plm
function from the plm
package can be used (see Introduction to Econometrics with R).
Exercise Simulate an exponential growth model \(y(t) = y_0e^{kt}\) and estimate the growth rate \(k\) and the initial population \(y_0\).
# time grid t <- seq(0, 10, by = 0.01) # simulate y values for k = 0.33 and initial population y0 = 1000 y <- 1000*exp(0.33*t) # add random noise y <- y * rnorm(n = length(y), mean = 1, sd = 0.1) # plot plot(y ~ t, main = "Population Growth")
Assume the \(y\) values generated above are given. We don’t know the initial population \(y0\) nor the growth rate \(k\). To estimate these parameters we proceed as follows:
\[z = ln(y(t)) = ln(y_0e^{kt}) = ln(y_0) + k\;t = \alpha + \beta\;t\] where \(\alpha=ln(y_0)\) and \(\beta=k\).
# transform the output variable z <- log(y) # fit the model mod <- lm(z ~ t) # extract the coefficients mod.c <- coefficients(mod) # extract alpha alpha <- mod.c[1] # extract beta beta <- mod.c[2] # compute y0 y0 <- exp(alpha) # compute k k <- beta # print estimates sprintf("y0 = %s; k = %s", y0, k) ## [1] "y0 = 997.365557000044; k = 0.329840311556089"
The estimates seems close to the true values \(y_0=1000\) and \(k = 0.33\), but how can we test for them to be equal? We need confidence intervals.
# computes confidence intervals for the parameters in the model mod.i <- confint(mod, level = 0.95) ## 2.5 % 97.5 % ## (Intercept) 6.8927301 6.9175046 ## t 0.3276953 0.3319853
The true value of \(k = \beta = 0.33\) is inside the confidence interval obtained above and has been consistently estimated. To check for \(y_0\) we need to transform the confidence interval obtained for \(\alpha\).
# compute the confidence interval for y0 low <- exp(mod.i[1,1]) upp <- exp(mod.i[1,2]) # print sprintf("Confidence interval for y0: %s - %s", round(low,1), round(upp,1)) ## [1] "Confidence interval for y0: 985.1 - 1009.8"
Model Selection
In the previous example we knew the functional form linking the inputs to the output variable. This is not often the case in economics and finance, where the model is not known a priori and has to be deduced from the data.
Exercise Repeat the same exercise of the previous section but assume no model is given a priori. Deduce a reasonable model and estimate its parameters.
# visualize the data plot(y ~ t, main = "First Look at the Data")
The data are not linear with respect to \(t\). They seem to be some exponential, quadratic, cubic… function of \(t\). We can try to take the log of \(y\) and see what they look like.
plot(log(y) ~ t, main = "Log Output")
Much better! This seems linear but we want to test also for quadratic and cubic effects. Build the full model:
\[ln(y) = \alpha + \beta_1 t + \beta_2 t^2 + \beta_3 t^3+ \epsilon\] and fit it to the data.
# build a data frame of regressors data <- data.frame(log.y = log(y), t1 = t, t2 = t^2, t3 = t^3) # fit the model mod <- lm(log.y ~ t1 + t2 + t3, data = data) # summary statistics summary(mod) ## ## Call: ## lm(formula = log.y ~ t1 + t2 + t3, data = data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.32659 -0.06253 0.00485 0.06780 0.28404 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.908e+00 1.260e-02 548.241 <2e-16 *** ## t1 3.272e-01 1.092e-02 29.972 <2e-16 *** ## t2 6.196e-04 2.538e-03 0.244 0.807 ## t3 -3.954e-05 1.668e-04 -0.237 0.813 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.1 on 997 degrees of freedom ## Multiple R-squared: 0.9891, Adjusted R-squared: 0.9891 ## F-statistic: 3.029e+04 on 3 and 997 DF, p-value: < 2.2e-16
From the output we discover that:
- only the
intercept
(\(\alpha\)) andt1
(\(\beta_1\)) are statistically different from zero. The probability for them to be zero is infact less than \(10^{-16}\). t2
andt3
are not statistically different from zero. The probability of observing such estimates if their true value is zero, is infact pretty high: around 80%. We cannot reject the hypothesis of \(\beta_2\) and \(\beta_3\) to be zero and we are going to accept it.- the R-squared is close to 1: the model is able to capture almost all the variability in the data
Since \(\beta_2\) and \(\beta_3\) are not statistically different from zero, we reduce the full model and estimate it again.
\[ln(y) = \alpha + \beta_1 t+ \epsilon\]
# fit the model mod <- lm(log.y ~ t1, data = data) # summary statistics summary(mod) ## ## Call: ## lm(formula = log.y ~ t1, data = data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.32628 -0.06206 0.00441 0.06815 0.28363 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.905117 0.006312 1093.9 <2e-16 *** ## t1 0.329840 0.001093 301.8 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.09993 on 999 degrees of freedom ## Multiple R-squared: 0.9891, Adjusted R-squared: 0.9891 ## F-statistic: 9.106e+04 on 1 and 999 DF, p-value: < 2.2e-16
To understand the meaning of the estimated coefficients, we proceed as follows:
\[ln(y) = \alpha + \beta_1 t \rightarrow y = exp(\alpha + \beta_1 t) = e^{\alpha}e^{\beta_1t}=y_0e^{k t}\] where:
# extract estimates mod.c <- coef(mod) # y0 y0 <- exp(mod.c[1]) # k k <- mod.c[2] # print sprintf("y0 = %s; k = %s", y0, k) ## [1] "y0 = 997.365557000044; k = 0.329840311556089"
R-squared
R-squared is a goodness-of-fit measure for linear regression models. This statistic indicates the percentage of the variance in the dependent variable that the independent variables explain collectively. R-squared measures the strength of the relationship between your model and the dependent variable on a convenient 0 – 100% scale.
A good predictive model should achieve high values of R-squared, while this measure plays no role when assessing the significancy of the parameters.
Exercise Simulate a dataset from the model \(y = 2sin(x) + 1\) and see how the R-square changes when increasing the noise in the data. Is the significance of the estimates affected?
# x grid x <- seq(0, 2*pi, by = 0.01) # y y = 2*sin(x)+1 # y: low noise y.low <- y + rnorm(n = length(y), mean = 0, sd = 0.1) # y: medium noise y.mid <- y + rnorm(n = length(y), mean = 0, sd = 1) # y: high noise y.high <- y + rnorm(n = length(y), mean = 0, sd = 10) # plot layout(t(1:3)) plot(y.low ~ x, main = "Low Noise") plot(y.mid ~ x, main = "Medium Noise") plot(y.high ~ x, main = "High Noise")
# low noise summary(lm(y.low ~ sin(x))) ## ## Call: ## lm(formula = y.low ~ sin(x)) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.27887 -0.06682 0.00323 0.06331 0.33386 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.004789 0.003924 256.0 <2e-16 *** ## sin(x) 1.995093 0.005553 359.3 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.09842 on 627 degrees of freedom ## Multiple R-squared: 0.9952, Adjusted R-squared: 0.9952 ## F-statistic: 1.291e+05 on 1 and 627 DF, p-value: < 2.2e-16 # medium noise summary(lm(y.mid ~ sin(x))) ## ## Call: ## lm(formula = y.mid ~ sin(x)) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.0789 -0.6506 -0.0113 0.7228 2.8477 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.97263 0.03964 24.53 <2e-16 *** ## sin(x) 2.08865 0.05610 37.23 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.9943 on 627 degrees of freedom ## Multiple R-squared: 0.6886, Adjusted R-squared: 0.6881 ## F-statistic: 1386 on 1 and 627 DF, p-value: < 2.2e-16 # high noise summary(lm(y.high ~ sin(x))) ## ## Call: ## lm(formula = y.high ~ sin(x)) ## ## Residuals: ## Min 1Q Median 3Q Max ## -28.8317 -6.3656 -0.1938 6.7277 29.8982 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.0089 0.3936 2.563 0.0106 * ## sin(x) 2.3579 0.5570 4.233 2.65e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 9.873 on 627 degrees of freedom ## Multiple R-squared: 0.02779, Adjusted R-squared: 0.02624 ## F-statistic: 17.92 on 1 and 627 DF, p-value: 2.648e-05
The R-squared is almost 100% for y.low
, 68% for y.mid
and only 3% for y.high
. In the first case, we are able to predict y
based on x
with very high accuracy. In the second case the accuracy drops. In the third case we have basically no predictive power but we were able to assess the statistically significant impact of \(sin(x)\) on \(y\). On the other hand, the uncertainty associated with the estimates of the coefficients increased and the significancy levels drop. For even higher noise levels we won’t be able to assess the statistically significant impact of the regressor on the response variable, but this problem can be solved increasing the number of observations when possible (try as an exercise).
After running a regression analysis, we should check if the model works well for data. We paid attention to regression results, such as slope coefficients, p-values, or R-squared but that’s not the whole picture. Residuals could show how poorly a model represents data. Residuals are leftover of the outcome variable after fitting a model (predictors) to data and they could reveal unexplained patterns in the data by the fitted model. Using this information, not only could we check if linear regression assumptions are met, but we could improve our model in an exploratory way. Refer to: Understanding Diagnostic Plots for Linear Regression Analysis.
Testing CAPM
\[E[R_i - r_f] = \beta_i E[R_{mkt} - r_f]\] where:
- \(R_{i_t}\): return on asset \(i\) at time \(t\)
- \(r_f\): risk-free return at time \(t\)
- \(R_{m,t}\): return on the market portfolio at time \(t\)
To test the model we use the following data file containing stock data from the website of Kenneth R. French. It includes the monthly simple computed stock returns in percentage points for decile portfolios formed on beta over the period 1963-2017. These are total returns (i.e. they include dividends).
# read data data <- read.csv('https://storage.guidotti.dev/course/asset-pricing-unine-2019-2020/basic-linear-regressions-for-finance.csv') # drop date data <- data[,-1] # print head(data) ## Lo.10 Dec.2 Dec.3 Dec.4 Dec.5 Dec.6 Dec.7 Dec.8 Dec.9 Hi.10 Mkt.RF RF ## 1 1.35 0.77 0.08 -0.24 -0.69 -1.20 -0.49 -1.39 -1.94 -0.77 -0.39 0.27 ## 2 3.52 3.89 4.29 5.25 5.23 7.55 7.57 4.91 9.04 10.47 5.07 0.25 ## 3 -3.09 -2.24 -0.54 -0.97 -1.37 -0.27 -0.63 -1.00 -1.92 -3.68 -1.57 0.27 ## 4 1.25 -0.12 2.00 5.12 2.32 1.78 6.63 4.78 3.10 3.01 2.53 0.29 ## 5 -0.91 -0.15 1.60 -2.05 -0.94 -0.69 -1.32 -0.51 -0.20 0.52 -0.85 0.27 ## 6 3.86 0.63 2.31 1.83 3.00 2.36 1.25 3.45 0.30 1.28 1.83 0.29 # get the portfolios portfolios <- data[,-c(11,12)] # compute excess returns portfolios <- portfolios - data$RF # print head(portfolios) ## Lo.10 Dec.2 Dec.3 Dec.4 Dec.5 Dec.6 Dec.7 Dec.8 Dec.9 Hi.10 ## 1 1.08 0.50 -0.19 -0.51 -0.96 -1.47 -0.76 -1.66 -2.21 -1.04 ## 2 3.27 3.64 4.04 5.00 4.98 7.30 7.32 4.66 8.79 10.22 ## 3 -3.36 -2.51 -0.81 -1.24 -1.64 -0.54 -0.90 -1.27 -2.19 -3.95 ## 4 0.96 -0.41 1.71 4.83 2.03 1.49 6.34 4.49 2.81 2.72 ## 5 -1.18 -0.42 1.33 -2.32 -1.21 -0.96 -1.59 -0.78 -0.47 0.25 ## 6 3.57 0.34 2.02 1.54 2.71 2.07 0.96 3.16 0.01 0.99
Time-Series Approach
The time-series approach consists in the following regression:
\[R_{i,t} - r_f = \alpha_i + \beta_i (R_{m,t} - r_f)+ \epsilon_{i,t}\] i.e.
\[Y_{i,t} = \alpha_i + \beta_i X_t + \epsilon_{i,t}\]
where:
- \(R_{i_t}\): return on asset \(i\) at time \(t\)
- \(r_f\): risk-free return at time \(t\)
- \(R_{m,t}\): return on the market portfolio at time \(t\)
- \(Y_{i,t} = R_{i,t} - r_f\): excess return on asset \(i\) at time \(t\)
- \(X_t = R_{mkt}-r_f\): excess return on the market portfolio at time \(t\)
The CAPM implies \(\alpha_i = 0\). Infact, if \(\alpha_i \neq 0\) then taking the expectation on both terms of the equation violates the CAPM.
\[E[R_{i,t} - r_f] = E[\alpha_i + \beta_i (R_{m,t} - r_f)] = \alpha_i + \beta_i E[R_{m,t} - r_f] \neq \beta_i E[R_{m,t} - r_f]\]
Therefore, the CAPM is rejected if we obsrve \(\alpha\) statistically different from zero.
# define an empty data frame capm <- data.frame() # define a matrix to store residuals eps <- matrix(NA, nrow = nrow(portfolios), ncol = ncol(portfolios)) # for each portfolio... for(i in 1:ncol(portfolios)){ # linear regression mod <- lm(portfolios[,i] ~ data$Mkt.RF) # summary mod.s <- summary(mod) # store residuals eps[,i] <- residuals(mod) # extract coefficients alpha <- mod.s$coefficients[1,'Estimate'] beta <- mod.s$coefficients[2,'Estimate'] # extract standard errors of the estimates sd.alpha <- mod.s$coefficients[1,'Std. Error'] sd.beta <- mod.s$coefficients[2,'Std. Error'] # compute the average excess return excess <- mean(portfolios[,i]) # store everything into the capm dataframe row <- c(excess, alpha, sd.alpha, beta, sd.beta) capm <- rbind(capm, row) } # assign colnames colnames(capm) <- c('<excess>', 'alpha', 'sd.alpha', 'beta', 'sd.beta') # print capm ## <excess> alpha sd.alpha beta sd.beta ## 1 0.5465291 0.219840960 0.08358282 0.6152566 0.01892420 ## 2 0.5221713 0.131098193 0.07708074 0.7365138 0.01745205 ## 3 0.5882875 0.145659989 0.06624880 0.8336070 0.01499956 ## 4 0.6657951 0.149145083 0.06207395 0.9730148 0.01405432 ## 5 0.5541590 0.013151107 0.05977863 1.0188884 0.01353463 ## 6 0.6346483 0.059530018 0.06389041 1.0831290 0.01446559 ## 7 0.5194801 -0.095702502 0.07022164 1.1585827 0.01589906 ## 8 0.6728287 -0.005224589 0.08177222 1.2769881 0.01851426 ## 9 0.6400306 -0.098993437 0.10113883 1.3918151 0.02289910 ## 10 0.6306269 -0.224398053 0.13376284 1.6102814 0.03028559
We estimated \(\alpha_i\) for all the ten portfolios and their standard errors. Each \(\alpha_i\) is (approximately) normally distributed with standard deviation \(\sigma_{\alpha_i}\). Therefore to test if all \(\alpha\) are jointly equal to zero we can define the following random variable
\[\chi^2_N=\sum_{i=1}^N \Bigl(\frac{\alpha_i-0}{\sigma_{\alpha_i}}\Bigl)^2\]
which is the sum of \(N\) (approximately) independent standard normal variables, i.e. it has a (approximate) chi-squared distribution with \(N\) degrees of freedom.
# chi squared random variable chi.sq <- sum((capm$alpha/capm$sd.alpha)^2) ## [1] 26.96824
Which is the probability of observing a value equal or greater than 26.9682402 if it has a chi-squared distribution with ten degrees of freedom?
pchisq(q = chi.sq, df = nrow(capm), lower.tail = FALSE) ## [1] 0.002634639
The CAPM would be rejected at a confidence level of 99%. The problem is that \(cov(\alpha_i,\alpha_j)\) will not be zero. Thus, it is common to use \(\boldsymbol \alpha^\intercal cov(\boldsymbol \alpha)^{-1} \boldsymbol \alpha\). Now we follow this approach to take correlation into account and compute the following statistic (GRS Test), which follows an F distributions assuming normally distributed error terms:
\[f_{GRS} \sim F(n,\tau - n - k)=\frac{\tau-n-k}{n}\frac{\hat{\alpha}^\intercal\hat\Omega^{-1}\hat\alpha}{1+\hat\mu_f^\intercal\hat\Sigma^{-1}_f\hat\mu_f}\] where:
- \(T\): number of time perdiods
- \(n\): number of assets
- \(k\): number of factors (in our case 1)
- \(\alpha\): vector of estimated \(\alpha_i\)
- \(\Omega\): covariance matrix of residuals
- \(\mu\): vector giving the sample means of the factor(s)
- \(\Sigma\): covariance matrix of factors (in our case it reduces to the variance of the market excess return)
# number of time perdiods t <- nrow(portfolios) # number of assets n <- ncol(portfolios) # number of factors (in our case 1) k <- 1 # vector of estimated alpha_i alpha <- capm$alpha # covariance matrix of residuals omega <- cov(eps) # vector giving the sample means of the factor mu <- mean(data$Mkt.RF) # covariance matrix of factors sigma <- var(data$Mkt.RF) # F-statistic (GRS test) f <- (t-n-k)/n * (alpha %*% solve(omega) %*% alpha)/(1 + mu %*% solve(sigma) %*% mu) # p-value pf(q = f, df1 = n, df2 = t-n-1, lower.tail = FALSE) ## [,1] ## [1,] 0.03102946
The CAPM is still rejected at a confidence level of 95%, even wen taking into account the correlations between \(\alpha_i\).
Finally, dropping the assumption of normally distributed error terms and taking correlation into account as well, there exists a test-statistic that asymptotically approaches the \(\chi^2\) distribution:
\[J \sim \chi^2(n)=\tau\frac{\hat{\alpha}^\intercal\hat\Omega^{-1}\hat\alpha}{1+\hat\mu_f^\intercal\hat\Sigma^{-1}_f\hat\mu_f}\]
# chi squared statistic x <- t * (alpha %*% solve(omega) %*% alpha)/(1 + mu %*% solve(sigma) %*% mu) # p-value pchisq(q = x, df = n, lower.tail = FALSE) ## [,1] ## [1,] 0.02617805
The CAPM is still rejected at a confidence level of 95%, even when taking into account the non-normality of error terms together with correlation of \(\alpha_i\).
We now consider a different approach to test CAPM. Note: what is done below is essentially the same of using dummy variables. Consider the model:
\[R_{i,t} - r_f = \alpha + \sum_{j=1}^N\beta_j \delta_{i,j}(R_{m,t} - r_f)+ \epsilon_{i,t}\]
where \(\delta_{i,j}\) is the Kronecker delta, i.e
\[\delta_{i_j} = \begin{cases} 1, & \text{if } i=j,\\ 0, & \text{if } i\neq j. \end{cases}\]
The model correctly reduces to the standard CAPM for each asset \(i\). For example, consider the first asset \(i=1\):
\[R_{1,t} - r_f = \alpha + \sum_{j=1}^N\beta_j \delta_{1,j}(R_{m,t} - r_f)+ \epsilon_{1,t}\] Now, \(\delta_{1,j}\) equals 1 only for \(j=1\) and vanishes for all other terms. The only term which contributes to the summation is therefore \(j=1\) and we have the standard CAPM for the first asset, which predicts \(\alpha=0\):
\[R_{1,t} - r_f = \alpha + \beta_1 (R_{m,t} - r_f)+ \epsilon_{1,t}\]
We can repeat the procedure for all assets and we obtain the standard CAPM for all assets, where now \(\alpha\) is a common parameter, equal to 0 according to CAPM. Test for \(\alpha=0\) and we will test for CAPM to hold.
# number of assets n.p <- ncol(portfolios) # number of observations for each asset n.t <- nrow(portfolios) # matrix of excess returns and the n.p regressors (delta_{i,j} * (R_{m,t} - r_f)) M <- matrix(0, nrow = n.t*n.p, ncol = n.p+1) colnames(M) <- c('excess', colnames(portfolios)) # fill the first column with the excess returns M[,1] <- unlist(portfolios) # fill each column with (R_{m,t} - r_f) only if i==j for(i in 1:n.p){ M[1:n.t + (i-1)*n.t, i+1] <- data$Mkt.RF } # linear regression mod <- lm(excess ~ ., data = as.data.frame(M)) # summary summary(mod) ## ## Call: ## lm(formula = excess ~ ., data = as.data.frame(M)) ## ## Residuals: ## Min 1Q Median 3Q Max ## -14.8914 -1.0903 -0.0251 1.0637 13.0585 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.02941 0.02622 1.122 0.262 ## Lo.10 0.62044 0.01865 33.270 <2e-16 *** ## Dec.2 0.73928 0.01865 39.643 <2e-16 *** ## Dec.3 0.83677 0.01865 44.870 <2e-16 *** ## Dec.4 0.97627 0.01865 52.351 <2e-16 *** ## Dec.5 1.01845 0.01865 54.612 <2e-16 *** ## Dec.6 1.08395 0.01865 58.125 <2e-16 *** ## Dec.7 1.15518 0.01865 61.944 <2e-16 *** ## Dec.8 1.27605 0.01865 68.426 <2e-16 *** ## Dec.9 1.38832 0.01865 74.446 <2e-16 *** ## Hi.10 1.60337 0.01865 85.978 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.105 on 6529 degrees of freedom ## Multiple R-squared: 0.8421, Adjusted R-squared: 0.8419 ## F-statistic: 3482 on 10 and 6529 DF, p-value: < 2.2e-16
Note that all \(\beta_i\) are the same of those estimated independently, while the intercept is not statistically significant, i.e. \(\alpha\) is not statistically different from zero. The CAPM cannot be rejected. Note that when performing this kind of tests the reverse does not hold: we cannot say that based on this test the CAPM holds. Infact, we caould have observed \(\alpha\) not statistically different from zero both because:
- the true value of \(\alpha\) is zero
- we don’t have enough data and the uncertainty of the parameters is too high to detect the significant difference between the true \(\alpha\) and zero. In other words, we didn’t have enough statistical power to tell the difference between zero and somthing close to zero. Increasing the size of the dataset would allow us to estimare a significant \(\alpha \neq0\)
What do we learn from this? First, not rejecting an hypothesis does not mean accepting it, otherwise the last apprach would contradict the previous ones. Second, for the same puropose there can be many different approaches, more or less suited to it, and several tests with different statistical power, i.e. able to distinguish better between the true value and something close to the true value.
Cross-Sectional Approach
The cross-sectional approach consists in the following regression:
\[E[R_i - r_f] = \beta_i E[R_{mkt} - r_f]\]
i.e.
\[Y_i = \lambda X_i + \theta + \epsilon_{i}\]
where:
- \(Y_i=E[R_i-r_f]\): average excess return on asset \(i\)
- \(X_i=\beta_i\): coefficients estimated in the time-series approach on asset \(i\)
The CAPM implies \(\gamma=E[R_{mkt}-r_f]\) and \(\theta=0\). Infact, if \(\lambda \neq E[R_{mkt}-r_f]\) and/or \(\theta \neq 0\) then:
\[E[R_i-r_f]= Y_i = \lambda X_i = \lambda \beta_i +\theta \neq \beta_i E[R_{mkt} - r_f]\]
Therefore, the CAPM is rejected if we obsrve \(\lambda\) statistically different from \(E[R_{mkt}-r_f]\) and/or \(\theta \neq 0\).
# linear regression sml <- lm(capm$`<excess>` ~ capm$beta) # print summary(sml) ## ## Call: ## lm(formula = capm$`<excess>` ~ capm$beta) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.087237 -0.034292 0.002738 0.030721 0.078437 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.48585 0.06353 7.647 6.03e-05 *** ## capm$beta 0.10432 0.05734 1.819 0.106 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.05231 on 8 degrees of freedom ## Multiple R-squared: 0.2927, Adjusted R-squared: 0.2043 ## F-statistic: 3.31 on 1 and 8 DF, p-value: 0.1063
We estimated \(\theta\) statistically different from zero and the CAPM is rejected. Regarding \(\lambda\), we estimated a value of 0.1043244. Is it statistically different from \(E[R_{mkt} - r_f]\)?
# mean excess return on the market portfolio mean(data$Mkt.RF) ## [1] 0.5309786 # confidence intervals at 95% confint(sml, level = 0.95) ## 2.5 % 97.5 % ## (Intercept) 0.33934005 0.6323572 ## capm$beta -0.02790093 0.2365497
The mean excess return does not follow inside the confidence interval: \(\lambda\) is statistically different from \(E[R_{mkt} - r_f]\). The CAPM is rejected.
To conclude, we represent graphically the results obtained.
# grid of beta betas <- seq(0, 2, by = 0.01) # excess returns by CAPM E.R <- betas * mean(data$Mkt.RF) # plot plot(E.R ~ betas, type = 'l', lwd = 2, col = 'orange', main = "SML vs Beta Regression", xlab = 'Beta', ylab = 'Mean Excess Return') # add points estimated in the time-series approach points(x = capm$beta, y = capm$`<excess>`, pch = 16, cex = 1) text(labels = 1:10, x = capm$beta, y = capm$`<excess>`, cex = 1, pos = 3) # add regression line abline(sml, lty = 'dashed')
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.