[This article was first published on We think therefore we R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
While working for my Financial economics project I came across this elegant tool called Principal component analysis (PCA)which is an extremely powerful tool when it comes to reducing the dimentionality of a data set comprising of highly correlated variables. This tool finds majority application in genetic research, which deals with data sets having many variables that are highly correlated.
I will try and be as explicit and refrain from using statistical/mathematical jargons to explain what/how about this tool . To state a few stylized facts PCA is used mainly for:
I was trying to investigate the factors that affect the returns of stocks in the Indian equity market, however I wanted to take into account all the S&P CNX 500 companies. What would be really nice if I could somehow find a way of squeezing the 500 companies into say not more than 2-3 variables that can be representative of the entire set of 500 companies. This is precisely where PCA comes into play and does a fantastic job. What it gives me is just 1 variable that I can use instead of all the 500 companies!!!
I will try and be as explicit and refrain from using statistical/mathematical jargons to explain what/how about this tool . To state a few stylized facts PCA is used mainly for:
- compressing the data
- filter some of the noise in the data
Problem at hand:
Hats off and a bow of respect for the contributors/donors of packages to the CRAN servers that the above simplification can be achieved using just one line of script in R. Sounds easy, but what one really needs to do is to understand what PCA does and how the output from this script can be interpreted. Again, at the risk of over simplication (however trying hard to maintain my commandment of simplicity), I would illustrate in a crude manner the working of PCA.
What PCA does:
Let me explain this relating to the above example, if I do a PCA on the returns data for the 500 companies, I would obtain 500 principal components. These components are nothing but the linear combination of the existing 500 variables(companies) arranged in the decreasing order of their variance. So 1st principal component (PC)has the maximum variance and 500th principal component (PC)has the least variance. The variance in the PCA represent nothing but the variance in the data. So 1st PC explains the maximum amount of variance in my data. One magical feature of PCA is that all these 500 components will be orthogonal to each other, meaning these components will be uncorrelated with each other. So essentially if we look at PCA as a black box it takes inputs as data set of highly correlated variables and gives as output PC’s that explain the variance in the input data and they are uncorrelated with each other.(I don’t leverage this feature in this particular problem, I would illustrate this use in other part of this blog)
How PCA does it:
Since I have taken a vow of simplicity, I dont have much to say here.:-) However for the mathematically inclined and certainty freaks like Madhav, this paper does a brilliant job of illustrating the matrix algebra that goes behind PCA computations. There are essentially 2 methods of calculating PCA, one is the eigenvalue decomposition (done using princomp() command in R)and the other is singular value decomposition (done using prcomp() command using R).
How this can be done in R:
####### Calculating Principal component of returns of S&P CNX 500 companies ########
## Access the relevant file ##
returns <- read.csv(“Returns_CNX_500.csv”)
One caveat that you need to keep in mind in that there should be no “NA” values in your data set. A presence of an NA would impede the computation of the var-covar matrix and hence their eigen vectors(i.e the factor loadings)
## Dealing with missing values in the returns data for companies
for(i in 2:ncol(returns))
{
returns1[, i] <- approx(returns$Year, returns1[ ,i], returns$Year)$y ## approx function basically fits the value of linear approximate between the missing data points and the column $y stores the approximated values.
}
}
## Convert the data into matrix ##
ret <- as.matrix(returns1, nrow = dim(returns1)[1], ncol = dim(returns1)[2])
##Computing the principal component using eigenvalue decomposition ##
princ.return <- princomp(ret) ## This is it.!!
## Identifying what components to be used ##
barplot(height=princ.return$sdev[1:10]/princ.return$sdev[1]) ## I am plotting the standard deviation of the PC’s divided by standard deviation of PC 1, this can help us decide on a benchmark that we can use to select the relevant components.
Standard deviation of the first 10 components compared to 1st PC |
We can clearly see from the above figure that as expected the first PC does the majority of the variance explanation in the returns data for the 500 companies. So if we want to identify factors that influence the returns of S&P CNX 500 companies I can use the 1st PC as a variable in my regression. So far we have calculated the principal components, now we will extract out 1st PC as a numeric variable from the matrix.(princ.return)
## To get the first principal component in a variable ##
load <- loadings(princ.return)[,1] ## loadings() gives the linear combination by which our input variables will be linearly weighted to compute the components, and this command gives us the loading for 1st PC.
pr.cp <- ret %*% load ## Matrix multiplication of the input data with the loading for the 1st PC gives us the 1st PC in matrix form.
pr <- as.numeric(pr.cp) ## Gives us the 1st PC in numeric form in pr.
One question that might be raised is why not just use the S&P CNX 500 index returns as an input to the regression? The simple answer to that question would be that PC 1 gives you a relatively clear signal of the returns as opposed to the index which would have a lot of noise. This question would have made sense in the 1900’s when the technology was not so efficient in terms of computation. Since now computational time and effort finds minimum weight in any researchers mind there is no reason to settle for anything but the best.
There is an important caveat that must be kept in mind while doing analysis using PCA, though PCA has a clear mathematical intuition it lacks an economic intuition. That is, one unit change in PC 1 of returns has a mathematical meaning but no economic meaning, you cannot make sense of this statement that PC 1 of returns for the 500 companies has gone up by “x” amount. Therefore the use of this analysis should be limited to factor analysis and not to be extended to predictive analysis.
In case you wish to replicate the above exercise the data can be obtained from here.
To leave a comment for the author, please follow the link and comment on their blog: We think therefore we R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.