Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Often, it is not helpful or informative to only look at all the variables in a dataset for correlations or covariances. A preferable approach is to derive new variables from the original variables that preserve most of the information given by their variances. Principal component analysis is a widely used and popular statistical method for reducing data with many dimensions (variables) by projecting the data with fewer dimensions using linear combinations of the variables, known as principal components. The new projected variables (principal components) are uncorrelated with each other and are ordered so that the first few components retain most of the variation present in the original variables. Thus, PCA is also useful in situations where the independent variables are correlated with each other and can be employed in exploratory data analysis or for making predictive models. Principal component analysis can also reveal important features of the data such as outliers and departures from a multinormal distribution.
Defining Principal Components
The first step in defining the principal components of \(p\) original variables is to find a linear function \(a_1’y\), where \(a_1\) is a vector of \(p\) constants, for the observation vectors that have maximum variance. This linear function is defined as:
\( a_1’y = a_{11}x_1 + a_{12}x_2 + \cdots + a_{1p}x_p = \sum_{j=1}^p a_{1j}x_j \)Principal component analysis continues to find a linear function \(a_2’y\) that is uncorrelated with \(a_1’y\) with maximized variance and so on up to \(k\) principal components.
Derivation of Principal Components
The principal components of a dataset are obtained from the sample covariance matrix \(S\) or the correlation matrix \(R\). Although principal components obtained from \(S\) is the original method of principal component analysis, components from \(R\) may be more interpretable if the original variables have different units or wide variances (Rencher 2002, pp. 393). For now, \(S\) will be referred to as \(\Sigma\) (denotes a known covariance matrix) which will be used in the derivation.
The goal of the derivation is to find \(a’_ky\) that maximizes the variance of \(a’_ky \Sigma a_k\). For this, we will consider the first vector \(a’_1y\) that maximizes \(Var(a’_1y) = a’_1y \Sigma a_1\). To do this maximization, we will need a constraint to rein in otherwise unnecessarily large values of \(a_1\). The constraint in this example is the unit length vector \(a_1′ a_1 = 1\). This constraint is employed with a Lagrange multiplier \(\lambda\) so that the function is maximized at an equality constraint of \(g(x) = 0\). Thus the Lagrangian function is defined as:
\( a’_1 \Sigma a_1 – \lambda(a_1’y a_1 – 1) \)Brief Aside: Lagrange Multipliers
The Lagrange mulitiplier method is used for finding a maximum or minimum of a multivariate function with some constraint on the input values. As we are interested in maximization, the problem can be briefly stated as ’maximize \(f(x)\) subject to \(g(x) = c\)’. In this example, \(g(x) = a_1’y a_1 = 1\) and \(f(x) = a_1’y \Sigma a_1\). The Lagrange multiplier, defined as \(\lambda\) allows the combination of \(f(x)\) and \(g(x)\) into a new function \(L(x, \lambda)\), defined as:
\( L(a_1’y, \lambda) = f(a_1’y) – \lambda(g(a_1’y) – c) \)The sign of \(\lambda\) can be positive or negative. The new function is then solved for a stationary point, in this case \(0\), using partial derivatives:
\( \frac{\partial L(a_1’y, \lambda)}{\partial a_1’y} = 0 \)Returning to principal component analysis, we differentiate \(L(a_1) = a’_1 \Sigma a_1 – \lambda(a_1’y a_1 – 1)\) with respect to \(a_1\):
\( \frac{\partial L}{\partial a_1} = 2\Sigma a_1 – 2\lambda a_1 = 0 \)
\( \Sigma a_1 – \lambda a_1 = 0 \)
Expressing the above with an identity matrix, \(I\):
\( (\Sigma – \lambda I) a_1 = 0 \)Which shows \(\lambda\) is an eigenvector of the covariance matrix \(\Sigma\) and \(a_1\) is the corresponding eigenvector. As stated previously, we are interested in finding \(a_1’y\) with maximum variance. Therefore \(\lambda\) must be as large as possible which follows \(a_1\) is the eigenvector corresponding to the largest eigenvalue of \(\Sigma\).
The remaining principal components are found in a similar manner and correspond to the \(k\)th principal component. Thus the second principal component is \(a_2’y\) and is equivalent to the eigenvector of the second largest eigenvalue of \(\Sigma\), and so on.
Principal Component Analysis
Twenty engineer apprentices and twenty pilots were given six tests. The data were obtained from the companion FTP site of the book Methods of Multivariate Analysis by Alvin Rencher. The tests measured the following attributes:
- Intelligence
- Form Relations
- Dynamometer
- Dotting
- Sensory Motor Coordination
- Perservation
Principal component analysis will be performed on the data to transform the attributes into new variables that will hopefully be more open to interpretation and allow us to find any irregularities in the data such as outliers.
Load the data and name the columns. The factors in the Group
column are renamed to their actual grouping names.
pilots <- read.table('PILOTS.DAT', col.names = c('Group', 'Intelligence', 'Form Relations', 'Dynamometer', 'Dotting', 'Sensory Motor Coordination', 'Perservation')) pilots$Group <- ifelse(pilots$Group == 1, 'Apprentice', 'Pilot')
Inspect the first few rows of the data.
head(pilots) ## Group Intelligence Form.Relations Dynamometer Dotting ## 1 Apprentice 121 22 74 223 ## 2 Apprentice 108 30 80 175 ## 3 Apprentice 122 49 87 266 ## 4 Apprentice 77 37 66 178 ## 5 Apprentice 140 35 71 175 ## 6 Apprentice 108 37 57 241 ## Sensory.Motor.Coordination Perservation ## 1 54 254 ## 2 40 300 ## 3 41 223 ## 4 80 209 ## 5 38 261 ## 6 59 245
The variables appear to be measured in different units which may lead to the variables with larger variances dominating the principal components of the covariance matrix \(S\). We will perform principal component analysis on the correlation matrix \(R\) later in the example to find a scaled and more balanced representation of the components.
Find the covariance matrix \(S\) of the data. The grouping column is not included.
S <- cov(pilots[,2:7]) S ## Intelligence Form.Relations Dynamometer ## Intelligence 528.19487 35.98974 27.97949 ## Form.Relations 35.98974 68.91282 -12.09744 ## Dynamometer 27.97949 -12.09744 145.18974 ## Dotting 104.42821 -81.75128 128.88205 ## Sensory.Motor.Coordination -20.03077 -33.00513 -30.85641 ## Perservation 291.15385 -18.28205 29.38462 ## Dotting Sensory.Motor.Coordination ## Intelligence 104.42821 -20.03077 ## Form.Relations -81.75128 -33.00513 ## Dynamometer 128.88205 -30.85641 ## Dotting 1366.43013 -113.58077 ## Sensory.Motor.Coordination -113.58077 264.35641 ## Perservation 395.18590 -79.85897 ## Perservation ## Intelligence 291.15385 ## Form.Relations -18.28205 ## Dynamometer 29.38462 ## Dotting 395.18590 ## Sensory.Motor.Coordination -79.85897 ## Perservation 1069.11538
The total variance is defined as:
\( \sum^k_{j=1} s_{jj} \)Which is also equal to the sum of the eigenvalues of \(S\).
sum(diag(S)) ## [1] 3442.199
Compute the eigenvalues and corresponding eigenvectors of \(S\).
s.eigen <- eigen(S) s.eigen ## $values ## [1] 1722.0424 878.3578 401.4386 261.0769 128.9051 50.3785 ## ## $vectors ## [,1] [,2] [,3] [,4] [,5] ## [1,] -0.21165160 -0.38949336 0.88819049 0.03082062 -0.04760343 ## [2,] 0.03883125 -0.06379320 0.09571590 -0.19128493 -0.14793191 ## [3,] -0.08012946 0.06602004 0.08145863 -0.12854488 0.97505667 ## [4,] -0.77552673 0.60795970 0.08071120 0.08125631 -0.10891968 ## [5,] 0.09593926 -0.01046493 0.01494473 0.96813856 0.10919120 ## [6,] -0.58019734 -0.68566916 -0.43426141 0.04518327 0.03644629 ## [,6] ## [1,] 0.10677164 ## [2,] -0.96269790 ## [3,] -0.12379748 ## [4,] -0.06295166 ## [5,] -0.20309559 ## [6,] -0.03572141
The eigenvectors represent the principal components of \(S\). The eigenvalues of \(S\) are used to find the proportion of the total variance explained by the components.
for (s in s.eigen$values) { print(s / sum(s.eigen$values)) } ## [1] 0.5002739 ## [1] 0.2551734 ## [1] 0.1166227 ## [1] 0.07584597 ## [1] 0.03744848 ## [1] 0.01463556
The first two principal components account for 75.5% of the total variance. A scree graph of the eigenvalues can be plotted to visualize the proportion of variance explained by each subsequential eigenvalue.
plot(s.eigen$values, xlab = 'Eigenvalue Number', ylab = 'Eigenvalue Size', main = 'Scree Graph') lines(s.eigen$values)
The elements of the eigenvectors of \(S\) are the ‘coefficients’ or ‘loadings’ of the principal components.
s.eigen$vectors ## [,1] [,2] [,3] [,4] [,5] ## [1,] -0.21165160 -0.38949336 0.88819049 0.03082062 -0.04760343 ## [2,] 0.03883125 -0.06379320 0.09571590 -0.19128493 -0.14793191 ## [3,] -0.08012946 0.06602004 0.08145863 -0.12854488 0.97505667 ## [4,] -0.77552673 0.60795970 0.08071120 0.08125631 -0.10891968 ## [5,] 0.09593926 -0.01046493 0.01494473 0.96813856 0.10919120 ## [6,] -0.58019734 -0.68566916 -0.43426141 0.04518327 0.03644629 ## [,6] ## [1,] 0.10677164 ## [2,] -0.96269790 ## [3,] -0.12379748 ## [4,] -0.06295166 ## [5,] -0.20309559 ## [6,] -0.03572141
The first two principal components are thus:
\( z_1 = a’_1y = -.212y_1 + .039y_2 – 0.080y_3 – 0.776y_4 + 0.096y_5 – 0.580y_6 \) \( z_2 = a’_2y = -.389y_1 – .064y_2 + 0.066y_3 + 0.608y_4 – 0.010y_5 – 0.686y_6 \)
Principal Component Analysis with R
Computing the principal components in R is straightforward with the functions prcomp()
and princomp()
. The difference between the two is simply the method employed to calculate PCA. According to ?prcomp
:
The calculation is done by a singular value decomposition of the (centered and possibly scaled) data matrix, not by using eigen on the covariance matrix. This is generally the preferred method for numerical accuracy.
From ?princomp
:
The calculation is done using eigen on the correlation or covariance matrix, as determined by cor. This is done for compatibility with the S-PLUS result.
pilots.pca <- prcomp(pilots[,2:7]) pilots.pca ## Standard deviations: ## [1] 41.497499 29.637102 20.035932 16.157875 11.353640 7.097781 ## ## Rotation: ## PC1 PC2 PC3 PC4 ## Intelligence 0.21165160 -0.38949336 0.88819049 -0.03082062 ## Form.Relations -0.03883125 -0.06379320 0.09571590 0.19128493 ## Dynamometer 0.08012946 0.06602004 0.08145863 0.12854488 ## Dotting 0.77552673 0.60795970 0.08071120 -0.08125631 ## Sensory.Motor.Coordination -0.09593926 -0.01046493 0.01494473 -0.96813856 ## Perservation 0.58019734 -0.68566916 -0.43426141 -0.04518327 ## PC5 PC6 ## Intelligence -0.04760343 -0.10677164 ## Form.Relations -0.14793191 0.96269790 ## Dynamometer 0.97505667 0.12379748 ## Dotting -0.10891968 0.06295166 ## Sensory.Motor.Coordination 0.10919120 0.20309559 ## Perservation 0.03644629 0.03572141
Although we didn’t use the preferred method of applying singular value decomposition, the components reported by the prcomp()
are the same as what was computed earlier save arbitrary scalings of \(-1\) to some of the eigenvectors.
The summary method of prcomp()
also outputs the proportion of variance explained by the components.
summary(pilots.pca) ## Importance of components: ## PC1 PC2 PC3 PC4 PC5 PC6 ## Standard deviation 41.4975 29.6371 20.0359 16.15788 11.35364 7.09778 ## Proportion of Variance 0.5003 0.2552 0.1166 0.07585 0.03745 0.01464 ## Cumulative Proportion 0.5003 0.7554 0.8721 0.94792 0.98536 1.00000
Plotting of Principal Components
The first two principal components are often plotted as a scatterplot which may reveal interesting features of the data, such as departures from normality, outliers or non-linearity. The first two principal components are evaluated for each observation vector and plotted.
The ggfortify package provides a handy method for plotting the first two principal components with autoplot()
.
library(ggfortify)
The autoplot()
function also generates a useful data table of the calculated principal components we which we will use later.
pca.plot <- autoplot(pilots.pca, data = pilots, colour = 'Group') pca.plot
The points of the two groups are clustered for the most part; however, the three points at the top of the graph may be outliers. The data does not appear to depart widely from multivariate normality. We will see if this conclusion changes when the PCs from the correlation matrix \(R\) are plotted.
To recreate the graph generated by autoplot()
, scale the data using the standard deviations of the principal components multiplied by the square root of the number of observations. The principal components are then computed for each observation vector. Note the first eigenvector is multiplied by a scaling factor of \(-1\) so the signs what was reported by the prcomp()
function.
scaling <- pilots.pca$sdev[1:2] * sqrt(nrow(pilots)) pc1 <- rowSums(t(t(sweep(pilots[,2:7], 2 ,colMeans(pilots[,2:7]))) * s.eigen$vectors[,1] * -1) / scaling[1]) pc2 <- rowSums(t(t(sweep(pilots[,2:7], 2, colMeans(pilots[,2:7]))) * s.eigen$vectors[,2]) / scaling[2])
Collect the PCs in a data.frame
and plot using ggplot
(loaded when ggfortify
was loaded).
df <- data.frame(pc1, pc2, c(rep('Apprentice', 20), rep('Pilot', 20))) colnames(df) <- c('PC1', 'PC2', 'Group') ggplot(df, aes(x=PC1, y=PC2, color=Group)) + geom_point()
The scaling employed when calculating the PCs can be omitted. To remove scaling in the autplot()
function, set the scaling
argument to 0.
Principal Component Analysis with the Correlation Matrix \(R\)
As mentioned previously, although principal component analysis is typically performed on the covariance matrix \(S\), it often makes more intuitive sense to apply PCA to the correlation matrix. Cases where using \(R\) may be preferable to \(S\) include data that is measured in different units or has wide variances. The pilot data analyzed does not appear to have commensurate units for each variable, and because we have very little information regarding the tests and the measurements collected, it might make sense to employ the \(R\) matrix rather than \(S\).
The correlation matrix is found with the cor()
function.
R <- cor(pilots[,2:7]) R ## Intelligence Form.Relations Dynamometer ## Intelligence 1.00000000 0.18863907 0.10103566 ## Form.Relations 0.18863907 1.00000000 -0.12094150 ## Dynamometer 0.10103566 -0.12094150 1.00000000 ## Dotting 0.12292126 -0.26641020 0.28935484 ## Sensory.Motor.Coordination -0.05360504 -0.24453244 -0.15750071 ## Perservation 0.38744776 -0.06735388 0.07458298 ## Dotting Sensory.Motor.Coordination ## Intelligence 0.1229213 -0.05360504 ## Form.Relations -0.2664102 -0.24453244 ## Dynamometer 0.2893548 -0.15750071 ## Dotting 1.0000000 -0.18898014 ## Sensory.Motor.Coordination -0.1889801 1.00000000 ## Perservation 0.3269606 -0.15021611 ## Perservation ## Intelligence 0.38744776 ## Form.Relations -0.06735388 ## Dynamometer 0.07458298 ## Dotting 0.32696061 ## Sensory.Motor.Coordination -0.15021611 ## Perservation 1.00000000
Find the eigenvalues and eigenvectors of the \(R\) matrix.
r.eigen <- eigen(R)
As with the covariance matrix, we can compute the proportion of total variance explained by the eigenvalues.
for (r in r.eigen$values) { print(r / sum(r.eigen$values)) } ## [1] 0.2958546 ## [1] 0.225736 ## [1] 0.1787751 ## [1] 0.1357993 ## [1] 0.08843547 ## [1] 0.07539955
What is readily noticeable is the first eigenvalue accounts for 30% of total variance compared with 50% of the variance of the \(S\) matrix. The first two components of \(R\) only account for 52% of total variance while the last two components have little significance. Thus, one may want to keep the first four components rather than the first two or three with the \(S\) matrix.
To perform principal component analysis using the correlation matrix using the prcomp()
function, set the scale
argument to TRUE
.
pilots.pca.scaled <- prcomp(pilots[,2:7], scale = TRUE) pilots.pca.scaled ## Standard deviations: ## [1] 1.3323392 1.1637937 1.0356884 0.9026604 0.7284317 0.6726049 ## ## Rotation: ## PC1 PC2 PC3 PC4 ## Intelligence 0.40239072 -0.3964661 0.4617841 -0.3928149 ## Form.Relations -0.09715877 -0.7472294 -0.1752970 -0.1315611 ## Dynamometer 0.38541311 0.2181560 -0.4329575 -0.7177525 ## Dotting 0.54333623 0.3144601 -0.1065065 0.2453920 ## Sensory.Motor.Coordination -0.31188931 0.3559400 0.6268314 -0.3992852 ## Perservation 0.53629229 -0.1062657 0.4053555 0.3058981 ## PC5 PC6 ## Intelligence -0.2103062 -0.5187674 ## Form.Relations -0.2801896 0.5528697 ## Dynamometer 0.2585104 0.1855163 ## Dotting -0.7066663 0.1869825 ## Sensory.Motor.Coordination -0.2012981 0.4279773 ## Perservation 0.5201339 0.4155385 summary(pilots.pca.scaled) ## Importance of components: ## PC1 PC2 PC3 PC4 PC5 PC6 ## Standard deviation 1.3323 1.1638 1.0357 0.9027 0.72843 0.6726 ## Proportion of Variance 0.2959 0.2257 0.1788 0.1358 0.08844 0.0754 ## Cumulative Proportion 0.2959 0.5216 0.7004 0.8362 0.92460 1.0000
Plot the first two PCs of the correlation matrix using the autoplot()
function.
pca.plot.scaled <- autoplot(pilots.pca.scaled, data = pilots, colour = 'Group') pca.plot.scaled
The points remain clustered for the most part; however, there does appear to be more points outside that may be considered outliers, though they don’t appear to be too far off from the cluster. It is important to remember the first two PCs of the \(R\) matrix only represent 52% of the total variance and thus may not be fully representative of the variance in the dataset.
Interpreting Principal Components
Interpretation of principal components is still a heavily researched topic in statistics, and although the components may be readily interpreted in most settings, this is not always the case (Joliffe, 2002).
One method of interpretation of the principal components is to calculate the correlation between the original data and the component. The autoplot()
function also generates a nice data table with the original variables and the calculated PCs, which we will use here to find the correlations.
First, compute the correlations between the data and the calculated components of the covariance matrix \(S\).
comps <- pca.plot$data[,c(1:2,4:13)] cor(comps[,3:8], comps[,c(1:2,9)]) ## PC1 PC2 PC3 ## Intelligence 0.3821610 -0.50227170 0.77431660 ## Form.Relations -0.1941124 -0.22775091 0.23101683 ## Dynamometer 0.2759600 0.16238412 0.13544989 ## Dotting 0.8706127 0.48743512 0.04374714 ## Sensory.Motor.Coordination -0.2448631 -0.01907555 0.01841632 ## Perservation 0.7363518 -0.62149561 -0.26610222
The PCs can then be interpreted based on which variables they are most correlated in either a positive or negative direction. The level at which the correlations are significant is left to the researcher.
The first component is correlated negatively with Dotting, Perservation, Intelligence and Dynamometer. This correlation suggests the five variables vary together and when one goes down, the others decrease as well. The component is most correlated with Dotting at \(-.087\) and could be considered as primarily a measure of Dotting.
The second component is most correlated with Perservation and Intelligence, both in a negative direction. Dotting is correlated with the second component in a positive direction, which would indicate that as Perservation and Intelligence decrease, Dotting increases.
The third component is primarily correlated with Intelligence and not much else. This component could be viewed as a measure of the intelligence of the individual apprentice or pilot.
It was decided previously that due to lack of information regarding the variables and their units of measurement, it makes more sense to use the correlation matrix \(R\) for performing principal component analysis. Let’s see how the interpretation of the principal components changes when we use the \(R\) matrix.
comps.scaled <- pca.plot.scaled$data[,c(1:2,4:13)] cor(comps.scaled[,3:8], comps.scaled[,c(1:2,9)]) ## PC1 PC2 PC3 ## Intelligence 0.5361209 -0.4614047 0.4782644 ## Form.Relations -0.1294484 -0.8696209 -0.1815530 ## Dynamometer 0.5135010 0.2538886 -0.4484090 ## Dotting 0.7239081 0.3659667 -0.1103075 ## Sensory.Motor.Coordination -0.4155423 0.4142408 0.6492020 ## Perservation 0.7145232 -0.1236713 0.4198220
The most apparent changes between the correlations of the original variables and the PCs of the \(S\) and \(R\) matrices are in components 2 and 3. The first principal component is still strongly correlated with the variables Dotting and Perservation, but now the variables Intelligence and Dynamometer are much more correlated and could indicate that as the former two variables decrease, the latter two increase.
The second component is now correlated the most with Forming Relations and not much else, whereas with the \(S\) matrix the component was correlated more to Perservation and Intelligence. This difference in variable correlations between the components of the two matrices may indicate Perservation and Intelligence were unduly dominating the variances.
The third component is now most correlated with Sensory Motor Coordination and secondarily Intelligence and Perservation, which indicates that subjects with high Sensory Motor Coordination test scores also have higher Intelligence and Perservation scores.
Summary
This post ended up being much longer than I had anticipated but I hope it is a good introduction to the power and benefits of principal component analysis. The post covered PCA with the covariance and correlation matrices as well as plotting and interpreting the principal components. I plan to continue discussing PCA in the future as there are many more topics and applications related to the dimension reduction technique.
References
Joliffe, I. T. (2002). Principal Component Analysis (2nd ed.). Springer.
Rencher, A. (n.d.). Methods of Multivariate Analysis (2nd ed.)
https://onlinecourses.science.psu.edu/stat505/node/54
The post Principal Component Analysis appeared first on Aaron Schlegel.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.