Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
How to perform PCA trough singular value decomposition using R.
What is singular value decomposition?
Singular value decomposition (SVD) is a factorization of a real or complex matrix which generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any m x n matrix:
Where M is m x n, U is m x m, S is m x n, and V is n x n.
The diagonal entries si of S are know as the singular values of M. The number of non-zero singular values is equal to the rank of M. The columns of U and the columns of V are called the left-singular vectors and right singular vectors of M, respectively.
In R, you can perform SVD through svd()
which by default uses compact SVD, a similar decomposition M = USV´ in which S is square diagonal of size r x r, where r ≤ min{m, n} is the rank of M, and has only the non-zero singular values. In this variant, V is an m x r semi-unitary matrix and V is an n x r semi-unitary matrix, such that UU = VV = I.
How can we use SVD to perform principal component analysis?
Among other applications, SVD can be used to perform principal component analysis (PCA) since there is a close relationship between both procedures. Check out the post “Relationship between SVD and PCA. How to use SVD to perform PCA?” to see a more detailed explanation.
Let’s say you have a data matrix M of m x n size, where m is the number of samples (treatments, states, etc.) and n is the number of variables (response variable measures on each sample). Also let’s assume you have centered your data by subtracting the mean of each variable. Then you can calculate the covariance matrix C given by:
This symmetric matrix can be diagonalized as:
V is a matrix of eigenvectors (each column is an eigenvector) and L is a diagonal matrix with eigenvalues ʎi in the decreasing order on the diagonal. When performing PCA, the rest is just to obtain the projections of the original data through the matrix multiplication MV. The new points can be called principal components or PC scores. Besides, eigenvalues also tell us the variation covered by each principal component.
Now, if we perform SVD on M:
We can easily see that:
Meaning that right singular vectors V are principal directions (eigenvectors) and that singular values are related to the eigenvalues of covariance matrix via ʎi = si2/(n-1). Principal components are given by XV = US and loadings by columns of VS/(m-1)1/2.
Now, it’s time to see the above in action with some data and R code.
Data
The data set I’ve used for this post comes from tissuesGeneExpression, which in turn is a subset of gene expression data from the following papers:
- http://www.ncbi.nlm.nih.gov/pubmed/17906632
- http://www.ncbi.nlm.nih.gov/pubmed/21177656
- http://www.ncbi.nlm.nih.gov/pubmed/24271388
You can download this data from its original repository:
devtools::install_github("genomicsclass/tissuesGeneExpression") library(tissuesGeneExpression) data(tissuesGeneExpression) data_gene <- t(e) data_gene <- as.data.frame(data_gene)
I’ve used t()
to transpose the data, so in rows tissue samples are marked while in columns expression for different genes:
dim(data_gene) ## [1] 189 22215
As you can see, there are 189 tissue samples where the expression levels of 22215 genes has been measured.
PCA through svd()
Center the data
To perform a PCA, first is necessary to center the data by subtracting the mean from each gene:
library(purrr) # Calculate the mean of each gene gene_means <- map_dbl(data_gene, mean) # Subtract each gene mean on its corresponding row data_gene_c <- map2(data_gene, gene_means, .f = function(x, mean) x - mean) data_gene_c <- data.frame(data_gene_c)
SVD with svd()
Now it’s easy to perform SVD:
dg_svd <- svd(data_gene_c)
dg_svd
contains a list with a u matrix, a v matrix, and a numerical vector with singular values:
# U matrix dg_u <- dg_svd$u # V matrix dg_v <- dg_svd$v # Singular values dg_d <- diag(dg_svd$d)
Scree plot
The we can calculate each eigenvalue and the variation covered by each principal component:
library(dplyr) # Obtain each eigenvalue ev_dg_svd <- dg_svd$d^2 / (nrow(data_gene) - 1) # Percentage of variation of each PC pervar_dg_svd <- data.frame(ev_dg_svd) %>% mutate( per_var = ev_dg_svd * 100 / sum(ev_dg_svd), pc = map_chr(1:nrow(dg_u), .f = function(x) paste0("PC", as.character(x)) ) )
And make a scree plot:
library(ggplot2) # Scree plot with the first 15 principal components ggplot(pervar_dg_svd[1:15,], aes(reorder(pc, -per_var), per_var)) + geom_col() + xlab("Principal component") + ylab("Percentage of variation (%)") + ggtitle("Scree plot - svd()") + theme_classic()
Since there are 189 principal components, I’m just took the first 15.
Score plot
First we calculate the projections of the original data:
# Scores scores_dg_svd <- data.frame(dg_u %*% dg_d) # Change the default names for PC names colnames(scores_dg_svd) <- map_chr( 1:nrow(scores_dg_svd), .f = function(x) paste0("PC", as.character(x)) )
I’m adding a new column to the previous data, that indicates the name of the tissue which each sample corresponds:
scores_dg_svd <- scores_dg_svd %>% mutate(Tissue = tissue) %>% relocate(Tissue)
The character vector tissue
is included in the data that I downloaded first:
table(tissue) ## tissue ## cerebellum colon endometrium hippocampus kidney liver ## 38 34 15 31 39 26 ## placenta ## 6
Finally we can easily make a score plot using ggplot2
:
ggplot(scores_dg_svd, aes(PC1, PC2, color = Tissue)) + geom_point(size = 2) + xlab("PC1 (33%)") + ylab("PC2 (14%)") + ggtitle("PCA on gene expression data using svd()") + theme_classic()
Loadings
Loadings can also be easily obtained:
ls_dg_svd <- (dg_v %*% dg_d) / sqrt(nrow(data_gene) - 1) dim(ls_dg_svd) ## [1] 22215 189
Each row on this matrix corresponds to a gene and each column to a principal component:
ls_dg_svd[1:5, 1:5] ## [,1] [,2] [,3] [,4] [,5] ## [1,] 0.28109040 -0.51722727 0.06503399 0.44542686 0.01671733 ## [2,] -0.12941227 -0.08617018 0.04809022 -0.08718245 -0.08208977 ## [3,] -0.18520759 0.22871016 0.05254447 0.10137023 0.04573839 ## [4,] -0.04402067 -0.07799667 -0.08576623 0.04628856 0.46190597 ## [5,] 0.06064367 0.04732136 -0.04864953 0.08728058 -0.02975143
Eigenvectors (usally reported as loadings)
Unfortunately, I have not been able to found another source that supports the loadigns definition on the previous section. Usually, in papers eigenvectors are reported and discussed as loadings. So if you want to see or use this values you have to take the v matrix:
dg_v[1:5, 1:5] ## [,1] [,2] [,3] [,4] [,5] ## [1,] 0.0046965883 -0.013275124 0.002087159 0.017092900 0.0006956308 ## [2,] -0.0021622800 -0.002211639 0.001543377 -0.003345557 -0.0034158660 ## [3,] -0.0030945339 0.005870061 0.001686328 0.003890002 0.0019032363 ## [4,] -0.0007355177 -0.002001858 -0.002752526 0.001776287 0.0192205309 ## [5,] 0.0010132625 0.001214547 -0.001561327 0.003349323 -0.0012379970
Remember this vectors are normalized, so their norm is equal to 1, and they are orthogonal, so their dot product will be 0.
PCA trhough prcomp()
A more direct way to do the above, is through the function prcomp()
, which uses SVD to perform PCA. Even you do not need to center your data since this function do it by default, among other options.
dg_pca <- prcomp(data_gene)
dg_pca
is basically a list that contains the PC scores, loadings and singular values.
Scree plot
To make a scree plot, with percentage of variation covered by each principal component, use the values on sdev
whithin dg_pca
:
# Eigenvalues ev_dg_prcomp <- dg_pca$sdev^2 / (nrow(data_gene) - 1) # Percentage of variation for each PC per_var_prcomp <- data.frame(ev_dg_prcomp) %>% mutate( per_var = ev_dg_prcomp * 100 / sum(ev_dg_prcomp), pc = map_chr( 1:nrow(dg_pca$x), .f = function(x) paste0("PC", as.character(x)) ) ) # Scree plot with the first 15 principal components ggplot(per_var_prcomp[1:15,], aes(reorder(pc, -per_var), per_var)) + geom_col() + xlab("Principal component") + ylab("Percentage of variation (%)") + ggtitle("Scree plot - prcomp()") + theme_classic()
If you compare this graph with that obtained using svd()
you will find it is the same.
Score plot
PC scores are in the object x
within dg_pca
:
# Scores scores_dg_prcomp <- dg_pca$x %>% as.data.frame() %>% mutate(Tissue = tissue) %>% relocate(Tissue) # Score plot ggplot(scores_dg_prcomp, aes(PC1, PC2, color = Tissue)) + geom_point(size = 2) + xlab("PC1 (33%)") + ylab("PC2 (14%)") + ggtitle("PCA on expression data using prcomp()") + theme_classic()
Again, exactly the same result as with svd()
.
Loadings
You can see the eigenvectors in the object rotation
within dg_pca
. To obtain the loadings you only need to multiply by the singular values:
ls_dg_prcomp <- dg_pca$rotation %*% diag(dg_pca$sdev) dim(ls_dg_prcomp) ## [1] 22215 189
In the same way as svd()
, on this matrix each row corresponds to a gene and each column to a principal component:
ls_dg_prcomp[1:5, 1:5] ## [,1] [,2] [,3] [,4] [,5] ## 1007_s_at 0.28109040 -0.51722727 0.06503399 0.44542686 0.01671733 ## 1053_at -0.12941227 -0.08617018 0.04809022 -0.08718245 -0.08208977 ## 117_at -0.18520759 0.22871016 0.05254447 0.10137023 0.04573839 ## 121_at -0.04402067 -0.07799667 -0.08576623 0.04628856 0.46190597 ## 1255_g_at 0.06064367 0.04732136 -0.04864953 0.08728058 -0.02975143
Eigenvectors (usally reported as loadings)
You can use or see the eigenvectors, usually reported as loadings as I previously wrote, in the component rotation
:
dg_pca$rotation[1:5, 1:5] ## PC1 PC2 PC3 PC4 PC5 ## 1007_s_at 0.0046965883 -0.013275124 0.002087159 0.017092900 0.0006956308 ## 1053_at -0.0021622800 -0.002211639 0.001543377 -0.003345557 -0.0034158660 ## 117_at -0.0030945339 0.005870061 0.001686328 0.003890002 0.0019032363 ## 121_at -0.0007355177 -0.002001858 -0.002752526 0.001776287 0.0192205309 ## 1255_g_at 0.0010132625 0.001214547 -0.001561327 0.003349323 -0.0012379970
You can find all the code on this post on the next repository:
Principal-Component-Analysis-through-Singular-Value-Decomposition
Thank you so much for your time and visit this site.
Juan Pablo Carreón Hidalgo
jpch_26@outlook.com
https://twitter.com/JuanPa2601
This work is licensed under a Creative Commons Attribution 4.0 International License.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.