Principal Component Analysis through Singular Value Decomposition

R in the Lab

8 months ago

[This article was first published on R in the Lab, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

How to perform PCA trough singular value decomposition using R.

What is singular value decomposition?

Singular value decomposition (SVD) is a factorization of a real or complex matrix which generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any m x n matrix:

Where M is m x n, U is m x m, S is m x n, and V is n x n.

The diagonal entries si of S are know as the singular values of M. The number of non-zero singular values is equal to the rank of M. The columns of U and the columns of V are called the left-singular vectors and right singular vectors of M, respectively.

In R, you can perform SVD through svd() which by default uses compact SVD, a similar decomposition M = USV´ in which S is square diagonal of size r x r, where r ≤ min{m, n} is the rank of M, and has only the non-zero singular values. In this variant, V is an m x r semi-unitary matrix and V is an n x r semi-unitary matrix, such that UU = VV = I.

How can we use SVD to perform principal component analysis?

Among other applications, SVD can be used to perform principal component analysis (PCA) since there is a close relationship between both procedures. Check out the post “Relationship between SVD and PCA. How to use SVD to perform PCA?” to see a more detailed explanation.
Let’s say you have a data matrix M of m x n size, where m is the number of samples (treatments, states, etc.) and n is the number of variables (response variable measures on each sample). Also let’s assume you have centered your data by subtracting the mean of each variable. Then you can calculate the covariance matrix C given by:

This symmetric matrix can be diagonalized as:

V is a matrix of eigenvectors (each column is an eigenvector) and L is a diagonal matrix with eigenvalues ʎi in the decreasing order on the diagonal. When performing PCA, the rest is just to obtain the projections of the original data through the matrix multiplication MV. The new points can be called principal components or PC scores. Besides, eigenvalues also tell us the variation covered by each principal component.

Now, if we perform SVD on M:

We can easily see that:

Meaning that right singular vectors V are principal directions (eigenvectors) and that singular values are related to the eigenvalues of covariance matrix via ʎi = si²/(n-1). Principal components are given by XV = US and loadings by columns of VS/(m-1)^1/2.

Now, it’s time to see the above in action with some data and R code.

Data

The data set I’ve used for this post comes from tissuesGeneExpression, which in turn is a subset of gene expression data from the following papers:

You can download this data from its original repository:

devtools::install_github("genomicsclass/tissuesGeneExpression")

library(tissuesGeneExpression)
data(tissuesGeneExpression)

data_gene <- t(e)
data_gene <- as.data.frame(data_gene)

I’ve used t() to transpose the data, so in rows tissue samples are marked while in columns expression for different genes:

dim(data_gene)
## [1]   189 22215

As you can see, there are 189 tissue samples where the expression levels of 22215 genes has been measured.

PCA through svd()

Center the data

To perform a PCA, first is necessary to center the data by subtracting the mean from each gene:

library(purrr)

# Calculate the mean of each gene
gene_means <- map_dbl(data_gene, mean)

# Subtract each gene mean on its corresponding row
data_gene_c <- map2(data_gene, gene_means, .f = function(x, mean) x - mean)
data_gene_c <- data.frame(data_gene_c)

SVD with svd()

Now it’s easy to perform SVD:

dg_svd <- svd(data_gene_c)

dg_svd contains a list with a u matrix, a v matrix, and a numerical vector with singular values:

# U matrix 
dg_u <- dg_svd$u

# V matrix
dg_v <- dg_svd$v

# Singular values
dg_d <- diag(dg_svd$d)

Scree plot

The we can calculate each eigenvalue and the variation covered by each principal component:

library(dplyr)

# Obtain each eigenvalue
ev_dg_svd <- dg_svd$d^2 / (nrow(data_gene) - 1)

# Percentage of variation of each PC
pervar_dg_svd <- data.frame(ev_dg_svd) %>% 
  mutate(
    per_var = ev_dg_svd * 100 / sum(ev_dg_svd),
    pc = map_chr(1:nrow(dg_u), .f = function(x) paste0("PC", as.character(x))
    )
  )

And make a scree plot:

library(ggplot2)

# Scree plot with the first 15 principal components 
ggplot(pervar_dg_svd[1:15,], aes(reorder(pc, -per_var), per_var)) +
  geom_col() +
  xlab("Principal component") +
  ylab("Percentage of variation (%)") +
  ggtitle("Scree plot - svd()") +
  theme_classic()

Since there are 189 principal components, I’m just took the first 15.

Score plot

First we calculate the projections of the original data:

# Scores 
scores_dg_svd <- data.frame(dg_u %*% dg_d)

# Change the default names for PC names
colnames(scores_dg_svd) <- map_chr(
  1:nrow(scores_dg_svd), .f = function(x) paste0("PC", as.character(x))
)

I’m adding a new column to the previous data, that indicates the name of the tissue which each sample corresponds:

scores_dg_svd <- scores_dg_svd %>% 
  mutate(Tissue = tissue) %>% 
  relocate(Tissue)

The character vector tissue is included in the data that I downloaded first:

table(tissue)
## tissue
##  cerebellum       colon endometrium hippocampus      kidney       liver 
##          38          34          15          31          39          26 
##    placenta 
##           6

Finally we can easily make a score plot using ggplot2:

ggplot(scores_dg_svd, aes(PC1, PC2, color = Tissue)) +
  geom_point(size = 2) +
  xlab("PC1 (33%)") + 
  ylab("PC2 (14%)") +
  ggtitle("PCA on gene expression data using svd()") +
  theme_classic()

Loadings

Loadings can also be easily obtained:

ls_dg_svd <- (dg_v %*% dg_d) / sqrt(nrow(data_gene) - 1)

dim(ls_dg_svd)
## [1] 22215   189

Each row on this matrix corresponds to a gene and each column to a principal component:

ls_dg_svd[1:5, 1:5]
##             [,1]        [,2]        [,3]        [,4]        [,5]
## [1,]  0.28109040 -0.51722727  0.06503399  0.44542686  0.01671733
## [2,] -0.12941227 -0.08617018  0.04809022 -0.08718245 -0.08208977
## [3,] -0.18520759  0.22871016  0.05254447  0.10137023  0.04573839
## [4,] -0.04402067 -0.07799667 -0.08576623  0.04628856  0.46190597
## [5,]  0.06064367  0.04732136 -0.04864953  0.08728058 -0.02975143

Eigenvectors (usally reported as loadings)

Unfortunately, I have not been able to found another source that supports the loadigns definition on the previous section. Usually, in papers eigenvectors are reported and discussed as loadings. So if you want to see or use this values you have to take the v matrix:

dg_v[1:5, 1:5]
##               [,1]         [,2]         [,3]         [,4]          [,5]
## [1,]  0.0046965883 -0.013275124  0.002087159  0.017092900  0.0006956308
## [2,] -0.0021622800 -0.002211639  0.001543377 -0.003345557 -0.0034158660
## [3,] -0.0030945339  0.005870061  0.001686328  0.003890002  0.0019032363
## [4,] -0.0007355177 -0.002001858 -0.002752526  0.001776287  0.0192205309
## [5,]  0.0010132625  0.001214547 -0.001561327  0.003349323 -0.0012379970

Remember this vectors are normalized, so their norm is equal to 1, and they are orthogonal, so their dot product will be 0.

PCA trhough prcomp()

A more direct way to do the above, is through the function prcomp(), which uses SVD to perform PCA. Even you do not need to center your data since this function do it by default, among other options.

dg_pca <- prcomp(data_gene)

dg_pca is basically a list that contains the PC scores, loadings and singular values.

Scree plot

To make a scree plot, with percentage of variation covered by each principal component, use the values on sdev whithin dg_pca:

# Eigenvalues
ev_dg_prcomp <- dg_pca$sdev^2 / (nrow(data_gene) - 1)

# Percentage of variation for each PC
per_var_prcomp <- data.frame(ev_dg_prcomp) %>% 
  mutate(
    per_var = ev_dg_prcomp * 100 / sum(ev_dg_prcomp),
    pc = map_chr(
      1:nrow(dg_pca$x), .f = function(x) paste0("PC", as.character(x))
      )
  )

# Scree plot with the first 15 principal components 
ggplot(per_var_prcomp[1:15,], aes(reorder(pc, -per_var), per_var)) +
  geom_col() +
  xlab("Principal component") +
  ylab("Percentage of variation (%)") +
  ggtitle("Scree plot - prcomp()") +
  theme_classic()

If you compare this graph with that obtained using svd() you will find it is the same.

Score plot

PC scores are in the object x within dg_pca:

# Scores
scores_dg_prcomp <- dg_pca$x %>% 
  as.data.frame() %>% 
  mutate(Tissue = tissue) %>% 
  relocate(Tissue)

# Score plot
ggplot(scores_dg_prcomp, aes(PC1, PC2, color = Tissue)) +
  geom_point(size = 2) +
  xlab("PC1 (33%)") + 
  ylab("PC2 (14%)") +
  ggtitle("PCA on expression data using prcomp()") +
  theme_classic()

Again, exactly the same result as with svd().

Loadings

You can see the eigenvectors in the object rotation within dg_pca. To obtain the loadings you only need to multiply by the singular values:

ls_dg_prcomp <- dg_pca$rotation %*% diag(dg_pca$sdev)

dim(ls_dg_prcomp)
## [1] 22215   189

In the same way as svd(), on this matrix each row corresponds to a gene and each column to a principal component:

ls_dg_prcomp[1:5, 1:5]
##                  [,1]        [,2]        [,3]        [,4]        [,5]
## 1007_s_at  0.28109040 -0.51722727  0.06503399  0.44542686  0.01671733
## 1053_at   -0.12941227 -0.08617018  0.04809022 -0.08718245 -0.08208977
## 117_at    -0.18520759  0.22871016  0.05254447  0.10137023  0.04573839
## 121_at    -0.04402067 -0.07799667 -0.08576623  0.04628856  0.46190597
## 1255_g_at  0.06064367  0.04732136 -0.04864953  0.08728058 -0.02975143

Eigenvectors (usally reported as loadings)

You can use or see the eigenvectors, usually reported as loadings as I previously wrote, in the component rotation:

dg_pca$rotation[1:5, 1:5]
##                     PC1          PC2          PC3          PC4           PC5
## 1007_s_at  0.0046965883 -0.013275124  0.002087159  0.017092900  0.0006956308
## 1053_at   -0.0021622800 -0.002211639  0.001543377 -0.003345557 -0.0034158660
## 117_at    -0.0030945339  0.005870061  0.001686328  0.003890002  0.0019032363
## 121_at    -0.0007355177 -0.002001858 -0.002752526  0.001776287  0.0192205309
## 1255_g_at  0.0010132625  0.001214547 -0.001561327  0.003349323 -0.0012379970

You can find all the code on this post on the next repository:

Principal-Component-Analysis-through-Singular-Value-Decomposition

Thank you so much for your time and visit this site.

Juan Pablo Carreón Hidalgo

jpch_26@outlook.com
https://twitter.com/JuanPa2601

This work is licensed under a Creative Commons Attribution 4.0 International License.

To leave a comment for the author, please follow the link and comment on their blog: R in the Lab.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.