Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This is this second post of the “Create your Machine Learning library from scratch with R !” series. Today, we will see how you can implement Principal components analysis (PCA) using only the linear algebra available in R. Previously, we managed to implement linear regression and logistic regression from scratch and next time we will deal with K nearest neighbors (KNN).
Principal components analysis
The PCA is a dimensionality reduction method which seeks the vectors which explains most of the variance in the dataset. From a mathematical standpoint, the PCA is just a coordinates change to represent the points in a more appropriate basis. Picking few of these coordinates is enough to explain an important part of the variance in the dataset.
The mathematics of PCA
Let
Then,
We denote
It can also be shown that
This is exactly what we wanted ! We have a smaller basis which explains as much variance as possible !
PCA in R
The implementation in R has three-steps:
- We center the data and divide them by their deviations. Our data now comply with PCA hypothesis.
- We diagonalise
and store the eigenvectors and eigenvalues - The cumulative variance is computed and the required numbers of eigenvectors
to reach the variance threshold is stored. We only keep the first eigenvectors
###PCA my_pca<-function(x,variance_explained=0.9,center=T,scale=T) { my_pca=list() ##Compute the mean of each variable if (center) { my_pca[['center']]=colMeans(x) } ## Otherwise, we set the mean to 0 else my_pca[['center']]=rep(0,dim(x)[2]) ####Compute the standard dev of each variable if (scale) { my_pca[['std']]=apply(x,2,sd) } ## Otherwise, we set the sd to 0 else my_pca[['std']]=rep(1,dim(x)[2]) ##Normalization ##Centering x_std=sweep(x,2,my_pca[['center']]) ##Standardization x_std=x_std%*%diag(1/my_pca[['std']]) ##Cov matrix eigen_cov=eigen(crossprod(x_std,x_std)) ##Computing the cumulative variance my_pca[['cumulative_variance']] =cumsum(eigen_cov[['values']]) ##Number of required components my_pca[['n_components']] =sum((my_pca[['cumulative_variance']]/sum(eigen_cov[['values']]))<variance_explained)+1 ##Selection of the principal components my_pca[['transform']] =eigen_cov[['vectors']][,1:my_pca[['n_components']]] attr(my_pca, "class") <- "my_pca" return(my_pca) }
Now that we have the transformation matrix, we can perform the projection on the new basis.
predict.my_pca<-function(pca,x,..) { ##Centering x_std=sweep(x,2,pca[['center']]) ##Standardization x_std=x_std%*%diag(1/pca[['std']]) return(x_std%*%pca[['transform']]) }
The function applies the change of basis formula and a projection on the
Plot the PCA projection
Using the predict function, we can now plot the projection of the observations on the two main components. As in the part 1, we used the Iris dataset.
library(ggplot2) pca1=my_pca(as.matrix(iris[,1:4]),1,scale=TRUE,center = TRUE) projected=predict(pca1,as.matrix(iris[,1:4])) ggplot()+geom_point(aes(x=projected[,1],y=projected[,2],color=iris[,5]))
Comparison with the FactoMineR implementation
We can now compare our implementation with the standard FactoMineR implementation of Principal Component Analysis.
library(FactoMineR) pca_stats= PCA(as.matrix(iris[,1:4])) projected_stats=predict(pca_stats,as.matrix(iris[,1:4]))$coord[,1:2] ggplot(data=iris)+geom_point(aes(x=projected_stats[,1],y=-projected_stats[,2],color=Species))+xlab('PC1')+ylab('PC2')+ggtitle('Iris dataset projected on the two mains PC (FactomineR)')
When running this, you should get a plot very similar to the previous one. This ensures the sanity of our implementation.
Thanks for reading ! To find more posts on Machine Learning, Python and R, you can follow us on Facebook or Twitter.
.
The post Create your Machine Learning library from scratch with R ! (2/5) – PCA appeared first on Enhance Data Science.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.