[This article was first published on Enhance Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This is this second post of the “Create your Machine Learning library from scratch with R !” series. Today, we will see how you can implement Principal components analysis (PCA) using only the linear algebra available in R. Previously, we managed to implement linear regression and logistic regression from scratch and next time we will deal with K nearest neighbors (KNN).
Principal components analysis
The PCA is a dimensionality reduction method which seeks the vectors which explains most of the variance in the dataset. From a mathematical standpoint, the PCA is just a coordinates change to represent the points in a more appropriate basis. Picking few of these coordinates is enough to explain an important part of the variance in the dataset.
The mathematics of PCA
Let be the observations of our datasets, the points are in . We assume that they are centered and of unit variance. We denote the matrix of observations.
Then, can be diagonalized and has real and positive eigenvalues (it is a symmetric positive definite matrix).
We denote
This is exactly what we wanted ! We have a smaller basis which explains as much variance as possible !
PCA in R
The implementation in R has three-steps:
We center the data and divide them by their deviations. Our data now comply with PCA hypothesis.
We diagonalise and store the eigenvectors and eigenvalues
The cumulative variance is computed and the required numbers of eigenvectors to reach the variance threshold is stored. We only keep the first eigenvectors
###PCA
my_pca<-function(x,variance_explained=0.9,center=T,scale=T)
{
my_pca=list()
##Compute the mean of each variable
if (center)
{
my_pca[['center']]=colMeans(x)
}
## Otherwise, we set the mean to 0
else
my_pca[['center']]=rep(0,dim(x)[2])
####Compute the standard dev of each variable
if (scale)
{
my_pca[['std']]=apply(x,2,sd)
}
## Otherwise, we set the sd to 0
else
my_pca[['std']]=rep(1,dim(x)[2])
##Normalization
##Centering
x_std=sweep(x,2,my_pca[['center']])
##Standardization
x_std=x_std%*%diag(1/my_pca[['std']])
##Cov matrix
eigen_cov=eigen(crossprod(x_std,x_std))
##Computing the cumulative variance
my_pca[['cumulative_variance']] =cumsum(eigen_cov[['values']])
##Number of required components
my_pca[['n_components']] =sum((my_pca[['cumulative_variance']]/sum(eigen_cov[['values']]))<variance_explained)+1
##Selection of the principal components
my_pca[['transform']] =eigen_cov[['vectors']][,1:my_pca[['n_components']]]
attr(my_pca, "class") <- "my_pca"
return(my_pca)
}
Now that we have the transformation matrix, we can perform the projection on the new basis.
Projection of the Iris dataset on the two mains PCA
Comparison with the FactoMineR implementation
We can now compare our implementation with the standard FactoMineR implementation of Principal Component Analysis.
library(FactoMineR)
pca_stats= PCA(as.matrix(iris[,1:4]))
projected_stats=predict(pca_stats,as.matrix(iris[,1:4]))$coord[,1:2]
ggplot(data=iris)+geom_point(aes(x=projected_stats[,1],y=-projected_stats[,2],color=Species))+xlab('PC1')+ylab('PC2')+ggtitle('Iris dataset projected on the two mains PC (FactomineR)')
When running this, you should get a plot very similar to the previous one. This ensures the sanity of our implementation.
Projection of the Iris dataset using the FactoMineR implementation
Thanks for reading ! To find more posts on Machine Learning, Python and R, you can follow us on Facebook or Twitter.
.
Introduction This is the final and concluding part of my series on ‘Practical Machine Learning with R and Python’. In this series I included the implementations of the most common Machine Learning algorithms in R and Python. The algorithms implemented were 1. Practical Machine Learning with R and Python –…
Bioinformatics is becoming more and more a Data Mining field. Every passing day, Genomics and Proteomics yield bucketloads of multivariate data (genes, proteins, DNA, identified peptides, structures), and every one of these biological data units are described by a number of features: length, physicochemical properties, scores, etc. Careful consideration of…
Unsupervised learning is a machine learning technique in which the dataset has no target variable or no response value-\(Y \).The data is unlabelled. Simply saying,there is no target value to supervise the learning process of a learner unlike in supervised learning where we have training examples which have both input…
October 9, 2017
Similar post
To leave a comment for the author, please follow the link and comment on their blog: Enhance Data Science.