Computing and visualizing LDA in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
As I have described before, Linear Discriminant Analysis (LDA) can be seen from two different angles. The first classify a given sample of predictors to the class with highest posterior probability . It minimizes the total probability of misclassification. To compute it uses Bayes’ rule and assume that follows a Gaussian distribution with class-specific mean and common covariance matrix . The second tries to find a linear combination of the predictors that gives maximum separation between the centers of the data while at the same time minimizing the variation within each group of data.
The second approach [1] is usually preferred in practice due to its dimension-reduction property and is implemented in many R packages, as in the lda
function of the MASS
package for example. In what follows, I will show how to use the lda
function and visually illustrate the difference between Principal Component Analysis (PCA) and LDA when applied to the same dataset.
Using lda
from MASS
R package
As usual, we are going to illustrate lda
using the iris
dataset. The data contains four continuous variables which correspond to physical measures of flowers and a categorical variable describing the flowers’ species.
require(MASS) # Load data data(iris) > head(iris, 3) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa
An usual call to lda
contains formula
, data
and prior
arguments [2].
r <- lda(formula = Species ~ ., data = iris, prior = c(1,1,1)/3)
The .
in the formula
argument means that we use all the remaining variables in data
as covariates. The prior
argument sets the prior probabilities of class membership. If unspecified, the class proportions for the training set are used. If present, the probabilities should be specified in the order of the factor levels.
> r$prior setosa versicolor virginica 0.3333333 0.3333333 0.3333333 > r$counts setosa versicolor virginica 50 50 50 > r$means Sepal.Length Sepal.Width Petal.Length Petal.Width setosa 5.006 3.428 1.462 0.246 versicolor 5.936 2.770 4.260 1.326 virginica 6.588 2.974 5.552 2.026 > r$scaling LD1 LD2 Sepal.Length 0.8293776 0.02410215 Sepal.Width 1.5344731 2.16452123 Petal.Length -2.2012117 -0.93192121 Petal.Width -2.8104603 2.83918785 > r$svd [1] 48.642644 4.579983
As we can see above, a call to lda
returns the prior
probability of each class, the counts
for each class in the data
, the class-specific means
for each covariate, the linear combination coefficients (scaling
) for each linear discriminant (remember that in this case with 3 classes we have at most two linear discriminants) and the singular values (svd
) that gives the ratio of the between- and within-group standard deviations on the linear discriminant variables.
prop = r$svd^2/sum(r$svd^2) > prop [1] 0.991212605 0.008787395
We can use the singular values to compute the amount of the between-group variance that is explained by each linear discriminant. In our example we see that the first linear discriminant explains more than of the between-group variance in the iris
dataset.
If we call lda
with CV = TRUE
it uses a leave-one-out cross-validation and returns a named list with components:
class
: the Maximum a Posteriori Probability (MAP) classification (a factor)posterior
: posterior probabilities for the classes.
r2 <- lda(formula = Species ~ ., data = iris, prior = c(1,1,1)/3, CV = TRUE) > head(r2$class) [1] setosa setosa setosa setosa setosa setosa Levels: setosa versicolor virginica > head(r2$posterior, 3) setosa versicolor virginica 1 1 5.087494e-22 4.385241e-42 2 1 9.588256e-18 8.888069e-37 3 1 1.983745e-19 8.606982e-39
There is also a predict
method implemented for lda
objects. It returns the classification and the posterior probabilities of the new data based on the Linear Discriminant model. Below, I use half of the dataset to train the model and the other half is used for predictions.
train <- sample(1:150, 75) r3 <- lda(Species ~ ., # training model iris, prior = c(1,1,1)/3, subset = train) plda = predict(object = r, # predictions newdata = iris[-train, ]) > head(plda$class) # classification result [1] setosa setosa setosa setosa setosa setosa Levels: setosa versicolor virginica > head(plda$posterior, 3) # posterior prob. setosa versicolor virginica 3 1 1.463849e-19 4.675932e-39 4 1 1.268536e-16 3.566610e-35 5 1 1.637387e-22 1.082605e-42 > head(plda$x, 3) # LD projections LD1 LD2 3 7.489828 -0.2653845 4 6.813201 -0.6706311 5 8.132309 0.5144625
Visualizing the difference between PCA and LDA
As I have mentioned at the end of my post about Reduced-rank DA, PCA is an unsupervised learning technique (don’t use class information) while LDA is a supervised technique (uses class information), but both provide the possibility of dimensionality reduction, which is very useful for visualization. Therefore we would expect (by definition) LDA to provide better data separation when compared to PCA, and this is exactly what we see at the Figure below when both LDA (upper panel) and PCA (lower panel) are applied to the iris dataset. The code to generate this Figure is available on github.
Although we can see that this is an easy dataset to work with, it allow us to clearly see that the versicolor
specie is well separated from the virginica
one in the upper panel while there is still some overlap between them in the lower panel. This kind of difference is to be expected since PCA tries to retain most of the variability in the data while LDA tries to retain most of the between-class variance in the data. Note also that in this example the first LD explains more than of the between-group variance in the data while the first PC explains of the total variability in the data.
Closing remarks
Although I have not applied it on my illustrative example above, pre-processing [3] of the data is important for the application of LDA. Users should transform, center and scale the data prior to the application of LDA. It is also useful to remove near-zero variance predictors (almost constant predictors across units). Given that we need to invert the covariance matrix, it is necessary to have less predictors than samples. Attention is therefore needed when using cross-validation.
References:
[1] Venables, W. N. and Ripley, B. D. (2002). Modern applied statistics with S. Springer.
[2] lda (MASS) help file.
[3] Kuhn, M. and Johnson, K. (2013). Applied Predictive Modeling. Springer.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.