Discriminant Analysis for Group Separation in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The term ‘discriminant analysis’ is often used interchangeably to represent two different objectives. These objectives of discriminant analysis are:
- Description of group separation. Linear combinations of variables, known as discriminant functions, of the dependent variables that maximize the separation between the groups are used to identify the relative contribution of the p variables that best predict group membership.
- Prediction of observations to groups using either linear or quadratic discriminant functions, known as LDA and QDA, respectively.
This post will explore the first objective of discriminant analysis with two groups. Future posts will examine the classification and prediction objective of discriminant analysis.
Discriminant Analysis for Two Groups
Discriminant analysis assumes the two samples or populations being compared have the same covariance matrix Σ but distinct mean vectors μ1 and μ2 with p variables. The discriminant function that maximizes the separation of the groups is the linear combination of the p variables. The linear combination denoted z=a′y transforms the observation vectors to a scalar. The discriminant functions thus take the form:
z1i=a′y1i=a1y1i1+a2y1i2+⋯+apy1ipi=1,2,⋯,n1
z2i=a′y2i=a2y2i1+a2y2i2+⋯+apy2ipi=1,2,⋯,n2
To compute the discriminant function coefficients, first find the sample means ˉz1 and ˉz2. The mean can be found by averaging the n values or as a linear combination of the sample mean vector y1, ˉy.
ˉz1=1n1∑n1i=1z1i=a′ˉy1
ˉz2=1n2∑ni=1z2i=a′ˉy2
Where,
ˉy1=∑n1i=1y1in1
ˉy1=∑n2i=1y2in2
The goal is to then find a vector a that maximizes the standardized squared difference (ˉz1–ˉz2)2/s2z. The sample variance s2z is the sample variance of z1,z2,⋯,zn or from the vector a and the sample covariance matrix of the mean vectors y1,y2,⋯,yn, denoted by S.
s2z=∑ni=1(zi–ˉz)2n–1=a′SaThus the standardized squared distance (ˉz1–ˉz2)2/s2z can also be written as the following:
(ˉz1–ˉz2)2s2z=[a′(ˉy1–ˉy2)]2a′Sp1aWhere Sp1 is an unbiased estimator of the covariance matrix Σ. Sp1 is defined as:
Sp1=1n1+n2–2(W1+W2)Where W1 and W2 are defined as matrices of the sample sum of squares and cross products.
W1=∑n1i=1(y1i–ˉy1)(y1i–ˉy1)′=(n1–1)S1
W2=∑n2i=1(y2i–ˉy2)(y2i–ˉy2)′=(n2–1)S2
For Sp1 to exist, n1+n2–2>p must be satisified.
The maximum of the above function is found when a is equivalent or a multiple of the following:
a=S−1p1(ˉy1–ˉy2)Since a can be a multiple of the above, it is not unique; however, its direction is unique. By ‘direction’, it is implied the relative values of the vector a, a1,a2,⋯,ap are unique.
Discriminant Analysis in R
The data we are interested in is four measurements of two different species of flea beetles. All measurements are in micrometers (μm) except for the elytra length which is in units of .01 mm. The data were obtained from the companion FTP site of the book Methods of Multivariate Analysis by Alvin Rencher.
Read the data and give the columns names for reference.
beetles <- read.table('BEETLES.DAT', col.names = c('Measurement.Number', 'Species', 'transverse.groove.dist', 'elytra.length', 'second.antennal.joint.length', 'third.antennal.joint.length'))
The dplyr package will be used for simple data manipulation.
library(dplyr)
Separate the two groups into different data frames.
beetle1 <- filter(beetles, Species == 1)[,3:6] beetle2 <- filter(beetles, Species == 2)[,3:6]
Store the sample size and means of the two groups for later.
n1 <- nrow(beetle1) n2 <- nrow(beetle2) beetle1.means <- apply(beetle1, 2, mean) beetle2.means <- apply(beetle2, 2, mean)
First, Sp1 must be calculated.
w1 <- (n1 - 1) * var(beetle1) w2 <- (n2 - 1) * var(beetle2) sp1 <- 1 / (n1 + n2 - 2) * (w1 + w2)
As mentioned above, the groups are maximally separated when a=S−1p1(ˉy1−ˉy2).
a <- solve(sp1) %*% (beetle1.means - beetle2.means) a ## [,1] ## transverse.groove.dist 0.3452490 ## elytra.length -0.1303878 ## second.antennal.joint.length -0.1064338 ## third.antennal.joint.length -0.1433533
The output of which gives us the linear discriminant function coefficients. However, as noted earlier, the data is not commensurate and therefore needs to be scaled to provide any meaningful interpretation. The linear discriminant analysis coefficients can be standardized by diag(Sp1)1/2a.
diag(sp1)^(1/2) * a ## [,1] ## transverse.groove.dist 4.136640 ## elytra.length -2.500550 ## second.antennal.joint.length -1.157705 ## third.antennal.joint.length -2.067833
Which gives us the following discriminant function:
z=4.137y1−2.501y2−1.158y3−2.068y4The interpretation of the discriminant function can be made in several ways. The most simple is to rank the absolute value of the coefficients and determine contribution based on the order of the coefficients. Another method is to perform a partial F-test to find the significance of the variables.
Judging from our discriminant function, it appears the first measurement is the most significant while the second and third measurements have similar contribution to group separation.
The MASS package contains the function lda()
for performing linear discriminant analysis.
library(MASS)
The lda()
function takes a formula argument.
beet.lda <- lda(Species ~ .-Measurement.Number, data = beetles) beet.lda$scaling ## LD1 ## transverse.groove.dist -0.09327642 ## elytra.length 0.03522706 ## second.antennal.joint.length 0.02875538 ## third.antennal.joint.length 0.03872998
Note the discriminant function coefficients are different than what we computed earlier. This difference is due to another scaling method employed by the lda()
function. Since any multiple of a can be taken as the maximum vector, either vector would suffice as the solution. We can see the coefficients are scaled differently than the a vector we found earlier. Despite this difference in scaling, output of the lda()
function would still provide the same interpretation of the coefficients if they were ordered by their absolute values.
beet.lda$scaling / a ## LD1 ## transverse.groove.dist -0.2701715 ## elytra.length -0.2701715 ## second.antennal.joint.length -0.2701715 ## third.antennal.joint.length -0.2701715
The group separation can be plotted by using the plot()
function from the MASS package. Note since we are only concerned with one discriminant function the plot will be a histogram rather than a scatterplot.
plot(beet.lda)
Summary
This post explored the objective of group separation using discriminant function analysis. By performing and interpreting a discriminant analysis function, one can get a better sense of what contributes the most distinction between the sample groups. As we will see in future posts, the discriminant function can also be used to classify and predict future observations.
References
Rencher, A. (n.d.). Methods of Multivariate Analysis (2nd ed.). Brigham Young University: John Wiley & Sons, Inc.
The post Discriminant Analysis for Group Separation in R appeared first on Aaron Schlegel.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.