Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In multivariate space, the Mahalanobis distance is the distance between two points. It’s frequently used to locate outliers in statistical investigations involving several variables.
This tutorial describes how to execute the Mahalanobis distance in R.
Discriminant Analysis in r » Discriminant analysis in r »
Mahalanobis Distance in R
First, we need to create a data frame
Step 1: Create Dataset.
We can explore student datasets with exam scores, the number of hours they spent studying, preparation numbers, and current grades.
Sample Size Calculation Formula » Sampling Methods »
data = data.frame(score = c(81, 83, 92, 87, 96, 73, 68, 77, 78, 97, 99, 86, 84, 96, 70, 80, 83, 83, 73, 70),
hours = c(7, 8, 3, 1, 4, 3, 2, 5, 5, 5, 2, 3, 4, 8, 3, 3, 7, 3, 4, 1),
prep = c(3, 4, 0, 3, 5, 0, 1, 2, 1, 2, 3, 5, 3, 2, 2, 1, 5, 3, 2, 3),
grade = c(80, 78, 80, 80, 84, 85, 88, 94, 91, 95, 79, 82, 95, 84, 81, 93, 83, 80, 89, 79))
head(data)
    score hours prep grade
1    81     7    3    80
2    83     8    4    78
3    92     3    0    80
4    87     1    3    80
5    96     4    5    84
6    73     3    0    85
Step 2: For each observation calculate the Mahalanobis distance
We can make use of mahalanobis() function in R
Syntax mentioned as follows,
mahalanobis(x, center, cov)
Naive Bayes Classification in R » Prediction Model »
where:
x: indicate matrix of data
center: indicate the mean vector of the distribution
cov: indicate the covariance matrix of the distribution
Now we can calculate the distance for each observation.
mahalanobis(data, colMeans(data), cov(data)) [1] 3.3431887 5.7202321 7.3521513 3.1990061 4.2208239 3.4181516 3.1017453 2.8156955 1.9605904 5.6692191 5.3856421 3.5954695 3.9963068 5.9551989 2.4928251 2.4151973 4.3417003 0.9334786 1.4406139 4.6427634
Step 3: Calculate the p-value
Based on the step 2 result, some of the distances are much higher than others. Suppose if we want to identify any of the distances that are statistically significant then we need to calculate p-values.
Cluster Analysis in R » Unsupervised Approach »
The p-value for each distance is calculated as the Chi-Square statistic of the Mahalanobis distance with k-1 degrees of freedom, where k is the number of variables.
data$mahalnobis<- mahalanobis(data, colMeans(data), cov(data))
    score hours prep grade mahalnobis
1     81     7    3    80  3.3431887
2     83     8    4    78  5.7202321
3     92     3    0    80  7.3521513
4     87     1    3    80  3.1990061
5     96     4    5    84  4.2208239
6     73     3    0    85  3.4181516
7     68     2    1    88  3.1017453
8     77     5    2    94  2.8156955
9     78     5    1    91  1.9605904
10    97     5    2    95  5.6692191
11    99     2    3    79  5.3856421
12    86     3    5    82  3.5954695
13    84     4    3    95  3.9963068
14    96     8    2    84  5.9551989
15    70     3    2    81  2.4928251
16    80     3    1    93  2.4151973
17    83     7    5    83  4.3417003
18    83     3    3    80  0.9334786
19    73     4    2    89  1.4406139
20    70     1    3    79  4.6427634
Let’s create the p values
KNN Algorithm Machine Learning » Classification & Regression »
data$pvalue <- pchisq(data$mahalnobis, df=3, lower.tail=FALSE) data score hours prep grade mahalnobis pvalue 1 81 7 3 80 3.3431887 0.34167668 2 83 8 4 78 5.7202321 0.12604387 3 92 3 0 80 7.3521513 0.06148152 4 87 1 3 80 3.1990061 0.36194826 5 96 4 5 84 4.2208239 0.23858527 6 73 3 0 85 3.4181516 0.33153375 7 68 2 1 88 3.1017453 0.37620253 8 77 5 2 94 2.8156955 0.42092267 9 78 5 1 91 1.9605904 0.58062647 10 97 5 2 95 5.6692191 0.12886057 11 99 2 3 79 5.3856421 0.14564075 12 86 3 5 82 3.5954695 0.30858950 13 84 4 3 95 3.9963068 0.26186321 14 96 8 2 84 5.9551989 0.11381036 15 70 3 2 81 2.4928251 0.47658914 16 80 3 1 93 2.4151973 0.49081192 17 83 7 5 83 4.3417003 0.22685238 18 83 3 3 80 0.9334786 0.81734205 19 73 4 2 89 1.4406139 0.69604281 20 70 1 3 79 4.6427634 0.19990417
In general, a p-value that is less than 0.001 is considered to be an outlier. In this case, all the p values are greater than 0.001.
Principal component analysis (PCA) in R »
The post How to Calculate Mahalanobis Distance in R appeared first on finnstats.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
