Site icon R-bloggers

How to Calculate Jaccard Similarity in R

[This article was first published on finnstats », and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Jaccard Similarity in R, The Jaccard similarity index compares two sets of data to see how similar they are. It might be anywhere between 0 and 1. The greater the number, the closer the two sets of data are.

Subscribe

The Jaccard Index is a statistical measure that is frequently used to compare the similarity of binary variable sets. It is the length of the union divided by the size of the intersection between the sets.

The following formula is used to calculate the Jaccard similarity index:

Jaccard Similarity = (number of observations in both sets) / (number in either set)

Or, written in notation form:

J(A, B) = |A∩B| / |A∪B|

This article will show you how to use R to calculate Jaccard Similarity between two sets of data.

Jaccard similarity in R

Assume that we have the following two sets of data.

a <- c(1,5,8,10,22,14,15,16,2,7)
b <- c(10,12,13,2,7,9,2,7,23,15)

To determine the Jaccard Similarity between the two sets, we can use the following function.

Repeated Measures of ANOVA in R Complete Tutorial »

Define Jaccard Similarity function

jaccard <- function(a, b) {
    intersection = length(intersect(a, b))
    union = length(a) + length(b) - intersection
    return (intersection/union)
}

Let’s find the Jaccard Similarity between the two sets

jaccard(a, b)

[1] 0.25

The Jaccard Similarity between the two lists is 0.25. As mentioned above greater the number closer to the data sets.

Deep Neural Network in R » Keras & Tensor Flow

If the two sets don’t exchange any values, the function will return 0. If the two sets are identical, the function will return 1.

Let see two examples here,

a <- c(1,5,8,10)
b <- c(11,6,12,13)
jaccard(a, b)

[1] 0

a <- c(1,5,8,10)
b <- c(1,5,8,10)
jaccard(a, b)

[1] 1

The function is also applicable to sets containing strings.

Linear optimization using R » Optimal Solution »

a <- c('potato', 'tomotto', 'chips', 'baloon')
b <- c('car', 'chips', 'bird', 'salt')
jaccard(a, b)

[1] 0.1428571

You can also use this method to discover the Jaccard distance between two sets, which is calculated as 1 – Jaccard Similarity and represents the dissimilarity between two sets.

a <- c(1,5,8,10,22,14,15,16,2,7)
b <- c(10,12,13,2,7,9,2,7,23,15)
1-jaccard(a, b)

[1] 0.75

If you’re looking for a way to calculate the Jaccard similarity matrix, the vegan package is a good place to start. Many other similarities/dissimilarity measures can be calculated with the vegdist() function.

LSTM Network in R » Recurrent Neural network »

install.packages("vegan")
library(vegan)
a <- c(1,5,8,10,22,14,15,16,2,7)
b <- c(10,12,13,2,7,9,2,7,23,15)
df<-data.frame(a,b)
vegdist(df, method = "jaccard")
          1         2         3         4         5         6         7         8
2  0.3529412                                                                      
3  0.4761905 0.1904762                                                            
4  0.8500000 0.6818182 0.5652174                                                  
5  0.7500000 0.6470588 0.5714286 0.5862069                                        
6  0.5833333 0.4615385 0.3703704 0.4782609 0.3225806                              
7  0.8800000 0.7407407 0.6428571 0.2941176 0.4137931 0.3333333                    
8  0.6923077 0.5714286 0.4827586 0.4782609 0.2068966 0.1600000 0.2608696          
9  0.5600000 0.5000000 0.5161290 0.8787879 0.8000000 0.7027027 0.8947368 0.7692308
10 0.5000000 0.2272727 0.1304348 0.6400000 0.6216216 0.4482759 0.7000000 0.5483871
           9
2           
3           
4           
5           
6           
7           
8           
9           
10 0.4333333

Significance of Spearman’s Rank Correlation

The post How to Calculate Jaccard Similarity in R appeared first on finnstats.

To leave a comment for the author, please follow the link and comment on their blog: finnstats ».

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.