Set Operations in R and Python. Useful!
Set operations are super useful when data cleaning or testing scripts. They are a must have in any analyst’s (data scientist’s/statistician’s/data wizard’s) toolbox. Here is a quick rundown in both R and python.
Say we have two vectors x and y…
# vector x x = c(1,2,3,4,5,6) # vector y y = c(4,5,6,7,8,9)
What if we ‘combined’ x and y ignoring any duplicate elements? ()
# x UNION y union(x, y) [1] 1 2 3 4 5 6 7 8 9
What are the common elements in x and y? ()
# x INTERSECTION y intersect(x, y) [1] 4 5 6
What elements feature in x but not in y?
# x members not in y setdiff(x,y) [1] 1 2 3
What elements feature in y but not in x?
# y members not in x setdiff(y,x) [1] 7 8 9
How might we visualise all this?
# required package | |
library(VennDiagram) | |
# plot venn | |
draw.pairwise.venn(area1 = 6, area2 = 6, cross.area = 3, | |
category = c('x', 'y'), | |
fill = c('darkred', 'darkgreen'), | |
alpha = rep(0.3, 2), | |
scaled = FALSE) |

What about python? In standard python there exists a module called ‘sets’ that allows for the creation of a ‘Set’ object from a python list. The Set object has methods that provide the same functionality as the R functions above.
# creating set x | |
x = set([1,2,3,4,5,6]) | |
# creating set y | |
y = set([4,5,6,7,8,9]) | |
# x UNION y | |
x.union(y) | |
{1, 2, 3, 4, 5, 6, 7, 8, 9} | |
# x INTERSECTION y | |
x.intersection(y) | |
{4, 5, 6} | |
# x members not in y | |
x.difference(y) | |
{1, 2, 3} | |
# y members not in x | |
y.difference(x) | |
{7, 8, 9} |
References:
http://rstudio-pubs-static.s3.amazonaws.com/13301_6641d73cfac741a59c0a851feb99e98b.html
https://docs.python.org/2/library/sets.html
