Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Consider the following dataset, with (only) ten points
x=c(.4,.55,.65,.9,.1,.35,.5,.15,.2,.85) y=c(.85,.95,.8,.87,.5,.55,.5,.2,.1,.3) plot(x,y,pch=19,cex=2)
We want to get – say – two clusters. Or more specifically, two sets of observations, each of them sharing some similarities.
Since the number of observations is rather small, it is actually possible to get an exhaustive list of all partitions, and to minimize some criteria, such as the within variance. Given a vector with clusters, we compute the within variance using
within_var = function(I){ I0=which(I==0) I1=which(I==1) xbar0=mean(x[I0]) xbar1=mean(x[I1]) ybar0=mean(y[I0]) ybar1=mean(y[I1]) w=sum(I0)*sum( (x[I0]-xbar0)^2+(y[I0]-ybar0)^2 )+ sum(I1)*sum( (x[I1]-xbar1)^2+(y[I1]-ybar1)^2 ) return(c(I,w)) }
Then, to compute all possible partitions, use
base2=function(z,n=10){ Base.b=rep(0,n) ndigits=(floor(logb(z, base=2))+1) for(i in 1:ndigits){ Base.b[ n-i+1]=(z%%2) z=(z%/%2)} return(Base.b)} L=function(x) within_var(base2(x)) S=sapply(1:(2^10),L)
The cluster indices at the mimimum is here
I=S[1:n,which.min(S[n+1,])]
To visualize those clusters, use
cluster_viz = function(indices){ library(RColorBrewer) CL2palette=rev(brewer.pal(n = 9, name = "RdYlBu")) CL2f=CL2palette[c(1,9)] plot(x,y,pch=19,xlab="",ylab="",xlim=0:1,ylim=0:1,cex=2,col=CL2f[1+I]) CL2cl=CL2palette[c(3,7)] I0=which(indices==0) I1=which(indices==1) xbar0=mean(x[I0]) xbar1=mean(x[I1]) ybar0=mean(y[I0]) ybar1=mean(y[I1]) segments(x[I0],y[I0],xbar0,ybar0,col=CL2c[1]) segments(x[I1],y[I1],xbar1,ybar1,col=CL2c[2]) points(xbar0,ybar0,pch=19,cex=1.5,col=CL2c[1]) points(xbar1,ybar1,pch=19,cex=1.5,col=CL2c[2])}
and then, simply
cluster_viz(I)
But that was possible only because
One idea can be to use hierarchical clustering techniques. For instance, with a complete aggregation technique, we get
H=hclust(dist(cbind(x,y)),method="complete") I=cutree(H,k=2)-1 cluster_viz(I)
If we change the aggregation technique, to Ward, we get
H=hclust(dist(cbind(x,y)),method="ward.D") I=cutree(H,k=2)-1 cluster_viz(I)
or, for the single one,
H=hclust(dist(cbind(x,y)),method="single") I=cutree(H,k=2)-1 cluster_viz(I)
Another idea is to use some
km=kmeans(cbind(x,y), centers=2, nstart = 1, algorithm = "Lloyd") I=km$cluster-1 cluster_viz(I)
But if we run it another time, we might have different starting points, and different clusters
km=kmeans(cbind(x,y), centers=2, nstart = 1, algorithm = "Lloyd") I=km$cluster-1 cluster_viz(I)
or
even
A natural strategy is to run it many times, with different starting points, and to keep the one with the smallest within-variance.
But to be honest, I like the idea that drawing different starting point will yield different clusters. Actually, we can search some kind of robustness. For instance, which observations are more likely to belong to the same cluster as the second one (on top) ?
ref=2 M=NULL for(s in 1:10000) { km=kmeans(cbind(x,y), centers=2, nstart = 1, algorithm = "Lloyd") I=km$cluster-1 I=(I[ref]==1)*I+(I[ref]==0)*(1-I) M=rbind(M,I) } M_prop=apply(M,2,mean)
or the fifth one (on the left)
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.