Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In order to illustrate hierarchical clustering techniques and k-means, I did borrow François Husson‘s dataset, with monthly average temperature in several French cities.
> temp=read.table( + "http://freakonometrics.free.fr/FR_temp.txt", + header=TRUE,dec=",")
We have 15 cities, with monthly observations
> X=temp[,1:12] > boxplot(X)
Since the variance seems to be rather stable, we will not ‘normalize’ the variables here,
> apply(X,2,sd) Janv Fevr Mars Avri 2.007296 1.868409 1.529083 1.414820 Mai Juin juil Aout 1.504596 1.793507 2.128939 2.011988 Sept Octo Nove Dece 1.848114 1.829988 1.803753 1.958449
In order to get a hierarchical cluster analysis, use for instance
> h <- hclust(dist(X), method = "ward") > plot(h, labels = rownames(X), sub = "")
An alternative is to use
> library(FactoMineR) > h2=HCPC(X) > plot(h2)
Here, we visualise observations with a principal components analysis. We have here also an automatic selection of the number of classes, here 3. We can get the description of the groups using
> h2$desc.ind
or directly
> cah=hclust(dist(X)) > groups.3 <- cutree(cah,3)
We can also visualise those classes by ourselves,
> acp=PCA(X,scale.unit=FALSE) > plot(acp$ind$coord[,1:2],col="white") > text(acp$ind$coord[,1],acp$ind$coord[,2], + rownames(acp$ind$coord),col=groups.3)
It is possible to plot the centroïds of those clusters
> PT=aggregate(acp$ind$coord,list(groups.3),mean) > points(PT$Dim.1,PT$Dim.2,pch=19)
If we add Voroid sets around those centroïds, here we do not see them (actually, we see the point – in the middle – that is exactly at the intersection of the three regions),
> library(tripack) > V <- voronoi.mosaic(PT$Dim.1,PT$Dim.2) > plot(V,add=TRUE)
To visualize those regions, use
> p=function(x,y){ + which.min((PT$Dim.1-x)^2+(PT$Dim.2-y)^2) + } > vx=seq(-10,12,length=251) > vy=seq(-6,8,length=251) > z=outer(vx,vy,Vectorize(p)) > image(vx,vy,z,col=c(rgb(1,0,0,.2), + rgb(0,1,0,.2),rgb(0,0,1,.2))) > CL=c("red","black","blue") > text(acp$ind$coord[,1],acp$ind$coord[,2], + rownames(acp$ind$coord),col=CL[groups.3])
Actually, those three groups (and those three regions) are also the ones we obtain using a k-mean algorithm,
> km=kmeans(acp$ind$coord[,1:2],3) > km K-means clustering with 3 clusters of sizes 3, 7, 5
(etc). But actually, since again we have some spatial data, it is possible to visualize them on a map
> library(maps) > map("france") > points(temp$Long,temp$Lati,col=groups.3,pch=19)
or, to visualize the regions, use e.g.
> library(car) > for(i in 1:3) + dataEllipse(temp$Long[groups.3==i], + temp$Lati[groups.3==i], levels=.7,add=TRUE, + col=i+1,fill=TRUE)
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.