bigkmeans also works well for ordinary matrix objects: The biganalytics package
[This article was first published on sfchaos' blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The bigmemory is an excellent package for handling big matrix in R. There are several sister packages provided by “The Bigmemory Project“: biganalytics for analysis, bigtabulate for tabulation, bigalgebra for linear algebra functionality, synchronicity for synchronization via mutexes and interprocess communication and message passing.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
biganalytics provides a few functions for analysis: linear regression model, generalized linear regression model, and clustering. In this post, I would like to focus on clustering, namely, bigkmeans function. There are several algorithms regarding k-means, for example, Hartigan-Wong method, Lloyd method, Forgy method, and MacQueen method. bigkmeans implements the last. The authors say in the manual that bigkmeans also work for the ordinay matrix objects. Where does bigkmeans excel ordinary kmeans? I decided to experiment it.
The “Gisette Data Set” was used. This dataset is famous for the hand-written digit recognition problem, one of datasets of the NIPS 2003 feature selection challenge. It contains 13,500 records and 5,000 features.
The experiments were conducted on two conditions:
1. kmeans with data.frame
2. bigkmeans with marix
Here is the sorce code:
First of all, load biganalytics package, and set the parameters for conducting k-means alogrithms.
library(biganalytics) # condition for conducting k-means algorithm size <- c(1000, 3000, 5000, 7500, 10000, 11000) centers <- 2 iter.max <- 50 nstart <- 100 algorithm <- "MacQueen" nsize <- length(size)
Second, read the dataset as a data.frame object, and also convert it to a matrix object to use in bigkmeans. Please notice that the data file was generated by combining "gisette_train.data", "gisette_test.data", and "gisette_valid.data".
# read data gisette.km <- read.table("../data/gisette_all.data", sep="", header=FALSE) gisette.bkm <- as.matrix(gisette.km)
Third, generate the object for maintaining the calcultion time, and measure the calculation time in those two cases, varying the size of the dataset.
# generate objects for maintainig calculation time calc.time <- matrix(NA, nrow=nsize, ncol=3, dimnames=list(size, c("kmeans with data.frame", "bigkmeans with matrix") ) ) # measure calculation time for (i in 1:nsize) { size.i <- size[i] gisette.km.i <- gisette.km[1:size.i, ] gisette.bkm.i <- gisette.bkm[1:size.i, ] # 1.kmeans with data.frame cat("1.kmeans with data.frame", "\n") calc.time[i, 1] <- system.time( kmeans(gisette.km.i, centers, iter.max, nstart, algorithm) )[3] rm(gisette.km.i) gc() # 2.bigkmeans with matrix cat("2.bigkmeans with matrix", "\n") calc.time[i, 2] <- system.time( bigkmeans(gisette.bkm.i, centers, iter.max, nstart) )[3] rm(gisette.bkm.i) gc() }
Finally, plot the result.
col <- c("blue", "red") matplot(size, calc.time, type="l", col=col, lty=1, xlab="N", ylab="time[s]") legend(2000, 6000, size, calc.time, col=col, lty=1, cex=0.8)
The result is shown below:
It is clearly shown that bigkmeans is faster than kmeans even for an ordinary matrix object: by 1.26 at N=5000, 1.39 at N=7500, 1.83 at N=10000, and almost twice at N=11000.
For datasets with fewer features, I'll try in the near future.
LINK:
The Bigmemory Project(vignette)
Big data analysis in R(sorry, in Japanese)
To leave a comment for the author, please follow the link and comment on their blog: sfchaos' blog.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.