Implementing K-means clustering for Hadoop in R and Java

David Smith

10 years ago

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

At the Bay Area R User Group meeting this week, Antonio Piccolboni gave an overview of the design goals and implementation of the RHadoop Project packages that connect Hadoop and R: rhdfs, rhbase and rmr:

(The image above was captured from Antionio's slides.) The most revealing part of the talk for me was the comparison of implementing the K-means clustering algorithm the "standard" way (using Python, Pig and Java, as shown on slides 8-10) compared to using just R (with the rmr package, shown on slides 14-15): it takes much less code, and can be implemented in a single language. Antonio expands on this example at the RHadoop wiki, which makes for a great place to start if you're looking to implement big-data statistical models with the rmr package.

RHadoop wiki: Comparison of high level languages for mapreduce: k means

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.