Implementing K-means clustering for Hadoop in R and Java
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
At the Bay Area R User Group meeting this week, Antonio Piccolboni gave an overview of the design goals and implementation of the RHadoop Project packages that connect Hadoop and R: rhdfs, rhbase and rmr:
(The image above was captured from Antionio's slides.) The most revealing part of the talk for me was the comparison of implementing the K-means clustering algorithm the “standard” way (using Python, Pig and Java, as shown on slides 8-10) compared to using just R (with the rmr package, shown on slides 14-15): it takes much less code, and can be implemented in a single language. Antonio expands on this example at the RHadoop wiki, which makes for a great place to start if you're looking to implement big-data statistical models with the rmr package.
RHadoop wiki: Comparison of high level languages for mapreduce: k means
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.