Resampling data in Hadoop with RHadoop
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
On Revolution Analytics partner Cloudera's blog, Uri Laserson has posted an excellent guide to resampling from a large data set in Hadoop. Resampling is an important step in fitting ensemble models (including random forests and other bagging techniques), and Uri provides a step-by-step guide to implementing resampling methods using RHadoop. He provides the complete map-reduce code in the R language, as well as a useful script for installing RHadoop on a Cloudera instance.
By the way, if you're new to RHadoop, here's RHadoop creator and project leader Antonio Piccolboni introducting RHadoop at last year's Strata CA conference.
Cloudera blog: How-to: Resample from a Large Data Set in Parallel (with R on Hadoop)
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.