Run R in parallel on a Hadoop cluster with AWS in 15 minutes
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
If you're looking to apply massively parallel resources to an R problem, one of the most time-consuming aspects of the problem might not be the computations themselves, but the task of setting up the cluster in the first place. You can use Amazon Web Services to set up the cluster in the cloud, but even that take some time, especially if you haven't done it before.
Jeffrey Breen created his first AWS cluster this weekend, and in just 15 minutes had demonstrated how to use 5 nodes to generate and analyze a billion simulations in R. It was a toy example, sure — estimating pi — but it's a great example of how quickly you can set up a parallel computing environment using R. Jeffrey used JD Long's segue package, which works with the Hadoop Streaming service on AWS. The segue package is still in the experimental stage, but still: this is a great demonstration of applying cloud-based hardware to parallel problems in R.
Jeffrey Breen: Abusing Amazon’s Elastic MapReduce Hadoop service… easily, from R
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.