StarCluster and R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
StarCluster is a utility for creating and managing
distributed computing clusters hosted on Amazon’s Elastic Compute
Cloud (EC2). StarCluster utilizes Amazon´s EC2 web service to create
and destroy clusters of Linux virtual machines on demand.
StarCluster provides a convenient way to quickly set up a cluster of machines to run some data parallel jobs using a distributed memory framework.
Install StarCluster using
$ sudo easy_install StarCluster
and then create a configuration file using
$ starcluster help
Add your AWS credentials to the config file and follow the instructions at the StarCluster quick-Start guide.
Once you have StarCluster up and running, you need to install R on all the cluster nodes and any packages you require. I wrote a shell script to automate the process:
#!/bin/zsh starcluster put $1 starcluster.setup.zsh /home/starcluster.setup.zsh starcluster put $1 Rpkgs.R /home/Rpkgs.R numNodes=`starcluster listclusters | grep "Total nodes" | cut -d' ' -f3` nodes=(`eval echo $(seq -f node%03g 1 $(($numNodes-1)))`) for node in $nodes; do cmd="source /home/starcluster.setup.zsh >& /home/install.log.$node" starcluster sshmaster $1 "ssh $node $cmd" & done
The script takes the name of your cluster as a parameter and pushes
the two helper files to the cluster. It then runs the installation on
the master and every node. It assumes you are running an Ubuntu Server
based StarCluster AMI, which is the default. The first helper
script, starcluster.setup.zsh
, installs the basic software
required:
#!/bin/zsh echo "deb http://stat.ethz.ch/CRAN/bin/linux/ubuntu precise/" >> /etc/apt/sources.list gpg --keyserver keyserver.ubuntu.com --recv-key E298A3A825C0D65DFD57CBB651716619E084DAB9 gpg -a --export E298A3A825C0D65DFD57CBB651716619E084DAB9 | sudo apt-key add - apt-get update apt-get install -y r-base r-base-dev echo “DONE with Ubuntu package installation on $(hostname -s).” R CMD BATCH --no-save /home/Rpkgs.R /home/install.Rpkgs.log echo “DONE with R package installation on $(hostname -s).”
The second script, Rpkgs.R
, is just a R script containing
the packages you want installed:
install.packages(c("randomForest", "caret", "mboost", "plyr", "glmnet"), repos = "http://cran.cnr.berkeley.edu") print(paste("DONE with R package installation on ", system("hostname -s", intern = TRUE), "."))
Once you have everything installed, you can ssh
into your master node and start up R as usual:
$ starcluster sshmaster mycluster $ R
Since StarCluster has set up all the networking nicely, you can use
parLapply
from the parallel
package to run a task on your cluster
without further configuration. Running a data parallel task on a
cluster with 10 nodes is now as easy as this (parLapply
is just like
lapply
, except it distributes the tasks over the cluster):
library("parallel") cluster_names <- paste("node00", 1:9, sep="") cluster_names <- c(cluster_names, "node010") cluster <- makePSOCKcluster(names = cluster_names) output <- parLapply(cluster, some_input, some_function) stopCluster(cluster)
Now you can watch 10 machines working for you. Like!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.