Site icon R-bloggers

Lessons Learned from EC2

[This article was first published on Byte Mining, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A week or so ago I had my first experience using someone else’s cluster on Amazon EC2. EC2 is the Amazon Elastic Compute Cloud. Users set up a virtual computing platform that runs on Amazon’s servers “in the cloud.” Amazon EC2 is not just another cluster. EC2 allows the user to create a disk image containing an operating system and all of the software they need to perform their computations. In my case, the disk image would contain Hadoop, R, Python and all of the R and Python packages I need for my work. This prevents the user (and the provider) from having to worry about providing or upgrading software and having compatibility issues.

No subscription is required. Users pay for the amount of resources used for the computing session. Hourly prices are very cheap, but accrue quickly. Additionally, Amazon charges for pretty much everything single thing you can do with an OS: transferring data to/from the cloud per GB, data storage per GB, CPU time per hour per core etc.

This is somewhat of a tangent, but EC2 was a brilliant business move in my opinion.

Anyway, life gets a bit more difficult when the EC2 instance you’re working with is not your own. My experience using someone else’s instance was new, and short, and I think both parties were not entirely aware of how foreign the system may be to a user outside of the loop. I provide these tips to others, so they know what to expect when working with someone else’s instance. Some of these points I add in hindsight and did not reflect my experience, but are important anyway. The people I worked with were great and assured me that I wasn’t being an absolute pain in the butt. Most of these points apply to Hadoop, but may be applicable to other systems as well.

  1. get an upfront inventory of what is installed and where it is located. /usr/local is a typical place for Hadoop for example, but it was not there when I looked for it. It was somewhere else. Do not expect things to be in certain places.
  2. get an upfront inventory of what data is available to you and where it is located. This is critical because if paths are hardcoded in your code, they will most likely fail. Instead, write code that takes input from stdin and writes to stdout if possible.
  3. understand whether or not your data is within the instance, or if it is in S3 buckets, or somewhere else. The syntax for your commands will change depending on this.
  4. know where to find logs for processes that write to stderr such as Hadoop. In my case, they were in a weird place, not /var/log or HADOOP_HOME/logs.
  5. ASK before you transfer data to or from the instance. Particularly in academia, it is easy to take advantage of free, unrestricted bandwidth. That is not applicable to Amazon EC2.
  6. monitoring Hadoop progress on EC2 is a pain. If you are using a typical EC2 instance, you will not have a static IP – it will change every time the instance is started or restarted. To access the web interface, you will need to establish an SSH tunnel to the instance and then use something like FoxyProxy to map the internal URLs spit out by Hadoop to an actual URL that can be accessed from outside the cloud. I was on a deadline and this was pretty frustrating.
  7. know that any packages you need will need to be installed not only on the master but any workers as well. Depending on the size of the cluster, this could take a while. Oh, and it will also require admin rights.
  8. If the main user of the cluster is generous, ask for admin rights so that you can use sudo. Depending on the application, having admin rights is not a big deal because the rights only last while the cluster is running. On a local hardware cluster, admin rights exist until somebody revokes them. Of course, the main user may have a problem with granting you admin privileges if possible lost profit is involved. For example, a credit card company probably would not give admin access to one of its employees.

What are your bits of advice when using someone else’s Amazon EC2 cluster?

Map-Reduce on the Fly

If all you need is Hadoop, your best bet is to use Amazon Elastic Map Reduce.

Elastic Map Reduce boots Hadoop on EC2 instances without the user having to do it themselves. Your data is read from S3, and the output is written back to S3. This keeps your work organized without having to worry about where to put it in the filesystem. The user simply writes a data processing application (mapper and reducer) in Hive, Pig, Cascading, Java, Ruby Perl, Python, PHP, R, C++ etc. The user uploads the data and the application code into S3.

Elastic Map-Reduce also keeps all of the output logs in one nice place for you! When processing is complete, Amazon tears down the instance so you don’t pay for what you don’t use.

To leave a comment for the author, please follow the link and comment on their blog: Byte Mining.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.