Site icon R-bloggers

You can Hadoop it! It’s elastic! Boogie woogie woog-ie!

[This article was first published on Cerebral Mastication » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This blog's name in Chinese!

I just came back from the future and let me be the first to tell you this: Learn some Chinese. And more than just cào nǐ niáng (肏你娘) which your friend in grad school told you means “Live happy with many blessings”. Trust me, I’ve been hanging with Madam Wu and she told me it doesn’t mean that.

So how did I travel to the future to visit with Madam Wu, you ask? Well the short answer is Hadoop. Yeah, the cute little elephant. As I have told you before, multicore makes your R code run fast by using worm holes to shoot your results back from the future. Well Hadoop actually takes you to the future on the back of an elephant and you can bring your own results back! I couldn’t make this up if I tried, so you know it’s true! And what’s fantastic about all of this is Hadoop works with R! And Amazon will let you rent a time traveling elephant through their Elastic MapReduce service! I think Amazon coined the term “Time Travel as a Service” or TTaaS  generally pronounced as “ta-tas” in the industry. If you are a CTO be sure and use this in your next “vision statement” pitch so everyone will know you’re hip to all this cloud stuff.

So you use R and you want to travel into the future on the back of an elephant to visit Madam Wu and get your model results back, don’t you? Well it’s a damn good thing you read this blog because I’m going to give you the keys to the Wu dynasty and a little 福寿 while we’re at it.

I’ve never had an original thought in my life so I started with this discussion over at the AMZN E M/R discussion forum. Peter Skomoroch from Data Wrangling gives a very good example with all the data and code provided so you can run it yourself.  Pete’s example really shakes the yáng guǐzi, as we say in the future. In addition I read the documentation for David Rosenberg’s HadoopStreaming package which was good for insight, but I didn’t use the package as it’s really focused on the ‘big data’ problem.

That elephant is so freaking cute!

Prior to my foray into time travel, I knew that Hadoop could be used to process big text files and do something like rip out all the links and count them. But I thought that Hadoop was all about processing big data. I never paid attention to the big Hadoop elephant in the room because I don’t have big data. I have big CPU hogging models (mostly slow because I don’t code worth a shit). What got me reconsidering my world view was John Myles White’s comment on my multicore post earlier. John encouraged me to look into running my simulations on AMZN’s E M/R service using Hadoop streaming. So instead of giving Hadoop  a big fat text file to parse, I just gave it a text file with 10,000 rows each containing an integer from 1:10,000. Then I refactored my R code to read a line from stdin, trim it down to just the integer, and then go run the simulation with that number. When done I had it serialize the resulting model output and return that to stdout. Hadoop takes care of chopping up the input and pulling together the output.

I learned a few “gotchas” or, as we say in the future: 臭婊子(I think that should be plural). I’ll do a whole blog post on gotchas soon, but here’s the bullet points:

  1. Your package has to work in R 2.7.1. (until AMZN upgrades to the next stable version of Debian.
  2. As far as I can tell, all the output has to come out of stdout. So if you want to end up with R objects which you use for other things, you should get comfortable with the serialize() command and reading text files back into R. Which, as you can see from this question, I am not yet comfortable with.
  3. There will be multiple instances of R running on every machine. So if they are all trying to download a package to the same directory, you are going to get file lock errors. One solution is to have each R instance create a directory for packages that includes the PID of the R instances. That way there’s no possibility for a conflict! Here’s an example of how I load the Hmisc package:

-cacheFile s3n://rdata/plyr_0.1.9.tar.gz#plyr_0.1.9.tar.gz

More to come later. I’ve gotta get back to the future.

You hold the elephant and I'll plug this in.

To leave a comment for the author, please follow the link and comment on their blog: Cerebral Mastication » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.