Snowdoop/partools Update

matloff

7 years ago

[This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve put together an updated version of my partools package, including Snowdoop, an alternative to MapReduce algorithms. You can download it here, version 1.0.1.

To review: The idea of Snowdoop is to create your own file chunking, rather than having something like Hadoop do it for you, and then using ordinary R coding to perform parallel operations. This avoids the need to deal with new constructs and complicated configuration issues with Hadoop and R interfaces to it.

Major changes are as follows:

There is a k-means clustering example of Snowdoop in the examples/ directory. Among other things, it illustrates the fact that with the Snowdoop approach, one automatically achieves a “caching” effect lacking in Hadoop, trivially by default.
There is a filesort() function, to sort a distributed file, keeping the result in memory in distributed form. I don’t know yet how efficient it will be relative to Hadoop.
There are various new short utility functions, such as filesplit().

Still not on Github yet, but Yihui should be happy that I converted the Snowdoop vignette to use knitr. 🙂

All of this is still preliminary, of course. It remains to be seen to what scale this approach will work well.

To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.