Site icon R-bloggers

Snowdoop, Part II

[This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In my last post, I questioned whether the fancy Big Data processing tools such as Hadoop and Spark are really necessary for us R users.  My argument was that (a) these tools tend to be difficult to install and configure, especially for non-geeks; (b) the tools require learning new computation paradigms and function calls; and (c) one should be able to generally do just as well with plain ol’ R.  I gave a small example of the idea, and promised that more would be forthcoming.  I’ll present one in this posting.

I called my approach Snowdoop for fun, and will stick with that name.  I hastened to explain at the time that although some very short support routines could be turned into a package (see below), Snowdoop is more a concept than a real package.  It’s just a simple idea for attacking problems that are normally handled through Hadoop and the like.

The example I gave last time involved the “Hello World” of Hadoop-dom, a word count.  However, mine simply counted the total number of words in a document, rather than the usual app in which is it reported how many times each individual word appears.  I’ll present the latter case here.

Here is the code:

# each node executes this function 
wordcensus <- function(basename,ndigs) {
 fname <- filechunkname(basename,ndigs)
 words <- scan(fname,what="")
 tapply(words,words,length, simplify=FALSE)
}

# manager 
fullwordcount <- function(cls,basename,ndigs) {
 assignids(cls)
 clusterExport(cls,"filechunkname")
 clusterExport(cls,"addlists")
 counts <- clusterCall(cls,wordcensus,basename,ndigs)
 addlistssum <- function(lst1,lst2)
   addlists(lst1,lst2,sum)
 Reduce(addlistssum,counts)
}

And here are the library functions:

# give each node in the cluster cls an ID number 
assignids <- function(cls) {
 # note that myid will be global
 clusterApply(cls,1:length(cls),
 function(i) myid <<- i)
}

# determine the file name for the chunk to be handled by node myid
filechunkname <- function(basename,ndigs) {
 tmp <- basename
 n0s <- ndigs - nchar(as.character(myid))
 paste(basename,".",rep('0',n0s),myid,sep="",colllapse="")
} 

# "add" 2 lists, applying the operation 'add' to elements in
# common,
# copying non-null others
addlists <- function(lst1,lst2,add) {
 lst <- list()
 for (nm in union(names(lst1),names(lst2))) {
 if (is.null(lst1[[nm]])) lst[[nm]] <- lst2[[nm]] else
 if (is.null(lst2[[nm]])) lst[[nm]] <- lst1[[nm]] else
 lst[[nm]] <- add(lst1[[nm]],lst2[[nm]])
 }
 lst
}

All pure R!  No Java, no configuration.  Indeed, it’s worthwhile comparing to the word count example in sparkr, the R interface to Spark.  There we see calls to sparkr functions such as flatMap(), reduceByKey() and collect().  Well, guess what!  The reduceByKey() function is pretty much the same as R’s tried and true apply().  The collect() function is more or less our Snowdoop library function addlists().  So, again, there is no need to resort to Spark, Hadoop, Java and so on.

And as noted last time, in Snowdoop, we can easily keep objects persistent in memory between cluster calls, like Spark but unlike Hadoop.  Consider k-means clustering, for instance.  Each node would keep its data chunk in memory across the iterations (say using R’s <<- operator upon read-in).  The distance computation at each iteration could be used with CRAN’s pdist library, finding distances from the node’s chunk to the current set of centroids.

Again, while the support routines, e.g. addlists() etc. above, plus a few not shown here, could be collected into a package for convenience, Snowdoop is more a method than a formal package.

So, is there a price to be paid for all this convenience and simplicity?  As noted last time, Snowdoop doesn’t have the fault tolerance redundancy of Hadoop/Spark.  Conceivably there may be a performance penalty in applications in which the Hadoop distributed shuffle-sort is key to the algorithm.  Also, I’m not sure anyone has ever tried R’s parallel library with a cluster of hundreds of nodes or more.

But the convenience factor makes Snowdoop highly attractive.  For example, try plugging “rJava install” into Google, and you’ll see that many people have trouble with this package, which is needed for sparkr (especially if the user doesn’t have root privileges on his machine).


To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.