{"id":111258,"date":"2015-12-08T12:13:51","date_gmt":"2015-12-08T18:13:51","guid":{"rendered":"http:\/\/www.r-bloggers.com\/?guid=8dcf8406a0c82c6471143edfaa6b9980"},"modified":"2015-12-06T19:43:35","modified_gmt":"2015-12-07T01:43:35","slug":"fun-with-ddr-using-distributed-data-structures-in-r","status":"publish","type":"post","link":"https:\/\/www.r-bloggers.com\/2015\/12\/fun-with-ddr-using-distributed-data-structures-in-r\/","title":{"rendered":"Fun with ddR: Using Distributed Data Structures in R"},"content":{"rendered":"<!-- \r\n<div style=\"min-height: 30px;\">\r\n[social4i size=\"small\" align=\"align-left\"]\r\n<\/div>\r\n-->\r\n\r\n<div style=\"border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;\">\r\n[This article was first published on  <strong><a href=\"http:\/\/blog.revolutionanalytics.com\/2015\/12\/fun-with-ddr-using-distributed-data-structures-in-r.html\"> Revolutions<\/a><\/strong>, and kindly contributed to <a href=\"https:\/\/www.r-bloggers.com\/\" rel=\"nofollow\">R-bloggers<\/a>].  (You can report issue about the content on this page <a href=\"https:\/\/www.r-bloggers.com\/contact-us\/\">here<\/a>)\r\n<hr>Want to share your content on R-bloggers?<a href=\"https:\/\/www.r-bloggers.com\/add-your-blog\/\" rel=\"nofollow\"> click here<\/a> if you have a blog, or <a href=\"http:\/\/r-posts.com\/\" rel=\"nofollow\"> here<\/a> if you don't.\r\n<\/div>\n\n<div><p><strong><em>by Edward Ma and Vishrut Gupta (Hewlett Packard Enterprise)<\/em><\/strong><\/p>\n<p>A few weeks ago, we revealed <a href=\"https:\/\/cran.r-project.org\/web\/packages\/ddR\/index.html\" rel=\"nofollow\" target=\"_blank\"><strong>ddR<\/strong><\/a> (Distributed Data-structures in R), an exciting new project started by R-Core, Hewlett Packard Enterprise, and others that provides a fresh new set of computational primitives for distributed and parallel computing in R. The package sets the seed for what may become a standardized and easy way to write parallel algorithms in R, regardless of the computational engine of choice.<\/p>\n<p>In designing <strong>ddR<\/strong>, we wanted to keep things simple and familiar. We expose only a small number of new user functions that are very close in semantics and API to their R counterparts. You can read the introductory material about the package <a href=\"http:\/\/community.hpe.com\/t5\/Behind-the-scenes-Labs\/Introducing-Distributed-Data-structures-in-R\/ba-p\/6809433#.Vl4NtnarQQh\" rel=\"nofollow\" target=\"_blank\">here<\/a>. In this post, we show how to use <strong>ddR <\/strong>functions.<\/p>\n<p><strong>Classes<\/strong> <strong>dlist, darray, <\/strong><strong>and<\/strong> <strong>dframe<\/strong>: These classes are the distributed equivalents of list, matrix, and data.frame, respectively. Keeping their APIs similar to those for the vanilla R classes, we implemented operators and functions that work on these functions in the same ways. The example below creates two distributed lists &#8212; one of five 3s and one out of the elements 1 through 5.<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 a <- dmapply(function(x) { x }, rep(3,5))<\/p>\n<p>b <- dlist(1,2,3,4,5,nparts=1L)<\/p>\n<p>The argument <em>nparts<\/em> specifies the number of partitions to split the resulting <em>dlist<\/em> <strong>b <\/strong>into. For <em>darrays <\/em>and <em>dframes<\/em>, which are two-dimensional, <em>nparts<\/em> also permits a two-element vector, which specifies the two-dimensional partitioning of the output.<\/p>\n<p><strong>Functions dmapply and dlapply: \u00a0<\/strong>Following R\u2019s functional-programming paradigm, we have created these two functions as the distributed equivalents of R\u2019s <em>mapply<\/em> and <em>lapply<\/em>. One can supply any combination of distributed objects and regular R args into <em>dmapply:<\/em><\/p>\n<p><em>addThenSubtract<\/em> <em><-<\/em> <em>function<\/em><em>(x,y,z) {\u00a0x <\/em><em>+<\/em><em> y <\/em><em>&#8211;<\/em><em> z}<\/em><em>\u00a0 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 <\/em><\/p>\n<p><em>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 <\/em><em>c <\/em><em><-<\/em><em> dmapply(addThenSubtract,a,b,<\/em><em>MoreArgs<\/em><em>=list<\/em><em>(<\/em><em>z<\/em><em>=<\/em><em>5<\/em><em>))<\/em><\/p>\n<p><strong>Functions parts and collect:<\/strong> The <em>parts <\/em>construct gives users the ability to partition data in a manner that is very explicit. <em>parts <\/em>is often used in conjunction with <em>dmapply <\/em>to achieve partition-level parallelism. To fetch data, the <em>collect<\/em> keyword is used. So, if we wanted to check our result in <strong>c <\/strong>from our previous example, we may do:<\/p>\n<p>collect(c)<\/p>\n<p>## [[1]]<br> \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ## [1] 2<br> \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ## <br> \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ## [[2]]<br> \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ## [1] 3<br> \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ## <br> \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ## [[3]]<br> \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ## [1] 4<br> \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ## <br> \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ## [[4]]<br> \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ## [1] 5<br> \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ## <br> \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ## [[5]]<br> \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ## [1] 6<\/p>\n<p>Backends can easily provide custom implementations of <em>dlist, darray, <\/em>and <em>dframe, <\/em>as well as for <em>dmapply<\/em>. At a minimum, backends define only a couple of new custom classes (extending ddR\u2019s classes), as well as the definitions for a couple of generic functions, including <em>dmapply<\/em>.<\/p>\n<p>With these definitions in place, ddR knows how to properly dispatch work to backends where behaviors differ, whilst taking care of the rest of the work &#8212; since most of these other operations can be defined using just <em>dmapply<\/em>. For example, <em>colSums<\/em> should automatically work on any <em>darray<\/em> created by a backend that has defined <em>dmapply<\/em>!<\/p>\n<p><strong>Putting it to Work: RandomForest written in ddR<\/strong><\/p>\n<p>In addition to adding new backend drivers for <strong>ddR<\/strong> (e.g., for Spark), part of this initiative is to develop an initial suite of algorithms written in ddR, such that they are portable to all ddR backends. <a href=\"https:\/\/cran.r-project.org\/web\/packages\/randomForest.ddR\/index.html\" rel=\"nofollow\" target=\"_blank\"><strong>RandomForest.ddR<\/strong><\/a> is one such algorithm that we have completed, now available on CRAN. ddR packages for <a href=\"https:\/\/cran.r-project.org\/web\/packages\/kmeans.ddR\/index.html\" rel=\"nofollow\" target=\"_blank\">K-Means<\/a> and <a href=\"https:\/\/cran.r-project.org\/web\/packages\/glm.ddR\/index.html\" rel=\"nofollow\" target=\"_blank\">GLM<\/a> (generalized linear models) are now also available.<\/p>\n<p>Random Forest is an algorithm that can be parallelized in a very simple way by asking each worker to create a subset of the trees:<\/p>\n<p>simple_RF <-function(formula, data, ntree = 500, ..., nparts = 2)<\/p>\n<p>\u00a0{<\/p>\n<p>\u00a0\u00a0\u00a0execute_randomForest_parallel <- function(ntree, formula, data, inputArgs)<\/p>\n<p>\u00a0\u00a0\u00a0{<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0inputArgs$formula <- formula<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0inputArgs$data <- data<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0inputArgs$ntree <- ntree<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0suppressMessages(requireNamespace(&#8220;randomForest&#8221;))<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0model <- do.call(randomForest::randomForest,inputArgs)<\/p>\n<p>\u00a0\u00a0\u00a0}<\/p>\n<p>\u00a0\u00a0\u00a0dmodel <- dmapply(execute_randomForest_parallel,<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0ntree = rep(ceiling(500\/nparts),nparts),<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0MoreArgs = list(formula=formula,<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0data=data,inputArgs=list(&#8230;)), \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0output.type = &#8220;dlist&#8221;, nparts = nparts)<\/p>\n<p>\u00a0\u00a0\u00a0model <- do.call(randomForest::combine, collect(dmodel))<\/p>\n<p>\u00a0}<\/p>\n<p>model <- simple_RF(Species ~ ., iris)<\/p>\n<p>The main <strong>dmapply<\/strong> in the above code snippet simply broadcasts all the objects passed to the function to the workers and calls <em>randomForest<\/em> with the same parameters. An important point here is that even if \u2018data\u2019 is a distributed object, it will still be broadcast because it is listed in <strong><em>MoreArgs<\/em>, <\/strong>which accepts a key-value list of either distributed objects or normal R objects.<\/p>\n<p>Here is a sample performance plot of running randomforest:<\/p>\n<p><a class=\"asset-img-link\" href=\"http:\/\/revolution-computing.typepad.com\/.a\/6a010534b1db25970b01b8d17fcbc9970c-pi\" style=\"display: inline;\" rel=\"nofollow\" target=\"_blank\"><img alt=\"Blog_post_plot_rf\" border=\"0\" class=\"asset  asset-image at-xid-6a010534b1db25970b01b8d17fcbc9970c image-full img-responsive\" src=\"http:\/\/revolution-computing.typepad.com\/.a\/6a010534b1db25970b01b8d17fcbc9970c-800wi\" title=\"Blog_post_plot_rf\"><\/a><\/p>\n<p>We tested the <strong>randomForest.ddR<\/strong> package on a medium sized dataset to measure speedup when increasing the number of cores. From the graph, it is clear that up until 4 cores, there is great improvement and only then does it start to reach the point of diminishing returns. Since most computers these days have several cores, the <strong>randomForest.ddR<\/strong> package should be helpful for most people. On a single node you can use <strong><em>parallel<\/em><\/strong> which stops at 24 cores which corresponds to all the cores of the test machine. You can use <strong><em>DistributedR<\/em><\/strong> to continue to scale beyond 24 cores, as it can utilize multiple machines.<\/p>\n<p>To read up a bit more on ddR and its semantics, visit our GitHub page <a href=\"https:\/\/github.com\/vertica\/ddR\" rel=\"nofollow\" target=\"_blank\">here<\/a> or read the user guide on <a href=\"https:\/\/cran.r-project.org\/web\/packages\/ddR\/vignettes\/user_guide.html\" rel=\"nofollow\" target=\"_blank\">CRAN<\/a>.<\/p><\/div>\n\n<div style=\"border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;\">\r\n<div style=\"text-align: center;\">To <strong>leave a comment<\/strong> for the author, please follow the link and comment on their blog: <strong><a href=\"http:\/\/blog.revolutionanalytics.com\/2015\/12\/fun-with-ddr-using-distributed-data-structures-in-r.html\"> Revolutions<\/a><\/strong>.<\/div>\r\n<hr \/>\r\n<a href=\"https:\/\/www.r-bloggers.com\/\" rel=\"nofollow\">R-bloggers.com<\/a> offers <strong><a href=\"https:\/\/feedburner.google.com\/fb\/a\/mailverify?uri=RBloggers\" rel=\"nofollow\">daily e-mail updates<\/a><\/strong> about <a title=\"The R Project for Statistical Computing\" href=\"https:\/\/www.r-project.org\/\" rel=\"nofollow\">R<\/a> news and tutorials about <a title=\"R tutorials\" href=\"https:\/\/www.r-bloggers.com\/how-to-learn-r-2\/\" rel=\"nofollow\">learning R<\/a> and many other topics. <a title=\"Data science jobs\" href=\"https:\/\/www.r-users.com\/\" rel=\"nofollow\">Click here if you're looking to post or find an R\/data-science job<\/a>.\r\n\r\n<hr>Want to share your content on R-bloggers?<a href=\"https:\/\/www.r-bloggers.com\/add-your-blog\/\" rel=\"nofollow\"> click here<\/a> if you have a blog, or <a href=\"http:\/\/r-posts.com\/\" rel=\"nofollow\"> here<\/a> if you don't.\r\n<\/div>","protected":false},"excerpt":{"rendered":"<p>by Edward Ma and Vishrut Gupta (Hewlett Packard Enterprise) A few weeks ago, we revealed ddR (Distributed Data-structures in R), an exciting new project started by R-Core, Hewlett Packard Enterprise, and others that provides a fresh new set of computational primitives for distributed and parallel computing in R. The package &#8230;<\/p>\n","protected":false},"author":106,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[4],"tags":[],"aioseo_notices":[],"jetpack-related-posts":[{"id":109578,"url":"https:\/\/www.r-bloggers.com\/2015\/11\/introducing-distributed-data-structures-in-r\/","url_meta":{"origin":111258,"position":0},"title":"Introducing Distributed Data-structures in R","date":"November 9, 2015","format":false,"excerpt":"\u00a0 [social4i size=\"large\" align=\"float-right\"] \u00a0 Indrajit Roy, Principal Researcher, Hewlett Packard Labs Due to R\u2019s popularity as a data mining tool, many Big Data systems expose an R based interface to users. However, these interfaces are custom, non-standard, and difficult to learn. Earlier in the year, we hosted a workshop\u2026","rel":"nofollow","context":"In \"R guest posts\"","img":{"src":"","width":0,"height":0},"classes":[]},{"id":113181,"url":"https:\/\/www.r-bloggers.com\/2016\/01\/the-evolution-of-distributed-programming-in-r\/","url_meta":{"origin":111258,"position":1},"title":"The Evolution of Distributed Programming in R","date":"January 7, 2016","format":false,"excerpt":"By Paulin Shek Both R and distributed programming rank highly on my list of \u201cgood things\u201d, so imagine my delight when two new packages, ddR (https:\/\/github.com\/vertica\/ddR) and multidplyr (https:\/\/github.com\/hadley\/multidplyr), used for distributed programming in R were released in November last \u2026 Continue reading \u2192","rel":"nofollow","context":"In \"R bloggers\"","img":{"src":"","width":0,"height":0},"classes":[]},{"id":207368,"url":"https:\/\/www.r-bloggers.com\/2020\/10\/the-evolution-of-distributed-programming-in-r-2\/","url_meta":{"origin":111258,"position":2},"title":"The Evolution of Distributed Programming in R","date":"October 23, 2020","format":false,"excerpt":"Both R and distributed programming rank highly on my list of \u201cgood things\u201d, so imagine my delight when two new... The post The Evolution of Distributed Programming in R appeared first on Mango Solutions.","rel":"nofollow","context":"In \"R bloggers\"","img":{"src":"","width":0,"height":0},"classes":[]}],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.r-bloggers.com\/wp-json\/wp\/v2\/posts\/111258"}],"collection":[{"href":"https:\/\/www.r-bloggers.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.r-bloggers.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.r-bloggers.com\/wp-json\/wp\/v2\/users\/106"}],"replies":[{"embeddable":true,"href":"https:\/\/www.r-bloggers.com\/wp-json\/wp\/v2\/comments?post=111258"}],"version-history":[{"count":0,"href":"https:\/\/www.r-bloggers.com\/wp-json\/wp\/v2\/posts\/111258\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.r-bloggers.com\/wp-json\/wp\/v2\/media?parent=111258"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.r-bloggers.com\/wp-json\/wp\/v2\/categories?post=111258"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.r-bloggers.com\/wp-json\/wp\/v2\/tags?post=111258"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}