[This article was first published on   ipub » R, and kindly contributed to R-bloggers].  (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
            This example is inspired by the examples of the treemap package.
You’ll learn how to
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
- convert a data.frame to a data.tree structure
- navigate a tree and locate specific nodes
- use AggregateandCumulate
- manipulate an existing tree, e.g. by using the Prunemethod
- use data.tree in connection with the treemap package
This code builds on version 0.2.4 of the data.tree package, which you can get from CRAN or from github. For more posts on data.tree, see here. You will also find this example in the package’s applications vignette.
 
Original treemap Example (to be improved)
The original example, as available in the treemap package documentation, visualises the world population as a tree map.| 1 2 3 4 5 6 7 | library(treemap) data(GNI2010) treemap(GNI2010,        index=c(“continent”, “iso3”),        vSize=“population”,        vColor=“GNI”,        type=“value”) | 
Conversion from data.frame
First, let’s convert the population data into a data.tree structure:| 1 2 3 4 | library(data.tree) GNI2010$pathString <– paste(“world”, GNI2010$continent, GNI2010$country, sep = “/”) n <– as.Node(GNI2010[,]) print(n, pruneMethod = “dist”, limit = 20) | 
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | ##                        levelName ## 1  world                         ## 2   ¦–North America             ## 3   ¦   ¦–Aruba                 ## 4   ¦   ¦–Antigua and Barbuda   ## 5   ¦   ¦–Bahamas               ## 6   ¦   °–… 30 nodes w/ 0 sub ## 7   ¦–Asia                      ## 8   ¦   ¦–Afghanistan           ## 9   ¦   ¦–United Arab Emirates  ## 10  ¦   °–… 45 nodes w/ 0 sub ## 11  ¦–Africa                    ## 12  ¦   ¦–Angola                ## 13  ¦   ¦–Burundi               ## 14  ¦   °–… 52 nodes w/ 0 sub ## 15  ¦–Europe                    ## 16  ¦   ¦–Albania               ## 17  ¦   ¦–Austria               ## 18  ¦   °–… 41 nodes w/ 0 sub ## 19  ¦–South America             ## 20  ¦   ¦–Argentina             ## 21  ¦   ¦–Bolivia               ## 22  ¦   °–… 10 nodes w/ 0 sub ## 23  °–Oceania                   ## 24      ¦–American Samoa        ## 25      ¦–Australia             ## 26      °–… 16 nodes w/ 0 sub | 
CTRL + SPACE):
| 1 | n$Europe$Switzerland$population | 
| 1 | ## [1] 7826 | 
| 1 2 3 | northAm <– n$`North America` northAm$Sort(“GNI”, decreasing = TRUE) print(northAm, “iso3”, “population”, “GNI”, limit = 12) | 
| 1 2 3 4 5 6 7 8 9 10 11 12 13 | ##                       levelName iso3 population   GNI ## 1  North America                             NA    NA ## 2   ¦–United States of America  USA     309349 47340 ## 3   ¦–Canada                    CAN      34126 43250 ## 4   ¦–Bahamas                   BHS        343 22240 ## 5   ¦–Puerto Rico               PRI       3978 15500 ## 6   ¦–Trinidad and Tobago       TTO       1341 15380 ## 7   ¦–Antigua and Barbuda       ATG         88 13280 ## 8   ¦–Saint Kitts and Nevis     KNA         52 11830 ## 9   ¦–Mexico                    MEX     113423  8930 ## 10  ¦–Panama                    PAN       3517  6970 ## 11  ¦–Grenada                   GRD        104  6960 ## 12  °–… 23 nodes w/ 0 sub                 NA    NA | 
Aggregate and Cumulate
We now want to aggregate the population. For non-leaves, this will recursively iterate through children, and cache the result in the 
 
Next, we sort each node by population:
 
Finally, we cumulate among siblings, and store the running sum in an attribute called
 
The tree now looks as follows. Note the new attributes cumPop, as well as the sort order:
 
population field. The main reason why we do this is not to calculate the population of the world, but to store the result via thecacheAttribute.
| 1 2 3 4 | Aggregate(node = n,           attribute = “population”,           aggFun = sum,           cacheAttribute = “population”) | 
| 1 | ## [1] 6727766 | 
| 1 | n$Sort(attribute = “population”, decreasing = TRUE, recursive = TRUE) | 
cumPop:
| 1 | n$Do(function(x) Cumulate(x, “population”, sum, “cumPop”)) | 
| 1 | print(n, “population”, “cumPop”, pruneMethod = “dist”, limit = 20) | 
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | ##                           levelName population  cumPop ## 1  world                               6727766 6727766 ## 2   ¦–Asia                            4089247 4089247 ## 3   ¦   ¦–China                       1338300 1338300 ## 4   ¦   ¦–India                       1224615 2562915 ## 5   ¦   ¦–Indonesia                    239870 2802785 ## 6   ¦   °–… 44 nodes w/ 0 sub            NA      NA ## 7   ¦–Africa                           954502 5043749 ## 8   ¦   ¦–Nigeria                      158423  158423 ## 9   ¦   ¦–Ethiopia                      82950  241373 ## 10  ¦   °–… 52 nodes w/ 0 sub            NA      NA ## 11  ¦–Europe                           714837 5758586 ## 12  ¦   ¦–Russian Federation           141750  141750 ## 13  ¦   ¦–Germany                       81777  223527 ## 14  ¦   °–… 41 nodes w/ 0 sub            NA      NA ## 15  ¦–North America                    540446 6299032 ## 16  ¦   ¦–United States of America     309349  309349 ## 17  ¦   ¦–Mexico                       113423  422772 ## 18  ¦   °–… 31 nodes w/ 0 sub            NA      NA ## 19  ¦–South America                    392162 6691194 ## 20  ¦   ¦–Brazil                       194946  194946 ## 21  ¦   ¦–Colombia                      46295  241241 ## 22  ¦   °–… 10 nodes w/ 0 sub            NA      NA ## 23  °–Oceania                           36572 6727766 ## 24      ¦–Australia                     22299   22299 ## 25      ¦–Papua New Guinea               6858   29157 ## 26      °–… 16 nodes w/ 0 sub            NA      NA | 
Prune
The previous steps were done to define our threshold: big countries should be displayed, while small ones should be grouped together. This lets us define a pruning function that will allow a maximum of 7 countries per continent. Additionally, it will prune all countries making up less than 90% of a continent’s population:| 1 2 3 4 5 | myPruneFun <– function(x, cutoff = 0.9, maxCountries = 7) {   if (isNotLeaf(x)) return (TRUE)   if (x$position > maxCountries) return (FALSE)   return (x$cumPop < (x$parent$population * cutoff)) } | 
We clone the tree. The reason is that data.tree uses reference semantics, and we want to store the original tree, because we might want to play around later with different parameters:
| 1 2 | n2 <– Clone(n, pruneFun = myPruneFun) print(n2$Oceania, “population”, pruneMethod = “simple”, limit = 20) | 
| 1 2 3 4 | ##              levelName population ## 1 Oceania                   36572 ## 2  ¦–Australia             22299 ## 3  °–Papua New Guinea       6858 | 
| 1 2 3 4 5 6 7 8 9 10 11 | n2$Do(function(x) {   missing <– x$population – sum(sapply(x$children, function(x) x$population))   other <– x$AddChild(“Other”)   other$iso3 <– “OTH”   other$country <– “Other”   other$continent <– x$name   other$GNI <– 0   other$population <– missing }, filterFun = function(x) x$level == 2 ) | 
Plotting the treemap
In order to plot the treemap, we need to convert the data.tree structure back to a data.frame:
 
| 1 2 3 4 5 6 7 | df <– ToDataFrameTable(n2, “iso3”, “country”, “continent”, “population”, “GNI”) treemap(df,         index=c(“continent”, “iso3”),         vSize=“population”,         vColor=“GNI”,         type=“value”) | 
And here we go: Our treemap now has at most 7 countries per continent, and groups all countries below the 90th percentile:
To leave a comment for the author, please follow the link and comment on their blog:  ipub » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

