[This article was first published on ipub » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This example is inspired by the examples of the treemap package.
You’ll learn how to
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
- convert a data.frame to a data.tree structure
- navigate a tree and locate specific nodes
- use
Aggregate
andCumulate
- manipulate an existing tree, e.g. by using the
Prune
method - use data.tree in connection with the treemap package
This code builds on version 0.2.4 of the data.tree package, which you can get from CRAN or from github. For more posts on data.tree, see here. You will also find this example in the package’s applications vignette.
There are many countries, so the chart gets clustered with many very small boxes. In this example, we will limit the number of countries shown, and sum the remaining population in a catch-all country called “Other”.
We use the data.tree package to do this aggregation.
Original treemap Example (to be improved)
The original example, as available in the treemap package documentation, visualises the world population as a tree map.
1
2
3
4
5
6
7
|
library(treemap)
data(GNI2010)
treemap(GNI2010,
index=c(“continent”, “iso3”),
vSize=“population”,
vColor=“GNI”,
type=“value”)
|
Conversion from data.frame
First, let’s convert the population data into a data.tree structure:
1
2
3
4
|
library(data.tree)
GNI2010$pathString <– paste(“world”, GNI2010$continent, GNI2010$country, sep = “/”)
n <– as.Node(GNI2010[,])
print(n, pruneMethod = “dist”, limit = 20)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
## levelName
## 1 world
## 2 ¦–North America
## 3 ¦ ¦–Aruba
## 4 ¦ ¦–Antigua and Barbuda
## 5 ¦ ¦–Bahamas
## 6 ¦ °–… 30 nodes w/ 0 sub
## 7 ¦–Asia
## 8 ¦ ¦–Afghanistan
## 9 ¦ ¦–United Arab Emirates
## 10 ¦ °–… 45 nodes w/ 0 sub
## 11 ¦–Africa
## 12 ¦ ¦–Angola
## 13 ¦ ¦–Burundi
## 14 ¦ °–… 52 nodes w/ 0 sub
## 15 ¦–Europe
## 16 ¦ ¦–Albania
## 17 ¦ ¦–Austria
## 18 ¦ °–… 41 nodes w/ 0 sub
## 19 ¦–South America
## 20 ¦ ¦–Argentina
## 21 ¦ ¦–Bolivia
## 22 ¦ °–… 10 nodes w/ 0 sub
## 23 °–Oceania
## 24 ¦–American Samoa
## 25 ¦–Australia
## 26 °–… 16 nodes w/ 0 sub
|
CTRL + SPACE
):
1
|
n$Europe$Switzerland$population
|
1
|
## [1] 7826
|
1
2
3
|
northAm <– n$`North America`
northAm$Sort(“GNI”, decreasing = TRUE)
print(northAm, “iso3”, “population”, “GNI”, limit = 12)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
|
## levelName iso3 population GNI
## 1 North America NA NA
## 2 ¦–United States of America USA 309349 47340
## 3 ¦–Canada CAN 34126 43250
## 4 ¦–Bahamas BHS 343 22240
## 5 ¦–Puerto Rico PRI 3978 15500
## 6 ¦–Trinidad and Tobago TTO 1341 15380
## 7 ¦–Antigua and Barbuda ATG 88 13280
## 8 ¦–Saint Kitts and Nevis KNA 52 11830
## 9 ¦–Mexico MEX 113423 8930
## 10 ¦–Panama PAN 3517 6970
## 11 ¦–Grenada GRD 104 6960
## 12 °–… 23 nodes w/ 0 sub NA NA
|
Aggregate and Cumulate
We now want to aggregate the population. For non-leaves, this will recursively iterate through children, and cache the result in the
Next, we sort each node by population:
Finally, we cumulate among siblings, and store the running sum in an attribute called
The tree now looks as follows. Note the new attributes cumPop, as well as the sort order:
population
field. The main reason why we do this is not to calculate the population of the world, but to store the result via thecacheAttribute
.
1
2
3
4
|
Aggregate(node = n,
attribute = “population”,
aggFun = sum,
cacheAttribute = “population”)
|
1
|
## [1] 6727766
|
1
|
n$Sort(attribute = “population”, decreasing = TRUE, recursive = TRUE)
|
cumPop
:
1
|
n$Do(function(x) Cumulate(x, “population”, sum, “cumPop”))
|
1
|
print(n, “population”, “cumPop”, pruneMethod = “dist”, limit = 20)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
## levelName population cumPop
## 1 world 6727766 6727766
## 2 ¦–Asia 4089247 4089247
## 3 ¦ ¦–China 1338300 1338300
## 4 ¦ ¦–India 1224615 2562915
## 5 ¦ ¦–Indonesia 239870 2802785
## 6 ¦ °–… 44 nodes w/ 0 sub NA NA
## 7 ¦–Africa 954502 5043749
## 8 ¦ ¦–Nigeria 158423 158423
## 9 ¦ ¦–Ethiopia 82950 241373
## 10 ¦ °–… 52 nodes w/ 0 sub NA NA
## 11 ¦–Europe 714837 5758586
## 12 ¦ ¦–Russian Federation 141750 141750
## 13 ¦ ¦–Germany 81777 223527
## 14 ¦ °–… 41 nodes w/ 0 sub NA NA
## 15 ¦–North America 540446 6299032
## 16 ¦ ¦–United States of America 309349 309349
## 17 ¦ ¦–Mexico 113423 422772
## 18 ¦ °–… 31 nodes w/ 0 sub NA NA
## 19 ¦–South America 392162 6691194
## 20 ¦ ¦–Brazil 194946 194946
## 21 ¦ ¦–Colombia 46295 241241
## 22 ¦ °–… 10 nodes w/ 0 sub NA NA
## 23 °–Oceania 36572 6727766
## 24 ¦–Australia 22299 22299
## 25 ¦–Papua New Guinea 6858 29157
## 26 °–… 16 nodes w/ 0 sub NA NA
|
Prune
The previous steps were done to define our threshold: big countries should be displayed, while small ones should be grouped together. This lets us define a pruning function that will allow a maximum of 7 countries per continent. Additionally, it will prune all countries making up less than 90% of a continent’s population:
1
2
3
4
5
|
myPruneFun <– function(x, cutoff = 0.9, maxCountries = 7) {
if (isNotLeaf(x)) return (TRUE)
if (x$position > maxCountries) return (FALSE)
return (x$cumPop < (x$parent$population * cutoff))
}
|
We clone the tree. The reason is that data.tree uses reference semantics, and we want to store the original tree, because we might want to play around later with different parameters:
1
2
|
n2 <– Clone(n, pruneFun = myPruneFun)
print(n2$Oceania, “population”, pruneMethod = “simple”, limit = 20)
|
1
2
3
4
|
## levelName population
## 1 Oceania 36572
## 2 ¦–Australia 22299
## 3 °–Papua New Guinea 6858
|
1
2
3
4
5
6
7
8
9
10
11
|
n2$Do(function(x) {
missing <– x$population – sum(sapply(x$children, function(x) x$population))
other <– x$AddChild(“Other”)
other$iso3 <– “OTH”
other$country <– “Other”
other$continent <– x$name
other$GNI <– 0
other$population <– missing
},
filterFun = function(x) x$level == 2
)
|
Plotting the treemap
In order to plot the treemap, we need to convert the data.tree structure back to a data.frame:
1
2
3
4
5
6
7
|
df <– ToDataFrameTable(n2, “iso3”, “country”, “continent”, “population”, “GNI”)
treemap(df,
index=c(“continent”, “iso3”),
vSize=“population”,
vColor=“GNI”,
type=“value”)
|
And here we go: Our treemap now has at most 7 countries per continent, and groups all countries below the 90th percentile:
If you have enjoyed this example, I recommend you read the package’s vignettes, or have a look at the other data.tree posts in this blog.
To leave a comment for the author, please follow the link and comment on their blog: ipub » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.