Working with tree-based hierarchies using data.tree
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Lately I tried to visualize an hierarchy with Tableau Desktop. The problem was that the hierarchy had a variable depth because it was tree-based. Each row had an id and a parent_id. Normally hierarchies in Tableau are defined by pulling some fields together, such as product category, product group and product id.
Handling tree-based hierarchies seems to be a lot more complex. I found a plugin at https://github.com/tableau/extension-hierarchy-navigator-sandboxed but this only works online.
So I asked myself how I can handle this using R. I found the R-package
data.tree
at https://github.com/gluc/data.tree. I want to describe how I use
this package to preprocess my data.
Some tree-based data
Let’s create some tree-based data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
options(tidyverse.quiet = TRUE) library(tidyverse) geography <- tribble( ~id, ~parent_id, ~area, ~some_additional_data, 1, NA, "Europe", "That's a continent", 2, 1, "Germany", "That's a country", 3, 1, "France", "Oh, yeah, another country", 4, 1, "Denmark", "Oh dear, how many contries do you know", 5, 2, "Hesse", "ah, that's called Bundesland", 6, 5, "Wiesbaden", "and that's the capital of Hesse", 7, 2, "Berlin", "that's the capital of Germany and also a Bundesland", 8, 5, "Main-Kinzig-Kreis", "that's a Landkreis", 9, 8, "Gelnhausen", "and the Landkreis does have a capital called Kreisstadt", 10, 3, "Paris", "The city of Love", 11, 2, "Bavaria", "another Bundesland", 12, 11, "Munich", "and its capital" ) geography |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
## # A tibble: 12 x 4 ## id parent_id area some_additional_data ## <dbl> <dbl> <chr> <chr> ## 1 1 NA Europe That's a continent ## 2 2 1 Germany That's a country ## 3 3 1 France Oh, yeah, another country ## 4 4 1 Denmark Oh dear, how many contries do you know ## 5 5 2 Hesse ah, that's called Bundesland ## 6 6 5 Wiesbaden and that's the capital of Hesse ## 7 7 2 Berlin that's the capital of Germany and also a Bun… ## 8 8 5 Main-Kinzig-Kr… that's a Landkreis ## 9 9 8 Gelnhausen and the Landkreis does have a capital called… ## 10 10 3 Paris The city of Love ## 11 11 2 Bavaria another Bundesland ## 12 12 11 Munich and its capital |
The Task
So the first two columns id
and parent_id
define the tree. How can we convert
this format into a format where the hierarchy consists of n
columns with
the highest level in column Level_1
, the second highest in Level_2
and so on?
The code should not depend on a given level of depth.
data.tree
Let’s take a look at data.tree
:
1 2 3 |
library(data.tree) FromDataFrameNetwork(geography) |
1 2 |
## levelName ## 1 NA |
That doesn’t look promising. But wait, let’s replace NA
by 0
and for further
investigation let’s save id
in another column called my_id
:
1 2 3 4 |
geography %>% replace_na(list(parent_id = 0)) %>% mutate(my_id = id) %>% FromDataFrameNetwork() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
## levelName ## 1 0 ## 2 °--1 ## 3 ¦--2 ## 4 ¦ ¦--5 ## 5 ¦ ¦ ¦--6 ## 6 ¦ ¦ °--8 ## 7 ¦ ¦ °--9 ## 8 ¦ ¦--7 ## 9 ¦ °--11 ## 10 ¦ °--12 ## 11 ¦--3 ## 12 ¦ °--10 ## 13 °--4 |
Oh great, that’s the tree structure we had in mind.
Now let’s build a data.frame
:
1 2 3 4 5 |
geography %>% replace_na(list(parent_id = 0)) %>% mutate(my_id = id) %>% FromDataFrameNetwork() %>% ToDataFrameTree("level", "area", "my_id") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
## levelName level area my_id ## 1 0 1 <NA> NA ## 2 °--1 2 Europe 1 ## 3 ¦--2 3 Germany 2 ## 4 ¦ ¦--5 4 Hesse 5 ## 5 ¦ ¦ ¦--6 5 Wiesbaden 6 ## 6 ¦ ¦ °--8 5 Main-Kinzig-Kreis 8 ## 7 ¦ ¦ °--9 6 Gelnhausen 9 ## 8 ¦ ¦--7 4 Berlin 7 ## 9 ¦ °--11 4 Bavaria 11 ## 10 ¦ °--12 5 Munich 12 ## 11 ¦--3 3 France 3 ## 12 ¦ °--10 4 Paris 10 ## 13 °--4 3 Denmark 4 |
Now we have the level of the hierarchy stored in column level
and the value or
name of the level in area
.
Pivoting this data.frame
should do the rest:
1 2 3 4 5 6 7 8 |
geography %>% replace_na(list(parent_id = 0)) %>% mutate(my_id = id) %>% FromDataFrameNetwork() %>% ToDataFrameTree("level", "area", "my_id") %>% select(-levelName) %>% mutate(level = glue::glue("Level_{level}")) %>% pivot_wider(names_from = level, values_from = area) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
## # A tibble: 13 x 7 ## my_id Level_1 Level_2 Level_3 Level_4 Level_5 Level_6 ## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 NA <NA> <NA> <NA> <NA> <NA> <NA> ## 2 1 <NA> Europe <NA> <NA> <NA> <NA> ## 3 2 <NA> <NA> Germany <NA> <NA> <NA> ## 4 5 <NA> <NA> <NA> Hesse <NA> <NA> ## 5 6 <NA> <NA> <NA> <NA> Wiesbaden <NA> ## 6 8 <NA> <NA> <NA> <NA> Main-Kinzig-Kreis <NA> ## 7 9 <NA> <NA> <NA> <NA> <NA> Gelnhausen ## 8 7 <NA> <NA> <NA> Berlin <NA> <NA> ## 9 11 <NA> <NA> <NA> Bavaria <NA> <NA> ## 10 12 <NA> <NA> <NA> <NA> Munich <NA> ## 11 3 <NA> <NA> France <NA> <NA> <NA> ## 12 10 <NA> <NA> <NA> Paris <NA> <NA> ## 13 4 <NA> <NA> Denmark <NA> <NA> <NA> |
As we can see the first level Level_1
is always empty but we will need it later.
Let’s solve this by
reducing each level before pivoting.
1 2 3 4 5 6 7 8 9 10 11 |
geopgraphy_data.frame <- geography %>% replace_na(list(parent_id = 0)) %>% mutate(my_id = id) %>% FromDataFrameNetwork() %>% ToDataFrameTree("level", "area", "my_id") %>% select(-levelName) %>% mutate(level = level -1) %>% mutate(level = glue::glue("Level_{level}")) %>% pivot_wider(names_from = level, values_from = area) geopgraphy_data.frame |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
## # A tibble: 13 x 7 ## my_id Level_0 Level_1 Level_2 Level_3 Level_4 Level_5 ## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 NA <NA> <NA> <NA> <NA> <NA> <NA> ## 2 1 <NA> Europe <NA> <NA> <NA> <NA> ## 3 2 <NA> <NA> Germany <NA> <NA> <NA> ## 4 5 <NA> <NA> <NA> Hesse <NA> <NA> ## 5 6 <NA> <NA> <NA> <NA> Wiesbaden <NA> ## 6 8 <NA> <NA> <NA> <NA> Main-Kinzig-Kreis <NA> ## 7 9 <NA> <NA> <NA> <NA> <NA> Gelnhausen ## 8 7 <NA> <NA> <NA> Berlin <NA> <NA> ## 9 11 <NA> <NA> <NA> Bavaria <NA> <NA> ## 10 12 <NA> <NA> <NA> <NA> Munich <NA> ## 11 3 <NA> <NA> France <NA> <NA> <NA> ## 12 10 <NA> <NA> <NA> Paris <NA> <NA> ## 13 4 <NA> <NA> Denmark <NA> <NA> <NA> |
That’s nice.
Filling missing values of the hierarchy levels
Now we need to fill the missing value by using the right value of the line above. But wait, we have to group by each level and do this for each level.
That’s a great example for using purrr::reduce
.
We apply a function fill_level
for each level and put the result as
argument of the next call with next level.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# group by the parent level of the given level_name and fill the level fill_level <- function(data, level_name) { grouping_level_id <- max(as.numeric(str_extract(level_name, "[0-9]+")) - 1, 0) previous_level <- glue::glue("Level_{grouping_level_id}") data %>% group_by(.data[[previous_level]]) %>% fill({{ level_name }}, .direction = "down") } # get all level names level_names <- data.frame(names = names(geopgraphy_data.frame)) %>% filter(grepl("Level_", names)) %>% pull(names) # use reduce to apply fill_level for each level to the result of the previous # result geopgraphy_data.frame <- reduce(level_names, fill_level, .init = geopgraphy_data.frame) %>% select(starts_with("Level_"), everything()) %>% select(-Level_0) geopgraphy_data.frame |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
## # A tibble: 13 x 6 ## # Groups: Level_4 [4] ## Level_1 Level_2 Level_3 Level_4 Level_5 my_id ## <chr> <chr> <chr> <chr> <chr> <dbl> ## 1 <NA> <NA> <NA> <NA> <NA> NA ## 2 Europe <NA> <NA> <NA> <NA> 1 ## 3 Europe Germany <NA> <NA> <NA> 2 ## 4 Europe Germany Hesse <NA> <NA> 5 ## 5 Europe Germany Hesse Wiesbaden <NA> 6 ## 6 Europe Germany Hesse Main-Kinzig-Kreis <NA> 8 ## 7 Europe Germany Hesse Main-Kinzig-Kreis Gelnhausen 9 ## 8 Europe Germany Berlin <NA> <NA> 7 ## 9 Europe Germany Bavaria <NA> <NA> 11 ## 10 Europe Germany Bavaria Munich <NA> 12 ## 11 Europe France <NA> <NA> <NA> 3 ## 12 Europe France Paris <NA> <NA> 10 ## 13 Europe Denmark <NA> <NA> <NA> 4 |
Adding other fancy columns
Now we can add all the other columns our original data has, currently just
some_additional_data
:
1 2 3 |
geopgraphy_data.frame %>% left_join(geography %>% select(-parent_id, -area), by = c('my_id' = 'id')) %>% filter(! is.na(Level_1)) # get rid of this line with all NAs |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
## # A tibble: 12 x 7 ## # Groups: Level_4 [4] ## Level_1 Level_2 Level_3 Level_4 Level_5 my_id some_additional_data ## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> ## 1 Europe <NA> <NA> <NA> <NA> 1 That's a continent ## 2 Europe Germany <NA> <NA> <NA> 2 That's a country ## 3 Europe Germany Hesse <NA> <NA> 5 ah, that's called Bundesla… ## 4 Europe Germany Hesse Wiesbaden <NA> 6 and that's the capital of … ## 5 Europe Germany Hesse Main-Kinz… <NA> 8 that's a Landkreis ## 6 Europe Germany Hesse Main-Kinz… Gelnhau… 9 and the Landkreis does hav… ## 7 Europe Germany Berlin <NA> <NA> 7 that's the capital of Germ… ## 8 Europe Germany Bavaria <NA> <NA> 11 another Bundesland ## 9 Europe Germany Bavaria Munich <NA> 12 and its capital ## 10 Europe France <NA> <NA> <NA> 3 Oh, yeah, another country ## 11 Europe France Paris <NA> <NA> 10 The city of Love ## 12 Europe Denmark <NA> <NA> <NA> 4 Oh dear, how many contries… |
So now we have a data.frame
we can use in Tableau. Our script doesn’t depend
on the overall depth of the hierarchy or the depth of one special branch.
Maybe there’s a simpler way to build this data.frame
. If so, let me know, please. 🙂
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.