Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
When you work on bigger R projects there comes a point when you may lose the overview of how your functions are connected. Or even worse: you get a large project and have to figure out what is actually happening! A possible remedy to this problem are flowcharts.
If you started your project with a flowchart: good for you – if you did not, then it can be a tedious job to do. Since it is kind of a repetitive task, I had the idea that this could also be done automatically through a function. Let's try it on a simple example, which you can find on our git!
What needs to be done?
The goal of a flowchart is to visualise the connections of the user defined functions (UDF). Firstly, I scan all project scripts and add all UDF into a big list. Secondly, I search in all scripts for the function names to find the dependencies. In R, functions are defined by <- function(){}
– or similar- which makes this a simple search. But sometimes you define a function within a function like this:
foo_01 <- function(y) { print("Start foo_01") # define sub functions foo_02 <- function(x){ if (x < 100) { print(x) } else { print("over 100") } foo_03 <- function(x){ print(2 * x) } sapply(1:5, foo_03) } # main part of foo_01 foo_02(x = y) foo_02(x = 10 * y) }
In this toy example I define a function foo_01
where a second function foo_02
is defined and called. However, within foo_02
itself another function, namely foo_03
, is defined and applied. This pattern is quite common in R functions and is a possible pitfall to look out for.
To tackle this, I need to separate the sub functions from the main functions to get the right connections. Otherwise I would find a direct connection between foo_01
and foo_03
instead of foo_01
to foo_02
and then foo_03
. My solution to this problem is to count the curly brackets {}
with respect to the their level and thus find the blocks of functions. In the toy example I would get an index like this for the curly brackets:
Once I find the block, I can just remove it from the main function and add it to my list of functions. With this big list, I can search for the functions calls and get a connection matrix:
foo_01 | foo_02 | foo_03 | |
---|---|---|---|
foo_01 | 0 | 2 | 0 |
foo_02 | 0 | 0 | 1 |
foo_03 | 0 | 0 | 0 |
Since I only evaluate everything as a string, I do not get the right number of calls for foo_03
by sapply
. In this toy example it might be possible to get the right amount, but in a project scenario the call might be much more complex. Furthermore, my function eliminates empty lines and comments, just to make it a bit tidier.
A different way to store the network is to use two data sets: one for the nodes and one forthe edges. This way additional information (e.g. size, weights, label,groupings, etc.) can be included. One can visualise a sophisticated project flowchart with such additional information. For instance, the igraph
network below.
Plotting the flowchart
There are different ways and packages to plot those kind of flowcharts or networks in R. Popular network packages are: ggnet
, visNetwork
, threejs
, networkD3
or igraph
. The first try is based on some test scripts and functions and plotted with ggnet
. I get the following graph:
To be honest, it’s not the prettiest graph, but it gets the job done and we can see the connections! Of course, there are a lot more ways to tweak and improve the plots with additional information. Some ideas come to mind:
- Add color to symbolise the folder structure.
- Vary the point size by the number of lines in a function.
- Adjust the line size by the amount of calls.
- Build them interactive.
Amongst others the threejs
makes it possible to build interactive plots. They are fun to play around with but it is very hard to make them suitable for a project description. For instance, a 3D network is hard to read when it overlaps.
So, I adjust my network function a bit to include some of the ideas from above. I also have to change my underlying test scripts and functions, because there are some more special cases, which I want to test and debug. With all these improvements, this is my second version of the network:
I also made one with igraph
where recursive functions can be plotted.
Features and problems
So far, I am pleased with the result, but I am still missing some features, which would make it a very robust and sophisticated function. I am looking forward to implement the upcoming list in the near future.
- The function only looks for function calls but no sourced scripts.
- If the name of a function is embodied by another one (eg.
foo_01
andfoo_01b
) the dependencies are not correct, because I just do a string search. - Linebreaks between
<-
and the function name also slip through my search grid.
There might as well be more missing special cases. If you have an idea or a solution for one of these tasks – feel free to try it yourself and let me know!
References
Über den Autor
Jakob Gepp
Der Beitrag Flowcharts of functions erschien zuerst auf STATWORX.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.