Flowcharts of functions

Jakob Gepp

4 years ago

[This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When you work on bigger R projects there comes a point when you may lose the overview of how your functions are connected. Or even worse: you get a large project and have to figure out what is actually happening! A possible remedy to this problem are flowcharts.

If you started your project with a flowchart: good for you – if you did not, then it can be a tedious job to do. Since it is kind of a repetitive task, I had the idea that this could also be done automatically through a function. Let's try it on a simple example, which you can find on our git!

What needs to be done?

The goal of a flowchart is to visualise the connections of the user defined functions (UDF). Firstly, I scan all project scripts and add all UDF into a big list. Secondly, I search in all scripts for the function names to find the dependencies. In R, functions are defined by <- function(){} – or similar- which makes this a simple search. But sometimes you define a function within a function like this:

foo_01 <- function(y) {
  print("Start foo_01")
  # define sub functions
  foo_02 <- function(x){
    if (x < 100) {
      print(x)
    } else {
      print("over 100")
    }
    foo_03 <- function(x){
      print(2 * x)
    }
    sapply(1:5, foo_03)
  }
  # main part of foo_01
  foo_02(x = y)
  foo_02(x = 10 * y)
}

In this toy example I define a function foo_01 where a second function foo_02 is defined and called. However, within foo_02 itself another function, namely foo_03, is defined and applied. This pattern is quite common in R functions and is a possible pitfall to look out for.

To tackle this, I need to separate the sub functions from the main functions to get the right connections. Otherwise I would find a direct connection between foo_01 and foo_03 instead of foo_01 to foo_02 and then foo_03. My solution to this problem is to count the curly brackets {} with respect to the their level and thus find the blocks of functions. In the toy example I would get an index like this for the curly brackets:

Once I find the block, I can just remove it from the main function and add it to my list of functions. With this big list, I can search for the functions calls and get a connection matrix:

	foo_02	foo_03
foo_01	2	0
foo_02	0	1
foo_03	0	0

Since I only evaluate everything as a string, I do not get the right number of calls for foo_03 by sapply. In this toy example it might be possible to get the right amount, but in a project scenario the call might be much more complex. Furthermore, my function eliminates empty lines and comments, just to make it a bit tidier.

A different way to store the network is to use two data sets: one for the nodes and one forthe edges. This way additional information (e.g. size, weights, label,groupings, etc.) can be included. One can visualise a sophisticated project flowchart with such additional information. For instance, the igraph network below.

Plotting the flowchart

There are different ways and packages to plot those kind of flowcharts or networks in R. Popular network packages are: ggnet, visNetwork, threejs, networkD3 or igraph. The first try is based on some test scripts and functions and plotted with ggnet. I get the following graph:

To be honest, it’s not the prettiest graph, but it gets the job done and we can see the connections! Of course, there are a lot more ways to tweak and improve the plots with additional information. Some ideas come to mind:

Add color to symbolise the folder structure.
Vary the point size by the number of lines in a function.
Adjust the line size by the amount of calls.
Build them interactive.

Amongst others the threejs makes it possible to build interactive plots. They are fun to play around with but it is very hard to make them suitable for a project description. For instance, a 3D network is hard to read when it overlaps.

So, I adjust my network function a bit to include some of the ideas from above. I also have to change my underlying test scripts and functions, because there are some more special cases, which I want to test and debug. With all these improvements, this is my second version of the network:

I also made one with igraph where recursive functions can be plotted.

Features and problems

So far, I am pleased with the result, but I am still missing some features, which would make it a very robust and sophisticated function. I am looking forward to implement the upcoming list in the near future.

The function only looks for function calls but no sourced scripts.
If the name of a function is embodied by another one (eg. foo_01 and foo_01b) the dependencies are not correct, because I just do a string search.
Linebreaks between <- and the function name also slip through my search grid.

There might as well be more missing special cases. If you have an idea or a solution for one of these tasks – feel free to try it yourself and let me know!

References

Über den Autor

Jakob Gepp

Jakob ist im Statistik Team und interessiert sich im Moment stark für Hadoop und Big Data. In seiner Freizeit bastelt er gerne an alten Elektrogeräten und spielt Hockey.

Der Beitrag Flowcharts of functions erschien zuerst auf STATWORX.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.