Extracting names of functions defined in a script with treesitter

[This article was first published on MaΓ«lle's R blog on MaΓ«lle Salmon's personal website, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Coming back from a conference, we might be excited to install and try out the cool things we have heard about. I, going against the stream 🐟, decided to experiment with a tool I have not heard about last week, as I unfortunately missed Davis Vaughan’s talk about treesitter. Nonetheless, I caught the idea that treesitter is a parser of code, R code in particular. The treesitter R package uses the tree-sitter C library. There are awesome applications of treesitter among which ✨ code search for R on GitHub ✨ but I learnt to know a bit of treesitter by solving a boring use case.

My use case: I won’t copy-paste these function names

Have you noticed this nice error message you get if you try to use dplyr::across() directly?

#> Error in `dplyr::across()`:
#> ! Must only be used inside data-masking verbs like `mutate()`,
#>   `filter()`, and `group_by()`.

Kirill MΓΌller got the idea to offer a similar behaviour to some igraph functions that exist only for use only within the square operator. Therefore I had to find all functions defined within this function (whose body has been simplified here):

`[.igraph.vs` <- function(...) {
  # some code
  bla <- function(...) {
    # some code
  blop <- function(...) {
    # some code

On the script outline I get on the right of my script in RStudio IDE1, I can see the function names (that would be bla() and blop() in the simplified chunk above) but not copy-paste from there.

I would know how to extract them with xmlparsedata but I thought it might be an opportunity for me to have a look at treesitter. I went through different emotions as a beginner, not all of them positive 😭, but I did get the function names in the end! πŸ’ͺ

Get the whole tree from the script

Below, I dutifully followed the example on the package homepage:

  • load the R language from the treesitter.r package, as far as I understand the only language available at this point for the R package.
  • load the parser for that language.
  • read in the text.
  • parse the text.

A last step is getting the tree root node as node, because I will query the whole script, and you can only query nodes, not trees.

language <- treesitter.r::language()
parser <- treesitter::parser(language)
text <- brio::read_lines("/home/maelle/Documents/rigraph/R/iterators.R") |>
  paste(collapse = "\n")
tree <- treesitter::parser_parse(parser, text)
node <- treesitter::tree_root_node(tree)

Find the node of the parent function of interest

The documentation page of treesitter::query() recommends reading the documentation of tree-sitter about the query syntax. That documentation features very useful examples.

Below I am looking for β€œbinary_operator” whose left hand side is an identifier that I capture as β€œname”, and whose right hand side is a β€œfunction_definition” I capture as β€œdef”. For some reason if I did not capture β€œdef” then I got less information about the node. 🀷 I also use a predicate: the name of the function has to be equal to β€œ[.igraph.vs”.

To find out I was after a β€œfunction_definition”, I parsed a few lines of made-up code to study the resulting tree.

square_bracket_thing <- '
  lhs: (identifier) @name
  rhs: (function_definition)) @def
  (#eq? @name "`[.igraph.vs`"))
square_bracket_query <- treesitter::query(language, square_bracket_thing)

Then I executed the query and extracted the node. In reality this took a bit more trial and error.

square_bracket_thing_captures <- treesitter::query_captures(square_bracket_query, node)
#> $name
#> [1] "def"  "name"
#> $node
#> $node[[1]]
#> <tree_sitter_node>
#> ── Text ────────────────────────────────────────────────────────────────────────
#> `[.igraph.vs` <- function(x, ..., na_ok = FALSE) {
#>   args <- lazy_dots(..., .follow_symbols = FALSE)
#>   ## If indexing has no argument at all, then we still get one,
#>   ## but it is "empty", a name that is  ""
#>   ## Special case, no argument (but we might get an artificial
#>   ## empty one
#>   if (length(args) < 1 ||
#>     (length(args) == 1 && inherits(args[[1]]$expr, "name") &&
#>       as.character(args[[1]]$expr) == "")) {
#>     return(x)
#>   }
#>   ## Special case: single numeric argument
#>   if (length(args) == 1 && inherits(args[[1]]$expr, "numeric")) {
#>     res <- simple_vs_index(x, args[[1]]$expr, na_ok)
#>     return(add_vses_graph_ref(res, get_vs_graph(x)))
#>   }
#>   ## Special case: single symbol argument, no such attribute
#>   if (length(args) == 1 && inherits(args[[1]]$expr, "name")) {
#>     graph <- get_vs_graph(x)
#>     if (!(as.character(args[[1]]$expr) %in% vertex_attr_names(graph))) {
#>       res <- simple_vs_index(x, lazy_eval(args[[1]]), na_ok)
#> <truncated>
#> ── S-Expression ────────────────────────────────────────────────────────────────
#> (binary_operator [(479, 0), (637, 1)]
#>   lhs: (identifier [(479, 0), (479, 13)])
#>   operator: "<-" [(479, 14), (479, 16)]
#>   rhs: (function_definition [(479, 17), (637, 1)]
#>     name: "function" [(479, 17), (479, 25)]
#>     parameters: (parameters [(479, 25), (479, 48)]
#>       open: "(" [(479, 25), (479, 26)]
#>       parameter: (parameter [(479, 26), (479, 27)]
#>         name: (identifier [(479, 26), (479, 27)])
#>       )
#>       (comma [(479, 27), (479, 28)])
#>       parameter: (parameter [(479, 29), (479, 32)]
#>         name: (dots [(479, 29), (479, 32)])
#>       )
#>       (comma [(479, 32), (479, 33)])
#>       parameter: (parameter [(479, 34), (479, 47)]
#>         name: (identifier [(479, 34), (479, 39)])
#>         "=" [(479, 40), (479, 41)]
#>         default: (false [(479, 42), (479, 47)])
#>       )
#>       close: ")" [(479, 47), (479, 48)]
#>     )
#>     body: (braced_expression [(479, 49), (637, 1)]
#>       open: "{" [(479, 49), (479, 50)]
#>       body: (binary_operator [(480, 2), (480, 49)]
#> <truncated>
#> $node[[2]]
#> <tree_sitter_node>
#> ── Text ────────────────────────────────────────────────────────────────────────
#> `[.igraph.vs`
#> ── S-Expression ────────────────────────────────────────────────────────────────
#> (identifier [(479, 0), (479, 13)])
square_fn <- square_bracket_thing_captures$node[[1]]

At this point I was already very proud of my tiny win but I did not have the β€œchildren functions” yet!

Find the functions defined within the parent

I first considered creating a complicated nested query but I found no example of that. I did see someone telling StackOverflow they did a recursive query and for some reason that gave me the idea of parsing the text of the node, then look for functions in that text.

The query was simpler: looking for function definitions, only capturing the names on the left hand side.

square_tree <- treesitter::parser_parse(parser, treesitter::node_text(square_fn))
square_node <- treesitter::tree_root_node(square_tree)
kiddos_source <- '
  lhs: (identifier) @name
  rhs: (function_definition))
kiddos_query <- treesitter::query(language, kiddos_source)
square_bracket_thing_captures <- treesitter::query_captures(kiddos_query, square_node)
#> $name
#>  [1] "name" "name" "name" "name" "name" "name" "name" "name" "name" "name"
#> [11] "name" "name" "name" "name"
#> $node
#> $node[[1]]
#> <tree_sitter_node>
#> ── Text ────────────────────────────────────────────────────────────────────────
#> `[.igraph.vs`
#> ── S-Expression ────────────────────────────────────────────────────────────────
#> (identifier [(0, 0), (0, 13)])
#> $node[[2]]
#> <tree_sitter_node>
#> ── Text ────────────────────────────────────────────────────────────────────────
#> .nei
#> ── S-Expression ────────────────────────────────────────────────────────────────
#> (identifier [(29, 2), (29, 6)])
#> $node[[3]]
#> <tree_sitter_node>
#> ── Text ────────────────────────────────────────────────────────────────────────
#> nei
#> ── S-Expression ────────────────────────────────────────────────────────────────
#> (identifier [(50, 2), (50, 5)])
#> $node[[4]]
#> <tree_sitter_node>
#> ── Text ────────────────────────────────────────────────────────────────────────
#> .innei
#> ── S-Expression ────────────────────────────────────────────────────────────────
#> (identifier [(53, 2), (53, 8)])
#> $node[[5]]
#> <tree_sitter_node>
#> ── Text ────────────────────────────────────────────────────────────────────────
#> innei
#> ── S-Expression ────────────────────────────────────────────────────────────────
#> (identifier [(56, 2), (56, 7)])
#> $node[[6]]
#> <tree_sitter_node>
#> ── Text ────────────────────────────────────────────────────────────────────────
#> .outnei
#> ── S-Expression ────────────────────────────────────────────────────────────────
#> (identifier [(59, 2), (59, 9)])
#> $node[[7]]
#> <tree_sitter_node>
#> ── Text ────────────────────────────────────────────────────────────────────────
#> outnei
#> ── S-Expression ────────────────────────────────────────────────────────────────
#> (identifier [(62, 2), (62, 8)])
#> $node[[8]]
#> <tree_sitter_node>
#> ── Text ────────────────────────────────────────────────────────────────────────
#> .inc
#> ── S-Expression ────────────────────────────────────────────────────────────────
#> (identifier [(65, 2), (65, 6)])
#> $node[[9]]
#> <tree_sitter_node>
#> ── Text ────────────────────────────────────────────────────────────────────────
#> inc
#> ── S-Expression ────────────────────────────────────────────────────────────────
#> (identifier [(78, 2), (78, 5)])
#> $node[[10]]
#> <tree_sitter_node>
#> ── Text ────────────────────────────────────────────────────────────────────────
#> adj
#> ── S-Expression ────────────────────────────────────────────────────────────────
#> (identifier [(81, 2), (81, 5)])
#> $node[[11]]
#> <tree_sitter_node>
#> ── Text ────────────────────────────────────────────────────────────────────────
#> .from
#> ── S-Expression ────────────────────────────────────────────────────────────────
#> (identifier [(84, 2), (84, 7)])
#> $node[[12]]
#> <tree_sitter_node>
#> ── Text ────────────────────────────────────────────────────────────────────────
#> from
#> ── S-Expression ────────────────────────────────────────────────────────────────
#> (identifier [(96, 2), (96, 6)])
#> $node[[13]]
#> <tree_sitter_node>
#> ── Text ────────────────────────────────────────────────────────────────────────
#> .to
#> ── S-Expression ────────────────────────────────────────────────────────────────
#> (identifier [(99, 2), (99, 5)])
#> $node[[14]]
#> <tree_sitter_node>
#> ── Text ────────────────────────────────────────────────────────────────────────
#> to
#> ── S-Expression ────────────────────────────────────────────────────────────────
#> (identifier [(111, 2), (111, 4)])

After that I was able to get the names of the children functions! I was actually only interested in those whose names start with a dot as all the other ones are deprecated anyway.

kiddos_functions <- purrr::map_chr(square_bracket_thing_captures$node, treesitter::node_text)
kiddos_functions[startsWith(kiddos_functions, ".")]
#> [1] ".nei"    ".innei"  ".outnei" ".inc"    ".from"   ".to"

Tada! Now I can go fix the issue I was tasked with.


In this post I report on my first encounter with the treesitter package for parsing R code. Copy-pasting the few function names would surely have been faster, but sometimes you’ve got to sit and learn something new. ☺️

  1. No, I have not installed Positron yet. β†©οΈŽ

To leave a comment for the author, please follow the link and comment on their blog: MaΓ«lle's R blog on MaΓ«lle Salmon's personal website.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)