Extracting names of functions defined in a script with treesitter
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Coming back from a conference, we might be excited to install and try out the cool things we have heard about. I, going against the stream π, decided to experiment with a tool I have not heard about last week, as I unfortunately missed Davis Vaughanβs talk about treesitter. Nonetheless, I caught the idea that treesitter is a parser of code, R code in particular. The treesitter R package uses the tree-sitter C library. There are awesome applications of treesitter among which β¨ code search for R on GitHub β¨ but I learnt to know a bit of treesitter by solving a boring use case.
My use case: I wonβt copy-paste these function names
Have you noticed this nice error message you get if you try to use dplyr::across()
directly?
dplyr::across()
#> Error in `dplyr::across()`:
#> ! Must only be used inside data-masking verbs like `mutate()`,
#> `filter()`, and `group_by()`.
Kirill MΓΌller got the idea to offer a similar behaviour to some igraph functions that exist only for use only within the square operator. Therefore I had to find all functions defined within this function (whose body has been simplified here):
`[.igraph.vs` <- function(...) { # some code bla <- function(...) { # some code } blop <- function(...) { # some code } }
On the script outline I get on the right of my script in RStudio IDE1, I can see the function names (that would be bla()
and blop()
in the simplified chunk above) but not copy-paste from there.
I would know how to extract them with xmlparsedata but I thought it might be an opportunity for me to have a look at treesitter. I went through different emotions as a beginner, not all of them positive π, but I did get the function names in the end! πͺ
Get the whole tree from the script
Below, I dutifully followed the example on the package homepage:
- load the R language from the treesitter.r package, as far as I understand the only language available at this point for the R package.
- load the parser for that language.
- read in the text.
- parse the text.
A last step is getting the tree root node as node, because I will query the whole script, and you can only query nodes, not trees.
language <- treesitter.r::language()
parser <- treesitter::parser(language)
text <- brio::read_lines("/home/maelle/Documents/rigraph/R/iterators.R") |>
paste(collapse = "\n")
tree <- treesitter::parser_parse(parser, text)
node <- treesitter::tree_root_node(tree)
Find the node of the parent function of interest
The documentation page of treesitter::query()
recommends reading the documentation of tree-sitter about the query syntax. That documentation features very useful examples.
Below I am looking for βbinary_operatorβ whose left hand side is an identifier that I capture as βnameβ, and whose right hand side is a βfunction_definitionβ I capture as βdefβ. For some reason if I did not capture βdefβ then I got less information about the node. π€· I also use a predicate: the name of the function has to be equal to β[.igraph.vs
β.
To find out I was after a βfunction_definitionβ, I parsed a few lines of made-up code to study the resulting tree.
square_bracket_thing <- '
(
((binary_operator
lhs: (identifier) @name
rhs: (function_definition)) @def
(#eq? @name "`[.igraph.vs`"))
)
'
square_bracket_query <- treesitter::query(language, square_bracket_thing)
Then I executed the query and extracted the node. In reality this took a bit more trial and error.
square_bracket_thing_captures <- treesitter::query_captures(square_bracket_query, node)
square_bracket_thing_captures
#> $name
#> [1] "def" "name"
#>
#> $node
#> $node[[1]]
#> <tree_sitter_node>
#>
#> ββ Text ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> `[.igraph.vs` <- function(x, ..., na_ok = FALSE) {
#> args <- lazy_dots(..., .follow_symbols = FALSE)
#>
#> ## If indexing has no argument at all, then we still get one,
#> ## but it is "empty", a name that is ""
#>
#> ## Special case, no argument (but we might get an artificial
#> ## empty one
#> if (length(args) < 1 ||
#> (length(args) == 1 && inherits(args[[1]]$expr, "name") &&
#> as.character(args[[1]]$expr) == "")) {
#> return(x)
#> }
#>
#> ## Special case: single numeric argument
#> if (length(args) == 1 && inherits(args[[1]]$expr, "numeric")) {
#> res <- simple_vs_index(x, args[[1]]$expr, na_ok)
#> return(add_vses_graph_ref(res, get_vs_graph(x)))
#> }
#>
#> ## Special case: single symbol argument, no such attribute
#> if (length(args) == 1 && inherits(args[[1]]$expr, "name")) {
#> graph <- get_vs_graph(x)
#> if (!(as.character(args[[1]]$expr) %in% vertex_attr_names(graph))) {
#> res <- simple_vs_index(x, lazy_eval(args[[1]]), na_ok)
#> <truncated>
#>
#> ββ S-Expression ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> (binary_operator [(479, 0), (637, 1)]
#> lhs: (identifier [(479, 0), (479, 13)])
#> operator: "<-" [(479, 14), (479, 16)]
#> rhs: (function_definition [(479, 17), (637, 1)]
#> name: "function" [(479, 17), (479, 25)]
#> parameters: (parameters [(479, 25), (479, 48)]
#> open: "(" [(479, 25), (479, 26)]
#> parameter: (parameter [(479, 26), (479, 27)]
#> name: (identifier [(479, 26), (479, 27)])
#> )
#> (comma [(479, 27), (479, 28)])
#> parameter: (parameter [(479, 29), (479, 32)]
#> name: (dots [(479, 29), (479, 32)])
#> )
#> (comma [(479, 32), (479, 33)])
#> parameter: (parameter [(479, 34), (479, 47)]
#> name: (identifier [(479, 34), (479, 39)])
#> "=" [(479, 40), (479, 41)]
#> default: (false [(479, 42), (479, 47)])
#> )
#> close: ")" [(479, 47), (479, 48)]
#> )
#> body: (braced_expression [(479, 49), (637, 1)]
#> open: "{" [(479, 49), (479, 50)]
#> body: (binary_operator [(480, 2), (480, 49)]
#> <truncated>
#>
#> $node[[2]]
#> <tree_sitter_node>
#>
#> ββ Text ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> `[.igraph.vs`
#>
#> ββ S-Expression ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> (identifier [(479, 0), (479, 13)])
square_fn <- square_bracket_thing_captures$node[[1]]
At this point I was already very proud of my tiny win but I did not have the βchildren functionsβ yet!
Find the functions defined within the parent
I first considered creating a complicated nested query but I found no example of that. I did see someone telling StackOverflow they did a recursive query and for some reason that gave me the idea of parsing the text of the node, then look for functions in that text.
The query was simpler: looking for function definitions, only capturing the names on the left hand side.
square_tree <- treesitter::parser_parse(parser, treesitter::node_text(square_fn))
square_node <- treesitter::tree_root_node(square_tree)
kiddos_source <- '
(
(binary_operator
lhs: (identifier) @name
rhs: (function_definition))
)
'
kiddos_query <- treesitter::query(language, kiddos_source)
square_bracket_thing_captures <- treesitter::query_captures(kiddos_query, square_node)
head(square_bracket_thing_captures)
#> $name
#> [1] "name" "name" "name" "name" "name" "name" "name" "name" "name" "name"
#> [11] "name" "name" "name" "name"
#>
#> $node
#> $node[[1]]
#> <tree_sitter_node>
#>
#> ββ Text ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> `[.igraph.vs`
#>
#> ββ S-Expression ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> (identifier [(0, 0), (0, 13)])
#>
#> $node[[2]]
#> <tree_sitter_node>
#>
#> ββ Text ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> .nei
#>
#> ββ S-Expression ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> (identifier [(29, 2), (29, 6)])
#>
#> $node[[3]]
#> <tree_sitter_node>
#>
#> ββ Text ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> nei
#>
#> ββ S-Expression ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> (identifier [(50, 2), (50, 5)])
#>
#> $node[[4]]
#> <tree_sitter_node>
#>
#> ββ Text ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> .innei
#>
#> ββ S-Expression ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> (identifier [(53, 2), (53, 8)])
#>
#> $node[[5]]
#> <tree_sitter_node>
#>
#> ββ Text ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> innei
#>
#> ββ S-Expression ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> (identifier [(56, 2), (56, 7)])
#>
#> $node[[6]]
#> <tree_sitter_node>
#>
#> ββ Text ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> .outnei
#>
#> ββ S-Expression ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> (identifier [(59, 2), (59, 9)])
#>
#> $node[[7]]
#> <tree_sitter_node>
#>
#> ββ Text ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> outnei
#>
#> ββ S-Expression ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> (identifier [(62, 2), (62, 8)])
#>
#> $node[[8]]
#> <tree_sitter_node>
#>
#> ββ Text ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> .inc
#>
#> ββ S-Expression ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> (identifier [(65, 2), (65, 6)])
#>
#> $node[[9]]
#> <tree_sitter_node>
#>
#> ββ Text ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> inc
#>
#> ββ S-Expression ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> (identifier [(78, 2), (78, 5)])
#>
#> $node[[10]]
#> <tree_sitter_node>
#>
#> ββ Text ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> adj
#>
#> ββ S-Expression ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> (identifier [(81, 2), (81, 5)])
#>
#> $node[[11]]
#> <tree_sitter_node>
#>
#> ββ Text ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> .from
#>
#> ββ S-Expression ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> (identifier [(84, 2), (84, 7)])
#>
#> $node[[12]]
#> <tree_sitter_node>
#>
#> ββ Text ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> from
#>
#> ββ S-Expression ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> (identifier [(96, 2), (96, 6)])
#>
#> $node[[13]]
#> <tree_sitter_node>
#>
#> ββ Text ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> .to
#>
#> ββ S-Expression ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> (identifier [(99, 2), (99, 5)])
#>
#> $node[[14]]
#> <tree_sitter_node>
#>
#> ββ Text ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> to
#>
#> ββ S-Expression ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> (identifier [(111, 2), (111, 4)])
After that I was able to get the names of the children functions! I was actually only interested in those whose names start with a dot as all the other ones are deprecated anyway.
kiddos_functions <- purrr::map_chr(square_bracket_thing_captures$node, treesitter::node_text)
kiddos_functions[startsWith(kiddos_functions, ".")]
#> [1] ".nei" ".innei" ".outnei" ".inc" ".from" ".to"
Tada! Now I can go fix the issue I was tasked with.
Conclusion
In this post I report on my first encounter with the treesitter package for parsing R code. Copy-pasting the few function names would surely have been faster, but sometimes youβve got to sit and learn something new. βΊοΈ
-
No, I have not installed Positron yet. β©οΈ
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.