Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Coming back from a conference, we might be excited to install and try out the cool things we have heard about. I, going against the stream π, decided to experiment with a tool I have not heard about last week, as I unfortunately missed Davis Vaughanβs talk about treesitter. Nonetheless, I caught the idea that treesitter is a parser of code, R code in particular. The treesitter R package uses the tree-sitter C library. There are awesome applications of treesitter among which β¨ code search for R on GitHub β¨ but I learnt to know a bit of treesitter by solving a boring use case.
My use case: I wonβt copy-paste these function names
Have you noticed this nice error message you get if you try to use dplyr::across()
directly?
dplyr::across() #> Error in `dplyr::across()`: #> ! Must only be used inside data-masking verbs like `mutate()`, #> `filter()`, and `group_by()`.
Kirill MΓΌller got the idea to offer a similar behaviour to some igraph functions that exist only for use only within the square operator. Therefore I had to find all functions defined within this function (whose body has been simplified here):
`[.igraph.vs` <- function(...) { # some code bla <- function(...) { # some code } blop <- function(...) { # some code } }
On the script outline I get on the right of my script in RStudio IDE1, I can see the function names (that would be bla()
and blop()
in the simplified chunk above) but not copy-paste from there.
I would know how to extract them with xmlparsedata but I thought it might be an opportunity for me to have a look at treesitter. I went through different emotions as a beginner, not all of them positive π, but I did get the function names in the end! πͺ
Get the whole tree from the script
Below, I dutifully followed the example on the package homepage:
- load the R language from the treesitter.r package, as far as I understand the only language available at this point for the R package.
- load the parser for that language.
- read in the text.
- parse the text.
A last step is getting the tree root node as node, because I will query the whole script, and you can only query nodes, not trees.
language <- treesitter.r::language() parser <- treesitter::parser(language) text <- brio::read_lines("/home/maelle/Documents/rigraph/R/iterators.R") |> paste(collapse = "\n") tree <- treesitter::parser_parse(parser, text) node <- treesitter::tree_root_node(tree)
Find the node of the parent function of interest
The documentation page of treesitter::query()
recommends reading the documentation of tree-sitter about the query syntax. That documentation features very useful examples.
Below I am looking for βbinary_operatorβ whose left hand side is an identifier that I capture as βnameβ, and whose right hand side is a βfunction_definitionβ I capture as βdefβ. For some reason if I did not capture βdefβ then I got less information about the node. π€· I also use a predicate: the name of the function has to be equal to β[.igraph.vs
β.
To find out I was after a βfunction_definitionβ, I parsed a few lines of made-up code to study the resulting tree.
square_bracket_thing <- ' ( ((binary_operator lhs: (identifier) @name rhs: (function_definition)) @def (#eq? @name "`[.igraph.vs`")) ) ' square_bracket_query <- treesitter::query(language, square_bracket_thing)
Then I executed the query and extracted the node. In reality this took a bit more trial and error.
square_bracket_thing_captures <- treesitter::query_captures(square_bracket_query, node) square_bracket_thing_captures #> $name #> [1] "def" "name" #> #> $node #> $node[[1]] #> <tree_sitter_node> #> #> ββ Text ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ #> `[.igraph.vs` <- function(x, ..., na_ok = FALSE) { #> args <- lazy_dots(..., .follow_symbols = FALSE) #> #> ## If indexing has no argument at all, then we still get one, #> ## but it is "empty", a name that is "" #> #> ## Special case, no argument (but we might get an artificial #> ## empty one #> if (length(args) < 1 || #> (length(args) == 1 && inherits(args[[1]]$expr, "name") && #> as.character(args[[1]]$expr) == "")) { #> return(x) #> } #> #> ## Special case: single numeric argument #> if (length(args) == 1 && inherits(args[[1]]$expr, "numeric")) { #> res <- simple_vs_index(x, args[[1]]$expr, na_ok) #> return(add_vses_graph_ref(res, get_vs_graph(x))) #> } #> #> ## Special case: single symbol argument, no such attribute #> if (length(args) == 1 && inherits(args[[1]]$expr, "name")) { #> graph <- get_vs_graph(x) #> if (!(as.character(args[[1]]$expr) %in% vertex_attr_names(graph))) { #> res <- simple_vs_index(x, lazy_eval(args[[1]]), na_ok) #> <truncated> #> #> ββ S-Expression ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ #> (binary_operator [(479, 0), (637, 1)] #> lhs: (identifier [(479, 0), (479, 13)]) #> operator: "<-" [(479, 14), (479, 16)] #> rhs: (function_definition [(479, 17), (637, 1)] #> name: "function" [(479, 17), (479, 25)] #> parameters: (parameters [(479, 25), (479, 48)] #> open: "(" [(479, 25), (479, 26)] #> parameter: (parameter [(479, 26), (479, 27)] #> name: (identifier [(479, 26), (479, 27)]) #> ) #> (comma [(479, 27), (479, 28)]) #> parameter: (parameter [(479, 29), (479, 32)] #> name: (dots [(479, 29), (479, 32)]) #> ) #> (comma [(479, 32), (479, 33)]) #> parameter: (parameter [(479, 34), (479, 47)] #> name: (identifier [(479, 34), (479, 39)]) #> "=" [(479, 40), (479, 41)] #> default: (false [(479, 42), (479, 47)]) #> ) #> close: ")" [(479, 47), (479, 48)] #> ) #> body: (braced_expression [(479, 49), (637, 1)] #> open: "{" [(479, 49), (479, 50)] #> body: (binary_operator [(480, 2), (480, 49)] #> <truncated> #> #> $node[[2]] #> <tree_sitter_node> #> #> ββ Text ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ #> `[.igraph.vs` #> #> ββ S-Expression ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ #> (identifier [(479, 0), (479, 13)]) square_fn <- square_bracket_thing_captures$node[[1]]
At this point I was already very proud of my tiny win but I did not have the βchildren functionsβ yet!
Find the functions defined within the parent
I first considered creating a complicated nested query but I found no example of that. I did see someone telling StackOverflow they did a recursive query and for some reason that gave me the idea of parsing the text of the node, then look for functions in that text.
The query was simpler: looking for function definitions, only capturing the names on the left hand side.
square_tree <- treesitter::parser_parse(parser, treesitter::node_text(square_fn)) square_node <- treesitter::tree_root_node(square_tree) kiddos_source <- ' ( (binary_operator lhs: (identifier) @name rhs: (function_definition)) ) ' kiddos_query <- treesitter::query(language, kiddos_source) square_bracket_thing_captures <- treesitter::query_captures(kiddos_query, square_node) head(square_bracket_thing_captures) #> $name #> [1] "name" "name" "name" "name" "name" "name" "name" "name" "name" "name" #> [11] "name" "name" "name" "name" #> #> $node #> $node[[1]] #> <tree_sitter_node> #> #> ββ Text ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ #> `[.igraph.vs` #> #> ββ S-Expression ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ #> (identifier [(0, 0), (0, 13)]) #> #> $node[[2]] #> <tree_sitter_node> #> #> ββ Text ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ #> .nei #> #> ββ S-Expression ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ #> (identifier [(29, 2), (29, 6)]) #> #> $node[[3]] #> <tree_sitter_node> #> #> ββ Text ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ #> nei #> #> ββ S-Expression ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ #> (identifier [(50, 2), (50, 5)]) #> #> $node[[4]] #> <tree_sitter_node> #> #> ββ Text ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ #> .innei #> #> ββ S-Expression ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ #> (identifier [(53, 2), (53, 8)]) #> #> $node[[5]] #> <tree_sitter_node> #> #> ββ Text ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ #> innei #> #> ββ S-Expression ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ #> (identifier [(56, 2), (56, 7)]) #> #> $node[[6]] #> <tree_sitter_node> #> #> ββ Text ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ #> .outnei #> #> ββ S-Expression ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ #> (identifier [(59, 2), (59, 9)]) #> #> $node[[7]] #> <tree_sitter_node> #> #> ββ Text ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ #> outnei #> #> ββ S-Expression ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ #> (identifier [(62, 2), (62, 8)]) #> #> $node[[8]] #> <tree_sitter_node> #> #> ββ Text ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ #> .inc #> #> ββ S-Expression ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ #> (identifier [(65, 2), (65, 6)]) #> #> $node[[9]] #> <tree_sitter_node> #> #> ββ Text ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ #> inc #> #> ββ S-Expression ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ #> (identifier [(78, 2), (78, 5)]) #> #> $node[[10]] #> <tree_sitter_node> #> #> ββ Text ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ #> adj #> #> ββ S-Expression ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ #> (identifier [(81, 2), (81, 5)]) #> #> $node[[11]] #> <tree_sitter_node> #> #> ββ Text ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ #> .from #> #> ββ S-Expression ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ #> (identifier [(84, 2), (84, 7)]) #> #> $node[[12]] #> <tree_sitter_node> #> #> ββ Text ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ #> from #> #> ββ S-Expression ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ #> (identifier [(96, 2), (96, 6)]) #> #> $node[[13]] #> <tree_sitter_node> #> #> ββ Text ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ #> .to #> #> ββ S-Expression ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ #> (identifier [(99, 2), (99, 5)]) #> #> $node[[14]] #> <tree_sitter_node> #> #> ββ Text ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ #> to #> #> ββ S-Expression ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ #> (identifier [(111, 2), (111, 4)])
After that I was able to get the names of the children functions! I was actually only interested in those whose names start with a dot as all the other ones are deprecated anyway.
kiddos_functions <- purrr::map_chr(square_bracket_thing_captures$node, treesitter::node_text) kiddos_functions[startsWith(kiddos_functions, ".")] #> [1] ".nei" ".innei" ".outnei" ".inc" ".from" ".to"
Tada! Now I can go fix the issue I was tasked with.
Conclusion
In this post I report on my first encounter with the treesitter package for parsing R code. Copy-pasting the few function names would surely have been faster, but sometimes youβve got to sit and learn something new. βΊοΈ
< section class="footnotes" role="doc-endnotes">-
No, I have not installed Positron yet. β©οΈ
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.