Automate code refactoring with {xmlparsedata} and {brio}

[This article was first published on Maëlle's R blog on Maëlle Salmon's personal website, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Once again a post praising XML. 😇 These are notes from a quite particular use case: what if you want to replace the usage of a function with another one in many scripts, without manual edits and without touching lines that do not contain a call to replace?

The real life example that inspired this post is the replacement of all calls to expect_that(..., equals(...)), like expect_that(a, equals(1)), in igraph tests with expect_equal(). If you’re a newer package developer who grew up with testthat’s third edition, you’ve probably never heard of that cutesy old-school testing style. 😉

Why automate? Where I subjectively justify my choice

As brilliantly explained by XKCD 1205, automation is not necessary worth the time. In the case that motivated this post, automation was worth it because there were many test files, and because being able to regenerate all edits meant we can recreate the changes after merging other edits to the main branch, without any conflict.

Parse the code to XML, detect problematic calls

For any path, we detect function calls to expect_that(). The code is parsed using the parse() function, digested into XML with {xmlparsedata}.

xml <- path |>
  parse(keep.source = TRUE) |>
  xmlparsedata::xml_parse_data(pretty = TRUE) |>
  xml2::read_xml()

deprecated <- xml2::xml_find_all(
  xml,
  ".//SYMBOL_FUNCTION_CALL[text()='expect_that']"
)

The deprecated object contains all the nodes we need to amend.

For info, here’s how code parsed to XML looks like (yes, it is big despite representing two short lines of code):

parse(text = "1+1\nsum(c(2,2))", keep.source = TRUE) |>
  xmlparsedata::xml_parse_data(pretty = TRUE) |>
  cat()
#> <?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
#> <exprlist>
#>   <expr line1="1" col1="1" line2="1" col2="3" start="13" end="15">
#>     <expr line1="1" col1="1" line2="1" col2="1" start="13" end="13">
#>       <NUM_CONST line1="1" col1="1" line2="1" col2="1" start="13" end="13">1</NUM_CONST>
#>     </expr>
#>     <OP-PLUS line1="1" col1="2" line2="1" col2="2" start="14" end="14">+</OP-PLUS>
#>     <expr line1="1" col1="3" line2="1" col2="3" start="15" end="15">
#>       <NUM_CONST line1="1" col1="3" line2="1" col2="3" start="15" end="15">1</NUM_CONST>
#>     </expr>
#>   </expr>
#>   <expr line1="2" col1="1" line2="2" col2="11" start="25" end="35">
#>     <expr line1="2" col1="1" line2="2" col2="3" start="25" end="27">
#>       <SYMBOL_FUNCTION_CALL line1="2" col1="1" line2="2" col2="3" start="25" end="27">sum</SYMBOL_FUNCTION_CALL>
#>     </expr>
#>     <OP-LEFT-PAREN line1="2" col1="4" line2="2" col2="4" start="28" end="28">(</OP-LEFT-PAREN>
#>     <expr line1="2" col1="5" line2="2" col2="10" start="29" end="34">
#>       <expr line1="2" col1="5" line2="2" col2="5" start="29" end="29">
#>         <SYMBOL_FUNCTION_CALL line1="2" col1="5" line2="2" col2="5" start="29" end="29">c</SYMBOL_FUNCTION_CALL>
#>       </expr>
#>       <OP-LEFT-PAREN line1="2" col1="6" line2="2" col2="6" start="30" end="30">(</OP-LEFT-PAREN>
#>       <expr line1="2" col1="7" line2="2" col2="7" start="31" end="31">
#>         <NUM_CONST line1="2" col1="7" line2="2" col2="7" start="31" end="31">2</NUM_CONST>
#>       </expr>
#>       <OP-COMMA line1="2" col1="8" line2="2" col2="8" start="32" end="32">,</OP-COMMA>
#>       <expr line1="2" col1="9" line2="2" col2="9" start="33" end="33">
#>         <NUM_CONST line1="2" col1="9" line2="2" col2="9" start="33" end="33">2</NUM_CONST>
#>       </expr>
#>       <OP-RIGHT-PAREN line1="2" col1="10" line2="2" col2="10" start="34" end="34">)</OP-RIGHT-PAREN>
#>     </expr>
#>     <OP-RIGHT-PAREN line1="2" col1="11" line2="2" col2="11" start="35" end="35">)</OP-RIGHT-PAREN>
#>   </expr>
#> </exprlist>

Fix problematic calls in the XML representation

The treat_deprecated() function below tries to find a call to equals() inside the expect_equal(), since we only fix the calls to expect_that() that contain equals(). We return early for these other cutesy expectations, with a warning so we can go look at the scripts and get an idea of what the calls are. They will have to be fixed with another script, or manually, depending on how many of them there are.

For the calls to expect_that() that contain a call to equals(), we

  • replace expect_that() with expect_equal()
  • extract the text inside equals() to put it directly as second argument of expect_equal().

Thus expect_that(a, equals(1)) becomes expect_equals(a, 1).

treat_deprecated <- function(xml, path) {
  siblings <- xml2::xml_parent(xml) |> xml2::xml_siblings()
  equal <- siblings[grepl("equals\\(", xml2::xml_text(siblings))]
  if (length(equal) == 0) {
    cli::cli_alert_warning("WARNING AT {path}.")
    return()
  }
  xml2::xml_text(xml) <- "expect_equal"
  text <- xml2::xml_contents(equal)[[3]] |> xml2::xml_text()
  xml2::xml_remove(xml2::xml_contents(equal))
  xml2::xml_text(equal) <- text
}

Serialize XML to character, write back

We only modify the lines of the script that need to be modified, as it will avoid spurious changes but also avoid figuring out how to serialize the whole HTML.

For each call that was edited in XML, we identify the corresponding start and end lines in the original script. Below is again an example of parsing just to show that each expression has attributes called line1 and line2, the start and end lines.

parse(text = "1+1\n2+2", keep.source = TRUE) |>
    xmlparsedata::xml_parse_data(pretty = TRUE) |>
    xml2::read_xml()
#> {xml_document}
#> <exprlist>
#> [1] <expr line1="1" col1="1" line2="1" col2="3" start="5" end="7">\n  <expr l ...
#> [2] <expr line1="2" col1="1" line2="2" col2="3" start="9" end="11">\n  <expr  ...

There are two cases:

  • the start and end line is the same. We replace the corresponding line with the text of the grand-parent node.
  • the start and end lines are different. We loop over them, for each of them we replace the corresponding line with the text of parent and uncles/aunts nodes.

The choice of parents/siblings might seem a bit arbitrary. I made it work by putting a browser() in my code and figuring out what level of ancestry I had to deal with thanks to random tries. I do not have a particularly good mental model of R code as XML. 😉

For some reason I wrote for loops in the code below, probably because that’s what made sense to me at the time. 🤷

lines <- brio::read_lines(path)

# ...
# code identifying deprecated calls
# ...

purrr::walk(deprecated, treat_deprecated, path = path)

for (deprecated_call in deprecated) {
  
  parent <- xml2::xml_parent(xml2::xml_parent(deprecated_call))
  
  line1 <- as.numeric(xml2::xml_attr(parent, "line1"))
  line2 <- as.numeric(xml2::xml_attr(parent, "line2"))
  
  if (line1 == line2) {
    lines[line1] <- xml2::xml_text(parent)
  } else {
    for (line in line1:line2) {
      siblings <- xml2::xml_children(parent)
      lines[line] <- paste(
        xml2::xml_text(siblings[xml2::xml_attr(siblings, "line1") == line]),
        collapse = ""
      )
    }
  }
  
  
}

brio::write_lines(lines, path)

After all this, we write the lines, modified and not, to the original path. It’s important to first try this on a script and check the diff.

Put it all together

Here’s the all script, including automatic commit generation.

parse_script <- function(path) {

  cli::cli_alert_info("Refactoring {path}.")

  lines <- brio::read_lines(path)

  xml <- path |>
    parse(keep.source = TRUE) |>
    xmlparsedata::xml_parse_data(pretty = TRUE) |>
    xml2::read_xml()

  deprecated <- xml2::xml_find_all(
    xml,
    ".//SYMBOL_FUNCTION_CALL[text()='expect_that']"
  )

  purrr::walk(deprecated, treat_deprecated, path = path)

  for (deprecated_call in deprecated) {

    parent <- xml2::xml_parent(xml2::xml_parent(deprecated_call))

    line1 <- as.numeric(xml2::xml_attr(parent, "line1"))
    line2 <- as.numeric(xml2::xml_attr(parent, "line2"))

    if (line1 == line2) {
      lines[line1] <- xml2::xml_text(parent)
    } else {
      for (line in line1:line2) {
        siblings <- xml2::xml_children(parent)
        lines[line] <- paste(
          xml2::xml_text(siblings[xml2::xml_attr(siblings, "line1") == line]),
          collapse = ""
        )
      }
    }


  }

  brio::write_lines(lines, path)

  if (! (path %in% gert::git_status()[["file"]])) {
    return(invisible(TRUE))
  }

  styler::style_file(path)

  gert::git_add(path)
  gert::git_commit(
    sprintf("refactor: remove deprecated expect_that() from %s", fs::path_file(path))
  )
}

treat_deprecated <- function(xml, path) {
  siblings <- xml2::xml_parent(xml) |> xml2::xml_siblings()
  equal <- siblings[grepl("equals\\(", xml2::xml_text(siblings))]
  if (length(equal) == 0) {
    cli::cli_alert_warning("WARNING AT {path}.")
    return()
  }
  xml2::xml_text(xml) <- "expect_equal"
  text <- xml2::xml_contents(equal)[[3]] |> xml2::xml_text()
  xml2::xml_remove(xml2::xml_contents(equal))
  xml2::xml_text(equal) <- text
}

paths <- fs::dir_ls("tests/testthat", regex = "test-")

purrr::walk(paths, parse_script)

Example PR

Conclusion

In this post I presented a strategy that served me well when refactoring igraph’s test scripts: parsing code to XML, editing it as XML, then writing back the edited lines thanks to the attributes of XML nodes that indicate their original lines in the script.

Other possible approaches include styler’s parsing of code into a table and serialization of that table.

In a more similar approach, which means it might have been wise for me to explore this codebase sooner 😅, the codegrip package uses xmlparsedata and has helpers for finding which lines a node refers to.

Do you sometimes use automatic refactoring (styler, codegrip, etc.), or automatic diagnostics (lintr, pkgcheck, etc.)? Have you written any customization or standalone script to help you with that?

To leave a comment for the author, please follow the link and comment on their blog: Maëlle's R blog on Maëlle Salmon's personal website.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)