[This article was first published on Maëlle's R blog on Maëlle Salmon's personal website, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Once again a post praising XML. 😇 These are notes from a quite particular use case: what if you want to replace the usage of a function with another one in many scripts, without manual edits and without touching lines that do not contain a call to replace?

The real life example that inspired this post is the replacement of all calls to expect_that(..., equals(...)), like expect_that(a, equals(1)), in igraph tests with expect_equal(). If you’re a newer package developer who grew up with testthat’s third edition, you’ve probably never heard of that cutesy old-school testing style. 😉

Why automate? Where I subjectively justify my choice

As brilliantly explained by XKCD 1205, automation is not necessary worth the time. In the case that motivated this post, automation was worth it because there were many test files, and because being able to regenerate all edits meant we can recreate the changes after merging other edits to the main branch, without any conflict.

Parse the code to XML, detect problematic calls

For any path, we detect function calls to expect_that(). The code is parsed using the parse() function, digested into XML with {xmlparsedata}.

xml <- path |>
  parse(keep.source = TRUE) |>
  xmlparsedata::xml_parse_data(pretty = TRUE) |>
  xml2::read_xml()

deprecated <- xml2::xml_find_all(
  xml,
  ".//SYMBOL_FUNCTION_CALL[text()='expect_that']"
)

The deprecated object contains all the nodes we need to amend.

For info, here’s how code parsed to XML looks like (yes, it is big despite representing two short lines of code):

parse(text = "1+1nsum(c(2,2))", keep.source = TRUE) |>
  xmlparsedata::xml_parse_data(pretty = TRUE) |>
  cat()
#> <?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
#> <exprlist>
#>   <expr line1="1" col1="1" line2="1" col2="3" start="13" end="15">
#>     <expr line1="1" col1="1" line2="1" col2="1" start="13" end="13">
#>       <NUM_CONST line1="1" col1="1" line2="1" col2="1" start="13" end="13">1</NUM_CONST>
#>     </expr>
#>     <OP-PLUS line1="1" col1="2" line2="1" col2="2" start="14" end="14">+</OP-PLUS>
#>     <expr line1="1" col1="3" line2="1" col2="3" start="15" end="15">
#>       <NUM_CONST line1="1" col1="3" line2="1" col2="3" start="15" end="15">1</NUM_CONST>
#>     </expr>
#>   </expr>
#>   <expr line1="2" col1="1" line2="2" col2="11" start="25" end="35">
#>     <expr line1="2" col1="1" line2="2" col2="3" start="25" end="27">
#>       <SYMBOL_FUNCTION_CALL line1="2" col1="1" line2="2" col2="3" start="25" end="27">sum</SYMBOL_FUNCTION_CALL>
#>     </expr>
#>     <OP-LEFT-PAREN line1="2" col1="4" line2="2" col2="4" start="28" end="28">(</OP-LEFT-PAREN>
#>     <expr line1="2" col1="5" line2="2" col2="10" start="29" end="34">
#>       <expr line1="2" col1="5" line2="2" col2="5" start="29" end="29">
#>         <SYMBOL_FUNCTION_CALL line1="2" col1="5" line2="2" col2="5" start="29" end="29">c</SYMBOL_FUNCTION_CALL>
#>       </expr>
#>       <OP-LEFT-PAREN line1="2" col1="6" line2="2" col2="6" start="30" end="30">(</OP-LEFT-PAREN>
#>       <expr line1="2" col1="7" line2="2" col2="7" start="31" end="31">
#>         <NUM_CONST line1="2" col1="7" line2="2" col2="7" start="31" end="31">2</NUM_CONST>
#>       </expr>
#>       <OP-COMMA line1="2" col1="8" line2="2" col2="8" start="32" end="32">,</OP-COMMA>
#>       <expr line1="2" col1="9" line2="2" col2="9" start="33" end="33">
#>         <NUM_CONST line1="2" col1="9" line2="2" col2="9" start="33" end="33">2</NUM_CONST>
#>       </expr>
#>       <OP-RIGHT-PAREN line1="2" col1="10" line2="2" col2="10" start="34" end="34">)</OP-RIGHT-PAREN>
#>     </expr>
#>     <OP-RIGHT-PAREN line1="2" col1="11" line2="2" col2="11" start="35" end="35">)</OP-RIGHT-PAREN>
#>   </expr>
#> </exprlist>

Fix problematic calls in the XML representation

The treat_deprecated() function below tries to find a call to equals() inside the expect_equal(), since we only fix the calls to expect_that() that contain equals(). We return early for these other cutesy expectations, with a warning so we can go look at the scripts and get an idea of what the calls are. They will have to be fixed with another script, or manually, depending on how many of them there are.

For the calls to expect_that() that contain a call to equals(), we

  • replace expect_that() with expect_equal()
  • extract the text inside equals() to put it directly as second argument of expect_equal().

Thus expect_that(a, equals(1)) becomes expect_equals(a, 1).

treat_deprecated <- function(xml, path) {
  siblings <- xml2::xml_parent(xml) |> xml2::xml_siblings()
  equal <- siblings[grepl("equals(", xml2::xml_text(siblings))]
  if (length(equal) == 0) {
    cli::cli_alert_warning("WARNING AT {path}.")
    return()
  }
  xml2::xml_text(xml) <- "expect_equal"
  text <- xml2::xml_contents(equal)[[3]] |> xml2::xml_text()
  xml2::xml_remove(xml2::xml_contents(equal))
  xml2::xml_text(equal) <- text
}

Serialize XML to character, write back

We only modify the lines of the script that need to be modified, as it will avoid spurious changes but also avoid figuring out how to serialize the whole HTML.

For each call that was edited in XML, we identify the corresponding start and end lines in the original script. Below is again an example of parsing just to show that each expression has attributes called line1 and line2, the start and end lines.

parse(text = "1+1n2+2", keep.source = TRUE) |>
    xmlparsedata::xml_parse_data(pretty = TRUE) |>
    xml2::read_xml()
#> {xml_document}
#> <exprlist>
#> [1] <expr line1="1" col1="1" line2="1" col2="3" start="5" end="7">n  <expr l ...
#> [2] <expr line1="2" col1="1" line2="2" col2="3" start="9" end="11">n  <expr  ...

There are two cases:

  • the start and end line is the same. We replace the corresponding line with the text of the grand-parent node.
  • the start and end lines are different. We loop over them, for each of them we replace the corresponding line with the text of parent and uncles/aunts nodes.

The choice of parents/siblings might seem a bit arbitrary. I made it work by putting a browser() in my code and figuring out what level of ancestry I had to deal with thanks to random tries. I do not have a particularly good mental model of R code as XML. 😉

For some reason I wrote for loops in the code below, probably because that’s what made sense to me at the time. 🤷

lines <- brio::read_lines(path)

# ...
# code identifying deprecated calls
# ...

purrr::walk(deprecated, treat_deprecated, path = path)

for (deprecated_call in deprecated) {

  parent <- xml2::xml_parent(xml2::xml_parent(deprecated_call))

  line1 <- as.numeric(xml2::xml_attr(parent, "line1"))
  line2 <- as.numeric(xml2::xml_attr(parent, "line2"))

  if (line1 == line2) {
    lines[line1] <- xml2::xml_text(parent)
  } else {
    for (line in line1:line2) {
      siblings <- xml2::xml_children(parent)
      lines[line] <- paste(
        xml2::xml_text(siblings[xml2::xml_attr(siblings, "line1") == line]),
        collapse = ""
      )
    }
  }


}

brio::write_lines(lines, path)

After all this, we write the lines, modified and not, to the original path. It’s important to first try this on a script and check the diff.

Put it all together

Here’s the all script, including automatic commit generation.

parse_script <- function(path) {

  cli::cli_alert_info("Refactoring {path}.")

  lines <- brio::read_lines(path)

  xml <- path |>
    parse(keep.source = TRUE) |>
    xmlparsedata::xml_parse_data(pretty = TRUE) |>
    xml2::read_xml()

  deprecated <- xml2::xml_find_all(
    xml,
    ".//SYMBOL_FUNCTION_CALL[text()='expect_that']"
  )

  purrr::walk(deprecated, treat_deprecated, path = path)

  for (deprecated_call in deprecated) {

    parent <- xml2::xml_parent(xml2::xml_parent(deprecated_call))

    line1 <- as.numeric(xml2::xml_attr(parent, "line1"))
    line2 <- as.numeric(xml2::xml_attr(parent, "line2"))

    if (line1 == line2) {
      lines[line1] <- xml2::xml_text(parent)
    } else {
      for (line in line1:line2) {
        siblings <- xml2::xml_children(parent)
        lines[line] <- paste(
          xml2::xml_text(siblings[xml2::xml_attr(siblings, "line1") == line]),
          collapse = ""
        )
      }
    }


  }

  brio::write_lines(lines, path)

  if (! (path %in% gert::git_status()[["file"]])) {
    return(invisible(TRUE))
  }

  styler::style_file(path)

  gert::git_add(path)
  gert::git_commit(
    sprintf("refactor: remove deprecated expect_that() from %s", fs::path_file(path))
  )
}

treat_deprecated <- function(xml, path) {
  siblings <- xml2::xml_parent(xml) |> xml2::xml_siblings()
  equal <- siblings[grepl("equals(", xml2::xml_text(siblings))]
  if (length(equal) == 0) {
    cli::cli_alert_warning("WARNING AT {path}.")
    return()
  }
  xml2::xml_text(xml) <- "expect_equal"
  text <- xml2::xml_contents(equal)[[3]] |> xml2::xml_text()
  xml2::xml_remove(xml2::xml_contents(equal))
  xml2::xml_text(equal) <- text
}

paths <- fs::dir_ls("tests/testthat", regex = "test-")

purrr::walk(paths, parse_script)

Example PR

Conclusion

In this post I presented a strategy that served me well when refactoring igraph’s test scripts: parsing code to XML, editing it as XML, then writing back the edited lines thanks to the attributes of XML nodes that indicate their original lines in the script.

Other possible approaches include styler’s parsing of code into a table and serialization of that table.

In a more similar approach, which means it might have been wise for me to explore this codebase sooner 😅, the codegrip package uses xmlparsedata and has helpers for finding which lines a node refers to.

Do you sometimes use automatic refactoring (styler, codegrip, etc.), or automatic diagnostics (lintr, pkgcheck, etc.)? Have you written any customization or standalone script to help you with that?

To leave a comment for the author, please follow the link and comment on their blog: Maëlle's R blog on Maëlle Salmon's personal website.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Continue reading: Automate code refactoring with {xmlparsedata} and {brio}

Future Implications of Automating Code Refactoring

Using a combination of xmlparsedata and brio packages, the article presents an innovative approach to automate code refactoring, transforming the tedious process of manual code edits into an efficient automated task. This automated refactoring dives deep into developing XML parses from code, editing the XML representation, and translating it back to the script, significantly easing the process of replacing function calls across scripts.

Next Development Steps: Potential Improvements and Expansions

While the current iteration presents a significant improvement over manual code refactoring, there is room for enhancements and possible expansion into broader areas:

  1. Scaling up for larger codes: The existing approach works well with smaller codes, but it requires testing and possible optimization for larger and more complex codes.
  2. Expanding to other programming languages: Currently, it is designed for R. However, this concept can be extended to other programming languages, broadening the script’s overall utility.
  3. Enhancing Support for mixed function calls: The present design works best when only one function requires replacement. Expanding this to handle multiple function replacements will increase its versatility.
  4. Incorporating error handling and verification: Building in independent checks to ensure the correctly modified code without any loss of functionality will add an extra layer of security and reliability.

Actionable Insights for Developers

For developers involved with frequent refactoring of code, the following actionable advice can enhance productivity and efficiency:

  1. Explore automated code refactoring tools like xmlparsedata and brio – it can significantly speed up the process and lower the chances of manual errors.
  2. Periodically reassess your code refactoring tools and practices, keeping abreast with the newest advancements and improvements in the field.
  3. Extend your code refactoring practices to include automated checks to ensure the modified code’s uncompromised functionality.
  4. Consider broadening your code refactoring practices to adapt to different programming languages, thereby increasing the versatility of your approach.

Read the original article