tinkr: editing Markdown documents using XML tools

rOpenSci - open tools for open science

3 years ago

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Remember our recent post showing that one can wrangle Markdown files programmatically without regex? That tech note showed how to convert Markdown bodies to XML in order to extract information from them. Now, this post goes one step further and presents tinkr, a package for converting .md and .Rmd files to XML, editing them, and… writing them back as Markdown!

General `tinkr` workflow

The goal of tinkr is to convert Markdown files to XML and back to allow their editing with xml2 (XPath!) instead of numerous complicated regular expressions. The XML represents the full Markdown syntax tree (or AST). If new to XPath refer to this great intro. The package offers two functions, to_xml and to_md.

The current workflow for editing .md and .Rmd with tinkr is

use to_xml to obtain XML from Markdown (based on commonmark::markdown_xml and blogdown:::split_yaml_body).
edit the XML using xml2.
use to_md to save back the resulting Markdown (this uses a XSLT stylesheet, and the xslt package).

Maybe there could be shortcuts functions for some operations in 2, maybe not.

The md to XML to md loop on which tinkr is based is slightly lossy because of Markdown syntax redundancy. For instance

lists can be created with either “+”, “-” or “*“. When using tinkr, the md after editing will only use”-” for lists.
Links built like [word][smallref] and bottom [smallref]: URL become [word](URL).
Characters are escaped (e.g. “[” when not for a link).
Block quotes lines all get “>” whereas in the input only the first could have a “>” at the beginning of the first line.
Tables are no longer pretty, i.e. each header is followed by three dashes no matter the length of the longest string in the column. See this cheatsheet.

Such losses make your Markdown file different, and the git diff a bit harder to parse, but should not change the documents your .md or .Rmd is rendered to. If it does, report a bug in the issue tracker! And in any case, as mentioned in the docs of the namer package, such programmatic Markdown editing is best paired with version control.

A solution to not loose your Markdown style, e.g. your preferring “*“ over “-” for lists is to tweak our XSL stylesheet and provide its filepath as stylesheet_path argument to to_md.

If you want to read more technical details, head over to that section. But we’ll show examples first!

A few `tinkr` examples

Since tinkr is still a toddler package, it’s available from GitHub only.

remotes::install_github("ropenscilabs/tinkr")

Use `tinkr` to change header level

We read “example1.md”, change all headers 3 to headers 1, and save it back to md.

# From Markdown to XML
path <- system.file("extdata", "example1.md", package = "tinkr")
(yaml_xml_list <- tinkr::to_xml(path))

## $yaml
##  [1] "---"                                                                                                                                                                                                                                                                             
##  [2] "title: \"What have these birds been studied for? Querying science outputs with R\""                                                                                                                                                                                              
##  [3] "slug: birds-science"                                                                                                                                                                                                                                                             
##  [4] "authors:"                                                                                                                                                                                                                                                                        
##  [5] "  - name: Maëlle Salmon"                                                                                                                                                                                                                                                         
##  [6] "    url: https://masalmon.eu/"                                                                                                                                                                                                                                                   
##  [7] "date: 2018-09-11"                                                                                                                                                                                                                                                                
##  [8] "topicid: 1347"                                                                                                                                                                                                                                                                   
##  [9] "preface: The blog post series corresponds to the material for a talk Maëlle will give at the [Animal Movement Analysis summer school in Radolfzell, Germany on September the 12th](http://animove.org/animove-2019-evening-keynotes/), in a Max Planck Institute of Ornithology."
## [10] "tags:"                                                                                                                                                                                                                                                                           
## [11] "- rebird"                                                                                                                                                                                                                                                                        
## [12] "- birder"                                                                                                                                                                                                                                                                        
## [13] "- fulltext"                                                                                                                                                                                                                                                                      
## [14] "- dataone"                                                                                                                                                                                                                                                                       
## [15] "- EML"                                                                                                                                                                                                                                                                           
## [16] "- literature"                                                                                                                                                                                                                                                                    
## [17] "output:"                                                                                                                                                                                                                                                                         
## [18] "  md_document:"                                                                                                                                                                                                                                                                  
## [19] "    variant: markdown_github"                                                                                                                                                                                                                                                    
## [20] "    preserve_yaml: true"                                                                                                                                                                                                                                                         
## [21] "---"                                                                                                                                                                                                                                                                             
## 
## $body
## {xml_document}
## <document xmlns="http://commonmark.org/xml/1.0">
##  [1] <paragraph>\n  <text xml:space="preserve">In the </text>\n  <link d ...
##  [2] <heading level="3">\n  <text xml:space="preserve">Getting a list of ...
##  [3] <paragraph>\n  <text xml:space="preserve">For more details about th ...
##  [4] <code_block info="r" xml:space="preserve"># polygon for filtering\n ...
##  [5] <paragraph>\n  <text xml:space="preserve">For the sake of simplicit ...
##  [6] <code_block info="r" xml:space="preserve">species &lt;- ebd %&gt;%\ ...
##  [7] <paragraph>\n  <text xml:space="preserve">The species are Carrion C ...
##  [8] <heading level="3">\n  <text xml:space="preserve">Querying the scie ...
##  [9] <paragraph>\n  <text xml:space="preserve">Just like rOpenSci has a  ...
## [10] <paragraph>\n  <text xml:space="preserve">We shall use </text>\n  < ...
## [11] <paragraph>\n  <text xml:space="preserve">We first define a functio ...
## [12] <paragraph>\n  <text xml:space="preserve">We use </text>\n  <code x ...
## [13] <code_block info="r" xml:space="preserve">.get_papers &lt;- functio ...
## [14] <code_block xml:space="preserve">##  [1] "Great spotted cuckoo nest ...
## [15] <paragraph>\n  <text xml:space="preserve">If we were working on a s ...
## [16] <paragraph>\n  <text xml:space="preserve">We then apply this functi ...
## [17] <code_block info="r" xml:space="preserve">get_papers &lt;- ratelimi ...
## [18] <code_block xml:space="preserve">## [1] 522\n</code_block>
## [19] <code_block info="r" xml:space="preserve">all_papers &lt;- unique(a ...
## [20] <code_block xml:space="preserve">## [1] 378\n</code_block>
## ...

library("magrittr")
# transform level 3 headers into level 1 headers
body <- yaml_xml_list$body
body %>%
  xml2::xml_find_all(xpath = './/d1:heading',
                     xml2::xml_ns(.)) %>%
  .[xml2::xml_attr(., "level") == "3"] -> headers3

xml2::xml_set_attr(headers3, "level", 1)

(yaml_xml_list$body <- body)

## {xml_document}
## <document xmlns="http://commonmark.org/xml/1.0">
##  [1] <paragraph>\n  <text xml:space="preserve">In the </text>\n  <link d ...
##  [2] <heading level="1">\n  <text xml:space="preserve">Getting a list of ...
##  [3] <paragraph>\n  <text xml:space="preserve">For more details about th ...
##  [4] <code_block info="r" xml:space="preserve"># polygon for filtering\n ...
##  [5] <paragraph>\n  <text xml:space="preserve">For the sake of simplicit ...
##  [6] <code_block info="r" xml:space="preserve">species &lt;- ebd %&gt;%\ ...
##  [7] <paragraph>\n  <text xml:space="preserve">The species are Carrion C ...
##  [8] <heading level="1">\n  <text xml:space="preserve">Querying the scie ...
##  [9] <paragraph>\n  <text xml:space="preserve">Just like rOpenSci has a  ...
## [10] <paragraph>\n  <text xml:space="preserve">We shall use </text>\n  < ...
## [11] <paragraph>\n  <text xml:space="preserve">We first define a functio ...
## [12] <paragraph>\n  <text xml:space="preserve">We use </text>\n  <code x ...
## [13] <code_block info="r" xml:space="preserve">.get_papers &lt;- functio ...
## [14] <code_block xml:space="preserve">##  [1] "Great spotted cuckoo nest ...
## [15] <paragraph>\n  <text xml:space="preserve">If we were working on a s ...
## [16] <paragraph>\n  <text xml:space="preserve">We then apply this functi ...
## [17] <code_block info="r" xml:space="preserve">get_papers &lt;- ratelimi ...
## [18] <code_block xml:space="preserve">## [1] 522\n</code_block>
## [19] <code_block info="r" xml:space="preserve">all_papers &lt;- unique(a ...
## [20] <code_block xml:space="preserve">## [1] 378\n</code_block>
## ...

# Back to Markdown
tinkr::to_md(yaml_xml_list, "newmd.md")

Use `tinkr` to wrangle chunks

Because to_xml parses chunk options into node attributes, one can use tinkr to programmatically change options. In the example below, we set the “echo” option to FALSE in all chunks (in real life we might prefer deleting all echo options in chunks, and set it to FALSE in the setup chunk).

# from Rmd to XML
path <- system.file("extdata", "example2.Rmd",
                    package = "tinkr")
yaml_xml_list <- tinkr::to_xml(path)
library("magrittr")

# identify code blocks
body <- yaml_xml_list$body
blocks <- body %>%
xml2::xml_find_all(xpath = './/d1:code_block',
                   xml2::xml_ns(.))

# Change echo attribute
xml2::xml_set_attr(blocks, "echo", "FALSE")
blocks

## {xml_nodeset (4)}
## [1] <code_block xml:space="preserve" language="r" name="setup" include=" ...
## [2] <code_block xml:space="preserve" language="r" name="" eval="TRUE" ec ...
## [3] <code_block xml:space="preserve" language="python" name="" fig.cap=" ...
## [4] <code_block xml:space="preserve" language="python" name="" echo="FAL ...

Note that logical chunk options are handled as character. Actual string chunk options such as fig.cap keep their quotes. This is all to make sure the Rmd we write back has valid options.

# Show fig.cap
xml2::xml_attr(blocks, "fig.cap")

## [1] NA                NA                "\"pretty plot\"" NA

# save back to Rmd
yaml_xml_list$body <- body
tinkr::to_md(yaml_xml_list, "newmd.Rmd")

Bonus: use `tinkr::to_xml` to analyse chunk options

Even only tinkr::to_xml on its own can be powerful, thanks to its parsing chunk options to node attributes. In the example below, we find all code chunks of the “R for data science” book by Garrett Grolemund and Hadley Wickham that have the error=TRUE chunk option. It’s the option you can set to show wrong code in your R Markdown document. Let’s find the wrong examples of that book!

Note that the path below is where the clone of the r4ds repo lives on my computer.

book_path <- "C:\\Users\\Maelle\\Documents\\ropensci\\r4ds"

rmds <- fs::dir_ls(book_path, regexp = "*.Rmd")

There are 32 R Markdown documents.

library("magrittr")

get_chunks <- function(xml){
  xml %>%
    xml2::xml_find_all(xpath = './/d1:code_block',
                       xml2::xml_ns(.))
}

get_error_chunks_code <- function(xml){
  xml %>%
    get_chunks()%>%
    .[xml2::xml_has_attr(., "error")] %>%
    # note that logicals are character
    .[xml2::xml_attr(., "error") == "TRUE"] %>%
    xml2::xml_text() %>%
    as.character() }

purrr::map(rmds, tinkr::to_xml) %>%
  purrr::map("body") -> bodies

bodies %>%
  purrr::map(get_chunks) -> all_chunks

length(unlist(all_chunks))

## [1] 1776

bodies %>%
  purrr::map(get_error_chunks_code) %>%
  unlist() -> buggy_code

There are 11 chunks to show errors out of 1776 chunks. Few enough to show all of them (using the results="asis" chunk option!)…

glue::glue("```r\n {unname(buggy_code)} \n ```")

if (c(TRUE, FALSE)) {}

if (NA) {}
 

wt_mean <- function(x, w, na.rm = FALSE) {
  stopifnot(is.logical(na.rm), length(na.rm) == 1)
  stopifnot(length(x) == length(w))
  
  if (na.rm) {
    miss <- is.na(x) | is.na(w)
    x <- x[!miss]
    w <- w[!miss]
  }
  sum(w * x) / sum(w)
}
wt_mean(1:6, 6:1, na.rm = "foo")
 

issues %>% map_chr(c("pull_request", "html_url"))
 

tibble(x = "e") %>% 
  add_predictions(mod2)
 

mtcars %>% 
  group_by(cyl) %>% 
  summarise(q = quantile(mpg))
 

# Ok, because y and z have the same number of elements in
# every row
df1 <- tribble(
  ~x, ~y,           ~z,
   1, c("a", "b"), 1:2,
   2, "c",           3
)
df1
df1 %>% unnest(y, z)

# Doesn't work because y and z have different number of elements
df2 <- tribble(
  ~x, ~y,           ~z,
   1, "a",         1:2,  
   2, c("b", "c"),   3
)
df2
df2 %>% unnest(y, z)
 

tryCatch(stop("!"), error = function(e) "An error")

stop("!") %>% 
  tryCatch(error = function(e) "An error")
 

filter(flights, month = 1)
 

tibble(x = 1:4, y = 1:2)

tibble(x = 1:4, y = rep(1:2, 2))

tibble(x = 1:4, y = rep(1:2, each = 2))
 

x[c(1, -1)]
 

my_variable <- 10
my_variable

The book explains in more detail why these chunks create errors… or you can use tinkr to extract more information out of the R Markdown source!

If you’re not interested into technical details, hop over the next section and read the conclusion directly!

`tinkr` under the hood

This section features a bit more technical details about to_xml and to_md, for those interested in such stuff, out of curiosity (cool!) or out of a desire to become a contributor (yay!).

From Markdown to XML: `to_xml`

The to_xml function uses internal code from blogdown to split lines of the md between YAML header and Markdown body. The Markdown body is further processed using commonmark::markdown_xml with extensions=TRUE and a homegrown code to transform code chunks info from string ““`{r setup, include=FALSE, eval = TRUE}” to XML node attributes (called “language”, “name”, “include”, “eval” for this example). This allows easier editing of code chunks.

Side-note: This piece of tinkr worked quite well right away but we noticed that the XML conversion lost the alignment attributes of table cells, so I opened an issue in the repo of the github fork of cmark. rOpenSci post-doc hacker Jeroen Ooms helped me find where to open this issue: cmark is the C library wrapped by the R package commonmark. There’s the parent repo commonmark/cmark, but the one wrapped by commonmark is the github/cmark fork because it’s the one supporting GitHub-flavored Markdown extensions such as tables and strike-through text. A new version of the github/cmark library was released by Ashe Connor and commonmark was updated by Jeroen to use this newest version.

to_xml returns a list containing the YAML header as a character vector, and the body as XML. YAML metadata could be edited using the yaml package, which is not the goal of tinkr.

From XML to Markdown: `to_md`

After editing the XML body, one needs to convert the list containing the YAML and the body back to Markdown. to_md does so by:

packing the chunk options back from node attributes to a string.
using the rOpenSci package xslt (bindings to libxslt) for conversion of the body to Markdown, using an XSLT stylesheet as Rosetta stone. In Jeroen’s words, “XSLT is just a mini programming language to run loops over an XML, that is written itself in XML”. Things such as <heading level="2">\n <text>R Markdown</text>\n</heading> thus becomes ## R Markdown
using writeLines on the YAML character vectors and on the body.

Another side-note: The most crucial part here was obtaining an XSLT stylesheet. Jeroen opened an issue in the commonmark/cmark repo and Nick Wellnhofer “whipped up” an XSLT stylesheet. If you followed correctly, the commonmark/cmark library doesn’t support GitHub-flavoured Markdown extensions, so Nick Wellnhofer understandably didn’t include templates for those in his stylesheet. I miraculously was able to write a bit of XSLT myself, and submitted a stylesheet to the github/cmark repo. It imports Nick Wellnhofer’s stylesheet, and defines templates for tables and strike-through text. Copies of these two stylesheets live in tinkr and are therefore available after installing it.

to_md exposes the path to the stylesheet as argument, so you could provide your own in order to preserve your style preferences (e.g. writing lists with “*” rather than “-”)

Future plans

The goal of tinkr was to experiment with the idea of Commonmark/XML conversion, and then to go from there. Now that to_xml and to_md allow the programmatic XML editing of Markdown files, albeit slightly lossy, maybe tinkr could keep growing! Here are two main possible discussion topics with you, dear reader:

Are you an XSL wizard? If so, would you like to help improve the XSLT stylesheet, in particular to prettify tables? Head to the GitHub repository, in particular to this issue, and volunteer!
tinkr was called after the made-up but accepted word “tink” that means “unknit”. tinkr is currently more about tinkering with than tinking R Markdown files, but why not unearth the common discussion of how to collaborate with people who’d rather edit a Google doc/docx than R Markdon source? Could one convert to and from Open Office XML and help track changes for instance?

And of course, to finish on a more down-to-earth note, if you do edit Markdown files with tinkr, please report your experience and what could be improved, in the comments below or the GitHub repo!

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

General tinkr workflow

A few tinkr examples

Use tinkr to change header level

Use tinkr to wrangle chunks

Bonus: use tinkr::to_xml to analyse chunk options

tinkr under the hood

From Markdown to XML: to_xml

From XML to Markdown: to_md