Extracting all links from my slidedeck

[This article was first published on Maëlle's R blog on Maëlle Salmon's personal website, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last week after my useR! talk, someone I had met at the R-Ladies dinner asked me for a list of all the links in my slides. I said I’d prepare it, not because I’m a nice person, but because I knew it’d be an use case where the great tinkr package would shine! 😈

What is tinkr?

tinkr is an R package I created, and that its current maintainer Zhian Kamvar took much further that I’d ever would have. tinkr can transform Markdown into XML and back.

Under the hood, tinkr uses

  • commonmark for the Markdown-to-XML conversion. CommonMark, in the form of its cmark implementation, is the C library that GitHub for instance uses to display your Markdown comments as HTML. The commonmark package is also what powers Markdown support in roxygen2.
  • xslt for the XML-to-Markdown conversion. XSLT is a templating language for XSLT.

Anyway, enough said, let’s go back to today’s use case.

With tinkr I can use XPath, the query language for XML or HTML, to extract links from my slidedeck source. Then I will format them as a list.

First, I create a yarn object from my slidedeck source.

talk_yarn <- tinkr::yarn$new("/home/maelle/Documents/conferences/user2024/index.qmd")
talk_yarn
#> <yarn>
#>   Public:
#>     add_md: function (md, where = 0L) 
#>     body: xml_document, xml_node
#>     clone: function (deep = FALSE) 
#>     get_protected: function (type = NULL) 
#>     head: function (n = 6L, stylesheet_path = stylesheet()) 
#>     initialize: function (path = NULL, encoding = "UTF-8", sourcepos = FALSE, 
#>     md_vec: function (xpath = NULL, stylesheet_path = stylesheet()) 
#>     ns: http://commonmark.org/xml/1.0
#>     path: /home/maelle/Documents/conferences/user2024/index.qmd
#>     protect_curly: function () 
#>     protect_math: function () 
#>     protect_unescaped: function () 
#>     reset: function () 
#>     show: function (lines = TRUE, stylesheet_path = stylesheet()) 
#>     tail: function (n = 6L, stylesheet_path = stylesheet()) 
#>     write: function (path = NULL, stylesheet_path = stylesheet()) 
#>     yaml: --- format:   revealjs:       highlight-style: a11y      ...
#>   Private:
#>     encoding: UTF-8
#>     md_lines: function (path = NULL, stylesheet = NULL) 
#>     sourcepos: FALSE

Then I extract all links.

links <- xml2::xml_find_all(
  talk_yarn$body, 
  xpath = ".//md:link",
  ns = talk_yarn$ns
)
head(links)
#> {xml_nodeset (6)}
#> [1] <link destination="https://user-maelle.netlify.app" title="">\n  <text xm ...
#> [2] <link destination="https://www.pexels.com/photo/old-cargo-ship-on-sea-207 ...
#> [3] <link destination="https://www.pexels.com/photo/the-word-louise-is-spelle ...
#> [4] <link destination="https://www.pexels.com/photo/gray-rotary-telephone-on- ...
#> [5] <link destination="https://www.pexels.com/photo/close-up-photography-of-y ...
#> [6] <link destination="https://www.r-consortium.org/all-projects/call-for-pro ...

I then throw away the links to the great website Pexels, because these are image credits rather than information useful to do R stuff.

links <- purrr::discard(
  links, 
  \(x) startsWith(xml2::xml_attr(x, "destination"), "https://www.pexels")
)
head(links)
#> {xml_nodeset (6)}
#> [1] <link destination="https://user-maelle.netlify.app" title="">\n  <text xm ...
#> [2] <link destination="https://www.r-consortium.org/all-projects/call-for-pro ...
#> [3] <link destination="https://www.r-consortium.org/all-projects/call-for-pro ...
#> [4] <link destination="https://www.heltweg.org/posts/who-wrote-this-shit/" ti ...
#> [5] <link destination="https://fosstodon.org/@hadleywickham/11202130903588421 ...
#> [6] <link destination="https://nostarch.com/kill-it-fire" title="">\n  <text  ...

After that I can format the links and display them here using an “asis” chunk. Yes, my slidedeck uses Quarto but this blog is still powered by R Markdown, hugodown to be precise.

I’m using the formatting as an opportunity to only keep distinct links: sometimes I had very similar slides in a row, with repeated information.

Conclusion

Using tinkr, XPath and sprintf(), I was able to create a list of all the links shared in my useR! slidedeck. Some of them have no text, meaning that the URL is used as text for the link; or text that only makes sense in the context of the paragraph they were a part of; others are slightly more informative; but at least none of them is a “click here” link. 😅

To leave a comment for the author, please follow the link and comment on their blog: Maëlle's R blog on Maëlle Salmon's personal website.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)