Site icon R-bloggers

Extracting all links from my slidedeck

[This article was first published on Maëlle's R blog on Maëlle Salmon's personal website, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last week after my useR! talk, someone I had met at the R-Ladies dinner asked me for a list of all the links in my slides. I said I’d prepare it, not because I’m a nice person, but because I knew it’d be an use case where the great tinkr package would shine! 😈

What is tinkr?

tinkr is an R package I created, and that its current maintainer Zhian Kamvar took much further that I’d ever would have. tinkr can transform Markdown into XML and back.

Under the hood, tinkr uses

Anyway, enough said, let’s go back to today’s use case.

Extract and format links from index.qmd

With tinkr I can use XPath, the query language for XML or HTML, to extract links from my slidedeck source. Then I will format them as a list.

First, I create a yarn object from my slidedeck source.

talk_yarn <- tinkr::yarn$new("/home/maelle/Documents/conferences/user2024/index.qmd")
talk_yarn
#> <yarn>
#>   Public:
#>     add_md: function (md, where = 0L) 
#>     body: xml_document, xml_node
#>     clone: function (deep = FALSE) 
#>     get_protected: function (type = NULL) 
#>     head: function (n = 6L, stylesheet_path = stylesheet()) 
#>     initialize: function (path = NULL, encoding = "UTF-8", sourcepos = FALSE, 
#>     md_vec: function (xpath = NULL, stylesheet_path = stylesheet()) 
#>     ns: http://commonmark.org/xml/1.0
#>     path: /home/maelle/Documents/conferences/user2024/index.qmd
#>     protect_curly: function () 
#>     protect_math: function () 
#>     protect_unescaped: function () 
#>     reset: function () 
#>     show: function (lines = TRUE, stylesheet_path = stylesheet()) 
#>     tail: function (n = 6L, stylesheet_path = stylesheet()) 
#>     write: function (path = NULL, stylesheet_path = stylesheet()) 
#>     yaml: --- format:   revealjs:       highlight-style: a11y      ...
#>   Private:
#>     encoding: UTF-8
#>     md_lines: function (path = NULL, stylesheet = NULL) 
#>     sourcepos: FALSE

Then I extract all links.

links <- xml2::xml_find_all(
  talk_yarn$body, 
  xpath = ".//md:link",
  ns = talk_yarn$ns
)
head(links)
#> {xml_nodeset (6)}
#> [1] <link destination="https://user-maelle.netlify.app" title="">\n  <text xm ...
#> [2] <link destination="https://www.pexels.com/photo/old-cargo-ship-on-sea-207 ...
#> [3] <link destination="https://www.pexels.com/photo/the-word-louise-is-spelle ...
#> [4] <link destination="https://www.pexels.com/photo/gray-rotary-telephone-on- ...
#> [5] <link destination="https://www.pexels.com/photo/close-up-photography-of-y ...
#> [6] <link destination="https://www.r-consortium.org/all-projects/call-for-pro ...

I then throw away the links to the great website Pexels, because these are image credits rather than information useful to do R stuff.

links <- purrr::discard(
  links, 
  \(x) startsWith(xml2::xml_attr(x, "destination"), "https://www.pexels")
)
head(links)
#> {xml_nodeset (6)}
#> [1] <link destination="https://user-maelle.netlify.app" title="">\n  <text xm ...
#> [2] <link destination="https://www.r-consortium.org/all-projects/call-for-pro ...
#> [3] <link destination="https://www.r-consortium.org/all-projects/call-for-pro ...
#> [4] <link destination="https://www.heltweg.org/posts/who-wrote-this-shit/" ti ...
#> [5] <link destination="https://fosstodon.org/@hadleywickham/11202130903588421 ...
#> [6] <link destination="https://nostarch.com/kill-it-fire" title="">\n  <text  ...

After that I can format the links and display them here using an “asis” chunk. Yes, my slidedeck uses Quarto but this blog is still powered by R Markdown, hugodown to be precise.

I’m using the formatting as an opportunity to only keep distinct links: sometimes I had very similar slides in a row, with repeated information.

format_link <- function(link) {
  url <- xml2::xml_attr(link, "destination")
  text <- xml2::xml_text(link)
  sprintf("* [%s](%s)", text, url)
}

formatted_links <- purrr::map_chr(links, format_link)

formatted_links <- unique(formatted_links)

formatted_links |>
  paste(collapse = "\n") |>
  cat()

Conclusion

Using tinkr, XPath and sprintf(), I was able to create a list of all the links shared in my useR! slidedeck. Some of them have no text, meaning that the URL is used as text for the link; or text that only makes sense in the context of the paragraph they were a part of; others are slightly more informative; but at least none of them is a “click here” link. 😅

To leave a comment for the author, please follow the link and comment on their blog: Maëlle's R blog on Maëlle Salmon's personal website.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version