Migrating a blog to Quarto: Reverse engineering HTML to markdown

David Schoch

1 day ago

[This article was first published on schochastics - all things R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I migrated my personal webpage to Quarto in july 2022. The only thing I did not do was my blog; for two reasons: 1) Quarto was still very new and many of todays features were not available. I already needed to hack some things together for my personal page (which are still in place although Quarto now has the features…). For my blog, I wanted to wait until Quarto is more mature. 2) I feared that it will not be straightforward to migrate all my blog entries from blogdown to Quarto.

Time went by and I kept thinking about the migration but 2) kept me away from it. Over time though, I realized (as apparently many others) how broken my hugo/blogdown/theme setup has become. Every update introduced new issues. Lately I barely managed to put a post together without some weird hotfixes in the background. My blog reached the state of FUBAR. So it was time to migrate. Or simply start over.

< section id="migrate-or-start-over" class="level2">

Migrate or start over?

Pondering about the migration, I reached a point where I considered starting over with my blog. Why go through the hassle of migrating all posts? Obviously I could just leave the blog as is to not break anyones bookmarks (if they even exist) and simply start a new quarto blog. This lazy solution seemed compelling for many other reasons. When I started this iteration of my blog in 2017, I didnt know (or care) about “reproducibilty”. So, many of my early posts cannot be rerun because the data is lost, hidden on some old harddrive, and have paths that do not resolve anymore. So, without trying, I felt it is highly unlikely that I can rerender all posts in Quarto without considerable effort.

But there was one thing I was willing to try, purely out of technical interest:

Is it possible to convert the html posts back to a raw markdown file?

That way, I would not need to to rerun all analysis and only need to render to html with my new Quarto theme (and probably some yaml patching).

< section id="preparatory-steps" class="level2">

Preparatory steps

I created a csv file with all existing posts. This was done semiautomatically by scraping my own blog. the only manual work was the categories. I was very random on my old blog and I recategorized everything to be more consistent.

Next, I downloaded the html files. This was surprisingly challenging. I tried some automatic approaches (download.file(), the rvest package) but the best (for my approach) was to save the page via CTRL+S in Firefox. This created the html file and a folder containing all asset files. In later steps, this turned out to be very beneficial.¹

< section id="html-to-markdown" class="level2">

Html to markdown

To convert the html files to markdown, I used pandoc, a powerful converter for markup languages.²

A little bit of searching gave me this command to convert html to markdown.

pandoc post.html -t gfm-raw_html -o index.md

-t gfm-raw_html supposedly removes all html tags from the file and really only returns raw markdown. It didn’t do so for me. I wrote a quick lua filter to help with the remaining tags.

remove-tags.lua

function Header (elem)
  elem.identifier = ""
  return elem
end

function Div (elem)
  return elem.content
end

So the final pandoc command is

pandoc post.html --lua-filter remove-tags.lua  -t gfm-raw_html -o index.md

For me, this produced a clean markdown file of the post that I can now essentially be used to rerender the posts with quarto 🥳!

< section id="clean-up" class="level2">

Clean up

While I did get a raw markdown file out of the html file, there was still some cleaning up to do. For instance, there where some lines at the beginning and the end that needed to be eliminated. Fortunately, it was the same pattern for all posts (first 16 lines and all lines after the line starting with “Tagged”), so sed does the trick.

sed -i '/^Tagged/,$d' index.md
sed -i '1,16d' index.md

Next up, a yaml header needed to be added. With the csv file, this was also quite straightforward.³ I also injected a line into the post warning about the automatic convertion and a link to the archived original post.

for (i in seq_len(nrow(posts))) {
    file_md <- fs::path("posts", posts$new_folder[i], "index.md")
    post <- readLines(file_md)
    addendum <- c(
        "",
        paste0("*This post was semi automatically converted from blogdown to
        Quarto and may contain errors. The original can be found in the 
        [archive](", 
        str_replace(posts$old_link[i], "blog.", "archive."), ").*"),
        ""
    )
    post <- c(post[1], addendum, post[-1])
    header <- tibble(
        author = list(name = "David Schoch", orcid = "0000-0003-2952-4812")
    ) |>
        yaml::as.yaml() |>
        paste0("title: \"", posts$post_title[i], "\"\n", ... = _) |>
        paste0(... = _, "date: ", posts$pub_date[i], "\n") |>
        paste0(... = _, "categories: [", posts$category[i], "]\n") |>
        paste0("---\n", ... = _, "---\n") |>
        str_replace("name:", "- name:") |>
        str_replace("orcid:", "  orcid:") |>
        str_split("\n")
    writeLines(c(header[[1]], post), file_md)
}

The last step was to fix the images. This is where the post_files folder from Firefox became very handy. In each markdown files, images where included as ![](post_files/image.png). So all that needed to be done was copy the image.png out of the post_files folder and adjust the path to ![](image.png).

And those were all the required steps to convert my blogdown blog to a quarto blog by reverse engineering html to markdown. I am pretty sure that there are still lots of tiny errors⁴, but the main work is done.

< section id="addendum" class="level2">

Addendum

A full migration script is available on GitHub. This is not going to work out of the box for other blogs, because it does depend on the theme used of the blog. I just shared it in case it helps as a reference point for others who are insne enough to migrate their blog like this.

< aside id="footnotes" class="footnotes footnotes-end-of-document">

Footnotes

There is for sure a better (and automatic) way to do this. But I gave this project one afternoon/evening, so I was not willing to dig deeper.↩︎
Actually, quarto is based around it.↩︎
Although the code is probably unnecessarily complicated.↩︎
I swear i will fix them all! Until then http://archive.schochastics.net has the original posts.↩︎

< section class="quarto-appendix-contents">

Reuse

CC BY 4.0

< section class="quarto-appendix-contents">

Citation

BibTeX citation:

@online{schoch2024,
  author = {Schoch, David},
  title = {Migrating a Blog to {Quarto:} {Reverse} Engineering {HTML} to
    Markdown},
  date = {2024-01-02},
  url = {http://blog.schochastics.net/posts/2024-01-02_migrating-to-quarto},
  langid = {en}
}

For attribution, please cite this work as:

Schoch, David. 2024. “Migrating a Blog to Quarto: Reverse Engineering HTML to Markdown.” January 2, 2024. http://blog.schochastics.net/posts/2024-01-02_migrating-to-quarto.

To leave a comment for the author, please follow the link and comment on their blog: schochastics - all things R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.