Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The knitr
and rmarkdown
packages are used in conjunction with pandoc to convert R code and figures to a variety of formats including PDF, and word. Here, I’m exploring how to convert HTML back to markdown format. This post came about when I was searching how to convert XML to markdown, which I still haven’t found an easy way to do. Pandoc is not the only way to convert HTML to markdown (see turndown, html2text)
Pandoc is packaged within RStudio and on Windows, the executables are located within Program Files/RStudio/bin/pandoc
. The rmarkdown
package contains wrapper functions for using pandoc within RStudio.
Here, I am trying to convert this example HTML page back to markdown using the function pandoc_convert
. First, pandoc_convert
requires an actual file which means it does not accept a quoted string of HTML code in its input
argument.
The example html:
<html> <!-- Text between angle brackets is an HTML tag and is not displayed. Most tags, such as the HTML and /HTML tags that surround the contents of a page, come in pairs; some tags, like HR, for a horizontal rule, stand alone. Comments, such as the text you're reading, are not displayed when the Web page is shown. The information between the HEAD and /HEAD tags is not displayed. The information between the BODY and /BODY tags is displayed.--> <head> <title>Enter a title, displayed at the top of the window.</title> </head> <!-- The information between the BODY and /BODY tags is displayed.--> <body> <h1>Enter the main heading, usually the same as the title.</h1> <p>Be <b>bold</b> in stating your key points. Put them in a list: </p> <ul> <li>The first item in your list</li> <li>The second item; <i>italicize</i> key words</li> </ul> <p>Improve your image by including an image. </p> <p><img src="http://www.mygifs.com/CoverImage.gif" alt="A Great HTML Resource"></p> <p>Add a link to your favorite <a href="https://www.dummies.com/">Web site</a>. Break up your page with a horizontal rule or two. </p> <hr> <p>Finally, link to <a href="page2.html">another page</a> in your own Web site.</p> <!-- And add a copyright notice.--> <p>© Wiley Publishing, 2011</p> </body> </html>
I saved the HTML example here as example.html
.
html_page <- readLines("../../static/files/example.html")
We can print the object in R.
cat(html_page) ## <html> <!-- Text between angle brackets is an HTML tag and is not displayed. Most tags, such as the HTML and /HTML tags that surround the contents of a page, come in pairs; some tags, like HR, for a horizontal rule, stand alone. Comments, such as the text you're reading, are not displayed when the Web page is shown. The information between the HEAD and /HEAD tags is not displayed. The information between the BODY and /BODY tags is displayed.--> <head> <title>Enter a title, displayed at the top of the window.</title> </head> <!-- The information between the BODY and /BODY tags is displayed.--> <body> <h1>Enter the main heading, usually the same as the title.</h1> <p>Be <b>bold</b> in stating your key points. Put them in a list: </p> <ul> <li>The first item in your list</li> <li>The second item; <i>italicize</i> key words</li> </ul> <p>Improve your image by including an image. </p> <p><img src="http://www.mygifs.com/CoverImage.gif" alt="A Great HTML Resource"></p> <p>Add a link to your favorite <a href="https://www.dummies.com/">Web site</a>. Break up your page with a horizontal rule or two. </p> <hr> <p>Finally, link to <a href="page2.html">another page</a> in your own Web site.</p> <!-- And add a copyright notice.--> <p>© Wiley Publishing, 2011</p> </body> </html>
pandoc
can convert between many different formats, and for markdown, it has multiple variants including the github flavored variant (for Github), and php markdown extra (the variant used by WordPress sites).
The safest variant to pick is markdown_strict
which is the original markdown
variant.
Pandoc requires the file path which in my case, is located in a different directory rather than my working directory.
library(rmarkdown) file_path <- "../../static/files/example.html" pandoc_convert(file_path, to = "markdown_strict") Enter the main heading, usually the same as the title. ====================================================== Be **bold** in stating your key points. Put them in a list: - The first item in your list - The second item; *italicize* key words Improve your image by including an image. ![A Great HTML Resource](http://www.mygifs.com/CoverImage.gif) Add a link to your favorite [Web site](https://www.dummies.com/). Break up your page with a horizontal rule or two. ------------------------------------------------------------------------ Finally, link to [another page](page2.html) in your own Web site. © Wiley Publishing, 2011
Notice that heading 1 is formatted with ====
rather than the #
that RMarkdown seems to favor. We can require pandoc to use the #
during the conversion by adding an argument.
pandoc_convert(file_path, to = "markdown_strict", options = c("--atx-headers")) # Enter the main heading, usually the same as the title. Be **bold** in stating your key points. Put them in a list: - The first item in your list - The second item; *italicize* key words Improve your image by including an image. ![A Great HTML Resource](http://www.mygifs.com/CoverImage.gif) Add a link to your favorite [Web site](https://www.dummies.com/). Break up your page with a horizontal rule or two. ------------------------------------------------------------------------ Finally, link to [another page](page2.html) in your own Web site. © Wiley Publishing, 2011
Right now, the output is being piped to the console. A file can be created instead with:
pandoc_convert(file_path, to = "markdown_strict", output = "example.md")
Pandoc has a multitude of styling extensions for markdown variants, all listed on the manual page.
Pandoc ignores everything enclosed in <!-- -->
. When converting from markdown to HTML, these comments are usually directly placed as is in the HTML document but the opposite does not seem to be true.
Lastly, this was tested using pandoc version 1.19.2.1. Pandoc 2.5 was released last month.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.