Convert epub to Text for Processing in R

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

@RMHoge asked the following on Twitter:

Here’s one way to do that which doesn’t rely on pandoc (pandoc can easily do this and ships with RStudio but shelling out for this is cheating 🙂

We’ll need some help (NOTE that 2 of these are “GitHub” packages)

library(archive) # install_github("jimhester/archive") + 3rd party library
library(hgr) # install_github("hrbrmstr/hgr")
library(stringi)
library(tidyverse)

We’ll use one of @hadley’s books since it’s O’Reilly and they do epubs well. The archive package lets us treat the epub (which is really just a ZIP file) as a mini-filesystem and embraces “tidy” so we have lovely data frames to work with:

bk_src <- "~/Data/R Packages.epub"

bk <- archive::archive(bk_src)

bk
## # A tibble: 92 x 3
##    path                           size date               
##    <chr>                         <dbl> <dttm>             
##  1 mimetype                        20. 2015-03-24 21:49:16
##  2 OEBPS/assets/cover.png      211616. 2015-06-03 16:16:56
##  3 OEBPS/content.opf            10193. 2015-03-24 21:49:16
##  4 OEBPS/toc.ncx                30037. 2015-03-24 21:49:16
##  5 OEBPS/cover.html               315. 2015-03-24 21:49:16
##  6 OEBPS/titlepage01.html         466. 2015-03-24 21:49:16
##  7 OEBPS/copyright-page01.html   3286. 2015-03-24 21:49:16
##  8 OEBPS/toc01.html             17557. 2015-03-24 21:49:16
##  9 OEBPS/preface01.html         17784. 2015-03-24 21:49:16
## 10 OEBPS/part01.html              444. 2015-03-24 21:49:16
## # ... with 82 more rows

We care not about crufty bits and only want HTML files (NOTE: I use html for the pattern since they can be .xhtml files as well):

## # A tibble: 26 x 3
##    path                          size date               
##    <chr>                        <dbl> <dttm>             
##  1 OEBPS/cover.html              315. 2015-03-24 21:49:16
##  2 OEBPS/titlepage01.html        466. 2015-03-24 21:49:16
##  3 OEBPS/copyright-page01.html  3286. 2015-03-24 21:49:16
##  4 OEBPS/toc01.html            17557. 2015-03-24 21:49:16
##  5 OEBPS/preface01.html        17784. 2015-03-24 21:49:16
##  6 OEBPS/part01.html             444. 2015-03-24 21:49:16
##  7 OEBPS/ch01.html             12007. 2015-03-24 21:49:16
##  8 OEBPS/ch02.html             28633. 2015-03-24 21:49:18
##  9 OEBPS/part02.html             454. 2015-03-24 21:49:18
## 10 OEBPS/ch03.html             28629. 2015-03-24 21:49:18
## # ... with 16 more rows

Let’s read in one file (as a test) and convert it to text and show the first few lines of it:

archive::archive_read(bk, "OEBPS/preface01.html") %>%
  read_lines() %>%
  paste0(collapse = "\n") -> chapter

hgr::clean_text(chapter) %>%
  stri_sub(1, 1000) %>%
  cat()
## Preface
## 
## 
## In This Book
## 
## This book will guide you from being a user of R packages to being a creator of R packages. In , you’ll learn why mastering this skill is so important, and why it’s easier than you think. Next, you’ll learn about the basic structure of a package, and the forms it can take, in . The subsequent chapters go into more detail about each component. They’re roughly organized in order of importance:
## 
## 
##  The most important directory is R/, where your R code lives. A package with just this directory is still a useful package. (And indeed, if you stop reading the book after this chapter, you’ll have still learned some useful new skills.)
##  
##  The DESCRIPTION lets you describe what your package needs to work. If you’re sharing your package, you’ll also use the DESCRIPTION to describe what it does, who can use it (the license), and who to contact if things go wrong.
##  
##  If you want other people (including “future you”!) to understand how to use the functions in your package, you’

hgr::clean_text() uses some XSLT magic to pull text. My jericho? can often do a better job but it’s rJava-based so a bit painful for some folks to get running.

Now, we’ll convert all the files:

filter(bk, stri_detect_fixed(path, "html")) %>%
  mutate(content = map_chr(path, ~{
    archive::archive_read(bk, .x) %>%
      read_lines() %>%
      paste0(collapse = "\n") %>%
      hgr::clean_text()
  })) %>%
  print(n=27)
## # A tibble: 26 x 4
##    path                          size date                content         
##    <chr>                        <dbl> <dttm>              <chr>           
##  1 OEBPS/cover.html              315. 2015-03-24 21:49:16 Cover           
##  2 OEBPS/titlepage01.html        466. 2015-03-24 21:49:16 "R Packages\n\n…
##  3 OEBPS/copyright-page01.html  3286. 2015-03-24 21:49:16 "R Packages\n\n…
##  4 OEBPS/toc01.html            17557. 2015-03-24 21:49:16 "navPrefaceIn T…
##  5 OEBPS/preface01.html        17784. 2015-03-24 21:49:16 "Preface\n\n\nI…
##  6 OEBPS/part01.html             444. 2015-03-24 21:49:16 Getting Started 
##  7 OEBPS/ch01.html             12007. 2015-03-24 21:49:16 "Introduction\n…
##  8 OEBPS/ch02.html             28633. 2015-03-24 21:49:18 "Package Struct…
##  9 OEBPS/part02.html             454. 2015-03-24 21:49:18 Package Compone…
## 10 OEBPS/ch03.html             28629. 2015-03-24 21:49:18 "R Code\n\nThe …
## 11 OEBPS/ch04.html             31275. 2015-03-24 21:49:18 "Package Metada…
## 12 OEBPS/ch05.html             42089. 2015-03-24 21:49:18 "Object Documen…
## 13 OEBPS/ch06.html             31484. 2015-03-24 21:49:18 "Vignettes: Lon…
## 14 OEBPS/ch07.html             28594. 2015-03-24 21:49:18 "Testing\n\nTes…
## 15 OEBPS/ch08.html             30808. 2015-03-24 21:49:18 "Namespace\n\nT…
## 16 OEBPS/ch09.html             12125. 2015-03-24 21:49:18 "External Data\…
## 17 OEBPS/ch10.html             42013. 2015-03-24 21:49:18 "Compiled Code\…
## 18 OEBPS/ch11.html              8933. 2015-03-24 21:49:18 "Installed File…
## 19 OEBPS/ch12.html              3897. 2015-03-24 21:49:18 "Other Componen…
## 20 OEBPS/part03.html             446. 2015-03-24 21:49:18 Best Practices  
## 21 OEBPS/ch13.html             59493. 2015-03-24 21:49:18 "Git and GitHub…
## 22 OEBPS/ch14.html             44702. 2015-03-24 21:49:18 "Automated Chec…
## 23 OEBPS/ch15.html             39450. 2015-03-24 21:49:18 "Releasing a Pa…
## 24 OEBPS/ix01.html             75277. 2015-03-24 21:49:20 IndexAad hoc te…
## 25 OEBPS/colophon01.html         974. 2015-03-24 21:49:20 "About the Auth…
## 26 OEBPS/colophon02.html        1653. 2015-03-24 21:49:20 "Colophon\n\nTh…

I’m not wrapping this into a package anytime soon but this is also a pretty basic flow that may not require a package. This has been wrapped into a small package dubbed pubcrawl?.

Drop a note in the comments with your hints/workflows on converting epub to plaintext!

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)