Site icon R-bloggers

Join, split, and compress PDF files with pdftools

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last month we released a new version of pdftools and a new companion package qpdf for working with pdf files in R. This release introduces the ability to perform pdf transformations, such as splitting and combining pages from multiple files. Moreover, the pdf_data() function which was introduced in pdftools 2.0 is now available on all major systems.

Split and Join PDF files

It is now possible to split, join, and compress pdf files with pdftools. For example the pdf_subset() function creates a new pdf file with a selection of the pages from the input file:

# Load pdftools
library(pdftools)

# extract some pages
pdf_subset('https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf',
  pages = 1:3, output = "subset.pdf")

# Should say 3
pdf_length("subset.pdf")

Similarly pdf_combine() is used to join several pdf files into one.

# Generate another pdf
pdf("test.pdf")
plot(mtcars)
dev.off()

# Combine them with the other one
pdf_combine(c("test.pdf", "subset.pdf"), output = "joined.pdf")

# Should say 4
pdf_length("joined.pdf")

The split and join features are provided via a new package qpdf which wraps the qpdf C++ library. The main benefit of qpdf is that no external software (such as pdftk) is needed. The qpdf package is entirely self contained and works reliably on all operating systems with zero system dependencies.

Data extraction now available on Linux too

The pdftools 2.0 announcement post from December introduced the new pdf_data() function for extracting individual text boxes from pdf files. However it was noted that this function was not yet available on most Linux distributions because it requires a recent fix from poppler 0.73.

I am happy to say that this should soon work on all major Linux distributions. Ubuntu has upgraded to poppler 0.74 on Ubuntu Disco which was released this week. I also created a PPA for Ubuntu 16.04 (Xenial) and 18.04 (Bionic) with backports of poppler 0.74. This makes it possible to use pdf_data on Ubuntu LTS servers, including Travis:

sudo add-apt-repository ppa:opencpu/poppler
sudo apt-get update
sudo apt-get install libpoppler-cpp-dev

Moreover, the upcoming Fedora 30 will ship with poppler-devel 0.73.

Finally, the upcoming Debian “Buster” release will ship with poppler 0.71, but the Debian maintainers were nice enough to let me backport the required patch from poppler 0.73, so pdf_data() will work on Debian (and hence CRAN) as well!

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.