Join, split, and compress PDF files with pdftools
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Last month we released a new version of pdftools and a new companion package qpdf for working with pdf files in R. This release introduces the ability to perform pdf transformations, such as splitting and combining pages from multiple files. Moreover, the pdf_data()
function which was introduced in pdftools 2.0 is now available on all major systems.
Split and Join PDF files
It is now possible to split, join, and compress pdf files with pdftools. For example the pdf_subset()
function creates a new pdf file with a selection of the pages from the input file:
# Load pdftools library(pdftools) # extract some pages pdf_subset('https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf', pages = 1:3, output = "subset.pdf") # Should say 3 pdf_length("subset.pdf")
Similarly pdf_combine()
is used to join several pdf files into one.
# Generate another pdf pdf("test.pdf") plot(mtcars) dev.off() # Combine them with the other one pdf_combine(c("test.pdf", "subset.pdf"), output = "joined.pdf") # Should say 4 pdf_length("joined.pdf")
The split and join features are provided via a new package qpdf which wraps the qpdf C++ library. The main benefit of qpdf is that no external software (such as pdftk) is needed. The qpdf package is entirely self contained and works reliably on all operating systems with zero system dependencies.
Data extraction now available on Linux too
The pdftools 2.0 announcement post from December introduced the new pdf_data()
function for extracting individual text boxes from pdf files. However it was noted that this function was not yet available on most Linux distributions because it requires a recent fix from poppler 0.73.
I am happy to say that this should soon work on all major Linux distributions. Ubuntu has upgraded to poppler 0.74 on Ubuntu Disco which was released this week. I also created a PPA for Ubuntu 16.04 (Xenial) and 18.04 (Bionic) with backports of poppler 0.74. This makes it possible to use pdf_data
on Ubuntu LTS servers, including Travis:
sudo add-apt-repository ppa:opencpu/poppler sudo apt-get update sudo apt-get install libpoppler-cpp-dev
Moreover, the upcoming Fedora 30 will ship with poppler-devel 0.73.
Finally, the upcoming Debian “Buster” release will ship with poppler 0.71, but the Debian maintainers were nice enough to let me backport the required patch from poppler 0.73, so pdf_data()
will work on Debian (and hence CRAN) as well!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.