Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This week we released version 1.0
of the ropensci pdftools package to CRAN. Pdftools provides utilities for extracting text, s, attachments and other data from PDF files. It also supports rendering of PDF files into bitmap images.
This release has a few internal enhancements and fixes an annoying bug for landscape PDF pages. The version bump to 1.0
signifies that the package has undergone sufficient testing and the API is stable.
Extracting Text
As described in our previous post, the most common use of pdftools
is extracting text from (scientific) articles for searching / indexing. But let's try a somewhat more unusual PDF file this time: a poster.
library(pdftools) url <- "https://www.rstudio.com/wp-content/uploads/2016/02/advancedR.pdf" # Display author, editor pdf_info(url)
The pdf_info
file returns all kind of metadata from the pdf file. For example we can read that this PDF was created on 2016-02-12 by Arianne Colton using Acrobat PDFMaker 11 for PowerPoint.
# extract text vector text <- pdf_text(url) # Print text from page 1 cat(text[1])
The pdf_text
function extracts text into an R character vector if length equal to the number of pages in the PDF.
Note how the text is spaced to match the position in the PDF page.
Rendering PDF
Recent versions of pdftools allow rendering of PDF pages into bitmap images. The pdf_render_page
function returns the bitmap as a raw vector array of size channels * width * height (in pixels).
library(pdftools) bitmap <- pdf_render_page(url, page = 1, dpi = 72) dim(bitmap) ## 4 1100 850
From here we can use for example the rOpenSci magick package to read the bitmap and manipulate/export it to various formats.
library(magick) poster <- image_read(bitmap) print(poster) image_write(poster, "out.png", format = "png")
Or have some fun with the other magick tools 🙂
# Download dancing banana banana <- image_read("https://jeroenooms.github.io/images/banana.gif") banana <- image_scale(banana, "300") # Combine and flatten frames frames <- lapply(banana, function(frame) { image_composite(poster, frame, offset = "+70+30") }) # Turn frames into animation animation <- image_animate(image_join(frames)) print(animation) # Save as gif image_write(animation, "output.gif")
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.