Parse pdf files with R (on a Mac)
[This article was first published on Nicebread » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Inspired by this blog post from theBioBucket, I created a script to parse all pdf files in a directory. Due to its reliance on the Terminal, it’s Mac specific, but modifications for other systems shouldn’t be too hard (as a start for Windows, see BioBucket’s script).
First, you have to install the command line tool pdftotext (a binary can be found on Carsten Blüm’s website). Then, run following script within a directory with pdfs:
# helper function: get number of words in a string, separated by tab, space, return, or point. nwords <- function(x){ res <- strsplit(as.character(x), "[ \t\n,\\.]+") res <- lapply(res, length) unlist(res) } # sanitize file name for terminal usage (i.e., escape spaces) sanitize <- function(str) { gsub('([#$%&~_\\^\\\\{}\\s\\(\\)])', '\\\\\\1', str, perl = TRUE) } # get a list of all files in the current directory fi <- list.files() fi2 <- fi[grepl(".pdf", fi)] ## Parse files and do something with it ... res <- data.frame() # keeps records of the calculations for (f in fi2) { print(paste("Parsing", f)) f2 <- sanitize(f) system(paste0("pdftotext ", f2), wait = TRUE) # read content of converted txt file filetxt <- sub(".pdf", ".txt", f) text <- readLines(filetxt, warn=FALSE) # adjust encoding of text - you have to know it Encoding(text) <- "latin1" # Do something with the content - here: get word and character count of all pdfs in the current directory text2 <- paste(text, collapse="\n") # collapse lines into one long string res <- rbind(res, data.frame(filename=f, wc=nwords(text2), cs=nchar(text2), cs.nospace=nchar(gsub("\\s", "", text2)))) # remove converted text file file.remove(filetxt) } print(res) |
… gives following result (wc = word count, cs = characgter count, cs.nospace = character count without spaces):
> print(res) filename wc cs cs.nospace 1 Applied_Linear_Regression.pdf 33697 186280 154404 2 Baron-rpsych.pdf 22665 128440 105024 3 bootstrapping regressions.pdf 6309 34042 27694 4 Ch_multidimensional_scaling.pdf 718 4632 3908 5 corrgram.pdf 6645 40726 33965 6 eRm - Extended Rach Modeling (Paper).pdf 11354 65273 53578 7 eRm (Folien).pdf 371 1407 886 8 Faraway 2002 - Practical Regression and ANOVA using R.pdf 68777 380902 310037 9 Farnsworth-EconometricsInR.pdf 20482 125207 101157 10 ggplot_book.pdf 10681 65388 53551 11 ggplot2-lattice.pdf 18067 118591 93737 12 lavaan_usersguide_0.3-1.pdf 12608 64232 52962 13 lme4 - Bootstrapping.pdf 2065 11739 9515 14 Mclust.pdf 18191 92180 70848 15 multcomp.pdf 5852 38769 32344 16 OpenMxUserGuide.pdf 37320 233817 197571
To leave a comment for the author, please follow the link and comment on their blog: Nicebread » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.