Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
# Libraries packages <- c("data.table", "rlist", "stringr", "pdftools", "readxl" ) if (length(setdiff(packages,rownames(installed.packages()))) > 0) { install.packages(setdiff(packages, rownames(installed.packages()))) } invisible(lapply(packages, library, character.only = TRUE)) knitr::opts_chunk$set(comment=NA, fig.width=12, fig.height=8, out.width = '100%')
Introduction
This post is a continuation Tabulizer and pdftools Together as Super-powers – Part 2 where we showed how combining pdftools
and tabulizer
together could lead to better, more scaleable data extraction on a large number of slightly varying pdfs. Although the full process used to extract data from all our 149 PDFs will not shown in this series, to review the steps that Redwall followed for the portion using pdftools
and tabulizer
:
- Load all the pdf_data (metadata) for every page of the 148 PDFs into a list of lists with the metadata data.frame from pdftool’s pdf_data.
- Filter the nested list to keep only the key tables (ie: Statement of Activities, Balance Sheet, Statement of Net Position) using regular expression matching of sub-elements.
- Find the coordinates for the key Tabulizer area parameters using regular expressions and the pdf_data metadata.
- Extract those tables with Tabulizer into data.frames.
- Clean up the raw output especially column headers.
- Extract out key fields (ie: Assigned Net Balance, Unassigned Net Balance and Total Expenditures).
- Compare to manually extracted data from Marc Joffe.
Please see our Github for the code. In this post, we will evaluate how we did with with this first method.
# Load comparison of scraped vs Reason spreadsheet mass_compare <- readxl::read_excel("~/Desktop/David/Projects/pdf_cafr_parse/intermediate_data/mass_compare.xls", col_types = c("skip", "text", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric")) mass_compare <- setDT(mass_compare)
Tabulizer Results Compared to Manual Spreadsheet
In the code below, we show the PDFs where all fields did not match the manually-compiled spreadsheet. It is encouraging that there are only 5 (Boston didn’t match only because its CAFR numbers were rounded to the nearest thousand). We also discovered that our CAFR library didn’t have a 2018 CAFR in some cases, so that was the reason some didn’t return any scraped results. In addition, some CAFR’s (Clinton and Montague) are released as scanned images of printed documents, which don’t work with pdftools
, so we couldn’t get page indices to send to tabulizer
.
# Filter diff != 0 (cases where there wasn't a match) diff <- names(mass_compare)[str_detect(names(mass_compare), "diff_")] # All missing all_missing <- mass_compare[, problem := as.logical(apply(.SD, 1, function(x) all(x != 0))), .SDcols = diff][ ][problem == 1]$muni all_missing [1] "boston" "clinton" "montague" "new_bedford" "sandwich" [6] "weston"
The next table shows cases where one or more variable was not being extracted properly (difference not equal to zero). Of the 720 potential items in the spreadsheets which were successfully parsed, 43 did not match. 94% accuracy seems pretty good at first glance, but an analyst probably couldn’t rely on this without some manual back-up checking. Another possible objective would be to know which formats would have a higher likelihood of failure, and just check those. For example, some tables run over onto a second page, and others have sections with larger indentations. In the next step, we will slice the problematic tables out of the relevant PDF, and run these through textract
to see what happens.
Conclusion
The combination of pdftools
and tabulizer
gave encouraging results, and we are confident that further fine tuning could further reduce the error rate. We also have some ideas about which tables fail most often, so the risk of an unexpected mismatch might be mitigated by manual intervention or getting a back up from Textract
in these cases. In the next post Scraping Failed Tabulizer PDFs with AWS Textract – Part 4, we use the AWS paws
SDK to save an aggregated PDF in S3
and use their machine-learning based Textract
solution to extract the tables to see how their solution does for the difficult cases.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.