Freeing PDF Data to Account for the Unaccounted

hrbrmstr

4 years ago

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve mentioned @stiles before on the blog but for those new to my blatherings, Matt is a top-notch data journalist with the @latimes and currently stationed in South Korea. I can only imagine how much busier his life has gotten since that fateful, awful November 2016 Tuesday, but I’m truly glad his eyes, pen and R console are covering the important events there.

When I finally jumped on Twitter today, I saw this:

@hrbrmstr Do you have an R rig to convert this large PDF to csv? I tried xpdf and @TabulaPDF but I don't trust the results.

— Matt Stiles (@stiles) July 2, 2018

and went into action and figured I should blog the results as one can never have too many “convert this PDF to usable data” examples.

The Problem

The U.S. Defense POW/MIA Accounting Agency maintains POW/MIA data for all our nation’s service members. Matt is working with data from Korea (the “All US Unaccounted-For” PDF direct link is in the code below) and needed to get the PDF into a usable form and (as you can see if you read through the Twitter thread) both Tabulizer and other tools were introducing sufficient errors that the resultant extracted data was either not complete or trustworthy enough to rely on (hand-checking nearly 8,000 records is not fun).

There PDF in question was pretty uniform, save for the first and last pages. Here’s a sample:

Click to view slideshow.

We just need a reproducible way to extract the data with sufficient veracity to ensure we can use it faithfully.

The Solution

We’ll need some packages and the file itself, so let’s get that bit out of the way first:

library(stringi)
library(pdftools)
library(hrbrthemes)
library(ggpomological)
library(tidyverse)

# grab the PDF text
mia_url <- "http://www.dpaa.mil/portals/85/Documents/KoreaAccounting/pmkor_una_all.pdf"
mia_fil <- "~/Data/pmkor_una_all.pdf"
if (!file.exists(mia_fil)) download.file(mia_url, mia_fil)

# read it in
doc <- pdf_text(mia_fil)

Let's look at those three example pages:

cat(doc[[1]])
##                                   Defense POW/MIA Accounting Agency
##                                       Personnel Missing - Korea (PMKOR)
##                                        (Reported for ALL Unaccounted For)
##                                                                                                Total Unaccounted: 7,699
## Name                       Rank/Rate     Branch                           Year State/Territory
## ABBOTT, RICHARD FRANK      M/Sgt         UNITED STATES ARMY               1950 VERMONT
## ABEL, DONALD RAYMOND       Pvt           UNITED STATES ARMY               1950 PENNSYLVANIA
## ...
## AKERS, HERBERT DALE        Cpl           UNITED STATES ARMY               1950 INDIANA
## AKERS, JAMES FRANCIS       Cpl           UNITED STATES MARINE CORPS       1950 VIRGINIA

cat(doc[[2]])
## Name                          Rank/Rate Branch                     Year State/Territory
## AKERS, RICHARD ALLEN          1st Lt    UNITED STATES ARMY         1951 PENNSYLVANIA
## AKI, CLARENCE HALONA          Sgt       UNITED STATES ARMY         1950 HAWAII
...
## AMIDON, DONALD PRENTICE       PFC       UNITED STATES MARINE CORPS 1950 TEXAS
## AMOS, CHARLES GEARL           Cpl       UNITED STATES ARMY         1951 NORTH CAROLINA

cat(doc[[length(doc)]])
## Name                                                Rank/Rate           Branch                                              Year         State/Territory
## ZAVALA, FREDDIE                                     Cpl                 UNITED STATES ARMY                                  1951         CALIFORNIA
## ZAWACKI, FRANK JOHN                                 Sgt                 UNITED STATES ARMY                                  1950         OHIO
## ...
## ZUVER, ROBERT LEONARD                               Pfc                 UNITED STATES ARMY                                  1950         CALIFORNIA
## ZWILLING, LOUIS JOSEPH                              Cpl                 UNITED STATES ARMY                                  1951         ILLINOIS
##                                       This list of Korean War missing personnel was prepared by the Defense POW/MIA Accounting Agency (DPAA).
##                Please visit our web site at http://www.dpaa.mil/Our-Missing/Korean-War-POW-MIA-List/ for updates to this list and other official missing personnel data lists.
## Report Prepared: 06/19/2018 11:25

The poppler library's "layout" mode (which pdftools uses brilliantly) combined with the author of the PDF not being evil will help us make short work of this since:

there's a uniform header on each page
the "layout" mode returned uniform per-page, fixed-width columns
there's no "special column tweaks" that some folks use to make PDFs more readable by humans

There are plenty of comments in the code, so I'll refrain from too much blathering about it, but the general plan is to go through each of the 119 pages and:

convert the text to lines
find the header line
find the column start/end positions from the header on the page (since they are different for each page)
reading it in with readr::read_fwf()
remove headers, preamble and epilogue cruft
turn it all into one data frame

# we're going to process each page and read_fwf will complain violently
# when it hits header/footer rows vs data rows and we rly don't need to
# see all those warnings
read_fwf_q <- quietly(read_fwf)

# go through each page
map_df(doc, ~{
  
  stri_split_lines(.x) %>% 
    flatten_chr() -> lines # want the lines from each page
  
  # find the header on the page and get the starting locations for each column
  keep(lines, stri_detect_regex, "^Name") %>% 
    stri_locate_all_fixed(c("Name", "Rank", "Branch", "Year", "State")) %>% 
    map(`[`, 1) %>% 
    flatten_int() -> starts
  
  # now get the ending locations; cheating and using `NA` for the last column  
  ends <- c(starts[-1] - 1, NA)

  # since each page has a lovely header and poppler's "layout" mode creates 
  # a surprisingly usable fixed-width table, the core idiom is to find the start/end
  # of each column using the header as a canary
  cols <- fwf_positions(starts, ends, col_names = c("name", "rank", "branch", "year", "state"))

  paste0(lines, collapse="\n") %>%        # turn it into something read_fwf() can read 
    read_fwf_q(col_positions = cols) %>%  # read it!
    .$result %>%                          # need to do this b/c of `quietly()`
    filter(!is.na(name)) %>%              # non-data lines
    filter(name != "Name") %>%            # remove headers from each page
    filter(!stri_detect_regex(name, "^(^This|Please|Report)")) # non-data lines (the last pg footer, rly)
  
}) -> xdf

xdf
## # A tibble: 7,699 x 5
##    name                       rank   branch                  year  state        
##    < chr>                      < chr>  < chr>                   < chr> < chr>        
##  1 ABBOTT, RICHARD FRANK      M/Sgt  UNITED STATES ARMY      1950  VERMONT      
##  2 ABEL, DONALD RAYMOND       Pvt    UNITED STATES ARMY      1950  PENNSYLVANIA 
##  3 ABELE, FRANCIS HOWARD      Sfc    UNITED STATES ARMY      1950  CONNECTICUT  
##  4 ABELES, GEORGE ELLIS       Pvt    UNITED STATES ARMY      1950  CALIFORNIA   
##  5 ABERCROMBIE, AARON RICHARD 1st Lt UNITED STATES AIR FORCE 1950  ALABAMA      
##  6 ABREU, MANUEL Jr.          Pfc    UNITED STATES ARMY      1950  MASSACHUSETTS
##  7 ACEVEDO, ISAAC             Sgt    UNITED STATES ARMY      1952  PUERTO RICO  
##  8 ACINELLI, BILL JOSEPH      Pfc    UNITED STATES ARMY      1951  MISSOURI     
##  9 ACKLEY, EDWIN FRANCIS      Pfc    UNITED STATES ARMY      1950  NEW YORK     
## 10 ACKLEY, PHILIP WARREN      Pfc    UNITED STATES ARMY      1950  NEW HAMPSHIRE
## # ... with 7,689 more rows

Now the data is both usable and sobering:

title <- "Defense POW/MIA Accounting Agency Personnel Missing - Korea"
subtitle <- "Reported for ALL Unaccounted For"
caption <-  "Source: http://www.dpaa.mil/portals/85/Documents/KoreaAccounting/pmkor_una_all.pdf"

mutate(xdf, year = factor(year)) %>% 
  mutate(branch = stri_trans_totitle(branch)) -> xdf

ordr <- count(xdf, branch, sort=TRUE)

mutate(xdf, branch = factor(branch, levels = rev(ordr$branch))) %>% 
  ggplot(aes(year)) +
  geom_bar(aes(fill = branch), width=0.65) +
  scale_y_comma(name = "# POW/MIA") +
  scale_fill_pomological(name=NULL, ) +
  labs(x = NULL, title = title, subtitle = subtitle) +
  theme_ipsum_rc(grid="Y") +
  theme(plot.background = element_rect(fill = "#fffeec", color = "#fffeec")) +
  theme(panel.background = element_rect(fill = "#fffeec", color = "#fffeec"))

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.