Freeing PDF Data to Account for the Unaccounted
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’ve mentioned @stiles before on the blog but for those new to my blatherings, Matt is a top-notch data journalist with the @latimes and currently stationed in South Korea. I can only imagine how much busier his life has gotten since that fateful, awful November 2016 Tuesday, but I’m truly glad his eyes, pen and R console are covering the important events there.
When I finally jumped on Twitter today, I saw this:
@hrbrmstr Do you have an R rig to convert this large PDF to csv? I tried xpdf and @TabulaPDF but I don't trust the results.
— Matt Stiles (@stiles) July 2, 2018
and went into action and figured I should blog the results as one can never have too many “convert this PDF to usable data” examples.
The Problem
The U.S. Defense POW/MIA Accounting Agency maintains POW/MIA data for all our nation’s service members. Matt is working with data from Korea (the “All US Unaccounted-For” PDF direct link is in the code below) and needed to get the PDF into a usable form and (as you can see if you read through the Twitter thread) both Tabulizer and other tools were introducing sufficient errors that the resultant extracted data was either not complete or trustworthy enough to rely on (hand-checking nearly 8,000 records is not fun).
There PDF in question was pretty uniform, save for the first and last pages. Here’s a sample:
Click to view slideshow.We just need a reproducible way to extract the data with sufficient veracity to ensure we can use it faithfully.
The Solution
We’ll need some packages and the file itself, so let’s get that bit out of the way first:
library(stringi) library(pdftools) library(hrbrthemes) library(ggpomological) library(tidyverse) # grab the PDF text mia_url <- "http://www.dpaa.mil/portals/85/Documents/KoreaAccounting/pmkor_una_all.pdf" mia_fil <- "~/Data/pmkor_una_all.pdf" if (!file.exists(mia_fil)) download.file(mia_url, mia_fil) # read it in doc <- pdf_text(mia_fil)
Let's look at those three example pages:
cat(doc[[1]]) ## Defense POW/MIA Accounting Agency ## Personnel Missing - Korea (PMKOR) ## (Reported for ALL Unaccounted For) ## Total Unaccounted: 7,699 ## Name Rank/Rate Branch Year State/Territory ## ABBOTT, RICHARD FRANK M/Sgt UNITED STATES ARMY 1950 VERMONT ## ABEL, DONALD RAYMOND Pvt UNITED STATES ARMY 1950 PENNSYLVANIA ## ... ## AKERS, HERBERT DALE Cpl UNITED STATES ARMY 1950 INDIANA ## AKERS, JAMES FRANCIS Cpl UNITED STATES MARINE CORPS 1950 VIRGINIA cat(doc[[2]]) ## Name Rank/Rate Branch Year State/Territory ## AKERS, RICHARD ALLEN 1st Lt UNITED STATES ARMY 1951 PENNSYLVANIA ## AKI, CLARENCE HALONA Sgt UNITED STATES ARMY 1950 HAWAII ... ## AMIDON, DONALD PRENTICE PFC UNITED STATES MARINE CORPS 1950 TEXAS ## AMOS, CHARLES GEARL Cpl UNITED STATES ARMY 1951 NORTH CAROLINA cat(doc[[length(doc)]]) ## Name Rank/Rate Branch Year State/Territory ## ZAVALA, FREDDIE Cpl UNITED STATES ARMY 1951 CALIFORNIA ## ZAWACKI, FRANK JOHN Sgt UNITED STATES ARMY 1950 OHIO ## ... ## ZUVER, ROBERT LEONARD Pfc UNITED STATES ARMY 1950 CALIFORNIA ## ZWILLING, LOUIS JOSEPH Cpl UNITED STATES ARMY 1951 ILLINOIS ## This list of Korean War missing personnel was prepared by the Defense POW/MIA Accounting Agency (DPAA). ## Please visit our web site at http://www.dpaa.mil/Our-Missing/Korean-War-POW-MIA-List/ for updates to this list and other official missing personnel data lists. ## Report Prepared: 06/19/2018 11:25
The poppler
library's "layout" mode (which pdftools
uses brilliantly) combined with the author of the PDF not being evil will help us make short work of this since:
- there's a uniform header on each page
- the "layout" mode returned uniform per-page, fixed-width columns
- there's no "special column tweaks" that some folks use to make PDFs more readable by humans
There are plenty of comments in the code, so I'll refrain from too much blathering about it, but the general plan is to go through each of the 119 pages and:
- convert the text to lines
- find the header line
- find the column start/end positions from the header on the page (since they are different for each page)
- reading it in with
readr::read_fwf()
- remove headers, preamble and epilogue cruft
- turn it all into one data frame
# we're going to process each page and read_fwf will complain violently # when it hits header/footer rows vs data rows and we rly don't need to # see all those warnings read_fwf_q <- quietly(read_fwf) # go through each page map_df(doc, ~{ stri_split_lines(.x) %>% flatten_chr() -> lines # want the lines from each page # find the header on the page and get the starting locations for each column keep(lines, stri_detect_regex, "^Name") %>% stri_locate_all_fixed(c("Name", "Rank", "Branch", "Year", "State")) %>% map(`[`, 1) %>% flatten_int() -> starts # now get the ending locations; cheating and using `NA` for the last column ends <- c(starts[-1] - 1, NA) # since each page has a lovely header and poppler's "layout" mode creates # a surprisingly usable fixed-width table, the core idiom is to find the start/end # of each column using the header as a canary cols <- fwf_positions(starts, ends, col_names = c("name", "rank", "branch", "year", "state")) paste0(lines, collapse="\n") %>% # turn it into something read_fwf() can read read_fwf_q(col_positions = cols) %>% # read it! .$result %>% # need to do this b/c of `quietly()` filter(!is.na(name)) %>% # non-data lines filter(name != "Name") %>% # remove headers from each page filter(!stri_detect_regex(name, "^(^This|Please|Report)")) # non-data lines (the last pg footer, rly) }) -> xdf xdf ## # A tibble: 7,699 x 5 ## name rank branch year state ## <chr> <chr> <chr> <chr> <chr> ## 1 ABBOTT, RICHARD FRANK M/Sgt UNITED STATES ARMY 1950 VERMONT ## 2 ABEL, DONALD RAYMOND Pvt UNITED STATES ARMY 1950 PENNSYLVANIA ## 3 ABELE, FRANCIS HOWARD Sfc UNITED STATES ARMY 1950 CONNECTICUT ## 4 ABELES, GEORGE ELLIS Pvt UNITED STATES ARMY 1950 CALIFORNIA ## 5 ABERCROMBIE, AARON RICHARD 1st Lt UNITED STATES AIR FORCE 1950 ALABAMA ## 6 ABREU, MANUEL Jr. Pfc UNITED STATES ARMY 1950 MASSACHUSETTS ## 7 ACEVEDO, ISAAC Sgt UNITED STATES ARMY 1952 PUERTO RICO ## 8 ACINELLI, BILL JOSEPH Pfc UNITED STATES ARMY 1951 MISSOURI ## 9 ACKLEY, EDWIN FRANCIS Pfc UNITED STATES ARMY 1950 NEW YORK ## 10 ACKLEY, PHILIP WARREN Pfc UNITED STATES ARMY 1950 NEW HAMPSHIRE ## # ... with 7,689 more rows
Now the data is both usable and sobering:
title <- "Defense POW/MIA Accounting Agency Personnel Missing - Korea" subtitle <- "Reported for ALL Unaccounted For" caption <- "Source: http://www.dpaa.mil/portals/85/Documents/KoreaAccounting/pmkor_una_all.pdf" mutate(xdf, year = factor(year)) %>% mutate(branch = stri_trans_totitle(branch)) -> xdf ordr <- count(xdf, branch, sort=TRUE) mutate(xdf, branch = factor(branch, levels = rev(ordr$branch))) %>% ggplot(aes(year)) + geom_bar(aes(fill = branch), width=0.65) + scale_y_comma(name = "# POW/MIA") + scale_fill_pomological(name=NULL, ) + labs(x = NULL, title = title, subtitle = subtitle) + theme_ipsum_rc(grid="Y") + theme(plot.background = element_rect(fill = "#fffeec", color = "#fffeec")) + theme(panel.background = element_rect(fill = "#fffeec", color = "#fffeec"))
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.