Tabulizer and pdftools Together as Super-powers – Part 2

R on Redwall Analytics

2 years ago

[This article was first published on R on Redwall Analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Libraries
packages <- 
  c("data.table",
    "stringr",
    "rlist",
    "tabulizer",
    "pdftools",
    "parallel",
    "DT"
    )

if (length(setdiff(packages,rownames(installed.packages()))) > 0) {
  install.packages(setdiff(packages, rownames(installed.packages())))  
}

invisible(lapply(packages, library, character.only = TRUE))

knitr::opts_chunk$set(comment=NA, fig.width=12, fig.height=8, out.width = '100%')

Introduction

This post will be a continuation of Parsing of Mass Municipal PDF CAFR’s with Tabulizer, pdftools and AWS Textract – Part 1 dealing with extracting data from PDFs using R. When Redwall discovered pdftools, and its pdf_data() function, which maps out every word on a pdf page by x-y coordinate, we thought that was interesting, but didn’t really know how to use it. We also didn’t have the regular expression skills, and were much more befuddled by the nested list structures than we are now.

As for tabulizer, it took about a year before rJava magically started working properly, but even then we it wasn’t possible to consistently read a large number of tables of different sizes without cutting off fields in unexpected ways. Only in this Mass pdf scraping project have we realized that, by combining these two packages, it becomes possible to access data in reliable way, from a large number of varied pdf formats.

Our Massachusset’s municipal CAFR project provided a perfect opportunity to put all these pieces together. This blog post will consist of a step-by-step walk through which will hopefully help others avoid some of the pain that we experienced in getting to this point.

Walk Through Plan

To begin with, we had to download the pdfs from the CAFR Library at the Center for Municipal Finance. We won’t show the code to do the download here, but it can be found at reason_pdf_parser.R. In order to do this on the scale that we plan for this project, we had to build nested lists with the pdf metadata of 150 Massachussett’s CAFR pdfs. For now, we will just walk through a few key points using a single statment from the Abington, MA 2018 Statement of Net Position from the CAFR downloaded here.

# Set up pdf and pdf_path to directory
dir <- "/Users/davidlucey/Desktop/David/Projects/mass_munis/data/pdf_cafr/"
city <- "abington"
pdf <- paste0(city, "_2018.pdf", collapse="")
pdf_path <- paste0(dir, pdf, collapse = "")

# Run pdf_data on Abington CAFR
abington <- pdftools::pdf_data(pdf_path)

# Name each page of list for page index in pdf
names(abington) <- 1:length(abington)

# Look at structure of 2n element in 92-page nested list
str(abington[[2]])
Classes 'tbl_df', 'tbl' and 'data.frame':   266 obs. of  6 variables:
 $ width : int  42 19 76 124 58 20 96 19 41 73 ...
 $ height: int  15 15 15 15 15 15 15 15 15 15 ...
 $ x     : int  168 215 238 319 92 154 179 279 302 348 ...
 $ y     : int  72 72 72 72 102 102 102 102 102 102 ...
 $ space : logi  TRUE TRUE TRUE FALSE TRUE TRUE ...
 $ text  : chr  "TOWN" "OF" "ABINGTON," "MASSACHUSETTS" ...

PDF Tools pdf_data Functionality

The above is a list of data.frames containing metadata of the location of every word on every one of the 92 pages of the Abington 2018 CAFR. The structure of the second page is shown above. But, we only need the key financial statements, so would like to drop the majority of pages which don’t have what we need. For example, we know that page 16 has the Statement of Net Position. We could search for that page using regular expressions as shown for variable sonp below.

See how we extract only the head unique 5 lines of each page by y, paste those lines back together, then match our regular expression on the text of those lines looking for the phrase “STATEMENT OF NET POSITION”. However, there are several pages meeting this criteria, such as “Proprietary Funds” Statement of Net Position on page 20 (which we don’t want). Most of these other pages can be eliminated by choosing NOT to match the word “FUNDS” by the same process, so notice that we negate our second match with “!”. Hence, our sonp_index comes back as 16, which can be used to filter out the remaining pages.

# Convert elements to data.table
abington <- mclapply(abington, setDT)

# Get index of Abington Statement of Net Position
sonp_index <- 
  which(
    unlist(
      mclapply(abington, function(page){
        (str_detect(
          paste(
            # Reformat top 5 lines by y and look for match to "STATEMENT OF NET POSITION"
              page$text[
                page$y %in% head(unique(page$y), 5)
                ],
              collapse = " "
              ),
          "STATEMENT OF NET POSITION"
        ) & 
          # And requires both statements to be TRUE
          !str_detect(
            paste(
              page$text[
               # Reformat top 5 lines by y and look for non match to "FUNDS" 
                page$y %in% head(unique(page$y), 5)
                ],
              collapse = " "
              ),
            "FUNDS"
            )
        )
        }
        )
    )
    )
    
# Extract and View Statement of Net Position pdftools pdf_data  metadata
sonp <- abington[sonp_index][[1]]
sonp
     width height   x   y space          text
  1:    32      6 264  73  TRUE     STATEMENT
  2:     7      6 299  73  TRUE            OF
  3:    10      6 308  73  TRUE           NET
  4:    25      6 320  73 FALSE      POSITION
  5:    14      6 287  87  TRUE          JUNE
 ---                                         
337:    71      9 144 745 FALSE Massachusetts
338:    11     11 300 743 FALSE            13
339:    26      9 426 745  TRUE         Basic
340:    43      9 456 745  TRUE     Financial
341:    53      9 502 745 FALSE    Statements

Above is the text grid of Abington’s Statement of Net Position as taken by pdftools. Using this metadata, we can begin to put together exact area parameters for Tabula. Even better, we could programically do it for a large number of tabless. In our experience, this is important because the tabulizer default lattice method for tabular data can be unpredictable cutting off fields unexpectedly.

Tabulizer Area Coordinates

Tabulizer specifies pages in blocks of 72 * inches, so a typical 8.5 x 11 verticle page would have dimensions of 612 x 720. This coordinate grid is used to specify the area parameter (top, left, bottom and right). All of of Massachusett’s financial statement tables have a "$" sign in the first and last rows, so those could be used to locate the top or bottom paramenters. In addition, all pages including financial statements have language referring users to the “notes to the financial statements” usually on the second to last line, which could be the “bottom”, or the midpoint between the bottom and the "$" (if more room was needed).

 # Determine if page is verticle or horizontal
    x <- 8.5 * 72
    y <- 11 * 72
    max_x <- max(sonp$x)
    max_y <- max(sonp$y)
    orientation <- 
      ifelse(x < max_x, "horizontal", "verticle")
    
    # TOP
    
    # Keys on the first instance of the year "2018"
    table_top <-
      min(sonp$y[str_detect(sonp$text, "2018") & sonp$space==FALSE])
    # Find the height at in the table_top row
    height_top <- unique(sonp$height[sonp$y == table_top])
    # Add table_top and height_top to avoid slicing row
    top <- table_top + height_top 
    
    # BOTTOM
    
    # Table Bottom marked by last instance of character "$"
    table_bottom <-
      max(sonp$y[str_detect(sonp$text, "\\$")])
    # Height at bottom row of table 
    height_bottom <- unique(sonp$height[sonp$y == table_bottom])
    # Bottom of table
    bottom <- table_bottom + height_bottom
    
    # LEFT
    
    # Add some space to leftmost x coordinate to avoid slicing
    left <-     
      ifelse( min(sonp$x) - 30 > 0,
              min(sonp$x) - 30, 1 )
    
    # RIGHT
    
    # Find width at maximum "x" coordinate
    width_max_x <- max(sonp$width[sonp$x == max_x])
    # Add width at maximum "x" plus more space wether verticle or horizontal
    right <- 
      max_x + width_max_x + ifelse(orientation == "verticle", 30, 50)
    
    # FINAL AREA PARAMETER FOR TABULIZER AS INTEGER VECTOR
    # Note the specification as an integer vector
    a <- c(top, left, bottom, right)
    
    # Show coordinates 
    a
[1]  93  24 681 585

We give an example for Abington’s Statement of Net Position above, starting with the maximum x and y, and determining the page orientation (ie: verticle or horizontal). After finding the location of the date line at the top, and walk down a little from there to set a table_top variable. Typically, it is best to leave a little margin between the page header and the top of the table. The bottom of the table is set adding the height to the bottom line of the table, and left parameter is set by taking the smallest x coordinate and reducing by a little to margin for error. We leave a larger margin for the right-most coordinate because this is where we have found that the most errors occur, often when the algorithm seems to try to squish the table into the available columns.

In our experience, the most problems come with missetting the top and right parameters. Indentation can also confuse the algorithm. Columns can be split in the middle into two columns, often at the far-rightmost, for example. In the end, we chose parameters of 93 (top), 24 (left), 681 (bottom) and 585 (right).

Tabulizer `extract_table()` Function

Below we run our area parameters we derived above through tabulizer. Note that the area parameter, itself an integer vector, is further wrapped as a list because not having this structure throws an error. In addition, avoid the half day of wheel spinning we experienced by specifying guess as “F” to over-ride the default lattice, otherwise your area parameter is ignored with no warning. Also, we use the sonp_index integer to specify the page of the pdf. There are several options for output which all work as expected, but data.frame seems most natural.

# Tabulizer extract_tables output is a list
abington_sonp <-
  extract_tables(
    pdf_path, 
    pages = sonp_index,
    area = list(a), 
    guess = F,
    output = "data.frame")

# Extract and print single element from list
abington_sonp <- abington_sonp[[1]]
abington_sonp
                                                                                         X
1                                                                                         
2                                                                                         
3                                                                                   ASSETS
4                                                                                 CURRENT:
5              Cash and cash equivalents................................................ $
6                                        Receivables, net of allowance for uncollectibles:
7                       Real estate and personal property taxes...........................
8           Tax liens.....................................................................
9                      Community preservation fund surtax.................................
10                   Motor vehicle and other excise taxes.................................
11             User charges...............................................................
12                  Departmental and other................................................
13              Intergovernmental.........................................................
14                     Community preservation state share.................................
15               Special assessments......................................................
16         Tax foreclosures...............................................................
17                 Total current assets...................................................
18                                                                             NONCURRENT:
19                                       Receivables, net of allowance for uncollectibles:
20               Special assessments......................................................
21             Capital assets, nondepreciable.............................................
22                    Capital assets, net of accumulated depreciation.....................
23                    Total noncurrent assets.............................................
24          TOTAL ASSETS..................................................................
25                                                                                        
26                                                          DEFERRED OUTFLOWS OF RESOURCES
27               Deferred outflows related to pensions....................................
28                        Deferred outflows related to other postemployment benefits......
29                                  TOTAL DEFERRED OUTFLOWS OF RESOURCES..................
30                                                                             LIABILITIES
31                                                                                CURRENT:
32            Warrants payable............................................................
33          Accrued payroll...............................................................
34          Health claims payable.........................................................
35         Accrued interest...............................................................
36      Accrued liabilities...............................................................
37            Capital lease obligations...................................................
38      Landfill closure..................................................................
39              Compensated absences......................................................
40         Notes payable..................................................................
41         Bonds payable..................................................................
42               Total current liabilities................................................
43                                                                             NONCURRENT:
44      Landfill closure..................................................................
45              Compensated absences......................................................
46       Net pension liability............................................................
47               Net other postemployment benefits liability..............................
48         Bonds payable..................................................................
49                  Total noncurrent liabilities..........................................
50        TOTAL LIABILITIES...............................................................
51                                                           DEFERRED INFLOWS OF RESOURCES
52                Deferred inflows related to pensions....................................
53                                                                            NET POSITION
54         Net investment in capital assets...............................................
55                                                                         Restricted for:
56                                                                        Permanent funds:
57              Expendable................................................................
58               Nonexpendable............................................................
59         Gifts and grants...............................................................
60               Community preservation...................................................
61 Unrestricted...........................................................................
62           TOTAL NET POSITION......................................................... $
   X.1          X.2 Primary.Government X.3          X.4
1   NA Governmental      Business-type                 
2   NA   Activities         Activities            Total
3   NA                                                 
4   NA                                                 
5   NA   10,392,587        $ 7,449,193   $   17,841,780
6   NA                                                 
7   NA      313,316                  -          313,316
8   NA      882,182             34,598          916,780
9   NA       6 ,245                  -            6,245
10  NA      387,455                  -          387,455
11  NA            -          1,930,158        1,930,158
12  NA            -            149,296          149,296
13  NA    1,715,882                  -        1,715,882
14  NA       70,735                  -           70,735
15  NA            -             32,137           32,137
16  NA      663,449                  -          663,449
17  NA   14,431,851          9,595,382       24,027,233
18  NA                                                 
19  NA                                                 
20  NA            -             10,712           10,712
21  NA  101,526,106          1,614,044      103,140,150
22  NA   26,998,272         31,944,596       58,942,868
23  NA  128,524,378         33,569,352      162,093,730
24  NA  142,956,229         43,164,734      186,120,963
25  NA                                                 
26  NA                                                 
27  NA      417,711              7,062          424,773
28  NA    1,982,740             19,982        2,002,722
29  NA    2,400,451             27,044        2,427,495
30  NA                                                 
31  NA                                                 
32  NA      660,037            403,389        1,063,426
33  NA      206,897             75,704          282,601
34  NA      311,064                  -          311,064
35  NA      278,817             71,281          350,098
36  NA       50,638          2,247,039        2,297,677
37  NA            -             53,845           53,845
38  NA      139,000                  -          139,000
39  NA      346,271             26,896          373,167
40  NA       53,168                  -           53,168
41  NA    2,462,040            491,136        2,953,176
42  NA    4,507,932          3,369,290        7,877,222
43  NA                                                 
44  NA    4,080,000                  -        4,080,000
45  NA      881,952             20,919          902,871
46  NA   19,188,882            324,438       19,513,320
47  NA   67,618,712            681,460       68,300,172
48  NA   45,761,763          5,764,229       51,525,992
49  NA  137,531,309          6,791,046      144,322,355
50  NA  142,039,241         10,160,336      152,199,577
51  NA                                                 
52  NA    1,399,576             23,663        1,423,239
53  NA                                                 
54  NA   82,168,482         27,395,220      109,563,702
55  NA                                                 
56  NA                                                 
57  NA       99,189                  -           99,189
58  NA       69,778                  -           69,778
59  NA    1,088,568                  -        1,088,568
60  NA      740,211                  -          740,211
61  NA (82,248,365)          5,612,559     (76,635,806)
62  NA    1,917,863       $ 33,007,779   $   34,925,642

Clean up

The tabulizer output is still in a raw form with colums sometimes determined by indentations and x values, such as the “$” signs. The numbers are in character form with commas and sometimes negative numbers are shown in parenthesis, and need to be parsed into numeric. The item names often have a long series of periods which need to be stripped. The biggest challenge is the column names which often include the first row of the full column name, and need to be rebuilt. This is not a small task and not what we were hoping to illustrate in this post, so we are just showing the output below. Please refer to our Github code for the a more complete explanation and solutions to many of these issues.

Final Product

Though there is still work to be done, the final product of this post is shown above. Single elements could be extracted to form a database, or the output could be saved to csv. The headers such as ASSETS or LIABILITIES could be nested. The main point is that short of XBRL, the data has been set free from the PDF in a machine readable form. Not only that, this general process can be repeated for a large number of slightly differing PDFs with a relatively high low error rate as we will show in the next post Evaluating Mass Muni CAFR Tabulizer Results – Part 3. In cases where errors do occur, a second layer can be used to run the more challenging PDFs through AWS Textract SDK. We will show how this is done in our next post.

To leave a comment for the author, please follow the link and comment on their blog: R on Redwall Analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.