Evaluating Mass Muni CAFR Textract Results – Part 5

[This article was first published on R on Redwall Analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Libraries
packages <- 
  c("data.table",
    "reticulate",
    "paws.machine.learning",
    "paws.common",
    "keyring",
    "pdftools",
    "listviewer",
    "readxl"
    )

if (length(setdiff(packages,rownames(installed.packages()))) > 0) {
  install.packages(setdiff(packages, rownames(installed.packages())))  
}

invisible(lapply(packages, library, character.only = TRUE))

knitr::opts_chunk$set(comment=NA, fig.width=12, fig.height=8, out.width = '100%')

Introduction

In Evaluating Mass Muni CAFR Tabulizer Results – Part 3, we showed how to use pdftools and tabulizer to subset a group of PDFs, the AWS paws SDK package to store the PDF in s3, and Textract machine learning to extract a block response object using its “asynchronous” process. Subsequently, we discovered an alternate route to save the desired pages as PNG and send those page-by-page to AWS to get the same result. In the last post Scraping Failed Tabulizer PDFs with AWS Textract – Part 4 method has the added advantage of being free to free-tier AWS users.

In this post, we will show how to do this, and also how to parse the response blocks (shown in Figure ??), which took us a few days to figure out. As mentioned previously, the response blocks are complicated, with one nested list pertaining to text and another to the coordinates on the page. We attempted to write R-code to extract the required data, but in the end, decided to modify the AWS code, and call it within RStudio using the reticulate package. This had the added benefit of bringing together several pieces we had been working on.

Convert Selected Page to PNG with pdftools

Below we show the steps extract the Balance Sheet (page 27) from the Attleboro CAFR as a png with the pdftools pdf_convert() function. At first, we used the default setting for dpi of 75, but we found that the resolution this was too fuzzy, and led to frequent errors on particular letters with the OCR. These were not eliminated, but significantly reduced once we switched to dpi of 200, but it seems like it might be beneficial to go even higher, because for example, the “l” was commonly dropped when it occurred at the end of a word (ie: “Governmenta”).

# Point to particular muni pdf on disc
pdf <- 
  paste0("/Users/davidlucey/Desktop/David/Projects/mass_munis/data/pdf_cafr/attleboro_2018.pdf")
  
# Create png name for muni
png <- 
  paste0("/Users/davidlucey/Desktop/David/Projects/pdf_cafr_parse/attleboro.png")
    
  # Convert report to png
pdf_convert(
  pdf, 
  format = "png", 
  pages = 27, 
  filenames = png, 
  dpi = 200)
Converting page 27 to /Users/davidlucey/Desktop/David/Projects/pdf_cafr_parse/attleboro.png... done!
[1] "/Users/davidlucey/Desktop/David/Projects/pdf_cafr_parse/attleboro.png"

Extract Page with AWS Textract Synchronous Operation

We input our AWS credentials and set up a Textract response object in the same way as in the previous post. One difference between synchronous and asynchronous (demonstrated in the last post), is that paws sends the request and gets the response with just the analyze_document() function, instead of start_document_analysis() and the second get_document_analysis. We also did not need to loop, because each page fit within the maximum of 1000 blocks, and was immediately returned without a second function call. We also used the Bytes argument to reference the PNG from our local environment instead of pointing Textract to an s3 bucket.

# Call Textract for particular muni pdf page
response <-
  svc$analyze_document(
    Document = list(
      Bytes = png
      ),
    FeatureTypes = list(
      "TABLES"
      )
    )

Description and List View of Textract Response Object

In the chunk below, we show the anatomy of the response object in a subset of blocks. The first 10 tabs in the listviewer below have the PAGE (element 0) and some LINE elements (1-10), which are the parents of WORD and CELL. The TABLE block is shown in element 21, with its 168 children. WORD and CELL blocks are shown in lines 11-20 and 22-30, respectively. CELL and WORD blocks hvae no children, but can be used to find the location or row-column coordinates of a WORD. We spent a lot of time trying to understand the relationships between the different kinds of objects in order to parse it out in R, but in the end, it seemed easier just to use AWS’s pre-built Python parser via reticulate.

# Sample of blocks
listviewer::jsonedit(
  response[["Blocks"]][c(1:10, 161:170, 421:430)]
)

Parsing the Response Object with Reticulated Python

An increasing amount is being written on how to use reticulate to run Python code from RStudio, so we won’t get into great detail here about how to set it up here. It took us a long time to understand how to set up our environment, but as this case shows, it is probably going to be worth it to switch back and forth to take advantage of the strengths of the two languages. We used Python 3.7 with miniconda after re-installing Anaconda post the recent upgrade to Catalina, in which Apple re-arranged our whole set-up.

# Choose Python 3.7 miniconda
reticulate::use_condaenv(
  condaenv = "r-miniconda", 
  conda = "/Users/davidlucey/opt/anaconda3/bin/conda", 
  required = TRUE
  )

AWS gives the Python code to parse the blocks back into a tabular form [Textract Python Table Parser] (https://github.com/awsdocs/aws-doc-sdk-examples/blob/master/python/example_code/textract/textract_python_table_parser.py). This code built in the call to AWS with the Python boto client from the command line, but we didn’t need these functions, so we had to modify to take already returned response straight from our RStudio environment.

A second problem that we encountered was that when the response object was read into Python from R, blocks which had multiple children Ids were converted to list, while those with single children were stored as strings. It took some time, we patched this in the get_text() function below by converting the strings to single item lists, so it should work for any R user now. We noted where we made modifications with comments below.

import webbrowser, os
import json
import io
from io import BytesIO
import sys


def get_rows_columns_map(table_result, blocks_map):
    rows = {}
    for relationship in table_result['Relationships']:
        if relationship['Type'] == 'CHILD':
            for child_id in relationship['Ids']:
                cell = blocks_map[child_id]
                if cell['BlockType'] == 'CELL':
                    row_index = cell['RowIndex']
                    col_index = cell['ColumnIndex']
                    if row_index not in rows:
                        # create new row
                        rows[row_index] = {}
                        
                    # get the text value
                    rows[row_index][col_index] = get_text(cell, blocks_map)
    return rows

def get_text(result, blocks_map):
    text = ''
    if 'Relationships' in result:
        for relationship in result['Relationships']:
            if relationship['Type'] == 'CHILD':
                if isinstance(relationship['Ids'], str):     # Modified here
                    relationship_ids = [relationship['Ids']]
                else: 
                    relationship_ids = relationship['Ids']
                for child_id in relationship_ids:
                    word = blocks_map[child_id]
                    if word['BlockType'] == 'WORD':
                        text += word['Text'] + ' '
                        
    return text

def get_table_csv_results(response):
    blocks = response['Blocks'] # Modified here
    blocks_map = {}
    table_blocks = []
    for block in blocks:
        blocks_map[block['Id']] = block
        if block['BlockType'] == "TABLE":
            table_blocks.append(block)
            
    if len(table_blocks) <= 0:
        return "<b> NO Table FOUND </b>"
        
    csv = ''
    for index, table in enumerate(table_blocks):
        csv += generate_table_csv(table, blocks_map, index +1)
        csv += '\n\n'
    return csv

def generate_table_csv(table_result, blocks_map, table_index):
    rows = get_rows_columns_map(table_result, blocks_map)
    table_id = 'Table_' + str(table_index)
    # get cells.
    csv = 'Table: {0}\n\n'.format(table_id)
    for row_index, cols in rows.items():
        for col_index, text in cols.items():
            csv += '{}'.format(text) + "\t"
        csv += '\n'
    csv += '\n\n\n'
    return csv

# Removed main()

We called our Python get_table_csv_results() function from reticulate (as py$get_table_csv_results()) and show the raw parsed unparsed text below. We will not show the clean up of the raw text string here, but this also involves still some complicated regulaar expressions. Please refer to our Github repo for the code we used.

# Call Python function above from R
page <- py$get_table_csv_results(response)

# Print text
cat(page)
Table: Table_1

    General Fund    Governmental Funds  Governmental Funds  
Assets              
Cash and investments -- unrestricted    $ 9,663,898     $ 21,470,155    $ 31,134,053    
Cash and investments - restricted (for stabilization                
purposes)   3,172,925   -   3,172,925   
Receivables:                
Property taxes  1,266,214   --  1,266,214   
Motor vehicle excise    879,694     -   879,694     
Tax liens and foreclosures  2,110,587   --  2,110,587   
User charges    557,376     --  557,376     
Intergovernmenta    160,060     2,728,377   2,888,437   
Other   5,584,192   2,184,952   7,769,144   
Due from ARA    616,048     --  616,048     
Total assets    24,010,994  26,383,484  50,394,478  
Deferred outflows of resources              
None    -   --  -   
Total deferred outflows of resources    -   --  -   
Total assets and deferred outflows of resources     $ 24,010,994    $ 26,383,484    $ 50,394,478    
Liabilities                 
Warrants payable    $ 1,850,514     $ 1,745,496     $ 3,596,010     
Accounts payable and accrued expenses   682,383     311,354     993,737     
Retainage payable   -   164,392     164,392     
Due to federal and state governments    -   19,101  19,101  
Notes payable   -   6,499,066   6,499,066   
Total liabilities   2,532,897   8,739,409   11,272,306  
Deferred inflows of resources               
Property taxes paid in advance  62,260  -   62,260  
Deferred property tax revenues  3,809,278   -   3,809,278   
Deferred user fees and fines    6,141,568   --  6,141,568   
Deferred revenue from ARA   616,048         616,048     
Unearned income     -   8,240   8,240   
Deferred grant income   -   253,192     253,192     
Deferred loan income    -   2,124,915   2,124,915   
Total deferred inflows of resources     10,629,154  2,386,347   13,015,501  
Fund balance                
Nonspendable    -   315,478     315,478     
Restricted  -   7,235,505   7,235,505   
Committed   3,172,925   9,500,776   12,673,701  
Assigned    2,100,746   --  2,100,746   
Unassigned  5,575,272   (1,794,031)     3,781,241   
Total fund balance  10,848,943  15,257,728  26,106,671  
Total liabilities, deferred inflows of resources and fund balance   $ 24,010,994    $ 26,383,484    $ 50,394,478    

Comparing the Textract Results

As a refresher from Evaluating Mass Muni CAFR Tabulizer Results - Part 3, we started with 149 Massachusetts municipalities, and 5 PDFs couldn’t be read at all with OCR (because of their formatting). Of the remaining CAFRs, we were able to match all elements of 121 perfectly, and 23 had one or more elements which failed to match. Textract didn’t really help with the PDFs which couldn’t be read by OCR, although we really didn’t try those because our method relied on pdftools using OCR to get the page index.

Figure 1: Textract fixed almost half of wrongly extracted tabulizer elements

For our Textract trial, we started with the 23 municipalities where there were known problems matching manually extracted data for 43 elements. Of the requested 125 tables from Textract for about $2, all but a 4 tables were returned without problems. Of the challenging elements, we successfully extracted 21 from 12 municipalities with Textract for about $2. Recall that we were able to accurately extract all but about 6% of the cases we wanted with pdftools and tabulizer, so combined with Textract, we got about 97% right. We didn’t spend much time fine tuning our regex and cleaning of the raw Textract output, so if this were going to be a repeated process, it could likely be further improved.

Series summary

As reminder of the tools demonstrated in this 5-part series,

  1. Find the location of a table on a page with regex matching and pdftools Tabulizer and pdftools Together as Super-powers

  2. Extract a particular page from a pdf with pdftools Tabulizer and pdftools Together as Super-powers

  3. Aggregate multiple PDF pages together Tabulizer and pdftools Together as Super-powers

  4. Use pdttools to find exact table parameters and accurately extract a large amount of slightly varying tables combining pdftools and tabulizer Tabulizer and pdftools Together as Super-powers

  5. Connect and interact with AWS’s3 with the paws SDK Scraping Failed Tabulizer PDFs with AWS Textract

  6. Send requests to AWS Textract from s3 and use bulk asynchronous operation with the paws SDK Scraping Failed Tabulizer PDFs with AWS Textract

  7. Convert a PDF page to PNG and send and receive single page requests to AWS Textract. [This post]

  8. Study the anatomy of the AWS JSON response object and use the Python parser from R with reticulate to convert it back into a tabular form. [This post]

Conclusion

It has been over 10 years since the the SEC mandated company financial statements be reported in XBRL. Marc Joffe has advocated for the similar requirements for municipal reporting, a subject which becomes all the more critical at a time when many governments are likely to need emergency funding. After our exploration of amost 150 PDFs, we believe that it would be possible for auditors to come close to the efficacy of XBRL with a few mandatory guidelines for PDF formatting.

Although there were 73 problem elements, there were only a few patterns common in the large majority of cases:

  • Scanned images of printed documents
  • Tables which ran over onto a second page without repeating the column headers
  • Tables which included a second section on the same page with large indentations
  • Lack of uniformity of header and footer formatting and naming
  • Lack of uniformity in line-item names (ie: Total Net Position and Net Position)

We now have a dataset which could be used to greatly improve the machine readability of municipal financial statements with a small number of prescribed formatting guidelines.

To leave a comment for the author, please follow the link and comment on their blog: R on Redwall Analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)