Evaluating Mass Muni CAFR Textract Results – Part 5
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
# Libraries packages <- c("data.table", "reticulate", "paws.machine.learning", "paws.common", "keyring", "pdftools", "listviewer", "readxl" ) if (length(setdiff(packages,rownames(installed.packages()))) > 0) { install.packages(setdiff(packages, rownames(installed.packages()))) } invisible(lapply(packages, library, character.only = TRUE)) knitr::opts_chunk$set(comment=NA, fig.width=12, fig.height=8, out.width = '100%')
Introduction
In Evaluating Mass Muni CAFR Tabulizer Results – Part 3, we showed how to use pdftools
and tabulizer
to subset a group of PDFs, the AWS paws
SDK package to store the PDF in s3
, and Textract machine learning to extract a block response object using its “asynchronous” process. Subsequently, we discovered an alternate route to save the desired pages as PNG and send those page-by-page to AWS to get the same result. In the last post Scraping Failed Tabulizer PDFs with AWS Textract – Part 4 method has the added advantage of being free to free-tier AWS users.
In this post, we will show how to do this, and also how to parse the response blocks (shown in Figure ??), which took us a few days to figure out. As mentioned previously, the response blocks are complicated, with one nested list pertaining to text and another to the coordinates on the page. We attempted to write R-code to extract the required data, but in the end, decided to modify the AWS code, and call it within RStudio using the reticulate
package. This had the added benefit of bringing together several pieces we had been working on.
Convert Selected Page to PNG with pdftools
Below we show the steps extract the Balance Sheet (page 27) from the Attleboro CAFR as a png with the pdftools
pdf_convert()
function. At first, we used the default setting for dpi of 75, but we found that the resolution this was too fuzzy, and led to frequent errors on particular letters with the OCR. These were not eliminated, but significantly reduced once we switched to dpi of 200, but it seems like it might be beneficial to go even higher, because for example, the “l” was commonly dropped when it occurred at the end of a word (ie: “Governmenta”).
# Point to particular muni pdf on disc pdf <- paste0("/Users/davidlucey/Desktop/David/Projects/mass_munis/data/pdf_cafr/attleboro_2018.pdf") # Create png name for muni png <- paste0("/Users/davidlucey/Desktop/David/Projects/pdf_cafr_parse/attleboro.png") # Convert report to png pdf_convert( pdf, format = "png", pages = 27, filenames = png, dpi = 200) Converting page 27 to /Users/davidlucey/Desktop/David/Projects/pdf_cafr_parse/attleboro.png... done! [1] "/Users/davidlucey/Desktop/David/Projects/pdf_cafr_parse/attleboro.png"
Extract Page with AWS Textract Synchronous Operation
We input our AWS credentials and set up a Textract response object in the same way as in the previous post. One difference between synchronous and asynchronous (demonstrated in the last post), is that paws
sends the request and gets the response with just the analyze_document()
function, instead of start_document_analysis()
and the second get_document_analysis
. We also did not need to loop, because each page fit within the maximum of 1000 blocks, and was immediately returned without a second function call. We also used the Bytes argument to reference the PNG from our local environment instead of pointing Textract to an s3
bucket.
# Call Textract for particular muni pdf page response <- svc$analyze_document( Document = list( Bytes = png ), FeatureTypes = list( "TABLES" ) )
Description and List View of Textract Response Object
In the chunk below, we show the anatomy of the response object in a subset of blocks. The first 10 tabs in the listviewer below have the PAGE
(element 0) and some LINE
elements (1-10), which are the parents of WORD
and CELL
. The TABLE
block is shown in element 21, with its 168 children. WORD
and CELL
blocks are shown in lines 11-20 and 22-30, respectively. CELL
and WORD
blocks hvae no children, but can be used to find the location or row-column coordinates of a WORD
. We spent a lot of time trying to understand the relationships between the different kinds of objects in order to parse it out in R, but in the end, it seemed easier just to use AWS’s pre-built Python parser via reticulate
.
# Sample of blocks listviewer::jsonedit( response[["Blocks"]][c(1:10, 161:170, 421:430)] )
Parsing the Response Object with Reticulated Python
An increasing amount is being written on how to use reticulate
to run Python code from RStudio, so we won’t get into great detail here about how to set it up here. It took us a long time to understand how to set up our environment, but as this case shows, it is probably going to be worth it to switch back and forth to take advantage of the strengths of the two languages. We used Python 3.7 with miniconda after re-installing Anaconda post the recent upgrade to Catalina, in which Apple re-arranged our whole set-up.
# Choose Python 3.7 miniconda reticulate::use_condaenv( condaenv = "r-miniconda", conda = "/Users/davidlucey/opt/anaconda3/bin/conda", required = TRUE )
AWS gives the Python code to parse the blocks back into a tabular form [Textract Python Table Parser] (https://github.com/awsdocs/aws-doc-sdk-examples/blob/master/python/example_code/textract/textract_python_table_parser.py). This code built in the call to AWS with the Python boto client from the command line, but we didn’t need these functions, so we had to modify to take already returned response straight from our RStudio environment.
A second problem that we encountered was that when the response object was read into Python from R, blocks which had multiple children Ids
were converted to list, while those with single children were stored as strings. It took some time, we patched this in the get_text()
function below by converting the strings to single item lists, so it should work for any R user now. We noted where we made modifications with comments below.
import webbrowser, os import json import io from io import BytesIO import sys def get_rows_columns_map(table_result, blocks_map): rows = {} for relationship in table_result['Relationships']: if relationship['Type'] == 'CHILD': for child_id in relationship['Ids']: cell = blocks_map[child_id] if cell['BlockType'] == 'CELL': row_index = cell['RowIndex'] col_index = cell['ColumnIndex'] if row_index not in rows: # create new row rows[row_index] = {} # get the text value rows[row_index][col_index] = get_text(cell, blocks_map) return rows def get_text(result, blocks_map): text = '' if 'Relationships' in result: for relationship in result['Relationships']: if relationship['Type'] == 'CHILD': if isinstance(relationship['Ids'], str): # Modified here relationship_ids = [relationship['Ids']] else: relationship_ids = relationship['Ids'] for child_id in relationship_ids: word = blocks_map[child_id] if word['BlockType'] == 'WORD': text += word['Text'] + ' ' return text def get_table_csv_results(response): blocks = response['Blocks'] # Modified here blocks_map = {} table_blocks = [] for block in blocks: blocks_map[block['Id']] = block if block['BlockType'] == "TABLE": table_blocks.append(block) if len(table_blocks) <= 0: return "<b> NO Table FOUND </b>" csv = '' for index, table in enumerate(table_blocks): csv += generate_table_csv(table, blocks_map, index +1) csv += '\n\n' return csv def generate_table_csv(table_result, blocks_map, table_index): rows = get_rows_columns_map(table_result, blocks_map) table_id = 'Table_' + str(table_index) # get cells. csv = 'Table: {0}\n\n'.format(table_id) for row_index, cols in rows.items(): for col_index, text in cols.items(): csv += '{}'.format(text) + "\t" csv += '\n' csv += '\n\n\n' return csv # Removed main()
We called our Python get_table_csv_results()
function from reticulate
(as py$get_table_csv_results()
) and show the raw parsed unparsed text below. We will not show the clean up of the raw text string here, but this also involves still some complicated regulaar expressions. Please refer to our Github repo for the code we used.
# Call Python function above from R page <- py$get_table_csv_results(response) # Print text cat(page) Table: Table_1 General Fund Governmental Funds Governmental Funds Assets Cash and investments -- unrestricted $ 9,663,898 $ 21,470,155 $ 31,134,053 Cash and investments - restricted (for stabilization purposes) 3,172,925 - 3,172,925 Receivables: Property taxes 1,266,214 -- 1,266,214 Motor vehicle excise 879,694 - 879,694 Tax liens and foreclosures 2,110,587 -- 2,110,587 User charges 557,376 -- 557,376 Intergovernmenta 160,060 2,728,377 2,888,437 Other 5,584,192 2,184,952 7,769,144 Due from ARA 616,048 -- 616,048 Total assets 24,010,994 26,383,484 50,394,478 Deferred outflows of resources None - -- - Total deferred outflows of resources - -- - Total assets and deferred outflows of resources $ 24,010,994 $ 26,383,484 $ 50,394,478 Liabilities Warrants payable $ 1,850,514 $ 1,745,496 $ 3,596,010 Accounts payable and accrued expenses 682,383 311,354 993,737 Retainage payable - 164,392 164,392 Due to federal and state governments - 19,101 19,101 Notes payable - 6,499,066 6,499,066 Total liabilities 2,532,897 8,739,409 11,272,306 Deferred inflows of resources Property taxes paid in advance 62,260 - 62,260 Deferred property tax revenues 3,809,278 - 3,809,278 Deferred user fees and fines 6,141,568 -- 6,141,568 Deferred revenue from ARA 616,048 616,048 Unearned income - 8,240 8,240 Deferred grant income - 253,192 253,192 Deferred loan income - 2,124,915 2,124,915 Total deferred inflows of resources 10,629,154 2,386,347 13,015,501 Fund balance Nonspendable - 315,478 315,478 Restricted - 7,235,505 7,235,505 Committed 3,172,925 9,500,776 12,673,701 Assigned 2,100,746 -- 2,100,746 Unassigned 5,575,272 (1,794,031) 3,781,241 Total fund balance 10,848,943 15,257,728 26,106,671 Total liabilities, deferred inflows of resources and fund balance $ 24,010,994 $ 26,383,484 $ 50,394,478
Comparing the Textract Results
As a refresher from Evaluating Mass Muni CAFR Tabulizer Results - Part 3, we started with 149 Massachusetts municipalities, and 5 PDFs couldn’t be read at all with OCR (because of their formatting). Of the remaining CAFRs, we were able to match all elements of 121 perfectly, and 23 had one or more elements which failed to match. Textract didn’t really help with the PDFs which couldn’t be read by OCR, although we really didn’t try those because our method relied on pdftools
using OCR to get the page index.
For our Textract trial, we started with the 23 municipalities where there were known problems matching manually extracted data for 43 elements. Of the requested 125 tables from Textract for about $2, all but a 4 tables were returned without problems. Of the challenging elements, we successfully extracted 21 from 12 municipalities with Textract for about $2. Recall that we were able to accurately extract all but about 6% of the cases we wanted with pdftools
and tabulizer
, so combined with Textract, we got about 97% right. We didn’t spend much time fine tuning our regex and cleaning of the raw Textract output, so if this were going to be a repeated process, it could likely be further improved.
Series summary
As reminder of the tools demonstrated in this 5-part series,
Find the location of a table on a page with regex matching and
pdftools
Tabulizer and pdftools Together as Super-powersExtract a particular page from a pdf with
pdftools
Tabulizer and pdftools Together as Super-powersAggregate multiple PDF pages together Tabulizer and pdftools Together as Super-powers
Use pdttools to find exact table parameters and accurately extract a large amount of slightly varying tables combining
pdftools
andtabulizer
Tabulizer and pdftools Together as Super-powersConnect and interact with AWS’
s3
with thepaws
SDK Scraping Failed Tabulizer PDFs with AWS TextractSend requests to AWS
Textract
froms3
and use bulk asynchronous operation with thepaws
SDK Scraping Failed Tabulizer PDFs with AWS TextractConvert a PDF page to PNG and send and receive single page requests to AWS Textract. [This post]
Study the anatomy of the AWS JSON response object and use the Python parser from R with
reticulate
to convert it back into a tabular form. [This post]
Conclusion
It has been over 10 years since the the SEC mandated company financial statements be reported in XBRL. Marc Joffe has advocated for the similar requirements for municipal reporting, a subject which becomes all the more critical at a time when many governments are likely to need emergency funding. After our exploration of amost 150 PDFs, we believe that it would be possible for auditors to come close to the efficacy of XBRL with a few mandatory guidelines for PDF formatting.
Although there were 73 problem elements, there were only a few patterns common in the large majority of cases:
- Scanned images of printed documents
- Tables which ran over onto a second page without repeating the column headers
- Tables which included a second section on the same page with large indentations
- Lack of uniformity of header and footer formatting and naming
- Lack of uniformity in line-item names (ie: Total Net Position and Net Position)
We now have a dataset which could be used to greatly improve the machine readability of municipal financial statements with a small number of prescribed formatting guidelines.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.