Site icon R-bloggers

Parsing Mass Municipal PDF CAFRs with Tabulizer, pdftools and AWS Textract – Part 1

[This article was first published on R on Redwall Analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
  • # Libraries
    packages <- 
      c("data.table",
        "rlist",
        "stringr",
        "DT",
        "janitor",
        "readxl",
        "xlsx"
        )
    
    if (length(setdiff(packages,rownames(installed.packages()))) > 0) {
      install.packages(setdiff(packages, rownames(installed.packages())))  
    }
    
    invisible(lapply(packages, library, character.only = TRUE))
    
    knitr::opts_chunk$set(comment=NA, fig.width=12, fig.height=8, out.width = '100%')

    Introduction

    Redwall Analytics had the pleasure of collaborating with Marc Joffe, of Reason Foundation, in its October 2018 post Replicating Yankee Institute Risk Score Over 15 Years for 150 Connecticut towns. This involved taking a well organized public dataset from the State’s website, and analyzing and building an application to view the risk score over time in R. When Marc called to ask if we could report on our blog site about similar analysis for 149 Massachusetts towns, as in Connecticut’s Warning Signs, we jumped at the idea. Naturally, Redwall wanted to replicate the analysis over a longer time period, but this turned out to be more challenging in Massachusetts.

    As Redwall discussed in A Walk Though of Accessing Financial Statements with XBRL in R – Part 1, the SEC has required registered companies to report in XBRL since 2008. Unfortunately, this is not the case for the tens of thousands of municipal borrowers around the US. Connecticut is one of approximately 20 states with a forman municipal monitoring program, so the Office of Policy and Management systematically gathers key financial statement data each year in an annual report and open public databases. Unfortunately, Massachusetts does not have formal monitoring and does not make similar databases available.

    While Massachusetts does maintain the slick-looking Municipal Trends Dashboard, with some key information going back in some cases 20 years, it doesn’t offer the opportunity to download the complete data, and for some reason, excludes a lot of key elements. For example, it includes “Unassigned General Fund Balance”, but not “Assigned General Fund Balance”, so it is impossible to know the complete picture of the Fund Balance of a municipality. There are similar difficulties finding the “Unrestricted Net Position”, which is essential to the Yankee Institute’s analysis. In Connecticut, we have felt a little left behind by Massachusetts’ successful biotech and high-technology industries, and resulting booming real estate markets, so it was a surprise to see how far ahead we are in this regard.

    Project Outlines

    As we worked to replicate Marc’s analysis, the outlines of an interesting project, which might be helpful to others needing to extract data from PDFs, became apparent. Redwall will gather some of the information from the Municipal Trends Dashboard, but then try to extract the missing elements from PDFs containing the cities’ audited financial statements (known as CAFRs), using OCR tools available in R. We will then attempt to use the commercial AWS Textract service for the remaining tough cases using the R paws library. For the purposes of this exercise, Marc’s spreadsheets will represents a kind of “labelled data” for comparison purposes with fields from the extracted PDFs.

    The relevant 149 PDFs for 2018 were downloaded from a library maintained by the Center for Municipal Finance, and Redwall set out on heavy-duty PDF parsing to be described in multiple upcoming posts. In this post, we will show the data required in Table 1 and results in Table 2 of Marc’s analysis, and explore which data fields will be needed to automate the process. (Please note: the data and methodology shown below are Marc’s, displayed here for information purposes. Redwall Analytics does not vouch for the accuracy of the data or methodology, nor make any investment recommendations based on it.)

    After that, we will move on to attempting to extract that data from the PDFs. The first steps of the process to parse these PDFs using two incredible rstat wrappers pdftools and tabulizer. We will also use Amazon’s Textract using the R paws SDK library. While these libraries have been around for a couple of years, we did not find anyone who had put all of these pieces together through the paces like we are about to.

    Because these tools are relatively new, we struggled to find much written on using them to their full individual potentials. We also believe that we have found some interesting methods, making them more powerful by using them in combination. One of the main goals of this series is to begin to repay the debt to all those who shared their wisdom and enabled us to reach the level where we can give back to the community.

    Blogpost Index

    Massachusetts Municipal Vulnerability

    To begin with, we load and show his spreadsheet analysis with his key inputs. Needless to say, it is a big job to gather and extract that much data manually PDF-by-PDF and record in a spreadsheet. In addition, Marc’s approach can only be used at a point in time, and would require a similar amount of effort to repeat for past or future years. If it is possible to reliably automate, even partially, it should help the cause of municipal fiscal transparency. We show the data used in Marc’s analysis in Table 1 below. As with most of our posts, it is possible to scroll over the table columns, adjust the number of rows displayed or search for a particular municipality with the search bar.

  • To leave a comment for the author, please follow the link and comment on their blog: R on Redwall Analytics.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.