Understanding the Contract Documents
In our case, we needed to extract text from a specific table nested somewhere inside each contract. The cells in that table would effectively alternate, making for nice key/value pairs (e.g. “Contract start date” and “January 1, 2020”). However, various cells were row-spanning or column-spanning.
Divide and Conquer
Our starting point was understanding which tools were available, as well as their respective capabilities and limitations. While Amazon Textract was our tool of choice for OCR, we found that Amazon’s tools struggled to accurately organize the text into table cells, especially when the layout involved cells spanning rows and columns. We needed a tool to map the cells and thus delineate where one block of text starts and stops. In the absence of such a tool on the market, we developed it in house, resulting in a “Form Extractor.”
Source PDFs
The source documents were multi-page contracts. Unfortunately, the contract layout was not homogenous due to formatting and legal language changes over time. Our first step was breaking the PDFs into separate pages in order to generate an image for each page.
Amazon Textract
Our second step was to extract the text from each image. If an image was found to have the required table, we stored the page. All other pages were ignored.
Six Feet Up’s “Form Extractor”
The stored pages, along with the text output provided by Amazon Textract, were then run through our custom-built “Form Extractor.” First, our Form Extractor made adjustments for color correction and various other types of image repair. Then, the Form Extractor looked for lines, and constructed what it believed was the layout of the table. Cells were then clustered into forms, and the forms checked for fiducial values to validate which form had been detected. Finally, a repair process made inferences from the surrounding cells to ensure the form definition was complete. For example, perhaps there was a cell along the bottom row with borders that were too faded for Form Extractor to detect. If Form Extractor identified cells on 2+ sides of an otherwise unidentified region, it could intuit that a cell existed and account for it.
Final Output
The program ultimately returned a JSON object with the text recognition placed into key/value pairs for further analysis.