Computer Vision and Image Processing

Implementation Details

Understanding the Contract Documents

In our case, we needed to extract text from a specific table nested somewhere inside each contract. The cells in that table would effectively alternate, making for nice key/value pairs (e.g. “Contract start date” and “January 1, 2020”). However, various cells were row-spanning or column-spanning.

Divide and Conquer

Our starting point was understanding which tools were available, as well as their respective capabilities and limitations. While Amazon Textract was our tool of choice for OCR, we found that Amazon’s tools struggled to accurately organize the text into table cells, especially when the layout involved cells spanning rows and columns. We needed a tool to map the cells and thus delineate where one block of text starts and stops. In the absence of such a tool on the market, we developed it in house, resulting in a “Form Extractor.”

Source PDFs

The source documents were multi-page contracts. Unfortunately, the contract layout was not homogenous due to formatting and legal language changes over time. Our first step was breaking the PDFs into separate pages in order to generate an image for each page.

Amazon Textract

Our second step was to extract the text from each image. If an image was found to have the required table, we stored the page. All other pages were ignored.

Six Feet Up’s “Form Extractor”

The stored pages, along with the text output provided by Amazon Textract, were then run through our custom-built “Form Extractor.” First, our Form Extractor made adjustments for color correction and various other types of image repair. Then, the Form Extractor looked for lines, and constructed what it believed was the layout of the table. Cells were then clustered into forms, and the forms checked for fiducial values to validate which form had been detected. Finally, a repair process made inferences from the surrounding cells to ensure the form definition was complete. For example, perhaps there was a cell along the bottom row with borders that were too faded for Form Extractor to detect. If Form Extractor identified cells on 2+ sides of an otherwise unidentified region, it could intuit that a cell existed and account for it.

Final Output

The program ultimately returned a JSON object with the text recognition placed into key/value pairs for further analysis.

Computer Vision and Image Processing

Areas of Expertise

Industries

Technology Used

CHALLENGE

Implementation Details

Results

ARE YOU READY TO START YOUR NEXT PROJECT?

Contact Us

HEAR FROM US