'AWS textract-trp package issue - cannot extract key-value pair

I'm using AWS Lambda running on Python 3.8 to run this code example below:

import boto3
from trp import Document

# Document
documentName = "employmentapp.png"

# Amazon Textract client
textract = boto3.client('textract')

# Call Amazon Textract
with open(documentName, "rb") as document:
    response = textract.analyze_document(
        Document={
            'Bytes': document.read(),
        },
        FeatureTypes=["FORMS"])

#print(response)

doc = Document(response)

for page in doc.pages:
    # Print fields
    print("Fields:")
    for field in page.form.fields:
        print("Key: {}, Value: {}".format(field.key, field.value))

    # Get field by key
    print("\nGet Field by Key:")
    key = "Phone Number:"
    field = page.form.getFieldByKey(key)
    if(field):
        print("Key: {}, Value: {}".format(field.key, field.value))

    # Search fields by key
    print("\nSearch Fields:")
    key = "address"
    fields = page.form.searchFieldsByKey(key)
    for field in fields:
        print("Key: {}, Value: {}".format(field.key, field.value))

Im getting this error

Traceback (most recent call last):
  File "/Users/shimon_zouzout/CloudZoneRepos/Projects/CloudZone/cloudzoneprod_lambdas/billing/BILLING_invoices-email-ocr/tests/queries.py", line 30, in <module>
    doc = Document(response)
  File "/Users/shimon_zouzout/Library/Python/3.9/lib/python/site-packages/trp/__init__.py", line 633, in __init__
    self._parse()
  File "/Users/shimon_zouzout/Library/Python/3.9/lib/python/site-packages/trp/__init__.py", line 667, in _parse
    page = Page(documentPage["Blocks"], self._blockMap)
  File "/Users/shimon_zouzout/Library/Python/3.9/lib/python/site-packages/trp/__init__.py", line 516, in __init__
    self._parse(blockMap)
  File "/Users/shimon_zouzout/Library/Python/3.9/lib/python/site-packages/trp/__init__.py", line 530, in _parse
    l = Line(item, blockMap)
  File "/Users/shimon_zouzout/Library/Python/3.9/lib/python/site-packages/trp/__init__.py", line 142, in __init__
    if(blockMap[cid]["BlockType"] == "WORD"):
KeyError: '73d47382-4f5a-4423-9665-124380736c2a'

Can someone please assist here? I want to extract key-value pairs from PDF invoices without killing myself using REGEX.

  • I've added textract-trp package using AWS Lambda layers.
  • The same error occurs when I'm running this code locally.

Thanks in advance!



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source