Creating an OCR pipeline with AWS Textract

June 21, 2024

Recently, my wife was cleaning out her desk and found a bunch of notes she had taken over the years. She hates clutter but hates technology even more, so I was pretty excited when she asked if there was a way I could turn those into digital text and get rid of the analog paper.

Considering my wife is one of 3 people left in the world that writes in cursive, I considered this an impossible task, but it was also an opportunity to build something in AWS, so I couldn't pass up the opportunity.

In this post I'll show you what I built, how I built it, and some of the considerations and trade-offs I made through the process. If you prefer watching a video, I've got you covered here.

Also, the code repository for this is available on GitHub.

Requirements

She has numerous notes scattered across different sizes of note cards, post-its, and sheets of paper. She wants them translated into text so she can add them to her note taking software. She doesn't need them automatically added to the software, she can copy/paste without a problem.

Solution

AWS has Textract as an OCR engine so I took some pictures of some of her notes, fed it into Textract using the AWS cli and got back pretty decent results. Now I just needed a pipeline to process images through Textract because there is no way I'm teaching her the CLI.

I also wasn't going to build some fancy UI because:

I didn't want to spend the time.
I kinda hate web development now. I know there are other ways to build UIs, but I wasn't going to do that either.

She likes email, so I made email the interface, even though I hate email too. (the longer I use technology the more I kinda hate it all).

Thinking about the process and seeing the options I ended up landing on this basic architecture

Let's walk through it real quick.

SES receives the incoming email and an action drops this into an s3 bucket in a raw/ prefix
An EventBridge rule triggers a Lambda function called the EmailParser.
The EmailParser reads the email out of the bucket, parses the metadata and extracts the 'from' address and the attachments, and then writes all the image attachments back to the bucket in a images/ prefix.
The ImageScanner is notified by another EventBridge rule of the new image in the images/ prefix
Textract does its OCR magic and responds back to the ImageScanner with the text from the image. This happens synchronously.
Finally, the ImageScanner sends an email via SES back to the original sender with the original image and the text extracted from it.

With that all sketched out I went off and built it. Obviously, I used the CDK and after a few hours I had a working solution. I sent a test email, and about 15 seconds later I got the response.

So, the proof-of-concept works, but I wasn't done yet. There are always two other things you have to consider after "it works". First, "what happens when it doesn't work" and second, "how much will this cost me". If you skip these questions, you're going to have a bad time.

Figuring out when things don't work is easy: just give it to your end users. No matter how much testing YOU do, they'll always find a way to break it.

I excitedly ran to my wife to tell her the good news. But she was on a call, so I had to wait. It was only a few minutes but I was so excited to tell her I had gotten the thing working. I knew she'd be over the moon, reaffirm her love for me, and exude heaps of praise and appreciation for making her life easier.

In reality she seemed pretty unexcited. But that's her personality, she has pretty flat responses to things. Plus, this was still her having to deal with technology and she's never going to be excited about that.

But, nevertheless, I persisted and suggested she email some pictures to the email address I set up. She tapped away on her phone, whelmed at all I had done for her. As she hit 'send' I watched her face intently, looking for that genuine surprise and glee to hit her when she received an email with her pictures almost perfectly translated.

10 seconds went by, nothing. 20... 30... 60 seconds. I knew at this point something had failed.

She looked at me confused. I told her I'd be back and I ran back to my computer.

Hardening the Solution

It only took seconds of digging through the Lambda function logs to see the issue, it ran out of memory in processing the email. I upped the memory, she tried again. This time it was a timeout (3 seconds wasn't long enough) that caused the failure.

I worked through a few issues but ultimately a few small tweaks and she was off and working. I have yet to receive my thank-you card or the congratulatory back rub.

Now the hard work begins. I had unblocked my customer but I needed better monitoring and better resiliency.

Let's go back and take a look at the code again. First thing I'm going to do is add some alerts on the Lambda function metrics, if any errors occur, any at all, I want to be emailed about it:


export class EmailParser extends Construct {
  constructor(scope: Construct, id: string, props: {
    notificationTopic: ITopic;
    bucket: IBucket;
  }) {
    super(scope, id);
    const emailParser = new NodejsFunction(this, 'Resource', {
      memorySize: 256,
      timeout: Duration.minutes(1),
      environment: {
        BUCKET: props.bucket.bucketName,
      },
      onFailure: new SqsDestination(new Queue(this, 'DeadLetterQueue')),
    });

    emailParser.metricErrors({})
      .createAlarm(this, 'EmailParserErrorAlarm', {
        alarmName: 'EmailParserErrorAlarm',
        alarmDescription: 'Email Parser Error Alarm',
        actionsEnabled: true,
        threshold: 1,
        comparisonOperator: ComparisonOperator.GREATER_THAN_OR_EQUAL_TO_THRESHOLD,
        evaluationPeriods: 1,
        treatMissingData: TreatMissingData.NOT_BREACHING,
      })
      .addAlarmAction(new SnsAction(props.notificationTopic));

    // ...
  }
}

I repeated this set up for the ImageScanner function as well. Now, if there are any errors, I'll be notified via a topic that has my personal email as the sole subscriber.

Next, I'm going to review the application at each step in the process and try to determine three things:

Can I increase the security of the component?
Can I increase the resilience, the tolerance to failure, of the component?
Can I decrease the cost of the element?

So I start with SES. Can I increase the security? Well, not that I can see. I don't really have options to restrict incoming emails from certain addresses. If you know otherwise, please let me know. Since this is the only external-facing component, it was the only component I looked at from a security perspective (whether this is correct or not is up to you to decide).

Can I increase the resiliency? SES is already a serverless email service and there's nothing I can do to ensure it stays up and running to receive emails, that's all on AWS. But what about the other components besides SES? How about the Lambda functions? If they fail, can I recover from that failure without requiring anything from the customer?

This is definitely a place where we can add some resiliency. I could write the handler to watch for errors and put some messages in a queue for reprocessing, but that wouldn't catch core execution problems, like bad dependency references or something that could cause the entire function to never execute. So, I'll use Lambda Destinations and have any failures drop messages into an SQS queue.


export class EmailParser extends Construct {
  constructor(scope: Construct, id: string, props: {
    notificationTopic: ITopic;
    bucket: IBucket;
  }) {
    super(scope, id);

    const timeoutDuration = Duration.minutes(1);
    const deadLetterQueue = new Queue(this, 'DeadLetterQueue', {
      visibilityTimeout: timeoutDuration,
    });
    const emailParser = new NodejsFunction(this, 'Resource', {
      memorySize: 256,
      timeout: timeoutDuration,
      environment: {
        BUCKET: props.bucket.bucketName,
      },
      onFailure: new SqsDestination(deadLetterQueue),
    });

    deadLetterQueue.grantConsumeMessages(emailParser);
    // ...
  }
}

Additionally, I'll update the handler to be able to read and process these queue messages. If there are errors, I just subscribe the Lambda function to the failure queue and the messages are automatically reprocessed (if you can see the potential issues with this, leave a comment on Twitter). I'll repeat the process for the other Lambda function as well.

export const handler = async (event: EventBridgeEvent<'Object Created', any> | SQSEvent): Promise<any> => {
  console.log(JSON.stringify(event, null, 2));

  if ('Records' in event) {
    // this is a redrive event in an sqs wrapper...
    for (const record of event.Records) {
      const { detail: { object: { key: key } } } = (JSON.parse(record.body) as RedriveEvent).requestPayload;
      await processObjectKey(key);
    }
  } else {
    const { detail: { object: { key: key } } } = event;
    await processObjectKey(key);

  }
};

Can I decrease the costs? Well, nothing on the incoming side but outgoing I could potentially try to batch image analysis results together and send in fewer emails. However, considering free tier limits and how few emails I'll actually be sending, I don't see a benefit to the engineering costs. Otherwise, I can use the AWS pricing models to estimate that each email processed will cost me ___ cents:

About $0.18 per email, assuming 5mb in attachments per email.
About $0.00 for the Lambdas, as free-tier covers almost all of this.
Roughly $0.04 for s3 since S3 storage is super cheap, and most likely $0.00 because of free-tier.

SES is definitely the largest cost and only because of the rather sizable attachments. But overall, still a pretty cheap solution.

That's it! Let me know what you think of if you have any questions. I'm happy to help out where I can.