The reason it's not working is because these python packages are only wrappers to tesseract. You have to compile tesseract using a AWS Linux instance and copy the binaries and libraries to the zip file of the lambda function.
1) Start an EC2 instance with 64-bit Amazon Linux;
2) Install dependencies:
sudo yum install gcc gcc-c++ make
sudo yum install autoconf aclocal automake
sudo yum install libtool
sudo yum install libjpeg-devel libpng-devel libpng-devel libtiff-devel zlib-devel
3) Compile and install leptonica:
cd ~
mkdir leptonica
cd leptonica
wget http://www.leptonica.com/source/leptonica-1.73.tar.gz
tar -zxvf leptonica-1.73.tar.gz
cd leptonica-1.73
./configure
make
sudo make install
4) Compile and install tesseract
cd ~
mkdir tesseract
cd tesseract
wget https://github.com/tesseract-ocr/tesseract/archive/3.04.01.tar.gz
tar -zxvf 3.04.01.tar.gz
cd tesseract-3.04.01
./autogen.sh
./configure
make
sudo make install
5) Download language traineddata to tessdata
cd /usr/local/share/tessdata
wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/eng.traineddata
export TESSDATA_PREFIX=/usr/local/share/
At this point you should be able to use tesseract on this EC2 instance. To copy the binaries of tesseract and use it on a lambda function you will need to copy some files from this instance to the zip file you upload to lambda. I'll post all the commands to get a zip file with all the files you need.
6) Zip all the stuff you need to run tesseract on lambda
cd ~
mkdir tesseract-lambda
cd tesseract-lambda
cp /usr/local/bin/tesseract .
mkdir lib
cd lib
cp /usr/local/lib/libtesseract.so.3 .
cp /usr/local/lib/liblept.so.5 .
cp /usr/lib64/libpng12.so.0 .
cd ..
mkdir tessdata
cd tessdata
cp /usr/local/share/tessdata/eng.traineddata .
cd ..
cd ..
zip -r tesseract-lambda.zip tesseract-lambda
The tesseract-lambda.zip file have everything lambda needs to run tesseract. The last thing to do is add the lambda function at the root of the zip file and upload it to lambda. Here is an example that I have not tested, but should work.
7) Create a file named main.py, write a lambda function like the one above and add it on the root of tesseract-lambda.zip:
from __future__ import print_function
import urllib
import boto3
import os
import subprocess
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
LIB_DIR = os.path.join(SCRIPT_DIR, 'lib')
s3 = boto3.client('s3')
def lambda_handler(event, context):
# Get the bucket and object from the event
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key']).decode('utf8')
try:
print("Bucket: " + bucket)
print("Key: " + key)
imgfilepath = '/tmp/image.png'
jsonfilepath = '/tmp/result.txt'
exportfile = key + '.txt'
print("Export: " + exportfile)
s3.download_file(bucket, key, imgfilepath)
command = 'LD_LIBRARY_PATH={} TESSDATA_PREFIX={} {}/tesseract {} {}'.format(
LIB_DIR,
SCRIPT_DIR,
SCRIPT_DIR,
imgfilepath,
jsonfilepath,
)
try:
output = subprocess.check_output(command, shell=True)
print(output)
s3.upload_file(jsonfilepath, bucket, exportfile)
except subprocess.CalledProcessError as e:
print(e.output)
except Exception as e:
print(e)
print('Error processing object {} from bucket {}.'.format(key, bucket))
raise e
When creating the AWS Lambda function on the AWS Console, upload the zip file and set the Hanlder to main.lambda_handler. This will tell AWS Lambda to look for the main.py file inside the zip and to call the function lambda_handler.
IMPORTANT
From time to time things change in AWS Lambda's environment. For example, the current image for the lambda env is amzn-ami-hvm-2017.03.1.20170812-x86_64-gp2 (it might not be this one when you read this answer). If tesseract starts to return segmentation fault, run "ldd tesseract" on the Lambda function and see the output for what libs are needed (currently libtesseract.so.3 liblept.so.5 libpng12.so.0).
Thanks for the comment, SergioArcos.