Thing you want to do

--I want to detect the position of characters using OCR ――I don't use it so often, so I want to run it on Lambda --I want to use it from the Web

That's why I was able to do it

Character position identification with OCR on AWS Lambda

The repository is here

What is tesseract?

--Software that performs OCR --Enter with brew on Mac (v3.04) --Not only can you get the characters, but you can also output the position of the characters in hOCR (html) or tsv format <-Important

How do you run it on Lambda?

-Refer to StackOverflow ... --In Lambda, it works if you upload standalone binary files and .so properly. --subprocess (Python command line execution library) also works

In other words ... !!

If you upload it with tesseract, OCR will work on Lambda !! *

By the way, if you don't build on Amazon Linux, Pillow (PIL) will witness the phenomenon that the fairy's neck is broken and die without ELF header.

How to detect the position of the character string?

――In general OCR, most of the characters are returned as text. -If you read Doc, v3.05 supports tsv format. --Tesseract (v3.04) will be included if you insuko normally --You have to build by hand according to StackOverflow to use v3.05

This was painful.

That's why I will write about how to introduce it.

Installation

All ec2-user is fine.

Installation of required packages

sudo yum install -y gcc gcc-c++ make
sudo yum install -y autoconf aclocal automake
sudo yum install -y libtool
sudo yum install -y libjpeg-devel libpng-devel libtiff-devel zlib-devel
sudo yum install -y git

nvm installation

In Amazon Linux, the version of node that was put in yum is too old and it is hard to do various things (described later), so put nvm in it. However, if it is not Amazon Linux, you will get an error in Lambda when you build it, so let's do our best.

$ curl -o- https://raw.githubusercontent.com/creationix/nvm/v0.33.0/install.sh | bash
$ source ~/.bashrc 

$ nvm install v6.9.4  
$ nvm alias default v6.9.4  

#Check version
$ npm -v
$ node -v

Install Leptonica

Leptonica is Necessary to run tesseract with OSS that does image analysis You can't use tesseract v3.05 without raising the version here

$ cd ~
$ mkdir leptonica
$ cd leptonica

$ wget http://www.leptonica.com/source/leptonica-1.74.tar.gz

# unzip
$ tar -zxvf leptonica-1.73.tar.gz
$ cd leptonica-1.73

# build
$ ./configure
$ make
$ sudo make install

Install Tesseract

Since v3.05 is dev yet, it is not on the release == zip is not dropped, so I will clone it and do my best.

$ cd ~
$ git clone https://github.com/tesseract-ocr/tesseract.git
$ cd tesseract/
$ git checkout -b 3.05 origin/3.05

# initialize
$ ./autogen.sh

# build
$ ./configure
$ make
$ sudo make install

Packaging for Lambda

$ cd ~
$ mkdir package
$ cd package

# Copy libraries
$ cp /usr/local/bin/tesseract .
$ mkdir lib
$ cd lib
$ cp /usr/local/lib/libtesseract.so.3 .
$ cp /usr/local/lib/liblept.so.5 .
$ cp /lib64/librt.so.1 .
$ cp /lib64/libz.so.1 .
$ cp /usr/lib64/libpng12.so.0 .
$ cp /usr/lib64/libjpeg.so.62 .
$ cp /usr/lib64/libtiff.so.5 .
$ cp /lib64/libpthread.so.0 .
$ cp /usr/lib64/libstdc++.so.6 .
$ cp /lib64/libm.so.6 .
$ cp /lib64/libgcc_s.so.1 .
$ cp /lib64/libc.so.6 .
$ cp /lib64/ld-linux-x86-64.so.2 .
$ cp /usr/lib64/libjbig.so.2.0 .

# Get trained data
$ cd ..
$ mkdir tessdata
$ cd tessdata
$ wget https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata
$ wget https://github.com/tesseract-ocr/tessdata/raw/master/osd.traineddata

# Make config file
$ mkdir configs
$ echo 'tessedit_create_tsv 1' > tsv

$ cd ../..
$ zip -r package.zip package

Now you can use it by enclosing package in your Lambda package!

As a result of trying it, wwwwww

I'm sorry for the grass.

This is the result of giving such an image

level	page_num	block_num	par_num	line_num	word_num	left	top	width	height	conf	text
1	1	0	0	0	0	0	0	1080	1920	-1	
2	1	1	0	0	0	29	11	1025	50	-1	
3	1	1	1	0	0	29	11	1025	50	-1	
4	1	1	1	1	0	29	11	1025	50	-1	
5	1	1	1	1	1	29	11	548	50	60	GnAﬂQﬂAA
5	1	1	1	1	2	640	15	167	43	58	X-IIZII"
5	1	1	1	1	3	899	14	155	44	89	l11:57
2	1	2	0	0	0	0	0	1080	76	-1	
3	1	2	1	0	0	0	0	1080	76	-1	
4	1	2	1	1	0	0	0	1080	76	-1	
5	1	2	1	1	1	0	0	1080	76	95	 
2	1	3	0	0	0	192	829	197	66	-1	
3	1	3	1	0	0	192	829	197	66	-1	
4	1	3	1	1	0	192	829	197	66	-1	
5	1	3	1	1	1	192	851	93	44	87	00
5	1	3	1	1	2	336	829	53	66	71	la
2	1	4	0	0	0	122	992	718	109	-1	
3	1	4	1	0	0	122	992	718	109	-1	
4	1	4	1	1	0	122	992	718	47	-1	
5	1	4	1	1	1	122	995	88	44	89	Sign
5	1	4	1	1	2	229	995	31	34	94	in
5	1	4	1	1	3	276	997	40	32	86	to
5	1	4	1	1	4	332	997	64	42	89	get
5	1	4	1	1	5	410	993	66	36	86	the
5	1	4	1	1	6	493	997	104	32	84	most
5	1	4	1	1	7	613	997	66	32	86	out
5	1	4	1	1	8	695	992	41	37	91	of
5	1	4	1	1	9	749	1003	91	36	93	your
4	1	4	1	2	0	122	1065	144	36	-1	
5	1	4	1	2	1	122	1065	144	36	87	device.
2	1	5	0	0	0	124	1269	312	46	-1	
3	1	5	1	0	0	124	1269	312	46	-1	
4	1	5	1	1	0	124	1269	312	46	-1	
5	1	5	1	1	1	124	1269	111	36	87	Email
5	1	5	1	1	2	253	1279	40	26	92	or
5	1	5	1	1	3	310	1269	126	46	89	phone

The source is like this

import requirements

from PIL import Image
import sys
import pyocr
import pyocr.builders

import urllib
import os
import subprocess
import base64
import json
import boto3

SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
LIB_DIR = os.path.join(SCRIPT_DIR, 'lib')
LANG_DIR = os.path.join(SCRIPT_DIR, 'tessdata')

def response(code, body):
    return {
        'statusCode': code,
        'headers': {
            'Access-Control-Allow-Origin': '*',
        },
        'body': json.dumps(body),
    }

def handler(event, context):
    # Get the bucket and object from the event
    try:
        tools = pyocr.get_available_tools()
        if len(tools) == 0:
            print("No OCR tool found")
            sys.exit(1)
        tool = tools[0]
        print("Will use tool '%s'" % (tool.get_name()))

        request = event['body']

        result_filepath = '/tmp/result'
        img_filepath = '/tmp/image.png'
        with open(img_filepath, 'wb') as fh:
            fh.write(base64.decodestring(request['template']))

        command = 'LD_LIBRARY={} TESSDATA_PREFIX={} {}/tesseract {} {} -l eng --oem 0  tsv'.format(
            LIB_DIR,
            SCRIPT_DIR,
            SCRIPT_DIR,
            img_filepath,
            result_filepath
        )
        print command

        try:
            output = subprocess.check_output(
                command,
                shell=True,
                stderr=subprocess.STDOUT
            )
            print(output)

            with open(result_filepath + '.tsv', 'rb') as fh:
                print(fh.read())
        except subprocess.CalledProcessError as e:
            return "except:: " + e.output

    except Exception as e:
        print(e)
        raise e

After that, please feel free to rewrite serverless.yml on GitHub or whatever.

It was a life I wanted to OCR on AWS Lambda to locate the characters.