--I want to detect the position of characters using OCR ――I don't use it so often, so I want to run it on Lambda --I want to use it from the Web
That's why I was able to do it
The repository is here
--Software that performs OCR --Enter with brew on Mac (v3.04) --Not only can you get the characters, but you can also output the position of the characters in hOCR (html) or tsv format <-Important
-Refer to StackOverflow ... --In Lambda, it works if you upload standalone binary files and .so properly. --subprocess (Python command line execution library) also works
In other words ... !!
By the way, if you don't build on Amazon Linux, Pillow (PIL) will witness the phenomenon that the fairy's neck is broken and die without ELF header.
――In general OCR, most of the characters are returned as text. -If you read Doc, v3.05 supports tsv format. --Tesseract (v3.04) will be included if you insuko normally --You have to build by hand according to StackOverflow to use v3.05
This was painful.
That's why I will write about how to introduce it.
All ec2-user is fine.
sudo yum install -y gcc gcc-c++ make
sudo yum install -y autoconf aclocal automake
sudo yum install -y libtool
sudo yum install -y libjpeg-devel libpng-devel libtiff-devel zlib-devel
sudo yum install -y git
In Amazon Linux, the version of node that was put in yum is too old and it is hard to do various things (described later), so put nvm in it. However, if it is not Amazon Linux, you will get an error in Lambda when you build it, so let's do our best.
$ curl -o- https://raw.githubusercontent.com/creationix/nvm/v0.33.0/install.sh | bash
$ source ~/.bashrc
$ nvm install v6.9.4
$ nvm alias default v6.9.4
#Check version
$ npm -v
$ node -v
Leptonica is Necessary to run tesseract with OSS that does image analysis You can't use tesseract v3.05 without raising the version here
$ cd ~
$ mkdir leptonica
$ cd leptonica
$ wget http://www.leptonica.com/source/leptonica-1.74.tar.gz
# unzip
$ tar -zxvf leptonica-1.73.tar.gz
$ cd leptonica-1.73
# build
$ ./configure
$ make
$ sudo make install
Since v3.05 is dev yet, it is not on the release == zip is not dropped, so I will clone it and do my best.
$ cd ~
$ git clone https://github.com/tesseract-ocr/tesseract.git
$ cd tesseract/
$ git checkout -b 3.05 origin/3.05
# initialize
$ ./autogen.sh
# build
$ ./configure
$ make
$ sudo make install
$ cd ~
$ mkdir package
$ cd package
# Copy libraries
$ cp /usr/local/bin/tesseract .
$ mkdir lib
$ cd lib
$ cp /usr/local/lib/libtesseract.so.3 .
$ cp /usr/local/lib/liblept.so.5 .
$ cp /lib64/librt.so.1 .
$ cp /lib64/libz.so.1 .
$ cp /usr/lib64/libpng12.so.0 .
$ cp /usr/lib64/libjpeg.so.62 .
$ cp /usr/lib64/libtiff.so.5 .
$ cp /lib64/libpthread.so.0 .
$ cp /usr/lib64/libstdc++.so.6 .
$ cp /lib64/libm.so.6 .
$ cp /lib64/libgcc_s.so.1 .
$ cp /lib64/libc.so.6 .
$ cp /lib64/ld-linux-x86-64.so.2 .
$ cp /usr/lib64/libjbig.so.2.0 .
# Get trained data
$ cd ..
$ mkdir tessdata
$ cd tessdata
$ wget https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata
$ wget https://github.com/tesseract-ocr/tessdata/raw/master/osd.traineddata
# Make config file
$ mkdir configs
$ echo 'tessedit_create_tsv 1' > tsv
$ cd ../..
$ zip -r package.zip package
Now you can use it by enclosing package
in your Lambda package!
I'm sorry for the grass.
This is the result of giving such an image
level page_num block_num par_num line_num word_num left top width height conf text
1 1 0 0 0 0 0 0 1080 1920 -1
2 1 1 0 0 0 29 11 1025 50 -1
3 1 1 1 0 0 29 11 1025 50 -1
4 1 1 1 1 0 29 11 1025 50 -1
5 1 1 1 1 1 29 11 548 50 60 GnAflQflAA
5 1 1 1 1 2 640 15 167 43 58 X-IIZII"
5 1 1 1 1 3 899 14 155 44 89 l11:57
2 1 2 0 0 0 0 0 1080 76 -1
3 1 2 1 0 0 0 0 1080 76 -1
4 1 2 1 1 0 0 0 1080 76 -1
5 1 2 1 1 1 0 0 1080 76 95
2 1 3 0 0 0 192 829 197 66 -1
3 1 3 1 0 0 192 829 197 66 -1
4 1 3 1 1 0 192 829 197 66 -1
5 1 3 1 1 1 192 851 93 44 87 00
5 1 3 1 1 2 336 829 53 66 71 la
2 1 4 0 0 0 122 992 718 109 -1
3 1 4 1 0 0 122 992 718 109 -1
4 1 4 1 1 0 122 992 718 47 -1
5 1 4 1 1 1 122 995 88 44 89 Sign
5 1 4 1 1 2 229 995 31 34 94 in
5 1 4 1 1 3 276 997 40 32 86 to
5 1 4 1 1 4 332 997 64 42 89 get
5 1 4 1 1 5 410 993 66 36 86 the
5 1 4 1 1 6 493 997 104 32 84 most
5 1 4 1 1 7 613 997 66 32 86 out
5 1 4 1 1 8 695 992 41 37 91 of
5 1 4 1 1 9 749 1003 91 36 93 your
4 1 4 1 2 0 122 1065 144 36 -1
5 1 4 1 2 1 122 1065 144 36 87 device.
2 1 5 0 0 0 124 1269 312 46 -1
3 1 5 1 0 0 124 1269 312 46 -1
4 1 5 1 1 0 124 1269 312 46 -1
5 1 5 1 1 1 124 1269 111 36 87 Email
5 1 5 1 1 2 253 1279 40 26 92 or
5 1 5 1 1 3 310 1269 126 46 89 phone
The source is like this
import requirements
from PIL import Image
import sys
import pyocr
import pyocr.builders
import urllib
import os
import subprocess
import base64
import json
import boto3
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
LIB_DIR = os.path.join(SCRIPT_DIR, 'lib')
LANG_DIR = os.path.join(SCRIPT_DIR, 'tessdata')
def response(code, body):
return {
'statusCode': code,
'headers': {
'Access-Control-Allow-Origin': '*',
},
'body': json.dumps(body),
}
def handler(event, context):
# Get the bucket and object from the event
try:
tools = pyocr.get_available_tools()
if len(tools) == 0:
print("No OCR tool found")
sys.exit(1)
tool = tools[0]
print("Will use tool '%s'" % (tool.get_name()))
request = event['body']
result_filepath = '/tmp/result'
img_filepath = '/tmp/image.png'
with open(img_filepath, 'wb') as fh:
fh.write(base64.decodestring(request['template']))
command = 'LD_LIBRARY={} TESSDATA_PREFIX={} {}/tesseract {} {} -l eng --oem 0 tsv'.format(
LIB_DIR,
SCRIPT_DIR,
SCRIPT_DIR,
img_filepath,
result_filepath
)
print command
try:
output = subprocess.check_output(
command,
shell=True,
stderr=subprocess.STDOUT
)
print(output)
with open(result_filepath + '.tsv', 'rb') as fh:
print(fh.read())
except subprocess.CalledProcessError as e:
return "except:: " + e.output
except Exception as e:
print(e)
raise e
After that, please feel free to rewrite serverless.yml on GitHub or whatever.
Recommended Posts