I wanted to apply OCR to the PDF file on the command line, so I wrote a bash script with the file name ocrize for OCR.
If you use ocrize, please execute the following command in advance to install the required packages.
$ sudo apt install tesseract-ocr-jpn imagemagick
#Tesseract to multiply vertical Japanese PDF by OCR-ocr-jpn-Need vert.
Execute ocrize as follows (make it executable with chmod in advance).
$ ./ocrize input.pdf
# or
$ ocrize input.pdf #For example/usr/local/When ocrize is placed in a place where the path passes, such as bin.
The following describes the procedure for applying OCR to the PDF file input.pdf on the command line.
Before applying OCR to input.pdf, store the file name in a variable and create a primary storage directory.
$ stem="input" #File name without extension.
$ dir="${file}.temp"
$ mkdir ${dir} # input.A temp directory will be created.
Use ImageMagick's convert command to split input.pdf into image files.
$ convert -density 300 -geometry 1000 ${stem}.pdf ${dir}/${stem}.png
The higher the value of option -density
, the better the image quality, but if you specify a higher value, you need to change the resource of ImageMagick.
When the above command is executed, the file that divides input.pdf into pages and converts it to PNG is output under the input.temp directory as shown below.
$ ls ${dir}
input-0.png input-1.png input-2.png input-3.png
If you get the following error with the convert
command, modify ImageMagick's policy.xml (in my environment, policy.xml was directly under / etc / ImageMagick-6).
Error.1 convert: not authorized
Let's modify the right of the policy tag whose domain is coder and pattern is PDF so that PDF can be read and written as follows.
<policy domain="coder" rights="read|write" pattern="PDF" />
Error.2 convert-im6.q16: cache resources exhausted
We're out of memory, so let's fix the policy tag with domain resource. In my case, I fixed the place where name is memory and disk.
<policy domain="resource" name="memory" value="1GiB"/>
<policy domain="resource" name="disk" value="2GiB"/>
You can check the resource with the following command.
$ identify -list resource
# or
$ identify -list Resource
OCR is applied to each image file under the input.temp directory.
$ N=$(ls ${dir} | grep -c '' | awk '{printf "%d", $1-1}')
$ for n in $(seq 0 ${N}); do tesseract -l jpn+eng ${dir}/${stem}-${n}.png ${dir}/${stem}-${n} pdf; done
The recognition language is specified by the option -l
. If you want to recognize in Japanese first and then in English, specify jpn + eng
.
Use the pdfunite
command to combine multiple PDF files into one.
$ pdfunite $(for n in $(seq 0 ${N}); do echo ${dir}/${stem}-${n}.pdf; done) ocrized-${stem}.pdf
The PDF file is output in page order with for
so that the page order is not broken. After execution, ocrized-input.pdf, which is input.pdf multiplied by OCR, will be created directly under the current directory.
Finally, delete the input.temp directory and you're done!
$ rm -r ${dir}
ocrize
The following is a bash script that summarizes the above contents. Save it with the file name ʻocrize` and use it.
#! /bin/bash
if [ $# -eq 1 ]; then
file=$1
ext=$(echo ${file} | rev | cut -d '.' -f 1 | rev)
dir=${file}.temp
if [ -d ${dir} ]; then
echo "${dir} allready exists. Please remove this directory."
exit 1
fi
if [ ${ext} = "pdf" -o ${ext} = "PDF" ]; then
if [ ! -f ${file} ]; then
echo "${file} dose not exist."
exit 1
fi
stem=$(echo ${file} | rev | cut -c 5- | rev)
mkdir ${dir}
echo "1: Converting PDF to PNG."
convert -density 300 -geometry 1000 ${file} ${dir}/${stem}.png
echo "1: Finished."
N=$(ls ${dir} | grep -c '' | awk '{printf "%d", $1-1}')
echo "2: OCRizing."
for n in $(seq 0 ${N}); do
p=$(echo "${n} ${N}" | awk '{printf "%5.1f", ($1+1)/($2+1)*100}')
echo -ne "Progress: [${p} %]\r"
tesseract -l jpn+eng ${dir}/${stem}-${n}.png ${dir}/${stem}-${n} pdf >& /dev/null
rm ${dir}/${stem}-${n}.png
done
echo "2: Finished. "
echo "3: Merging PDF files."
#pdftk $(for n in $(seq 0 ${N}); do echo ${dir}/${stem}-${n}.pdf; done) output ocrized-${pdffile}
pdfunite $(for n in $(seq 0 ${N}); do echo ${dir}/${stem}-${n}.pdf; done) ocrized-${file}
echo "3: Finished."
rm -r ${dir}
else
echo "Extension must be pdf or PDF."
exit 1
fi
else
echo "Usage: $ ocrize input.pdf"
exit 1
fi
When you run ʻocrize`, the script will run as follows.
$ ./ocrize input.pdf
1: Converting PDF to PNG.
convert-im6.q16: profile 'icc': 'RGB ': RGB color space not permitted on grayscale PNG `input.pdf.temp/input.png' @ warning/png.c/MagickPNGWarningHandler/1654.
1: Finished.
2: OCRizing.
Progress: [ 84.4 %]
A few seconds later,
1: Converting PDF to PNG.
convert-im6.q16: profile 'icc': 'RGB ': RGB color space not permitted on grayscale PNG `input.pdf.temp/input.png' @ warning/png.c/MagickPNGWarningHandler/1654.
1: Finished.
2: OCRizing.
2: Finished.
3: Merging PDF files.
3: Finished.
Recommended Posts