Convert PDF files to images (PNG), one file per page. Verification of business form output Since pre-processing is assumed, multiple PDF files can be processed together.
--Environment where Python3 runs. This article uses Python 3.8.1 (Windows 64bit) --Poppler. Open source command line tools for working with PDFs --pdf2 image. Wrapper module that makes Poppler available from Python
Head family (source) https://poppler.freedesktop.org/
Binaries for Windows are available here. http://blog.alivate.com.au/poppler-windows/
The installation procedure is summarized on this site. http://pdf-file.nnn2.com/?p=863 If you do not include the language file in the latter half of the explanation, the Japanese file name will be garbled, so be sure to include it.
Since imread () and imwrite () cannot handle file names other than ascii, it is necessary to change the file name to ascii characters when post-processing using openCV-Python.
It's okay to URL-encode it like base = urllib.parse.quote (pdf_file.stem)
, but it's unreadable by people.
If it is difficult to rename the original data, there is also such a countermeasure. About dealing with problems when handling file paths including Japanese in Python OpenCV cv2.imread and cv2.imwrite https://qiita.com/SKYS/items/cbde3775e2143cad745
pip install pdf2image
Click here for Github https://github.com/Belval/pdf2image
pdf2img.py
import pathlib
import pdf2image
pdf_files = pathlib.Path('in_pdf').glob('*.pdf')
img_dir = pathlib.Path('out_img')
for pdf_file in pdf_files:
base = pdf_file.stem
images = pdf2image.convert_from_path(pdf_file, grayscale=True, size=640)
for index, image in enumerate(images):
image.save(img_dir/pathlib.Path(base + '-{}.png'.format(index + 1)),
'png')
What I'm doing is simple: I'm reading a PDF file in the in_pdf folder of the current directory and outputting {PDF filename}-{page} .png to the out_img folder.
Example) Some form.pdf → Some form-1.png Some form-2.png
Image conversion parameters
images = pdf2image.convert_from_path(pdf_file, grayscale=True, size=640)
You can set it at.
--Grayscale with grayscale = True
. Color if you set grayscale = False
or omit the specification
--Output so that it fits in n pixels square with size = n
. Size calculated by DPI value if not specified
--Specify a DPI value with dpi = n
(default value is 200 DPI). When there is a size specification, that has priority
There are many other settings you can make, but for the time being, this is enough.
The image format is
image.save(img_dir/pathlib.Path(base + '-{}.png'.format(index + 1)), 'png')
Where
image.save(img_dir/pathlib.Path(base + '-{}.jpg'.format(index + 1)), 'jpeg')
Then, it will be output in JPEG format.
Recommended Posts