What you want to do in this article

Convert PDF files to images (PNG), one file per page. Verification of business form output Since pre-processing is assumed, multiple PDF files can be processed together.

Things to prepare

--Environment where Python3 runs. This article uses Python 3.8.1 (Windows 64bit) --Poppler. Open source command line tools for working with PDFs --pdf2 image. Wrapper module that makes Poppler available from Python

Install Poppler

Head family (source) https://poppler.freedesktop.org/

Binaries for Windows are available here. http://blog.alivate.com.au/poppler-windows/

The installation procedure is summarized on this site. http://pdf-file.nnn2.com/?p=863 If you do not include the language file in the latter half of the explanation, the Japanese file name will be garbled, so be sure to include it.

Added on January 6, 2020

Since imread () and imwrite () cannot handle file names other than ascii, it is necessary to change the file name to ascii characters when post-processing using openCV-Python. It's okay to URL-encode it like base = urllib.parse.quote (pdf_file.stem), but it's unreadable by people.

If it is difficult to rename the original data, there is also such a countermeasure. About dealing with problems when handling file paths including Japanese in Python OpenCV cv2.imread and cv2.imwrite https://qiita.com/SKYS/items/cbde3775e2143cad745

Install pdf2image

pip install pdf2image

Click here for Github https://github.com/Belval/pdf2image

Python code

`pdf2img.py`


import pathlib
import pdf2image

pdf_files = pathlib.Path('in_pdf').glob('*.pdf')
img_dir = pathlib.Path('out_img')

for pdf_file in pdf_files:
    base = pdf_file.stem
    images = pdf2image.convert_from_path(pdf_file, grayscale=True, size=640)
    for index, image in enumerate(images):
        image.save(img_dir/pathlib.Path(base + '-{}.png'.format(index + 1)),
                   'png')

What I'm doing is simple: I'm reading a PDF file in the in_pdf folder of the current directory and outputting {PDF filename}-{page} .png to the out_img folder.

Example) Some form.pdf → Some form-1.png Some form-2.png

Image conversion parameters images = pdf2image.convert_from_path(pdf_file, grayscale=True, size=640) You can set it at.

--Grayscale with grayscale = True. Color if you set grayscale = False or omit the specification --Output so that it fits in n pixels square with size = n. Size calculated by DPI value if not specified --Specify a DPI value with dpi = n (default value is 200 DPI). When there is a size specification, that has priority

There are many other settings you can make, but for the time being, this is enough.

The image format is image.save(img_dir/pathlib.Path(base + '-{}.png'.format(index + 1)), 'png') Where image.save(img_dir/pathlib.Path(base + '-{}.jpg'.format(index + 1)), 'jpeg') Then, it will be output in JPEG format.

Convert PDFs to images in bulk with Python