Convert pdf to Text on the command line. No knowledge of Python required. About pdf2txt.py attached to pdfminer and adjustment parameters.

This article is the 18th day article of Saison Information Systems Advent Calendar 2020.

I will explain the command first.

--For those who have researched various things and got here, I will describe the parameters of pdf2txt.py first. ――It was detailed here. https://www.unixuser.org/~euske/python/pdfminer/

python.exe pdf2txt.py [options] -o [OutFilename] [InFilename]

Example:

c:\python38\python.exe c:\python38\Scripts\pdf2txt.py -M 3.0 -o c:\work\OutFile.txt c:\work\InFile.pdf

Commentary

Parameters that mainly require read adjustment.

(Somewhere, I saw a statement that the conversion result is often not good if it is left as standard.)

--Set the adjustment to [options]. Character spacing (M) Word spacing (W) Line spacing (L) Vertical reading (V) --If no parameter is specified, the default value will be adopted. M = 1.0, W = 0.2, L = 0.3. It is a horizontal reading. --If you want to read vertically, specify -V.

I will post the Google Japanese translation of the linked URL.

These are the parameters used for layout analysis. In a real PDF file, depending on the authoring software, the text part may be split into several chunks during execution. Therefore, text extraction requires splicing text chunks. In the following figure, two text chunks that are closer than char_margin (shown as M) are considered consecutive and are grouped together. Also, two lines that are closer than line_margin (L) are grouped as a text box. This is a rectangular area that contains a "cluster" of text parts. In addition, if the distance between two words is greater than word_margin (W), whitespace between words may not be represented as spaces, so whitespace characters (spaces) must be inserted as needed, but each Indicated by the position of. word. Each value is specified as a ratio of length to the size of each character in question, not as the actual length. The default values are M = 1.0, L = 0.3, and W = 0.2, respectively.

There are other parameters such as image export, but they are omitted in this article. For details, please refer to the following. https://www.unixuser.org/~euske/python/pdfminer/

Then, it is an outline again.

--One of the Python libraries, pdfminer, has a module called pdf2txt that works if you call it. --You can convert from pdf to text using pdf2txt, but it may not work as expected. ――Even in such a case, pdfminer has parameters for adjustment, so it can be read depending on the adjustment. --There were other descriptions in pdfminer, but when I checked the case of implementing it in convenient pdf2txt.py, I did not find any material in Japanese, so I will describe it. --Since commands can be executed, it is convenient to execute from ETL tools such as DataSpider and job controllers.

Knowledge of Python is not required, but a Python execution environment is required.
Text cannot be extracted from the imaged pdf. To determine whether the text is missing, search the Pdf file for characters, and if it can be searched, the text will be missing.

About the procedure when you want to use it as a Python execution module

No Python development environment is required. An execution environment is required. (I thought I'd do it with a link, but it's not surprising T_T)

--Installation of Python runtime environment (windows) https://www.python.org/downloads/ From Select windows. Select final Select the package with the installer, download and install it. The installation destination is different for each person, but personally I like pythonxxx (version. 391 for 3.9.1) directly under the drive.

--Install pdfminner.six with pip At the command prompt

pip install pdfminer.six

If there is a pdf2txt.py file under Scripts in the python folder, it should work. is.

By the way, as I was writing an article, I noticed that it is a very convenient pdfminer, but the author seems to be Japanese. Yusuke Shinyama. Thank you very much.

that's all

There is a problem with the article. We would appreciate it if you could contact us if you have any inconvenience. The content of the article is not official of the company, but is described as an individual.