This article is the 18th day article of Saison Information Systems Advent Calendar 2020.
--For those who have researched various things and got here, I will describe the parameters of pdf2txt.py first. ――It was detailed here. https://www.unixuser.org/~euske/python/pdfminer/
python.exe pdf2txt.py [options] -o [OutFilename] [InFilename]
Example:
c:\python38\python.exe c:\python38\Scripts\pdf2txt.py -M 3.0 -o c:\work\OutFile.txt c:\work\InFile.pdf
(Somewhere, I saw a statement that the conversion result is often not good if it is left as standard.)
--Set the adjustment to [options]. Character spacing (M) Word spacing (W) Line spacing (L) Vertical reading (V) --If no parameter is specified, the default value will be adopted. M = 1.0, W = 0.2, L = 0.3. It is a horizontal reading. --If you want to read vertically, specify -V.
These are the parameters used for layout analysis. In a real PDF file, depending on the authoring software, the text part may be split into several chunks during execution. Therefore, text extraction requires splicing text chunks. In the following figure, two text chunks that are closer than char_margin (shown as M) are considered consecutive and are grouped together. Also, two lines that are closer than line_margin (L) are grouped as a text box. This is a rectangular area that contains a "cluster" of text parts. In addition, if the distance between two words is greater than word_margin (W), whitespace between words may not be represented as spaces, so whitespace characters (spaces) must be inserted as needed, but each Indicated by the position of. word. Each value is specified as a ratio of length to the size of each character in question, not as the actual length. The default values are M = 1.0, L = 0.3, and W = 0.2, respectively.
There are other parameters such as image export, but they are omitted in this article. For details, please refer to the following. https://www.unixuser.org/~euske/python/pdfminer/
--One of the Python libraries, pdfminer, has a module called pdf2txt that works if you call it. --You can convert from pdf to text using pdf2txt, but it may not work as expected. ――Even in such a case, pdfminer has parameters for adjustment, so it can be read depending on the adjustment. --There were other descriptions in pdfminer, but when I checked the case of implementing it in convenient pdf2txt.py, I did not find any material in Japanese, so I will describe it. --Since commands can be executed, it is convenient to execute from ETL tools such as DataSpider and job controllers.
No Python development environment is required. An execution environment is required. (I thought I'd do it with a link, but it's not surprising T_T)
--Installation of Python runtime environment (windows) https://www.python.org/downloads/ From Select windows. Select final Select the package with the installer, download and install it. The installation destination is different for each person, but personally I like pythonxxx (version. 391 for 3.9.1) directly under the drive.
--Install pdfminner.six with pip At the command prompt
pip install pdfminer.six
If there is a pdf2txt.py file under Scripts in the python folder, it should work. is.
By the way, as I was writing an article, I noticed that it is a very convenient pdfminer, but the author seems to be Japanese. Yusuke Shinyama. Thank you very much.
that's all
There is a problem with the article. We would appreciate it if you could contact us if you have any inconvenience. The content of the article is not official of the company, but is described as an individual.
Recommended Posts