What to use

A library called PDFMiner. It is one shot with pip.

pip install pdfminer.six

On the reference site, there was Japanese, but even if I put it in with pip, Japanese was detected properly.

CSV to make

-CSV creation date data is included in the "Update date" column. -PDF text data is included in the "Sentence" column -The PDF page number is entered in the "Page number" column.

What was made

This is the source of the 90% reference site.

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

import csv,re,datetime
import pandas as pd

class converter(object):
  def pdf_to_csv(self,p_d_f):
    df = pd.DataFrame(columns=["Update date and time","Sentence","page number"])
  
    #PDF text extraction from here
    cnt = 1
    space = re.compile("[ 　]+")
    fp = open(p_d_f, 'rb')
        
    for page in PDFPage.get_pages(fp):
      #Sequential initialization
      rsrcmgr = PDFResourceManager()
      outfp = StringIO()
      codec = 'utf-8'
      laparams = LAParams()
      laparams.detect_vertical = True
      device = TextConverter(rsrcmgr, outfp, codec=codec, laparams=laparams)
      interpreter = PDFPageInterpreter(rsrcmgr, device)
            
      interpreter.process_page(page)
      text = re.sub(space, "", outfp.getvalue())

      df.loc[cnt,["Sentence","page number"]] = [text,cnt]
      cnt += 1
            
    outfp.close()
    fp.close()
    device.close()
         
    now = datetime.datetime.now()
    df["Update date and time"] = now

    csv_path = p_d_f.replace('.pdf', '.csv')
    df.to_csv(csv_path, encoding='CP932', index=False)

if __name__ == "__main__":
       
  p_d_f = "Somehow.pdf"
  con=converter()
  hoge=con.pdf_to_csv(p_d_f)

The difference from the reference site is that the box (outfp) that stores the text data extracted from the PDF is initialized at the point where it is put in the data frame. If it is left as it is, the text data of all pages will be added more and more. If you put it in the data frame, it will be this one, so I wonder if you can quickly add small columns.

It may be because it is easy that csv conversion was not caught in one shot even if I searched, but for writing notes.

2/24 postscript

Continued for some reason

[Python] Convert PDF text to CSV page by page (2/24 postscript)

What to use

CSV to make

What was made

2/24 postscript