Overview

It monitors the specified file path, and when you put the PDF file there, it automatically renames the PDF file to the title of the book. :octocat:book_maker

Operation confirmed OS

macOS Catalina

Things necessary

Install Poppler(for PDF command)

  $ brew install poppler

Install Tesseract(for OCR)

  $ brew install tesseract
  $ brew install tesseract-lang

Library
- watchdog
- pdf2image
- pyocr
- pyzbar
- pillow
- requests
- python-box

How to use

$ python3 src/watch.py input_path [output_path] [*extensions]

Why made

I rushed to buy a cutting machine and a scanner because I wanted to digest a large number of books in my parents' house. However, I often heard that self-catering is troublesome, so I wanted to achieve some degree of automation, so I created this program.

Workflow

I assembled it in the following flow.

Specify the directory to be monitored and start src / watch.py
Place the PDF in the monitored directory
Detect the event and get the ISBN code from the contents of the PDF file --How to get the ISBN code --Get from barcode using shell --Get from barcode on Python code --Get from text on Python code
Get book information from each API based on ISBN --API you are using
- Google Books APIs
- openBD
Correct the file name and move the PDF file to the output directory

Monitor a specific directory

I used a library called watchdog to constantly monitor the directory. The following documents and articles were very helpful for detailed usage of watchdog. Thank you very much.

--watchdog official documentation

API Reference
Qiita

-I tried using Watchdog -About python watchdog operation -Command execution triggered by file update (python version)

Now, to use watchdog, you need Handler and ʻObserver. Handler describes what to do and how to handle each event (create / delete / move / change). This time, only the ʻon_created function, which is the event at the time of creation, is defined. This ʻon_createdmethod overrides the method in theFileSystemEventHandler class in watchdog.event`.

`src/handler/handler.py`


from watchdog.events import PatternMatchingEventHandler

class Handler(PatternMatchingEventHandler):
    def __init__(self, input_path, output_path, patterns=None):
        if patterns is None:
            patterns = ['*.pdf']
        super(Handler, self).__init__(patterns=patterns,
                                      ignore_directories=True,
                                      case_sensitive=False)

    def on_created(self, event):
        #Do something

It defines a Handler class and inherits PatternMatchingEventHandler which allows pattern matching. By using this, you can limit the types of files that are detected by the event. There is also a RegexMatchingEventHandler that allows you to use regular expression patterns. This time, I wanted to process only PDF, so I set patterns = ['* .pdf']. I set ʻignore_directories = Trueto ignore the directory, and I wanted to be able to detect both* .pdf and * .PDF, so I set case_sensitive = False`.

Next, prepare ʻObserver`, which is the role to monitor the Handler.

`src/watch.py`


from watchdog.observers import Observer
from src.handler.handler import Handler


def watch(input_path, output_path, extensions):
    print([f'*.{extension}' for extension in extensions], flush=True)
    event_handler = Handler(input_path=input_path,
                            output_path=output_path,
                            patterns=[f'*.{extension}' for extension in extensions])
    observer = Observer()
    observer.schedule(event_handler, input_path, recursive=False)
    observer.start()
    print('--Start Observer--', flush=True)
    try:
        while True:
            time.sleep(1)
    except KeyboardInterrupt:
        observer.unschedule_all()
        observer.stop()
        print('--End Observer--', flush=True)
    observer.join()

In the created Observer object, describe whether to monitor the Handler object, monitored directory, and subdirectories recursively, and create it. Start monitoring with ʻobserver.start ()and keep it running with thewhile statement and time.sleep (1) to continue processing. When Ctrl + C is pressed, ʻobserver.unschedule_all () terminates all monitoring, detaches the event handler, and ʻobserver.stop () notifies the thread to stop. Finally, ʻobserver.join () causes the thread to wait for it to finish.

Get the ISBN code from the barcode using the shell

I referred to this blog. Thank you very much.

-I want to read the barcode image from the pdf file of the self-catering book, get the ISBN, and link the title obtained from Amazon's API

When getting the ISBN code, try to get it from the barcode. The ones I used to get the information from the PDF are pdfinfo, pdfimages, and zbarimg. pdfinfo is to get the total number of pages in the PDF. pdfimages is to make only the first and last pages jpeg based on the total pages obtained from pdfinfo. zbarimg was used to get the ISBN code from the jpeg generated by pdfimages.

`getISBN.sh`


#!/bin/bash

# Number of pages to check in PDF
PAGE_COUNT=1
# File path
FILE_PATH="$1"

# If the file extension is not pdf
shopt -s nocasematch
if [[ ! $1 =~ .+(\.pdf)$ ]]; then
  exit 1
fi
shopt -u nocasematch

# Delete all .image* generated by pdfimages
rm -f .image*

# Get total count of PDF pages
pages=$(pdfinfo "$FILE_PATH" | grep -E "^Pages" | sed -E "s/^Pages: +//") &&
# Generate JPEG from PDF
pdfimages -j -l "$PAGE_COUNT" "$FILE_PATH" .image_h &&
pdfimages -j -f $((pages - PAGE_COUNT)) "$FILE_PATH" .image_t &&
# Grep ISBN
isbnTitle="$(zbarimg -q .image* | sort | uniq | grep -E '^EAN-13:978' | sed -E 's/^EAN-13://' | sed 's/-//')" &&
# If the ISBN was found, echo the ISBN
[ "$isbnTitle" != "" ] &&
echo "$isbnTitle" && rm -f .image* && exit 0 ||
# Else, exit with error code
rm -f .image* && exit 1

Finally, when the ISBN code is obtained, ʻecho "$ isbnTitle" `is received as standard output on the Python side.

Also this&&Or||I didn't understand the meaning well, but the following article was helpful. Thank you very much.

Convenient but comprehensible control operators`&&`When`||`

Use Python to get the ISBN code

Get from barcode

To get from the barcode, pdf2image to image the PDF, and pyzbar to get from the barcode. pyzbar) was used.

With pdf2image, generate an image of jpeg for 2 pages counting from the last page, call decode () with pyzbar for those images, and use the regular expression pattern of ISBN code ( If there is a string that matches ^ 978), it will be returned.

I used TemporaryDirectory () because I wanted the directory to put the generated images to be temporary.

`src/isbn_from_pdf.py`


import re
import sys
import tempfile
import subprocess
from pyzbar.pyzbar import decode
from pdf2image import convert_from_path

input_path = input_path
texts = []
cmd = f'echo $(pdfinfo "{input_path}" | grep -E "^Pages" | sed -E "s/^Pages: +//")'
result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
total_page_count = int(result.stdout.strip())

with tempfile.TemporaryDirectory() as temp_path:
    last_pages = convert_from_path(input_path,
                                    first_page=total_page_count - PAGE_COUNT,
                                    output_folder=temp_path,
                                    fmt='jpeg')
    # extract ISBN from using barcode
    for page in last_pages:
        decoded_data = decode(page)
        for data in decoded_data:
            if re.match('978', data[0].decode('utf-8', 'ignore')):
                return data[0].decode('utf-8', 'ignore').replace('-', '')

Get from text

Another option is to extract the ISBN code from the last page of the book, which contains information such as the publisher and edition of the book.

I used pyocr to extract the strings from the image. To use pyocr, you need the OCR tool, so you need to install Google's tesseract.

`src/isbn_from_pdf.py`


import re
import sys
import pyocr
import tempfile
import subprocess
import pyocr.builders
from pdf2image import convert_from_path

input_path = input_path
texts = []
cmd = f'echo $(pdfinfo "{input_path}" | grep -E "^Pages" | sed -E "s/^Pages: +//")'
result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
total_page_count = int(result.stdout.strip())

with tempfile.TemporaryDirectory() as temp_path:
    last_pages = convert_from_path(input_path,
                                    first_page=total_page_count - PAGE_COUNT,
                                    output_folder=temp_path,
                                    fmt='jpeg')
    tools = pyocr.get_available_tools()
    if len(tools) == 0:
        print('[ERROR] No OCR tool found.', flush=True)
        sys.exit()

    # convert image to string and extract ISBN
    tool = tools[0]
    lang = 'jpn'
    for page in last_pages:
        text = tool.image_to_string(
            page,
            lang=lang,
            builder=pyocr.builders.TextBuilder(tesseract_layout=3)
        )
        texts.append(text)
    for text in texts:
        if re.search(r'ISBN978-[0-4]-[0-9]{4}-[0-9]{4}-[0-9]', text):
            return re.findall(r'978-[0-4]-[0-9]{4}-[0-9]{4}-[0-9]', text).pop().replace('-', '')

Get book information from each API

To get the information of the book, I used Google Books APIs and openBD. did.

Both can be obtained in JSON format, but since the shapes are different, I wanted to write code that is as common as possible, so I used a library called Box. I did.

Box is intended to allow you to get what you would normally get withdict.get ('key')anddict ['key']with dict.key.another_key. .. You can also use dict ['key'].

Other features include the ability for key to convert camelcase (camelCase) to Python's naming convention for snakecase (snake_case), and key for spaces like personal thoughts. There is also a handy feature that allows you to access it like dict.personal_thoughts when there is.

Below is the code to get from ʻopenBD`.

`src/bookinfo_from_isbn.py`


import re
import json
import requests
from box import Box

OPENBD_API_URL = 'https://api.openbd.jp/v1/get?isbn={}'

HEADERS = {"content-type": "application/json"}

class BookInfo:
    def __init__(self, title, author):
        self.title = title
        self.author = author

    def __str__(self):
        return f'<{self.__class__.__name__}>{json.dumps(self.__dict__, indent=4, ensure_ascii=False)}'


def _format_title(title):
    #Replace full-width brackets and full-width spaces with half-width spaces
    title = re.sub('[（）　]', ' ', title).rstrip()
    #Replace one or more half-width spaces with one
    return re.sub(' +', ' ', title)


def _format_author(author):
    #Delete the character string after / written
    return re.sub('／.+', '', author)


def book_info_from_openbd(isbn):
    res = requests.get(OPENBD_API_URL.format(isbn), headers=HEADERS)
    if res.status_code == 200:
        openbd_res = Box(res.json()[0], camel_killer_box=True, default_box=True, default_box_attr='')
        if openbd_res is not None:
            open_bd_summary = openbd_res.summary
            title = _format_title(open_bd_summary.title)
            author = _format_author(open_bd_summary.author)
            return BookInfo(title=title, author=author)
    else:
        print(f'[WARNING] openBD status code was {res.status_code}', flush=True)

Since the title of the acquired book and the information of the author are mixed with full-width and half-width characters, we have prepared a function to correct each. (_Format_title ・ _format_author) I haven't actually cut and tried it yet, so these functions will need to be adjusted.

In Box, camel_killer_box = True which converts camel case to snake case, and default_box = True and default_box_attr ='' even if there is no value.

Correct the file name and move to the appropriate directory

First, when you start it, make sure to create a folder to move the PDF after renaming it.

`src/handler/handler.py`


import os
import datetime
from watchdog.events import PatternMatchingEventHandler

class Handler(PatternMatchingEventHandler):
    def __init__(self, input_path, output_path, patterns=None):
        if patterns is None:
            patterns = ['*.pdf']
        super(Handler, self).__init__(patterns=patterns,
                                      ignore_directories=True,
                                      case_sensitive=False)
        self.input_path = input_path
        # If the output_path is equal to input_path, then make a directory named with current time
        if input_path == output_path:
            self.output_path = os.path.join(self.input_path, datetime.datetime.now().strftime('%Y%m%d_%H%M%S'))
        else:
            self.output_path = output_path
        os.makedirs(self.output_path, exist_ok=True)

        # Create tmp directory inside of output directory
        self.tmp_path = os.path.join(self.output_path, 'tmp')
        os.makedirs(self.tmp_path, exist_ok=True)

When the process starts, it will create a destination folder formatted with today's date or a specified destination folder. Then, create a tmp folder in the output folder to be placed when some error occurs (when there is the same PDF book, when the ISBN is not found, when the book information is missing). ..

`src/handler/handler.py`


    def __del__(self):
        # Delete the tmp directory, when the directory is empty
        tmp_files_len = len(os.listdir(self.tmp_path))
        if tmp_files_len == 0:
            os.rmdir(self.tmp_path)

        # Delete the output directory, when the directory is empty
        output_files_len = len(os.listdir(self.output_path))
        if output_files_len == 0:
            os.rmdir(self.output_path)

When the process is completed, describe the __del__ method so that if there is a file in the output destination folder / tmp folder, it will be left and if it is not, it will be deleted.

`src/handler/handler.py`


import shutil
import subprocess
from src.isbn_from_pdf import get_isbn_from_pdf, NoSuchISBNException
from src.bookinfo_from_isbn import book_info_from_google, book_info_from_openbd, NoSuchBookInfoException

    def on_created(self, event):
        print('!Create Event!', flush=True)
        shell_path = os.path.join(os.path.dirname(__file__), '../../getISBN.sh')
        event_src_path = event.src_path
        cmd = f'{shell_path} {event_src_path}'
        result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
        try:
            if result.returncode == 0:
                # Retrieve ISBN from shell
                isbn = result.stdout.strip()
                print(f'ISBN from Shell -> {isbn}', flush=True)
                self._book_info_from_each_api(isbn, event_src_path)

            else:
                # Get ISBN from pdf barcode or text
                isbn = get_isbn_from_pdf(event_src_path)
                print(f'ISBN from Python -> {isbn}', flush=True)
                self._book_info_from_each_api(isbn, event_src_path)

        except (NoSuchISBNException, NoSuchBookInfoException) as e:
            print(e.args[0], flush=True)
            shutil.move(event_src_path, self.tmp_path)
            print(f'Move {os.path.basename(event_src_path)} to {self.tmp_path}', flush=True)

The ʻon_created` method describes the overall flow in the workflow.

When running the shell, make sure to run the shell with subprocess.run () to receive standard output, receive the shell status from result.returncode, and receive standard output with result.stdout. Can be done

Also, when retrieving book information from the ISBN code, a special exception is thrown.

Summary

Thank you for reading this far. I was struggling with the place to start the command and the variable name / function name, but I managed to make it the minimum form. At this stage, only PDF is supported, but I would like to be able to support epub. I want to be able to do it on Windows as well.

If there are any typographical errors or mistakes, this is the way to go! Please let me know if you have any. Thank you very much.

A story about trying to automate a chot when cooking for yourself

Overview

Operation confirmed OS

Things necessary

How to use

Why made

Workflow

Monitor a specific directory

src/handler/handler.py

src/watch.py

Get the ISBN code from the barcode using the shell

getISBN.sh

Use Python to get the ISBN code

Get from barcode

src/isbn_from_pdf.py

Get from text

src/isbn_from_pdf.py

Get book information from each API

src/bookinfo_from_isbn.py

Correct the file name and move to the appropriate directory

src/handler/handler.py

src/handler/handler.py

src/handler/handler.py

Summary

`src/handler/handler.py`

`src/watch.py`

`getISBN.sh`

`src/isbn_from_pdf.py`

`src/isbn_from_pdf.py`

`src/bookinfo_from_isbn.py`

`src/handler/handler.py`

`src/handler/handler.py`

`src/handler/handler.py`