It monitors the specified file path, and when you put the PDF file there, it automatically renames the PDF file to the title of the book. :octocat:book_maker
macOS Catalina
$ brew install poppler
$ brew install tesseract
$ brew install tesseract-lang
$ python3 src/watch.py input_path [output_path] [*extensions]
I rushed to buy a cutting machine and a scanner because I wanted to digest a large number of books in my parents' house. However, I often heard that self-catering is troublesome, so I wanted to achieve some degree of automation, so I created this program.
I assembled it in the following flow.
I used a library called watchdog to constantly monitor the directory.
The following documents and articles were very helpful for detailed usage of watchdog
.
Thank you very much.
--watchdog official documentation
Qiita
-I tried using Watchdog -About python watchdog operation -Command execution triggered by file update (python version)
Now, to use watchdog
, you need Handler
and ʻObserver.
Handler describes what to do and how to handle each event (create / delete / move / change). This time, only the ʻon_created
function, which is the event at the time of creation, is defined.
This ʻon_createdmethod overrides the method in the
FileSystemEventHandler class in
watchdog.event`.
src/handler/handler.py
from watchdog.events import PatternMatchingEventHandler
class Handler(PatternMatchingEventHandler):
def __init__(self, input_path, output_path, patterns=None):
if patterns is None:
patterns = ['*.pdf']
super(Handler, self).__init__(patterns=patterns,
ignore_directories=True,
case_sensitive=False)
def on_created(self, event):
#Do something
It defines a Handler class and inherits PatternMatchingEventHandler
which allows pattern matching.
By using this, you can limit the types of files that are detected by the event.
There is also a RegexMatchingEventHandler
that allows you to use regular expression patterns.
This time, I wanted to process only PDF, so I set patterns = ['* .pdf']
.
I set ʻignore_directories = Trueto ignore the directory, and I wanted to be able to detect both
* .pdf and
* .PDF, so I set
case_sensitive = False`.
Next, prepare ʻObserver`, which is the role to monitor the Handler.
src/watch.py
from watchdog.observers import Observer
from src.handler.handler import Handler
def watch(input_path, output_path, extensions):
print([f'*.{extension}' for extension in extensions], flush=True)
event_handler = Handler(input_path=input_path,
output_path=output_path,
patterns=[f'*.{extension}' for extension in extensions])
observer = Observer()
observer.schedule(event_handler, input_path, recursive=False)
observer.start()
print('--Start Observer--', flush=True)
try:
while True:
time.sleep(1)
except KeyboardInterrupt:
observer.unschedule_all()
observer.stop()
print('--End Observer--', flush=True)
observer.join()
In the created Observer object, describe whether to monitor the Handler object, monitored directory, and subdirectories recursively, and create it.
Start monitoring with ʻobserver.start ()and keep it running with the
while statement and
time.sleep (1) to continue processing. When
Ctrl + C is pressed, ʻobserver.unschedule_all ()
terminates all monitoring, detaches the event handler, and ʻobserver.stop () notifies the thread to stop. Finally, ʻobserver.join ()
causes the thread to wait for it to finish.
I referred to this blog. Thank you very much.
When getting the ISBN code, try to get it from the barcode.
The ones I used to get the information from the PDF are pdfinfo
, pdfimages
, and zbarimg
.
pdfinfo
is to get the total number of pages in the PDF.
pdfimages
is to make only the first and last pages jpeg based on the total pages obtained from pdfinfo
.
zbarimg
was used to get the ISBN code from the jpeg generated by pdfimages
.
getISBN.sh
#!/bin/bash
# Number of pages to check in PDF
PAGE_COUNT=1
# File path
FILE_PATH="$1"
# If the file extension is not pdf
shopt -s nocasematch
if [[ ! $1 =~ .+(\.pdf)$ ]]; then
exit 1
fi
shopt -u nocasematch
# Delete all .image* generated by pdfimages
rm -f .image*
# Get total count of PDF pages
pages=$(pdfinfo "$FILE_PATH" | grep -E "^Pages" | sed -E "s/^Pages: +//") &&
# Generate JPEG from PDF
pdfimages -j -l "$PAGE_COUNT" "$FILE_PATH" .image_h &&
pdfimages -j -f $((pages - PAGE_COUNT)) "$FILE_PATH" .image_t &&
# Grep ISBN
isbnTitle="$(zbarimg -q .image* | sort | uniq | grep -E '^EAN-13:978' | sed -E 's/^EAN-13://' | sed 's/-//')" &&
# If the ISBN was found, echo the ISBN
[ "$isbnTitle" != "" ] &&
echo "$isbnTitle" && rm -f .image* && exit 0 ||
# Else, exit with error code
rm -f .image* && exit 1
Finally, when the ISBN code is obtained, ʻecho "$ isbnTitle" `is received as standard output on the Python side.
Also this&&
Or||
I didn't understand the meaning well, but the following article was helpful.
Thank you very much.
To get from the barcode, pdf2image to image the PDF, and pyzbar to get from the barcode. pyzbar) was used.
With pdf2image
, generate an image of jpeg
for 2 pages counting from the last page, call decode ()
with pyzbar
for those images, and use the regular expression pattern of ISBN code ( If there is a string that matches ^ 978
), it will be returned.
I used TemporaryDirectory ()
because I wanted the directory to put the generated images to be temporary.
src/isbn_from_pdf.py
import re
import sys
import tempfile
import subprocess
from pyzbar.pyzbar import decode
from pdf2image import convert_from_path
input_path = input_path
texts = []
cmd = f'echo $(pdfinfo "{input_path}" | grep -E "^Pages" | sed -E "s/^Pages: +//")'
result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
total_page_count = int(result.stdout.strip())
with tempfile.TemporaryDirectory() as temp_path:
last_pages = convert_from_path(input_path,
first_page=total_page_count - PAGE_COUNT,
output_folder=temp_path,
fmt='jpeg')
# extract ISBN from using barcode
for page in last_pages:
decoded_data = decode(page)
for data in decoded_data:
if re.match('978', data[0].decode('utf-8', 'ignore')):
return data[0].decode('utf-8', 'ignore').replace('-', '')
Another option is to extract the ISBN code from the last page of the book, which contains information such as the publisher and edition of the book.
I used pyocr to extract the strings from the image.
To use pyocr
, you need the OCR tool, so you need to install Google's tesseract.
src/isbn_from_pdf.py
import re
import sys
import pyocr
import tempfile
import subprocess
import pyocr.builders
from pdf2image import convert_from_path
input_path = input_path
texts = []
cmd = f'echo $(pdfinfo "{input_path}" | grep -E "^Pages" | sed -E "s/^Pages: +//")'
result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
total_page_count = int(result.stdout.strip())
with tempfile.TemporaryDirectory() as temp_path:
last_pages = convert_from_path(input_path,
first_page=total_page_count - PAGE_COUNT,
output_folder=temp_path,
fmt='jpeg')
tools = pyocr.get_available_tools()
if len(tools) == 0:
print('[ERROR] No OCR tool found.', flush=True)
sys.exit()
# convert image to string and extract ISBN
tool = tools[0]
lang = 'jpn'
for page in last_pages:
text = tool.image_to_string(
page,
lang=lang,
builder=pyocr.builders.TextBuilder(tesseract_layout=3)
)
texts.append(text)
for text in texts:
if re.search(r'ISBN978-[0-4]-[0-9]{4}-[0-9]{4}-[0-9]', text):
return re.findall(r'978-[0-4]-[0-9]{4}-[0-9]{4}-[0-9]', text).pop().replace('-', '')
To get the information of the book, I used Google Books APIs and openBD. did.
Both can be obtained in JSON
format, but since the shapes are different, I wanted to write code that is as common as possible, so I used a library called Box. I did.
Box
is intended to allow you to get what you would normally get withdict.get ('key')
anddict ['key']
with dict.key.another_key
. ..
You can also use dict ['key']
.
Other features include the ability for key
to convert camelcase (camelCase
) to Python's naming convention for snakecase (snake_case
), and key
for spaces like personal thoughts
. There is also a handy feature that allows you to access it like dict.personal_thoughts
when there is.
Below is the code to get from ʻopenBD`.
src/bookinfo_from_isbn.py
import re
import json
import requests
from box import Box
OPENBD_API_URL = 'https://api.openbd.jp/v1/get?isbn={}'
HEADERS = {"content-type": "application/json"}
class BookInfo:
def __init__(self, title, author):
self.title = title
self.author = author
def __str__(self):
return f'<{self.__class__.__name__}>{json.dumps(self.__dict__, indent=4, ensure_ascii=False)}'
def _format_title(title):
#Replace full-width brackets and full-width spaces with half-width spaces
title = re.sub('[() ]', ' ', title).rstrip()
#Replace one or more half-width spaces with one
return re.sub(' +', ' ', title)
def _format_author(author):
#Delete the character string after / written
return re.sub('/.+', '', author)
def book_info_from_openbd(isbn):
res = requests.get(OPENBD_API_URL.format(isbn), headers=HEADERS)
if res.status_code == 200:
openbd_res = Box(res.json()[0], camel_killer_box=True, default_box=True, default_box_attr='')
if openbd_res is not None:
open_bd_summary = openbd_res.summary
title = _format_title(open_bd_summary.title)
author = _format_author(open_bd_summary.author)
return BookInfo(title=title, author=author)
else:
print(f'[WARNING] openBD status code was {res.status_code}', flush=True)
Since the title of the acquired book and the information of the author are mixed with full-width and half-width characters, we have prepared a function to correct each. (_Format_title
・ _format_author
)
I haven't actually cut and tried it yet, so these functions will need to be adjusted.
In Box
, camel_killer_box = True
which converts camel case to snake case, and default_box = True
and default_box_attr =''
even if there is no value.
First, when you start it, make sure to create a folder to move the PDF after renaming it.
src/handler/handler.py
import os
import datetime
from watchdog.events import PatternMatchingEventHandler
class Handler(PatternMatchingEventHandler):
def __init__(self, input_path, output_path, patterns=None):
if patterns is None:
patterns = ['*.pdf']
super(Handler, self).__init__(patterns=patterns,
ignore_directories=True,
case_sensitive=False)
self.input_path = input_path
# If the output_path is equal to input_path, then make a directory named with current time
if input_path == output_path:
self.output_path = os.path.join(self.input_path, datetime.datetime.now().strftime('%Y%m%d_%H%M%S'))
else:
self.output_path = output_path
os.makedirs(self.output_path, exist_ok=True)
# Create tmp directory inside of output directory
self.tmp_path = os.path.join(self.output_path, 'tmp')
os.makedirs(self.tmp_path, exist_ok=True)
When the process starts, it will create a destination folder formatted with today's date or a specified destination folder. Then, create a tmp folder in the output folder to be placed when some error occurs (when there is the same PDF book, when the ISBN is not found, when the book information is missing). ..
src/handler/handler.py
def __del__(self):
# Delete the tmp directory, when the directory is empty
tmp_files_len = len(os.listdir(self.tmp_path))
if tmp_files_len == 0:
os.rmdir(self.tmp_path)
# Delete the output directory, when the directory is empty
output_files_len = len(os.listdir(self.output_path))
if output_files_len == 0:
os.rmdir(self.output_path)
When the process is completed, describe the __del__
method so that if there is a file in the output destination folder / tmp folder, it will be left and if it is not, it will be deleted.
src/handler/handler.py
import shutil
import subprocess
from src.isbn_from_pdf import get_isbn_from_pdf, NoSuchISBNException
from src.bookinfo_from_isbn import book_info_from_google, book_info_from_openbd, NoSuchBookInfoException
def on_created(self, event):
print('!Create Event!', flush=True)
shell_path = os.path.join(os.path.dirname(__file__), '../../getISBN.sh')
event_src_path = event.src_path
cmd = f'{shell_path} {event_src_path}'
result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
try:
if result.returncode == 0:
# Retrieve ISBN from shell
isbn = result.stdout.strip()
print(f'ISBN from Shell -> {isbn}', flush=True)
self._book_info_from_each_api(isbn, event_src_path)
else:
# Get ISBN from pdf barcode or text
isbn = get_isbn_from_pdf(event_src_path)
print(f'ISBN from Python -> {isbn}', flush=True)
self._book_info_from_each_api(isbn, event_src_path)
except (NoSuchISBNException, NoSuchBookInfoException) as e:
print(e.args[0], flush=True)
shutil.move(event_src_path, self.tmp_path)
print(f'Move {os.path.basename(event_src_path)} to {self.tmp_path}', flush=True)
The ʻon_created` method describes the overall flow in the workflow.
When running the shell, make sure to run the shell with subprocess.run ()
to receive standard output, receive the shell status from result.returncode
, and receive standard output with result.stdout
. Can be done
Also, when retrieving book information from the ISBN code, a special exception is thrown.
Thank you for reading this far. I was struggling with the place to start the command and the variable name / function name, but I managed to make it the minimum form. At this stage, only PDF is supported, but I would like to be able to support epub. I want to be able to do it on Windows as well.
If there are any typographical errors or mistakes, this is the way to go! Please let me know if you have any. Thank you very much.
Recommended Posts