Hello! I had the opportunity to use GhostScript in python, so this time I would like to summarize the knowledge gained from it. Specifically, use GhostScript to convert PDF to JPEG image (PDF⇒JPEG conversion). The PDF-to-JPEG conversion that we are going to do in this article is "not only one-page PDF, but also multiple-page PDF is decomposed into one page and saved as a JPEG image."
GhostScript is an interpreter for the PostScript language and PDF, as you can see on the official website.
Quoted from the official website
An interpreter for the PostScript language and for PDF.
When I ask google teacher what I want to do in this article, I think that pdf2image and PyPDF2 are common solutions, but the reason why I used GhostScript is simple.
Because there was a PDF that could not be converted with PyPDF2 and could be converted with GhostScript lol
For GhostScript operations, refer to this site. (No python is used on this site.) http://michigawi.blogspot.com/2011/12/ghostscriptpdfjpg.html
Windows10 python3.6.9 GhostScript 9.50
This time, download GhostScript from the following. https://www.ghostscript.com/
Please download according to each development environment. In this article, AGPL for Windows (64bit) is selected.
Execute the downloaded file to set up.
As explained in "Introduction", this is a process required for a PDF with multiple pages, and it is not necessary for a PDF with one page originally. Since it will be long, I picked up only the parts related to the division. If you don't have the time, please check the entire source code in the following chapters.
As a prerequisite, create an "input" folder in the same directory as the source code, and place the PDF file that will be the input image there. A directory called "output_tmp" is automatically created and the paginated PDF file is saved.
if __name__ == "__main__":
#Get the current directory
current_dir = os.getcwd()
#Get input directory
indir = current_dir + "\\input\\"
indir_error = current_dir + "\\input_error\\"
#Get output directory
outdir = current_dir + "\\output\\"
outdir_tmp = current_dir + "\\output_tmp\\"
#Create if there is no output folder
if os.path.isdir(outdir) == False:
os.mkdir(outdir)
if os.path.isdir(outdir_tmp) == False:
os.mkdir(outdir_tmp)
if os.path.isdir(indir_error) == False:
os.mkdir(indir_error)
#Select an image in the input directory
all_images = glob.glob(indir + "*")
#Processing for each image
for img in all_images:
#Get the image name of the input image (with extension)
basename = os.path.basename(img)
print(basename)
#Separate the extension from the image name of the input image
name,ext = os.path.splitext(basename)
try:
#images = convert_from_path(img)
with open(img,"rb") as pdf_file: #There is img, but it is a PDF file that is input.
source = PdfFileReader(pdf_file)
num_pages = source.getNumPages()
#Single page PDF is output_Copy to tmp and proceed to the next image
if num_pages == 1:
copyfile(img,outdir_tmp + basename )
continue
for i in range(0,num_pages):
file_object = source.getPage(i)
pdf_file_name = outdir_tmp + name + "_" + str(i+1) + ".pdf"
pdf_file_writer = PdfFileWriter()
with open(pdf_file_name, 'wb') as f:
pdf_file_writer.addPage(file_object)
pdf_file_writer.write(f)
except:
print(basename + "*** error ***")
copyfile(img,indir_error + basename )
This is running in GhostScript. What I want to do ① Select the PDF files that have been divided into pages in order and convert them to JPEG. The flow is like defining a command first and executing it in the subprocess module.
print("***** pdf to jpg *****")
######################################
#
# pdf2jpeg(ghost script)Related settings
#
######################################
exe_name = r'C:\Program Files\gs\gs9.50\bin\gswin64c.exe'
exe_option = "-dSAFER -dBATCH -dNOPAUSE -sDEVICE=jpeg -dJPEGQ=100 -dQFactor=1.0 -dDisplayFormat=16 -r600"
exe_outfile = "-sOutputFile="
#Select an image in the input directory
all_pdfs = glob.glob(outdir_tmp + "*")
for pdf in all_pdfs:
#Get the image name of the input image (with extension)
basename = os.path.basename(pdf)
print(basename)
#Separate the extension from the image name of the input image
name,ext = os.path.splitext(basename)
outputfile = outdir + name + ".jpg "
cmd_list = list()
cmd_list.append(exe_name)
cmd_list.extend(exe_option.split()) #This is extend, not append
cmd_list.append(exe_outfile + outputfile)
cmd_list.append(pdf)
subprocess.call(cmd_list)
import os
import glob
from PyPDF2 import PdfFileReader
from PyPDF2 import PdfFileWriter
import subprocess
from shutil import copyfile
"""
Main
"""
if __name__ == "__main__":
#Get the current directory
current_dir = os.getcwd()
#Get input directory
indir = current_dir + "\\input\\"
indir_error = current_dir + "\\input_error\\"
#Get output directory
outdir = current_dir + "\\output\\"
outdir_tmp = current_dir + "\\output_tmp\\"
#Create if there is no output folder
if os.path.isdir(outdir) == False:
os.mkdir(outdir)
if os.path.isdir(outdir_tmp) == False:
os.mkdir(outdir_tmp)
if os.path.isdir(indir_error) == False:
os.mkdir(indir_error)
#Select an image in the input directory
all_images = glob.glob(indir + "*")
#Processing for each image
for img in all_images:
#Get the image name of the input image (with extension)
basename = os.path.basename(img)
print(basename)
#Separate the extension from the image name of the input image
name,ext = os.path.splitext(basename)
try:
#images = convert_from_path(img)
with open(img,"rb") as pdf_file: #There is img, but it is a PDF file that is input.
source = PdfFileReader(pdf_file)
num_pages = source.getNumPages()
#Single page PDF is output_Copy to tmp and proceed to the next image
if num_pages == 1:
copyfile(img,outdir_tmp + basename )
continue
for i in range(0,num_pages):
file_object = source.getPage(i)
pdf_file_name = outdir_tmp + name + "_" + str(i+1) + ".pdf"
pdf_file_writer = PdfFileWriter()
with open(pdf_file_name, 'wb') as f:
pdf_file_writer.addPage(file_object)
pdf_file_writer.write(f)
except:
print(basename + "*** error ***")
copyfile(img,indir_error + basename )
print("***** pdf to jpg *****")
######################################
#
# pdf2jpeg(ghost script)Related settings
#
######################################
exe_name = r'C:\Program Files\gs\gs9.50\bin\gswin64c.exe'
exe_option = "-dSAFER -dBATCH -dNOPAUSE -sDEVICE=jpeg -dJPEGQ=100 -dQFactor=1.0 -dDisplayFormat=16 -r600"
exe_outfile = "-sOutputFile="
#Select an image in the input directory
all_pdfs = glob.glob(outdir_tmp + "*")
for pdf in all_pdfs:
#Get the image name of the input image (with extension)
basename = os.path.basename(pdf)
print(basename)
#Separate the extension from the image name of the input image
name,ext = os.path.splitext(basename)
outputfile = outdir + name + ".jpg "
cmd_list = list()
cmd_list.append(exe_name)
cmd_list.extend(exe_option.split())
cmd_list.append(exe_outfile + outputfile)
cmd_list.append(pdf)
subprocess.call(cmd_list)
Because the source code was created quickly for what I wanted to do, it became quite dirty source code lol (It's not a laugh)
In addition, there were some PDFs that could not be converted to JPEG well even with GhostScript. The cause is not clear, but after a little research, it seems that the fonts do not support it. Depending on the response, I will investigate a little more.
If you have any questions, please leave a comment.
Recommended Posts