Introduction

Hello! I had the opportunity to use GhostScript in python, so this time I would like to summarize the knowledge gained from it. Specifically, use GhostScript to convert PDF to JPEG image (PDF⇒JPEG conversion). The PDF-to-JPEG conversion that we are going to do in this article is "not only one-page PDF, but also multiple-page PDF is decomposed into one page and saved as a JPEG image."

Why GhostScript?

GhostScript is an interpreter for the PostScript language and PDF, as you can see on the official website.

Quoted from the official website

An interpreter for the PostScript language and for PDF.

When I ask google teacher what I want to do in this article, I think that pdf2image and PyPDF2 are common solutions, but the reason why I used GhostScript is simple.

Because there was a PDF that could not be converted with PyPDF2 and could be converted with GhostScript lol

For GhostScript operations, refer to this site. (No python is used on this site.) http://michigawi.blogspot.com/2011/12/ghostscriptpdfjpg.html

Development environment

Windows10 python3.6.9 GhostScript 9.50

Preparation (GhostScript download)

This time, download GhostScript from the following. https://www.ghostscript.com/

Please download according to each development environment. In this article, AGPL for Windows (64bit) is selected.

Execute the downloaded file to set up.

Setup screen

It seems that ghostscript can be installed with the pip command even on PyPI, but this has not been confirmed at the time of writing this article. https://pypi.org/project/ghostscript/

What you want to do ①: Divide the PDF page

As explained in "Introduction", this is a process required for a PDF with multiple pages, and it is not necessary for a PDF with one page originally. Since it will be long, I picked up only the parts related to the division. If you don't have the time, please check the entire source code in the following chapters.

As a prerequisite, create an "input" folder in the same directory as the source code, and place the PDF file that will be the input image there. A directory called "output_tmp" is automatically created and the paginated PDF file is saved.


if __name__ == "__main__":

    
    #Get the current directory
    current_dir = os.getcwd()
    #Get input directory
    indir = current_dir + "\\input\\"
    indir_error = current_dir + "\\input_error\\"
    
    #Get output directory
    outdir = current_dir + "\\output\\"
    outdir_tmp = current_dir + "\\output_tmp\\"
    
    #Create if there is no output folder
    if os.path.isdir(outdir) == False:
        os.mkdir(outdir)

    if os.path.isdir(outdir_tmp) == False:
        os.mkdir(outdir_tmp)
    
    if os.path.isdir(indir_error) == False:
        os.mkdir(indir_error)            
    
    #Select an image in the input directory
    all_images = glob.glob(indir + "*")
   
     #Processing for each image
    for img in all_images:
        
        #Get the image name of the input image (with extension)
        basename = os.path.basename(img)
        print(basename)
        
        #Separate the extension from the image name of the input image
        name,ext = os.path.splitext(basename)
        
        try:
            #images = convert_from_path(img)
            with open(img,"rb") as pdf_file: #There is img, but it is a PDF file that is input.
                source = PdfFileReader(pdf_file)
            
                num_pages = source.getNumPages()
                
                #Single page PDF is output_Copy to tmp and proceed to the next image
                if num_pages == 1:
                    copyfile(img,outdir_tmp + basename )
                    continue
            
            
                for i in range(0,num_pages):
                    file_object = source.getPage(i)
                    pdf_file_name = outdir_tmp + name + "_" + str(i+1) + ".pdf"

                    pdf_file_writer = PdfFileWriter()

                
                    with open(pdf_file_name, 'wb') as f:

                        pdf_file_writer.addPage(file_object)
                        pdf_file_writer.write(f)                    

        except:
            print(basename + "*** error ***")
            copyfile(img,indir_error + basename )

What you want to do ②: Convert PDF to JPEG image

This is running in GhostScript. What I want to do ① Select the PDF files that have been divided into pages in order and convert them to JPEG. The flow is like defining a command first and executing it in the subprocess module.


    print("***** pdf to jpg *****")

    ######################################
    #
    # pdf2jpeg(ghost script)Related settings
    #
    ######################################
        
    exe_name = r'C:\Program Files\gs\gs9.50\bin\gswin64c.exe'
    exe_option = "-dSAFER -dBATCH -dNOPAUSE -sDEVICE=jpeg -dJPEGQ=100 -dQFactor=1.0 -dDisplayFormat=16 -r600" 
    exe_outfile = "-sOutputFile="   

    #Select an image in the input directory
    all_pdfs = glob.glob(outdir_tmp + "*")
    
    for pdf in all_pdfs:      
    
        #Get the image name of the input image (with extension)
        basename = os.path.basename(pdf)
        print(basename)

        #Separate the extension from the image name of the input image
        name,ext = os.path.splitext(basename)   

        outputfile = outdir + name + ".jpg "

        cmd_list = list()
        cmd_list.append(exe_name)
        cmd_list.extend(exe_option.split()) #This is extend, not append
        cmd_list.append(exe_outfile + outputfile)
        cmd_list.append(pdf)
        
        subprocess.call(cmd_list)

The entire


import os
import glob

from PyPDF2 import PdfFileReader
from PyPDF2 import PdfFileWriter

import subprocess
from shutil import copyfile 

"""
Main
"""
if __name__ == "__main__":

    
    #Get the current directory
    current_dir = os.getcwd()
    #Get input directory
    indir = current_dir + "\\input\\"
    indir_error = current_dir + "\\input_error\\"
    
    #Get output directory
    outdir = current_dir + "\\output\\"
    outdir_tmp = current_dir + "\\output_tmp\\"
    
    #Create if there is no output folder
    if os.path.isdir(outdir) == False:
        os.mkdir(outdir)

    if os.path.isdir(outdir_tmp) == False:
        os.mkdir(outdir_tmp)
    
    if os.path.isdir(indir_error) == False:
        os.mkdir(indir_error)            
    
    #Select an image in the input directory
    all_images = glob.glob(indir + "*")
   
     #Processing for each image
    for img in all_images:
        
        #Get the image name of the input image (with extension)
        basename = os.path.basename(img)
        print(basename)
        
        #Separate the extension from the image name of the input image
        name,ext = os.path.splitext(basename)
        
        try:
            #images = convert_from_path(img)
            with open(img,"rb") as pdf_file: #There is img, but it is a PDF file that is input.
                source = PdfFileReader(pdf_file)
            
                num_pages = source.getNumPages()
                
                #Single page PDF is output_Copy to tmp and proceed to the next image
                if num_pages == 1:
                    copyfile(img,outdir_tmp + basename )
                    continue
            
            
                for i in range(0,num_pages):
                    file_object = source.getPage(i)
                    pdf_file_name = outdir_tmp + name + "_" + str(i+1) + ".pdf"

                    pdf_file_writer = PdfFileWriter()

                
                    with open(pdf_file_name, 'wb') as f:

                        pdf_file_writer.addPage(file_object)
                        pdf_file_writer.write(f)                    

        except:
            print(basename + "*** error ***")
            copyfile(img,indir_error + basename )

    print("***** pdf to jpg *****")

    ######################################
    #
    # pdf2jpeg(ghost script)Related settings
    #
    ######################################
        
    exe_name = r'C:\Program Files\gs\gs9.50\bin\gswin64c.exe'
    exe_option = "-dSAFER -dBATCH -dNOPAUSE -sDEVICE=jpeg -dJPEGQ=100 -dQFactor=1.0 -dDisplayFormat=16 -r600" 
    exe_outfile = "-sOutputFile="   

    #Select an image in the input directory
    all_pdfs = glob.glob(outdir_tmp + "*")
    
    for pdf in all_pdfs:      
    
        #Get the image name of the input image (with extension)
        basename = os.path.basename(pdf)
        print(basename)

        #Separate the extension from the image name of the input image
        name,ext = os.path.splitext(basename)   

        outputfile = outdir + name + ".jpg "

        cmd_list = list()
        cmd_list.append(exe_name)
        cmd_list.extend(exe_option.split())                   
        cmd_list.append(exe_outfile + outputfile)
        cmd_list.append(pdf)
        
        subprocess.call(cmd_list)

at the end

Because the source code was created quickly for what I wanted to do, it became quite dirty source code lol (It's not a laugh)

In addition, there were some PDFs that could not be converted to JPEG well even with GhostScript. The cause is not clear, but after a little research, it seems that the fonts do not support it. Depending on the response, I will investigate a little more.

If you have any questions, please leave a comment.

I ran GhostScript with python, split the PDF into pages, and converted it to a JPEG image.