Java enables extraction of PDF text and images

PDF files are always used to carry a lot of great information content. To make better use of this information, you need to use some tools to extract text and image information from the PDF. Below are the texts and photos to extract PDF through Java.

Tool use:

-Free Spire Pdf for JAVA 2.4.4 (free version)

Jar package introduction:

--Method 1: After downloading the stress of Free Spire.Pdf for Java from the official site, add it to Shift + Ctrl + Alt + S in IDEA or Eclipse. By adding the Spire.Pdf.jar packet to the program, the jar file Can be obtained in the lib folder under the decompression path. The result of introducing the jar package is as follows:

image.png

--Method 2: Install from maven library. Refer to the installation method (https://www.e-iceblue.com/Tutorials/Licensing/How-to-install-Spire.PDF-for-Java-from-Maven-Repository.html).

The test source documentation is as follows: image.png

See Java code example:

[Example 1] Extract the text content of PDF

** Step 1: ** Add namespace;

import com.spire.pdf.*;
import java.io.FileWriter;

** Step 2: ** Create an instance of PDF and load the PDF source file;

//Create the PDF
PdfDocument doc = new PdfDocument();
//Load the PDF file
doc.loadFromFile("data/Sample.pdf");

** Step 3: ** Define an example of a character buffer that traverses the entire PDF document using the StringBuider method;

// Traverse the PDF
StringBuilder buffer = new StringBuilder();
for(int i = 1; i<doc.getPages().getCount(); i++){
    PdfPageBase page = doc.getPages().get(i);
    buffer.append(page.extractText());
}

** Step 4: ** Define an instance of one writer to write data to the buffer area and use write () to write the data in the buffer area to a text.txt file and save it.

//save text
String fileName = "output/text.txt";
FileWriter writer = new FileWriter(fileName);
writer.write(buffer.toString());
writer.flush();
writer.close();

Text extraction result: image.png

[Example 2] Extracting pictures in PDF

** Step 1: ** Add namespace;

import com.spire.pdf.*;
import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.File;

** Step 2: ** Create an instance of PDF and load the PDF source file;

        //Create the PDF
        PdfDocument pdf = new PdfDocument();
	   //Load the PDF file
        pdf.loadFromFile("data/Sample.pdf");

** Step 3: ** The for loop goes through each page of the PDF, gets the image of the specified page using the extractImages () method, and finally saves the image in PNG format.

        // Declare an int variable
	 int index = 0;
        // loop through the pages
        for (int i= 0;i< pdf.getPages().getCount(); i ++){
            //Get the PDF pages
            PdfPageBase page = pdf.getPages().get(i);
            // Extract images from a particular page 
            for (BufferedImage image : page.extractImages()) {
            //specify the file path and name
                File output = new File("output/" + String.format("Image_%d.png ", index++));                
            //Save image as .png file    
            ImageIO.write(image, "PNG", output);
            }
        }

Image extraction result: image.png image.png image.png image.png

Recommended Posts

Java enables extraction of PDF text and images
Text extraction in Java from PDF with pdfbox-2.0.8
Java encryption and decryption PDF
Advantages and disadvantages of Java
[Java] Upload images and base64
Java Excel Insertion and Image Extraction
About fastqc of Biocontainers and Java
Java to extract PDF text content
Output PDF and TIFF with Java 8
Add, replace, delete Java PDF images
[Java] Judgment of identity and equivalence
After 3 months of Java and Spring training
I want to display images with REST Controller of Java and Spring!
[Java] Inheritance and structure of HttpServlet class
Program PDF headers and footers in Java
Summary of Java Math.random and import (Calendar)
[Java] Contents of Collection interface and List interface
Basics of java basics ② ~ if statement and switch statement ~
Discrimination of Enums in Java 7 and above
[Java] Personal summary of classes and methods (basic)
I compared the characteristics of Java and .NET
JAVA: Realizes generation and scanning of various barcodes
Basics of threads and Callable in Java [Beginner]
Java introductory text
Java and JavaScript
XXE and Java
[Java] Types of comments and how to write them
Summary of ToString behavior with Java and Groovy annotations
Please note the division (division) of java kotlin Int and Int
The comparison of enums is ==, and equals is good [Java]
Java extracts text content of SmartArt shapes in PowerPoint
Organizing the current state of Java and considering the future
Java language from the perspective of Kotlin and C #
[Java] Text extraction from PowerPoint (ppt) using Apache POI
[Java] About Objects.equals () and Review of String comparisons (== and equals)
Verification of the relationship between Docker images and containers
I summarized the types and basics of Java exceptions
List of frequently used Java instructions (for beginners and beginners)
Change the storage quality of JPEG images in Java
Use of Abstract Class and Interface properly in Java
[Java] [Kotlin] Generically call valueOf and values of Enum
[Java10] Be careful of using var and generics together
[Java] Handling of character strings (String class and StringBuilder class)
Output vector graphics images (PDF, SVG, PPT, EPS, SWF) with Java Graphics 2D and various libraries