Java to extract PDF text content

In your daily work, you may need to extract the textual content contained in a huge PDF document. And Free Spire.PDF for Java provides a convenient and fast way to extract text, then introduce the Java code used in the process.

** Basic steps: ** ** 1. ** Free Spire.PDF for Java Download and unzip the package. ** 2. ** Import the Spire.Pdf.jar package in the lib folder as a dependency into your Java application or install the JAR package from the Maven repository (see below for the code that makes up the pom.xml file) please). ** 3. ** In your Java application, create a new Java Class (named ExtractText here) and enter and execute the corresponding Java code.

** Configure the pom.xml file: **

<repositories>
   <repository>
      <id>com.e-iceblue</id>
      <name>e-iceblue</name>
      <url>http://repo.e-iceblue.com/nexus/content/groups/public/</url>
   </repository>
</repositories>
<dependencies>
   <dependency>
      <groupId>e-iceblue</groupId>
      <artifactId>spire.pdf.free</artifactId>
      <version>2.6.3</version>
   </dependency>
</dependencies>

** The PDF source document is: ** sample.jpg

** Java code: **


import com.spire.pdf.PdfDocument;
import com.spire.pdf.PdfPageBase;
import java.io.*;

public class ExtractText {

    public static void main(String[] args) {

        //Create a PdfDocument instance
        PdfDocument doc = new PdfDocument();
        //Load PDF file
        doc.loadFromFile("snow.pdf");

        //Create a StringBuilder instance
        StringBuilder sb = new StringBuilder();

        PdfPageBase page;
        //Traverse the PDF pages, get the text for each page and add it to the StringBuilder object
        for(int i= 0;i<doc.getPages().getCount();i++){
            page = doc.getPages().get(i);
            sb.append(page.extractText(true));
        }
        FileWriter writer;
        try {
            //Writes the text of a StringBuilder object to a text file
            writer = new FileWriter("ExtractText.txt");
            writer.write(sb.toString());
            writer.flush();
        } catch (IOException e) {
            e.printStackTrace();
        }

        doc.close();
    }
}

** Extract results: ** text.jpg

Recommended Posts

Java to extract PDF text content
Add watermark to Java to PDF document
Try to extract java public method
Java adds form fields to PDF
Java introductory text
[Java] Introduction to Java
Introduction to java
Text extraction in Java from PDF with pdfbox-2.0.8
Java adds a text box to PowerPoint slides
Java adds page numbers to existing PDF documents
Java enables extraction of PDF text and images
[Java] How to extract the file name from the path
[Java] Connect to MySQL
Kotlin's improvements to Java
[Java] PDF viewing settings
Java applications convert Word (DOC / DOCX) documents to PDF
From Java to Ruby !!
Introduction to java command
Append text to BlobItem in Azure BlobStorage SDK Java V8
Java --How to make JTable
How to use java Optional
Java encryption and decryption PDF
Java basic learning content 5 (modifier)
New features from Java7 to Java8
How to minimize Java images
How to write java comments
How to use java class
Paging PDF with Java + PDFBox.jar
Connect from Java to PostgreSQL
[Java] How to use removeAll ()
[Java] How to display Wingdings
Java turns Excel into PDF
[Java] Introduction to lambda expressions
Shell to kill Java process
How to use Java Map
[Java] Content acquisition with HttpCliient
How to set Java constants
Connect to DB with Java
Connect to MySQL 8 with Java
[java] Reasons to use static
How to use Java variables
[Java] Introduction to Stream API
Java8 to start now ~ Optional ~
How to convert Java radix
[Java] Convert ArrayList to array
Java thread to understand loosely
[Java] How to implement multithreading
From Ineffective Java to Effective Java
How to initialize Java array
Java Basic Learning Content 8 (Java API)
Java basic learning content 4 (repetition)
[Introduction to rock-paper-scissors games] Java
Input to the Java console