[Java] Java to extract PDF text content

1 minute read

For everyday work, you may need to extract the text content contained in a large PDF document. And Free Spire.PDF for Java provides a convenient and fast way to extract text, then I’ll show you the Java code used in the process.

Basic steps: 1. Download and unzip the Free Spire.PDF for Java package. 2. Import the Spire.Pdf.jar package from the lib folder as a dependency into your Java application or install the JAR package from the Maven repository (see below for the code that makes up the pom.xml file please). 3. In your Java application, create a new Java Class (here named ExtractText), enter the corresponding Java code and run it.

Configure the pom.xml file:

<repositories>
   <repository>
      <id>com.e-iceblue</id>
      <name>e-iceblue</name>
      <url>http://repo.e-iceblue.com/nexus/content/groups/public/</url>
   </repository>
</repositories>
<dependencies>
   <dependency>
      <groupId>e-iceblue</groupId>
      <artifactId>spire.pdf.free</artifactId>
      <version>2.6.3</version>
   </dependency>
</dependencies>

The PDF source document is: sample.jpg

Java code:


import com.spire.pdf.PdfDocument;
import com.spire.pdf.PdfPageBase;
import java.io.*;

public class ExtractText {

    public static void main(String[] args) {

        //Create a PdfDocument instance
        PdfDocument doc = new PdfDocument();
        // load PDF file
        doc.loadFromFile("Snow.pdf");

        //Create a StringBuilder instance
        StringBuilder sb = new StringBuilder();

        PdfPageBase page;
        // Traverse the PDF page, get the text of each page and add it to a StringBuilder object
        for(int i= 0;i<doc.getPages().getCount();i++){
            page = doc.getPages().get(i);
            sb.append(page.extractText(true));
        }
        FileWriter writer;
        try {
            //Write the text of the StringBuilder object to a text file
            writer = new FileWriter("ExtractText.txt");
            writer.write(sb.toString());
            writer.flush();
        } catch (IOException e) {
            e.printStackTrace();
        }

        doc.close();
    }
}

Extract the results: text.jpg