Read a string in a PDF file with Java

Overview

Extract the character string from the PDF file.

Image data in PDF files is not handled in this case.

Execution environment

OS: Windows 7 Language: Java

Java preparation

Create a maven project and add the following to pom.xml

<dependency>
	<groupId>org.apache.pdfbox</groupId>
	<artifactId>pdfbox</artifactId>
	<version>2.0.8</version>
</dependency>

Implementation

try {
    File file = new File("test.pdf");
    PDDocument document = PDDocument.load(file);

    //Extract only character strings
    PDFTextStripper pdfStripper = new PDFTextStripper();

    //As it looks(Setting to read character strings in the order of (from upper left to lower right)
    pdfStripper.setSortByPosition(true);
    //Text extraction from pdf
    text = pdfStripper.getText(document);

    document.close();

} catch (Exception e) {
    e.printStackTrace();
}

Since the output character string includes page numbers and spaces (half-width spaces, full-width spaces, tabs), etc., analysis becomes easier once cleansing processing is applied.

Vertical PDF does not work this way

Recommended Posts

Read a string in a PDF file with Java

Read xlsx file in Java with Selenium

Split a string with ". (Dot)" in Java

I tried OCR processing a PDF file with Java

Read Java properties file in C #

I created a PDF in Java.

I tried OCR processing a PDF file with Java part2

Read the file under the classpath as a character string with spring

Read items containing commas in a CSV file without splitting (Java)

Read JSON in Java

Create a CSR with extended information in Java

Read Java Property file

Code to escape a JSON string in Java

Text extraction in Java from PDF with pdfbox-2.0.8

A bat file that uses Java in windows

[Java] Read the file in src / main / resources

<java> Read Zip file and convert directly to string

[Spring] Read a message from a YAML file with MessageSource

Include image in jar file with java static method

Export pdf with a single program (Java / Perl / VBA)

Quickly implement a singleton with an enum in Java

[Java] Get the file path in the folder with List

Output true with if (a == 1 && a == 2 && a == 3) in Java (Invisible Identifier)

How to convert a file to a byte array in Java

Java11: Run Java code in a single file as is

Save Java PDF in Excel

Paging PDF with Java + PDFBox.jar

Read standard input in Java

[Java] Create a temporary file

Find a subset in Java

Read binary files in Java 2

[Java] [Android] Read ini file

[Java] Integer information of characters in a text file acquired by the read () method

How to save a file with the specified extension under the directory specified in Java to the list

Implementing a large-scale GraphQL server in Java with Netflix DGS

Convert a Java byte array to a string in hexadecimal notation

Read WAV data as a byte array in Android Java

Run PHP-FPM with OPcache enabled in a Read Only container

Activate Excel file A1 cell of each sheet in Java

How to store a string from ArrayList to String in Java (Personal)

Create a SlackBot with AWS lambda & API Gateway in Java

Even in Java, I want to output true with a == 1 && a == 2 && a == 3

I tried to convert a string to a LocalDate type in Java

About the behavior when doing a file map with java

Build a Java project with Gradle

Easily read text files in Java (Java 11 & Java 7)

Morphological analysis in Java with Kuromoji

3 Implement a simple interpreter in Java

Upload a file using Java HttpURLConnection

Read CSV in Java (Super CSV Annotation)

Read dump file with Docker MySQL

Run a batch file from Java

Unzip the zip file in Java

Log output to file in Java

A simple sample callback in Java

Output PDF and TIFF with Java 8

EXCEL file update sample with JAVA

About file copy processing in Java

Get stuck in a Java primer

Play with Markdown in Java flexmark-java

Java draws shapes in PDF documents