Read a string in a PDF file with Java

Overview

Extract the character string from the PDF file.

Execution environment

OS: Windows 7 Language: Java

Java preparation

Create a maven project and add the following to pom.xml

<dependency>
	<groupId>org.apache.pdfbox</groupId>
	<artifactId>pdfbox</artifactId>
	<version>2.0.8</version>
</dependency>

Implementation

try {
    File file = new File("test.pdf");
    PDDocument document = PDDocument.load(file);

    //Extract only character strings
    PDFTextStripper pdfStripper = new PDFTextStripper();

    //As it looks(Setting to read character strings in the order of (from upper left to lower right)
    pdfStripper.setSortByPosition(true);
    //Text extraction from pdf
    text = pdfStripper.getText(document);

    document.close();

} catch (Exception e) {
    e.printStackTrace();
}

Since the output character string includes page numbers and spaces (half-width spaces, full-width spaces, tabs), etc., analysis becomes easier once cleansing processing is applied.

Recommended Posts

Read a string in a PDF file with Java
Read xlsx file in Java with Selenium
Split a string with ". (Dot)" in Java
I tried OCR processing a PDF file with Java
Read Java properties file in C #
I created a PDF in Java.
I tried OCR processing a PDF file with Java part2
Read the file under the classpath as a character string with spring
Read items containing commas in a CSV file without splitting (Java)
Read JSON in Java
Create a CSR with extended information in Java
Read Java Property file
Code to escape a JSON string in Java
Text extraction in Java from PDF with pdfbox-2.0.8
A bat file that uses Java in windows
[Java] Read the file in src / main / resources
<java> Read Zip file and convert directly to string
[Spring] Read a message from a YAML file with MessageSource
Include image in jar file with java static method
Export pdf with a single program (Java / Perl / VBA)
Quickly implement a singleton with an enum in Java
[Java] Get the file path in the folder with List
Output true with if (a == 1 && a == 2 && a == 3) in Java (Invisible Identifier)
How to convert a file to a byte array in Java
Java11: Run Java code in a single file as is
Save Java PDF in Excel
Paging PDF with Java + PDFBox.jar
Read standard input in Java
[Java] Create a temporary file
Find a subset in Java
Read binary files in Java 2
[Java] [Android] Read ini file
[Java] Integer information of characters in a text file acquired by the read () method
How to save a file with the specified extension under the directory specified in Java to the list
Implementing a large-scale GraphQL server in Java with Netflix DGS
Convert a Java byte array to a string in hexadecimal notation
Read WAV data as a byte array in Android Java
Run PHP-FPM with OPcache enabled in a Read Only container
Activate Excel file A1 cell of each sheet in Java
How to store a string from ArrayList to String in Java (Personal)
Create a SlackBot with AWS lambda & API Gateway in Java
Even in Java, I want to output true with a == 1 && a == 2 && a == 3
I tried to convert a string to a LocalDate type in Java
About the behavior when doing a file map with java
Build a Java project with Gradle
Easily read text files in Java (Java 11 & Java 7)
Morphological analysis in Java with Kuromoji
3 Implement a simple interpreter in Java
Upload a file using Java HttpURLConnection
Read CSV in Java (Super CSV Annotation)
Read dump file with Docker MySQL
Run a batch file from Java
Unzip the zip file in Java
Log output to file in Java
A simple sample callback in Java
Output PDF and TIFF with Java 8
EXCEL file update sample with JAVA
About file copy processing in Java
Get stuck in a Java primer
Play with Markdown in Java flexmark-java
Java draws shapes in PDF documents