Text extraction in Java from PDF with pdfbox-2.0.8

import java.io.File;
import java.io.IOException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

public class PdfToTextMain1 {

	public static void main(String[] args) {

		// https://pdfbox.apache.org/download.cgi
		// pdfbox-2.0.8
		// fontbox-2.0.8.jar

		String filepath = "file/pdf_ja.pdf";

		try {

			PDDocument pdDoc = PDDocument.load(new File(filepath)); // throws
																	// IOException
			PDFTextStripper pdfStripper = new PDFTextStripper();// throws
																// IOException

			pdfStripper.setStartPage(1);
			pdfStripper.setEndPage(5);

			String parsedText = pdfStripper.getText(pdDoc); // throws
															// IOException

			System.out.println(parsedText);
		} catch (IOException e) {
			e.printStackTrace();
		}
	}

}

--> This is a test.

Recommended Posts

Text extraction in Java from PDF with pdfbox-2.0.8
Read a string in a PDF file with Java
Java enables extraction of PDF text and images
[Java] Text extraction from PowerPoint (ppt) using Apache POI
Save Java PDF in Excel
Paging PDF with Java + PDFBox.jar
Create PDF with itext7-Free layout: Text-
Easily read text files in Java (Java 11 & Java 7)
Morphological analysis in Java with Kuromoji
Code Java from Emacs with Eclim
I created a PDF in Java.
Java to extract PDF text content
Output PDF and TIFF with Java 8
Work with Google Sheets from Java
Play with Markdown in Java flexmark-java
Java draws shapes in PDF documents
Study Deep Learning from scratch in Java.
Call Java library from C with JNI
API integration from Java with Jersey Client
Call Java method from JavaScript executed in Java
OCR in Java (character recognition from images)
Concurrency Method in Java with basic example
Reverse Key from Value in Java Map
Getting Started with Java Starting from 0 Part 1
Text extraction from documents using POI, Tika
Read xlsx file in Java with Selenium
Get history from Zabbix server in Java
Split a string with ". (Dot)" in Java
Working with huge JSON in Java Lambda
Program PDF headers and footers in Java
Execute Java code from cpp with cocos2dx
[Java] Rewrite the functions created by myself in the past from java.io.File with NIO.2.
Sample to create PDF from Excel with Ruby
GetInstance () from a @Singleton class in Groovy from Java
Create barcodes and QR codes in Java PDF
Run Rust from Java with JNA (Java Native Access)
Create a CSR with extended information in Java
Refactored GUI tools made with Java8 + JavaFX in 2016
Java method call from RPG (method call in own class)
How to get Class from Element in Java
Display text as ASCII art in Java (jfiglet)
Capture and save from selenium installation in Java
Get unixtime (seconds) from ZonedDateTime in Scala / Java
Use Matplotlib from Java or Scala with Matplotlib4j
[Deep Learning from scratch] in Java 3. Neural network
Practice working with Unicode surrogate pairs in Java
[JAVA] [Spring] [MyBatis] Use IN () with SQL Builder
Change paragraph text color in Java Word documents
Encrypt / decrypt with AES256 in PHP and Java
Generate OffsetDateTime from Clock and LocalDateTime in Java
Compare PDF output in Java for snapshot testing
[Java] Get KFunction from Method / Constructor in Java [Kotlin]
Programming with direct sum types in Java (Neta)
Get along with Java containers in Cloud Run
Partization in Java
Changes in Java 11
Java introductory text
Rock-paper-scissors in Java
Pi in Java
FizzBuzz in Java
Launch Docker from Java to convert Office documents to PDF