Read a string in a PDF file with Java


Extract the character string from the PDF file.

Execution environment

OS: Windows 7 Language: Java

Java preparation

Create a maven project and add the following to pom.xml



try {
    File file = new File("test.pdf");
    PDDocument document = PDDocument.load(file);

    //Extract only character strings
    PDFTextStripper pdfStripper = new PDFTextStripper();

    //As it looks(Setting to read character strings in the order of (from upper left to lower right)
    //Text extraction from pdf
    text = pdfStripper.getText(document);


} catch (Exception e) {

Since the output character string includes page numbers and spaces (half-width spaces, full-width spaces, tabs), etc., analysis becomes easier once cleansing processing is applied.

