Extract the character string from the PDF file.
OS: Windows 7 Language: Java
Create a maven project and add the following to pom.xml
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.8</version>
</dependency>
try {
File file = new File("test.pdf");
PDDocument document = PDDocument.load(file);
//Extract only character strings
PDFTextStripper pdfStripper = new PDFTextStripper();
//As it looks(Setting to read character strings in the order of (from upper left to lower right)
pdfStripper.setSortByPosition(true);
//Text extraction from pdf
text = pdfStripper.getText(document);
document.close();
} catch (Exception e) {
e.printStackTrace();
}
Since the output character string includes page numbers and spaces (half-width spaces, full-width spaces, tabs), etc., analysis becomes easier once cleansing processing is applied.
Recommended Posts