I tried OCR processing a PDF file with Java

At the beginning

I wrote a Java program that performs OCR processing on the image file included in the PDF At first, I tried to implement it in python, but I didn't understand the dependency of the library to use, so I decided to implement it in Java, which I'm used to.

Library used

Ingenuity etc.

--All of the PDF analysis sample programs used too many loops and the source was hard to see, so I tried writing by making full use of map and reduce. --Is there a smarter way to convert from Iterator to stream ...

Source

Class to get an image from PDF

PDFmaker.java


package pdf;

import java.io.File;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.nio.file.StandardOpenOption;
import java.util.Spliterator;
import java.util.Spliterators;
import java.util.stream.Collectors;
import java.util.stream.Stream;
import java.util.stream.StreamSupport;

import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDResources;
import org.apache.pdfbox.pdmodel.graphics.PDXObject;
import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;

public class PDFMaker {
	public static void main(String args[])throws Exception{
		//PDF file to read
		PDDocument document = PDDocument.load(new File("C:/python/img/pdf/2017h29a_sc_pm2_qs.pdf"));
		//Process PDF pages
		Stream<PDPage>stream = StreamSupport.stream(Spliterators.spliteratorUnknownSize(
				document.getDocumentCatalog().getPages().iterator(),
						Spliterator.ORDERED),false);
		
		System.out.println("start");
		Files.write(Paths.get("parse.txt"), 
				stream.map(s->exePDFpage(s)).collect(Collectors.toList()), 
				Charset.forName("MS932"),
				StandardOpenOption.CREATE);
		System.out.println("end");
	}
	
	//Process PDF Page
	public static String exePDFpage(PDPage p){
		Stream<COSName>stream = StreamSupport.stream(Spliterators.spliteratorUnknownSize(
				p.getResources().getXObjectNames().iterator(),Spliterator.ORDERED),false);
		return stream.map(s->exeImage(s,p.getResources()))
		.reduce((s,v)->s+v).get();
	}
	
	//Convert PDF Page to Jpg
	public static String exeImage(COSName n,PDResources resources){
		try{
			PDXObject xobject = resources.getXObject(n);
			if(xobject instanceof PDImageXObject){
				PDImageXObject image2 = (PDImageXObject) resources.getXObject(n);
				return PDFtoImg.extractFromPDF(image2.getImage());
			}
			return "";
		}catch(Exception e){
			e.printStackTrace();
			return "";
		}	
	}
}

Processing to read image file by OCR

PDFtoImg.java


package pdf;

import java.awt.image.BufferedImage;

import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;

public class PDFtoImg {
	private static final String  DICTIONARY_PATH ="C:/Users/takayoshi/workspace/PDF/tessdata";
    public static String extractFromPDF(BufferedImage img) {
		ITesseract instance = new Tesseract();
		try {
		    instance.setLanguage("jpn");
		    instance.setDatapath(DICTIONARY_PATH);
		    String result = instance.doOCR(img);
		    return result;
		} catch (TesseractException ex) {
		    ex.printStackTrace();
		    return "";
		}
    }
}

Impressions

――It takes about 10 to 20 minutes to revise the information processing security supporter examination (SC) problem of the information processing engineer examination. --The analysis result of the first page is as follows

Fall 2017
{D`
Information processing woman all securing supporter examination
No
afternoon=problem
Test time-4:30 ~-6:30 (2 hours)
Notes
-・ The start and end of the test,The supervisor's clock is the standard. Follow the instructions of the supervisor 〟
2~Until there is a signal to start the test,Do not open the question booklet and look inside.
3~Entering the examination number etc. on the answer sheet,Please start after the signal to start the test.
4_The problem is,Please answer according to the table of Fuki.

Considering that OCR is free, I think that Japanese is a pretty good line.

Recommended Posts

I tried OCR processing a PDF file with Java
I tried OCR processing a PDF file with Java part2
Read a string in a PDF file with Java
I tried to break a block with java (1)
I created a PDF in Java.
I tried to interact with Java
I tried to create a java8 development environment with Chocolatey
I tried playing with BottomNavigationView a little ①
I tried using OpenCV with Java + Tomcat
I made an app to scribble with PencilKit on a PDF file
[iOS] I tried to make a processing application like Instagram with Swift
I tried to make Basic authentication with Java
java I tried to break a simple block
I tried hitting a Java method from ABCL
I tried running Java on a Mac terminal
[Java] I tried to connect using a connection pool with Servlet (tomcat) & MySQL & Java
I tried to implement file upload with Spring MVC
[Java 11] I tried to execute Java without compiling with javac
Export pdf with a single program (Java / Perl / VBA)
I tried to create a Clova skill in Java
I want to monitor a specific file with WatchService
I tried to make a login function in Java
I tried to implement Stalin sort with Java Collector
I tried scraping a stock chart using Java (Jsoup)
I tried to create a shopping site administrator function / screen with Java and Spring
[Azure] I tried to create a Java application for free ~ Connect with FTP ~ [Beginner]
I tried DI with Ruby
Paging PDF with Java + PDFBox.jar
[Java] Create a temporary file
I tried using Java REPL
I tried UPSERT with PostgreSQL.
I tried BIND with Docker
I tried metaprogramming in Java
I tried to increase the processing speed with spiritual engineering
[Rails] I tried to create a mini app with FullCalendar
I want to make a list with kotlin and java!
I want to make a function with kotlin and java!
[Rails] I tried to implement batch processing with Rake task
Even in Java, I want to output true with a == 1 && a == 2 && a == 3
I tried to convert a string to a LocalDate type in Java
About the behavior when doing a file map with java
I tried to make a client of RESAS-API in Java
I tried to create a padrino development environment with Docker
I tried using the CameraX library with Android Java Fragment
A story that I struggled to challenge a competition professional with Java
I tried printing a form with Spring MVC and JasperReports 3/3 (Spring MVC control)
Build a Java project with Gradle
I tried Tribuo published by Oracle. Tribuo --A Java prediction library (v4.0)
I tried running a letter of credit transaction application with Corda 1
I can't create a Java class with a specific name in IntelliJ
I made a shopify app @java
I tried using JOOQ with Gradle
I tried morphological analysis with MeCab
I made a GUI with Swing
Upload a file using Java HttpURLConnection
I tried a little digdag docker.run_options
A person writing C ++ tried writing Java
Run a batch file from Java
I tried the Java framework "Quarkus"
Output PDF and TIFF with Java 8
I tried using Java8 Stream API