I made a Wrapper that calls KNP from Java

Article content that can be understood in 3 lines

Overview

Recently, the wave of Deep is rushing to the NLP area, so some people may say, "I don't use KNP these days." However, if you don't have decent data and you can't use common methods, or if you want to try the rule-based results before trying the deep method, there are plenty of case analysis results and various features. KNP is still useful because it gives you a lot of information.

On the other hand, it can't be used as a library like Sudachi, so it can be a hassle to use in a program (especially in languages other than Python with pyKNP).

So this time I made a Wrapper library that calls KNP (and Human ++) from Java. (https://github.com/Natsuume/knp4j) ~~ I wanted to publish it to the Maven repository, but I couldn't make it in time, so I will publish it to the Maven repository soon. ~~

~~ In addition, although we have confirmed the operation to some extent, there is a possibility that problems will occur because we have not written a proper test ~~

Postscript

Published to Maven Central. Now available on Maven, Gradle, etc.

pom.xml


<dependency>
  <groupId>dev.natsuume.knp4j</groupId>
  <artifactId>knp4j</artifactId>
  <version>1.1.3</version>
</dependency>

build.gradle


implementation 'dev.natsuume.knp4j:knp4j:1.1.3'

How to use

It's almost as README.md on github.

Sample.java


//Builder for creating KNPWrapper
ResultParser<KnpResult> knpResultParser = new KnpResultParser();
KnpWrapperBuilder<KnpResult> knpWrapperBuilder = new KnpWrapperBuilder<>();
KnpWrapper<KnpResult> wrapper = knpWrapperBuilder
    .setJumanCommand(List.of("bash", "-c", "jumanpp")) //Juman execution command
    .setKnpCommand(List.of("bash", "-c", "knp -tab -print-num -anaphora")) //KNP execution command(Currently"-tab」「-print-num」「-anaphora "option required)
    .setJumanMaxNum(1) //Maximum number of Human processes to start at the same time
    .setJumanStartNum(1) //Number of Human processes to start at initialization
    .setKnpMaxNum(1) //Maximum number of KNP processes to start at the same time
    .setKnpStartNum(1) //Number of KNP processes to start at initialization
    .setRetryNum(0) //Number of retries if result acquisition fails
    .setResultParser(knpResultParser) //List of output results<String>Set Parser to convert to any class
    .start();
var texts = List.of(
    "Test text 1",
    "Test text 2",
    "Test text 3"
);
texts.parallelStream().map(wrapper::analyze)
    .flatMap(List::stream)
    .map(KnpResult::getSurfaceForm)
    .forEach(System.out::println);

Give various settings with KnpWrapperBuilder and generate & start KnpWrapper for the first time withstart (). For setJumanCommand and setKnpCommand, give the same command as given to ProcessBuilder. Depending on the environment, it may be possible to execute with only the JUMAN and KNP paths. (In my environment, I had to call JUMANPP, KNP on WSL, so I gave a command like the above example)

For settings other than setResultParser (), the contents of the above example are the default values.

function

Run in multiple processes

Set up multiple processes and reuse them. The number of processes that can be set up at the same time can be freely set for each of JUMAN and KNP.

There is a server mode for JUMAN and KNP, but this is not currently supported (will be supported in the future).

Re-execution when analysis fails

Basically, it is assumed that it will not be used, but when ʻIOException or ʻInterruptedException occurs in a series of processes, the process in which the exception occurred is terminated, and another process tries to analyze again.

Result Parser

Any Parser that implements the ResultParser interface can be used as the output Parser. The following two types of methods are defined in ResultParser.

ResultParser.java


public interface ResultParser<OutputT> {

  /**
   *Returns an arbitrary instance with the analysis result of Knp as input.
   *
   * @param list Knp analysis result
   * @return Instance representing the analysis result
   */
  OutputT parse(List<String> list);

  /**
   *Returns the instance to use when parsing fails.
   *
   * @return Instance to return when parsing fails
   */
  OutputT getInvalidResult();
}

getInvalidResult () is a method that returns an instance when a normal parsing result cannot be obtained. It is used when re-execution at the above exception fails, or when KNP fails to analyze (KNP fails to analyze if half-width +, * is included).

Check if it is faster than the single process

Change jumanMaxNum, knpMaxNum with the code below and compare the execution time (ms). Also, in the experimental environment, the CPU is Ryzen 7 3700x and the heap size is 32GB.

In the experimental environment, WSL's JUMAN and KNP are called. (WSL is said to be slow IO, so it may be a little faster in other environments?)

  public static void main(String[] args) {
    long time = System.currentTimeMillis();

    KnpWrapperBuilder<KnpResult> knpWrapperBuilder = new KnpWrapperBuilder<>();
    int jumanMaxNum = 1;
    int knpMaxNum = 1;
    int textSize = 100;
    KnpWrapper<KnpResult> wrapper =
        knpWrapperBuilder
            .setJumanMaxNum(jumanMaxNum)
            .setKnpMaxNum(knpMaxNum)
            .setResultParser(new KnpResultParser())
            .start();
    var sampleText = "I registered in the Advent calendar with Nori," 
        + "I don't see any sign of time so today%You can only sleep after working for d hours.";
    var texts =
        IntStream.range(0, textSize)
            .mapToObj(i -> String.format(sampleText, i))
            .collect(Collectors.toList());
    var results =
        texts
            .parallelStream()
            .map(wrapper::analyze)
            .flatMap(List::stream)
            .collect(Collectors.toList());

    System.out.println("time: " + (System.currentTimeMillis() - time));
    System.exit(0);
  }

result

jumanMaxNum knpMaxNum First time Second time Third time 4th 5th time average
1 1 17297 17320 17241 17159 17421 17287.6
1 5 2808 2764 2858 2791 2789 2802
5 1 20334 20211 19974 20037 20189 20149

For the time being, I found that both JUMAN and KNP are faster when KNP is executed in multiple processes than when they are executed in a single process. On the other hand, unlike KNP, which is a bottleneck, the JUMAN side seems to slow down if it is increased too much.

In order to see how much the result differs depending on the number of JUMAN and KNP processes, the number of texts was increased from 100 to 500 and the following combinations were additionally measured.

jumanMaxNum knpMaxNum First time Second time Third time 4th 5th time average
1 5 27953 27590 27674 27999 27669 27777
1 10 15825 16366 15118 15632 14931 15574.4
5 10 18704 17778 17355 16134 17254 17445
10 10 19514 19265 20459 19891 19233 19672.4
1 15 14533 22271 14187 21838 19794 18524.6
5 15 14149 14584 14929 15709 15228 14919.8
10 15 19313 17903 15478 18219 16740 17530.6
1 20 21620 14489 21960 20456 15671 18839.2
5 20 15899 15820 15713 14720 17053 15841
10 20 18850 15850 18461 18200 16357 17543.6

~~ I don't know. ~~ For the time being, the fastest average combination in this environment was a combination of 5 processes for JUMAN and 15 processes for KNP. However, I feel that the combinations around [1, 10], [5, 15], [5, 20] are within the margin of error.

Also, try the result of typing the following command on the WSL terminal.

time echo "I registered it on the Advent calendar, but I don't see any sign of it in time, so I have to work for an hour today before I can sleep." | jumanpp | knp -tab -print-num -anaphora
result First time Second time Third time 4th 5th time average
real 220 223 234 219 234 226
user 78 78 63 109 94 84.4
sys 125 125 141 78 125 118.8

bonus

It's fun to see the CPU spinning around cpu.png

Recommended Posts

I made a Wrapper that calls KNP from Java
I made a class that can use JUMAN and KNP from Java
I made a shopify app @java
I made a mod that instantly calls a vehicle with Minecraft
I made a new Java deployment tool
I made a primality test program in Java
I tried hitting a Java method from ABCL
I made a rock-paper-scissors game in Java (CLI)
I made a viewer app that displays a PDF
I wrote a test code (Junit & mockit) for the code that calls the AWS API (Java)
I made roulette in Java.
[Beginner] I made a program to sell cakes in Java
I made a Dockerfile to start Glassfish 5 using Oracle Java
I made a JAVA framework "numatrix" that easily generates unique numbers in a distributed environment & multithreading.
I made a program in Java that solves the traveling salesman problem with a genetic algorithm
A story that I struggled to challenge a competition professional with Java
Ruby: I made a FizzBuzz program!
I created a PDF in Java.
I made a GUI with Swing
[Java] Precautions when creating a process that calls a method of Abstract class using DI from a child class
[Note] What I learned in half a year from inexperienced (Java)
Run a batch file from Java
Access Teradata from a Java application
I wrote a Stalin sort that feels like a mess in Java
I made a source that automatically generates JPA Entity class files
I made a question that can be used for a technical interview
I made a method to ask for Premium Friday (Java 8 version)
I made a simple recommendation function.
[Note] What I learned in half a year from inexperienced (Java) (3)
I made a matching app (Android app)
A story that I finally understood Java for statement as a non-engineer
I made a package.xml generation tool.
[Android] I made a pedometer app.
I tried learning Java with a series that beginners can understand clearly
[LINE BOT] I made a ramen BOT with Java (Maven) + Heroku + Spring Boot (1)
I made a class that automatically generates Json that specifies Minecraft texture etc. [1.12.2]
I made a site that summarizes information on carbohydrate restriction with Vue.js
[Ruby] I made a simple Ping client
A program that calculates factorials from 2 to 100
Try running a Kubernetes Job from Java
I made a plugin for IntelliJ IDEA
I made a calculator app on Android
I made a rock-paper-scissors app with android
What I learned from Java monetary calculation
I made a bulletin board using Docker 1
If a person from Java learns PHP
I will expose the fucking app that I made hard to get a job as an engineer from inexperienced.
I tried to make a program that searches for the target class from the process that is overloaded with Java
I made a THETA API client that can be used for plug-in development
Learn Java with Progate → I will explain because I made a basic game myself
I had a hard time doing Java multithreading from scratch, so organize it
Connect to Aurora (MySQL) from a Java application
04. I made a front end with SpringBoot + Thymeleaf
I made a mosaic art with Pokemon images
java I tried to break a simple block
To become a VB.net programmer from a Java shop
I did Java to make (a == 1 && a == 2 && a == 3) always true
I made a gender selection column with enum
Programming beginners learn PHP from a Java perspective-variables-
I wrote a primality test program in Java
I made a Docker container to run Maven