[JAVA] Continued-Published a Web API that can use Sentence Piece like morphological analysis

Introduction

It is a Web API that can use SentencePiece released the other day like morphological analysis, but I received various advice from Mr. Kudo, the creator of SentencePiece. Last post had some improper implementations, so I fixed it. You can use it for free from here.

API

Sample code

-Java sample

Related article

-Honke Github -Commentary by Mr. Kudo -My blog -Interaction with Qiita

What i did

Only the difference from the previous time will be described.

First, there are several modes of model calculation in Sentence Piece. This time I tried Unigram and BPE.

Unigram mode

I received the following comments from Mr. Kudo.

With unigram, the log-likelihood of the vocabulary table is multiplied by -1 to make it an integer cost, and if the unknown word processing is turned off, it will be the same in principle.

So I did exactly that. To tell you exactly what I did, I multiplied the log-likelihood of the vocabulary table by -100 to round it to an integer cost and added it to the kuromoji / mecab dictionary. For unknown words, I decided to run kuromoji in extend mode. The word cost of unknown words was much higher than the SentencePiece vocabulary table, so it is basically morphologically parsed in the SentencePiece vocabulary table (probably). I ran it in extend mode to delimit it when an unknown word appears.

BPE mode

I also received a comment from Mr. Kudo.

Splitting in BPE is not too difficult if you have a naive implementation. Try concatenating the two letters and if they are in the dictionary, replace the two letters with new symbols. If there are multiple places to replace, replace them in order of priority (the one registered first has priority). https://ja.wikipedia.org/wiki/%E3%83%90%E3%82%A4%E3%83%88%E5%AF%BE%E7%AC%A6%E5%8F%B7%E5%8C%96 Concatenate two consecutive characters and look up the dictionary. If found, the two characters are connected and connected to be regarded as one character. Repeat it until you can't look it up in the dictionary. The naive implementation scans two consecutive characters each time, so it's O (n ^ 2), but with the heap it's O (n log n).

I didn't know how to use the heap, so I did it honestly. "The vocabulary table is regarded as a rule, and the rules are applied in order from the top." For example, suppose the rule is defined as follows:

Ai
up
Ah

If the input is "aiueo", then the output will be "aiueo". It may be an overstatement, but in the case of another rule shown below,

Ai
Ah
up

The output will be "Aieo".

So how did the result change?

It was almost the same.

In the previous example, the result was the same whether it was Unigram or BPE. I compared the Unigram and BPE vocabulary tables, but they are pretty close. Unigram may have a lot of data, so there is a difference from BPE.

However, I think that I was able to implement it accurately (I think), and I think that you can use it with confidence.

in conclusion

As an aside, Unigram mode saves memory compared to BPE mode, so I put the whole Wikipedia text into learning. As a result, I continued to calculate for 12 days. .. .. The electricity bill is ... So, you can use the calculation result for 12 days for free in Apitore, so please use it.

Recommended Posts

Continued-Published a Web API that can use Sentence Piece like morphological analysis
Released an API that can use Sentence Piece like morphological analysis
Use Japanese morphological analysis "kuromoji"