[JAVA] Released an API that can use Sentence Piece like morphological analysis

2017/4/13 postscript: I received advice from Mr. taku910 himself. See the comments section for details. This article will be revised later. 2017/4/18 postscript: The public API has been modified. Unigram is still building a model (it doesn't end even after 5 days). 2017/4/26 postscript: Unigram is also available (it was finally over after 12 days). Details of this matter were posted in Another article.

Introduction

I forced the Web API, so please point out if you make a mistake. Google has released Sentence Piece. This approach has been confirmed to be effective by NMT (Neural Machine Translation). This time, I tried to use it as Web API that can be used like morphological analysis. You can use it for free, so please try it.

API

SentencePiece API

Sample code

-Java sample

-Honke Github -Commentary by Mr. Kudo -Original article of this post

Implementation details

To briefly explain

Apply Sentence Piece to Japanese Wikipedia articles
Format the vocal file that is the output of Sentence Piece into the dictionary format of mecab-ipadic
Compile the whole dictionary with kuromoji

Let me explain in a little more detail.

1. Apply Sentence Piece to Japanese Wikipedia articles

The Japanese Wikipedia data I had was a little old with the dump of 20160915, but I used this as the original data. I am shaping it with some modifications. If you list what you have done

Dump is converted to text with wp2txt, and all are put into one file.
Delete lines of 40 characters or less
Delete "Image", "File", "Image", and "File" at the beginning of the line.
Randomly swap lines in the file
Run Sentence Piece using the first 1,700,000 lines (about 600MB)

The reason why I do this is that SentencePiece seems to require a lot of memory, and I couldn't enter the full text (about 2GB) on my PC (16GB of memory). After a lot of trial and error, it seems that about 600MB is the limit for a 16GB PC. Instead, the process is fast as long as the data is in memory.

SentencePiece is executed with the following command. As for the execution environment, I put Cygwin in Windows 10 and ran it. The necessary library in the original README was Cygwin and I put something like that (some versions do not match, but it worked).

$ ./spm_train --input=input.txt --model_prefix=output --vocab_size=8000 --model_type=bpe

The output is a vocal file and a model file. The vocab file is a 8000 line word fragment.

2. Format the vocal file that is the output of Sentence Piece into the dictionary format of mecab-ipadic.

Now, prepare to publish with WebAPI. The WebAPI response returns a set of "word fragment" and "word fragment ID" as an array. Use "Word Fragment ID", for example, when creating a One-Hot vector by machine learning. Of course, you don't have to use it.

Now, let's consider the implementation. The Web API marketplace Apitore I run is implemented in full Java, but Sentence Piece is written in C ++. "It's a hassle to write a wrapper, and I don't really understand WebAPI in C ++, so I have to forcibly realize it! SentencePiece is like morphological analysis," so I'm always indebted. I decided to use the Java morphological analyzer kuromoji. kuromoji is the Java version of the famous mecab. And mecab is the research technology of Mr. Kudo who made Sentence Piece. You're connected!

So, this time, I took the form of adding the output of Sentence Piece as a new dictionary to the existing kuromoji. Instead, here are some tricks. The format of the dictionary looks like this.

#Surface type,Left context ID,Right context ID,cost,Part of speech,Part of speech細分類1,Part of speech細分類2,Part of speech細分類3,Inflected form,Utilization type,Prototype,reading,pronunciation
Be done,1,1,1,SPWORD,1,*,*,*,*,*,*,*

The "surface form" is the "word fragment" that is the output of Sentence Piece. The point is to set the "cost" to "1". If you set "Cost" to "1", the word fragment of Sentence Piece will almost certainly be selected during morphological analysis. To be on the safe side, change the connection cost of all parts of speech to 1 in `` `matrix.def```, which defines the connection cost of the context. By doing this, you can "connect the word fragments of Sentence Piece without considering the context". The "context ID" can be anything because you don't have to worry about the context. This time, the "context ID" is set to "1".

"Word fragment ID" was assigned to "Part of speech subcategory 1". "Word Fragment ID" is a unique ID that I gave to the output 8000 words of Sentence Piece (that is, 8000 IDs in total, numbers 1 to 8000 are used). The "part of speech" of Sentence Piece is "SPWORD". This part of speech is used to find words that are not covered by Sentence Piece. To explain a little, the word fragments of Sentence Piece are based on learning data. Of course, there is no way to handle characters that have never appeared in the training data. I decided to detect the unknown character with the conventional kuromoji. When the part of speech is not "SPWORD" (I think that unknown characters are almost certainly classified as "unknown words"), the word fragment ID is set to "0". Now you can handle unknown characters.

3. Compile the whole dictionary with kuromoji

All you have to do is compile. Compile kuromoji as usual. The test code included in kuromoji will never pass, so let's delete the test.

Actually use

The API is available at here. Please refer to here for preparations up to API call (API registration, access token issuance, sample execution).

Specifications such as API input / output are published at here. If you write it here as well, the API response specifications are like this. The input is text.

{
  "endTime": "string",
  "log": "string",
  "processTime": "string",
  "startTime": "string",
  "tokens": [
    {
      "token": "string",
      "wid": 0
    }
  ]
}

Let's see an actual usage example. I entered "I am a cat. I don't have a name yet." Certainly, it is a little different from normal morphological analysis.

"tokens": [
  {
    "wid": 5578,
    "token": "I"
  },
  {
    "wid": 5386,
    "token": "Employee"
  },
  {
    "wid": 472,
    "token": "Is"
  },
  {
    "wid": 5643,
    "token": "Cat"
  },
  {
    "wid": 11,
    "token": "Is"
  },
  {
    "wid": 3796,
    "token": "。"
  },
  {
    "wid": 2002,
    "token": "name"
  },
  {
    "wid": 472,
    "token": "Is"
  },
  {
    "wid": 1914,
    "token": "yet"
  },
  {
    "wid": 26,
    "token": "Absent"
  },
  {
    "wid": 3796,
    "token": "。"
  }
]

Then, "WRYYYYYYYYYY! The highest one is Aaaa". It has been disassembled into pieces.

"tokens": [
  {
    "wid": 829,
    "token": "W"
  },
  {
    "wid": 589,
    "token": "R"
  },
  {
    "wid": 3032,
    "token": "Y"
  },
  {
    "wid": 3032,
    "token": "Y"
  },
  {
    "wid": 3032,
    "token": "Y"
  },
  {
    "wid": 3032,
    "token": "Y"
  },
  {
    "wid": 3032,
    "token": "Y"
  },
  {
    "wid": 3032,
    "token": "Y"
  },
  {
    "wid": 3032,
    "token": "Y"
  },
  {
    "wid": 3032,
    "token": "Y"
  },
  {
    "wid": 3032,
    "token": "Y"
  },
  {
    "wid": 3032,
    "token": "Y"
  },
  {
    "wid": 0,
    "token": "！"
  },
  {
    "wid": 799,
    "token": "Best"
  },
  {
    "wid": 2689,
    "token": "To"
  },
  {
    "wid": 646,
    "token": "Yes"
  },
  {
    "wid": 9,
    "token": "What"
  },
  {
    "wid": 3880,
    "token": "Or"
  },
  {
    "wid": 3888,
    "token": "Tsu"
  },
  {
    "wid": 3914,
    "token": "Is"
  },
  {
    "wid": 1726,
    "token": "A"
  },
  {
    "wid": 1726,
    "token": "A"
  },
  {
    "wid": 1726,
    "token": "A"
  }
]

Finally, enter "To overcome" fear "is to" live "". It's a very characteristic segment.

"tokens": [
  {
    "wid": 648,
    "token": "「"
  },
  {
    "wid": 5092,
    "token": "Scary"
  },
  {
    "wid": 5725,
    "token": "Scary"
  },
  {
    "wid": 3846,
    "token": "」"
  },
  {
    "wid": 2163,
    "token": "To"
  },
  {
    "wid": 5711,
    "token": "Katsu"
  },
  {
    "wid": 4840,
    "token": "clothes"
  },
  {
    "wid": 543,
    "token": "To do"
  },
  {
    "wid": 648,
    "token": "「"
  },
  {
    "wid": 2859,
    "token": "Live"
  },
  {
    "wid": 3798,
    "token": "Ru"
  },
  {
    "wid": 3846,
    "token": "」"
  },
  {
    "wid": 12,
    "token": "thing"
  }
]

in conclusion

I set SentencePiece to Web API. It seems to be Mochiron to use in translation, and since it can be used for sec2sec, it seems that standard language-dialect conversion can also be done. I'm thinking of using it for polarity judgment. Right now I'm doing RNN + LSTM for Word2Vec results, but isn't RNN + LSTM something good to make a one-hot vector with the word fragments of Sentence Piece?