I made a library konoha that switches the tokenizer to a nice feeling

TL; DR

Introducing konoha, a library for tokenizing sentences. (Old tiny_tokenizer) You can use it like ↓. What is it ~

from konoha import WordTokenizer

sentence = 'I'm studying natural language processing'

tokenizer = WordTokenizer('MeCab')
print(tokenizer.tokenize(sentence))  # -> [Nature,language,processing,To,study,Shi,hand,I,Masu]

tokenizer = WordTokenizer('Kytea')
print(tokenizer.tokenize(sentence))  # -> [Nature,language,processing,To,study,Shi,hand,I,Ma,Su]

tokenizer = WordTokenizer('Sentencepiece', model_path="data/model.spm")
print(tokenizer.tokenize(sentence))  # -> [▁,Nature,language,processing,To,study,Shi,Is]

Introduction: What is a tokenizer?

Unlike languages such as English, Japanese does not have clear delimiters at word boundaries. For this reason, when analyzing Japanese, it is first necessary to divide the sentence into some units (for example, words). This division process is divided into word units, words are further divided into subword units, and so on. And there are things that divide the character string in the sentence into each character. In this paper, the substrings divided by the above units are called tokens. There are various methods for tokenization. Morphological analyzers, which are widely used in the analysis of Japanese text, also perform word segmentation. (In morphological analysis, in addition to word division, word heading and part of speech tag estimation are performed.)

For word-by-word division, use a dictionary to build a lattice, and then use MeCab to determine the optimal word sequence. There are algorithms such as Kytea that determine word boundaries at the character level. These algorithms may return the same split result or different split results. Also, even if the division algorithm is the same, the unit of the word will change if the part of speech system is different.

-IPADic part of speech system -UniDic Part of Speech System

Subwords are more subdivided words. Its effectiveness has been confirmed by neural machine translation. As a typical subword unit tokenizer used in Japanese text analysis Sentencepiece is famous.

-Detailed description of MeCab (Word splitting is also mentioned) -Explanation of Kytea's word division method -Comparison of various part-speech systems and tokenizers -Explanation of Sentencepiece

What to do with the tokenizer

For those who do Japanese text mining Do you usually use MeCab + NEologd? Studies may often use Kytea. Also, in the recently talked about dependency analysis library Ginza Using a morphological analyzer called SudachiPy (a Python implementation of the morphological analyzer Sudachi), etc. Various analyzers are used for word-level parsing. It is difficult to determine which analyzer is best for you.

Furthermore, in recent years mainly in the context of machine translation "The task performance is better when the tokens are divided into subwords than when the morphological analyzer's word-separation results are used." It has also been reported that subword-based division is often adopted.

Character-level tokenization is characterized by a small number of character types compared to the number of word types. The number of word types is generally much larger than the number of character types, and tokenization at the character level has the effect of reducing vocabulary size. For example, in the study of named entity extraction, there is an approach of adding character-level features to LSTM features. Many recent studies have adopted this approach. (The paper listed in the link is old, but it is my favorite paper)

In this situation, in what unit should we tokenize the sentence? In general, I understand that the answer to this question is "** task dependent **" and there is no single answer. For this reason, "I use MeCab + NEologd because many people use it." "Since the dissertation uses subwords, I will use subwords for the time being." Such options tend to be taken.

For modern natural language processing tasks (especially when using neural networks) There are many things that need to be paid as much as tokenization. (Example: architecture, hidden layer dimensions, optimizer, learning rate ... etc), Against this background, the method of tokenization is at the beginning of tackling the problem. I think the current situation is that it is often decided to be "No". But are other tokenization methods really worth trying? I find it worthwhile to try different tokenization methods, so We have developed a library to easily switch the tokenization method. This is the reason for the development of konoha.

Various tokenizers and various APIs

Switching tokenizers often costs a little. All of the morphological analyzers and tokenizers shown above have Python wrappers. Users can use these tools from Python by installing the wrapper library.

However, each wrapper library provides APIs with different idioms. (I think it's natural because each parser and their wrapper library authors are different.) Therefore, if you want to switch the output of multiple analyzers according to the situation, You need to implement your own layer to absorb the differences in the idioms of those APIs.

Preceding case

There is a library called JapaneseTokenizer. (GitHub repository: Kensuke-Mitsuzawa / JapaneseTokenizers) Like konoha, JapaneseTokenizer also provides wrappers for multiple tokenizers. JapaneseTokenizer provides an interface that handles multiple morphological analyzers. JapaneseTokenizer can be used to filter sentence analysis results by specific part of speech tags, etc. Many practical functions that are useful for text analysis are implemented. It is a very convenient tool for performing natural language processing that utilizes the results of multiple morphological analyzers.

konoha

On the other hand, tiny tokenizer does not provide any function such as filtering of part of speech at this time. tiny tokenizer is a library that abstracts the tokenization process of each analyzer. It provides subword-based division and character-level division functions that JapaneseTokenizer does not target.

The position of this library is a wrapper of Python wrapper. Thanks to everyone who provided the Python wrapper for the parser, The purpose of this library is to absorb the differences in the interfaces of those libraries. By using konoha, users will be able to use multiple analyzers with a unified API.

Tokenization

First, I will show you an example using MeCab. In this example, the dictionary uses mecab-ipadic. If you are using macOS, mecab, mecab-ipadic, If you are using Ubuntu, please install libmecab-dev in addition to the above. Operation has not been verified for other distributions. (If you can run mecab, mecab-config and have the dictionary installed, it should work fine) If you build the Dockerfile in the GitHub repository and create the environment, you will be ready to go.

--Code

from konoha import WordTokenizer

sentence = 'I'm studying natural language processing'

tokenizer = WordTokenizer('MeCab')
print(tokenizer.tokenize(sentence))

--Output

[Nature,language,processing,To,study,Shi,hand,I,Masu]

Next, let's use Kytea. Again, you need to build Kytea. (Also, please refer to the Dockerfile in the repository.)

--Code

from konoha import WordTokenizer

sentence = 'I'm studying natural language processing'

tokenizer = WordTokenizer('Kytea')
print(tokenizer.tokenize(sentence))

--Output

[Nature,language,processing,To,study,Shi,hand,I,Ma,Su]

Also, if you want to divide a sentence into subwords, you can use Sentencepiece. When using Sentencepiece, it is necessary to specify the model file. Pass the path to the model file to model_path.

--Code

from konoha import WordTokenizer

sentence = 'I'm studying natural language processing'

tokenizer = WordTokenizer('Sentencepiece', model_path="data/model.spm")
print(tokenizer.tokenize(sentence))

--Output

[▁,Nature,language,processing,To,study,Shi,Is]

In this way, multiple analyzers can be used in a unified manner simply by changing the value of the argument passed to WordTokenizer. This makes it easy to experiment with different tokenizers during the experimental phase.

Part of speech estimation

A morphological analyzer is also included in the tokenizer. Of the tokenizers currently supported by konoha The morphological analyzers are MeCab, Kytea and Sudachi (SudachiPy). Regarding these, whether or not to obtain the information given by the morphological analyzer such as the part of speech tag when tokenization is performed. It can be controlled as an option.

An example of using SudachiPy is shown below.

--Code

from konoha import WordTokenizer

sentence = 'I'm studying natural language processing'

tokenizer = WordTokenizer('Sudachi', mode='A', with_postag=True)
print(tokenizer.tokenize(sentence))

--Output

[Nature(noun),language(noun),processing(noun),To(Particle),study(noun),Shi(verb),hand(
Particle),I(verb),Masu(助verb)]

The output of tokenizer.tokenize is an instance of the Token class. The following instance variables are defined in the Token class. (Excerpt from docstring of Token class)

Information that the analyzer does not return is None. For example, token.normalized_form uses SudachiPy and And only when with_postag is True, the value is not None. (Token is one element of the array of token sequences output by tokenizer.tokenize)

"""
surface (str)
    surface (original form) of a word
postag (str, default: None)
    part-of-speech tag of a word (optional)
postag2 (str, default: None)
    detailed part-of-speech tag of a word (optional)
postag3 (str, default: None)
    detailed part-of-speech tag of a word (optional)
postag4 (str, default: None)
    detailed part-of-speech tag of a word (optional)
inflection (str, default: None)
    conjugate type of word (optional)
conjugation (str, default: None)
    conjugate type of word (optional)
base_form (str, default: None)
    base form of a word
yomi (str, default: None)
    yomi of a word (optional)
pron (str, default: None)
    pronounciation of a word (optional)
normalized_form (str, default: None)
    normalized form of a word (optional)
    Note that normalized_form is only
    available on SudachiPy
"""

Use your own user dictionary (MeCab)

If you want to use a user dictionary, ʻuser_dictionary_path of WordTokenizer` Pass the path to the user dictionary as an argument named.

from konoha import WordTokenizer

sentence = 'I'm studying natural language processing'

tokenizer = WordTokenizer('MeCab', user_dictionary_path="path/to/user_dict")
print(tokenizer.tokenize())

Use your own system dictionary (MeCab)

I want to use mecab-ipadic-NEologd, Or if you want to use the system dictionary that you relearned using the corpus yourself, It is possible to generate a tokenizer by specifying a system dictionary. Since the argument system_dictionary_path is generated in WordTokenizer, Give it the path to the system dictionary you want to use.

from konoha import WordTokenizer

sentence = 'I'm studying natural language processing'

tokenizer = WordTokenizer('MeCab', system_dictionary_path="path/to/system_dict")
print(tokenizer.tokenize())

Summary

In this paper, it is a library for using multiple tokenizers with the same interface. We introduced konoha. By using this library when wondering which analyzer to use at the beginning of text analysis It is possible to easily switch the analyzer. Also, I plan to experiment with MeCab, but my previous research uses other analyzers. Even in the case where you have to write code to use another analyzer for comparison By inserting konoha, you can experiment with the same code without hassle. I hope it helps people who process natural language in the field and those who process natural language in research. Please use it if you like, thank you.


To build Kytea on Ubuntu 18.04, this pull request Need to be imported. Please refer to the Dockerfile in the konoha repository.

Recommended Posts

I made a library konoha that switches the tokenizer to a nice feeling
I made a library that adds docstring to a Python stub file.
I made a library to separate Japanese sentences nicely
I made a command to markdown the table clipboard
I made a python library to do rolling rank
How to test the current time with Go (I made a very thin library)
I made a function to check the model of DCGAN
I made a program to solve (hint) Saizeriya's spot the difference
I made a library to easily read config files with Python
I made a program that solves the spot the difference in seconds
I created a Python library to call the LINE WORKS API
I made a slack bot that notifies me of the temperature
I made a command to display a colorful calendar in the terminal
I made a program that automatically calculates the zodiac with tkinter
I made a script to display emoji
I made a library for actuarial science
I made a calendar that automatically updates the distribution schedule of Vtuber
[Python] I made a decorator that doesn't seem to have any use.
[Django] I made a field to enter the date with 4 digit numbers
I made a kitchen timer to be displayed on the status bar!
I made a program to notify you by LINE when switches arrive
I made a simple timer that can be started from the terminal
I want to identify the alert email. --Is that x a wildcard? ---
What is a C language library? What is the information that is open to the public?
I made a program to check the size of a file in Python
I made a function to see the movement of a two-dimensional array (Python)
I started to work at different times, so I made a bot that tells me the time to leave
I made a tool to estimate the execution time of cron (+ PyPI debut)
I made a LINE BOT that returns a terrorist image using the Flickr API
The story of IPv6 address that I want to keep at a minimum
I made a Line Bot that uses Python to retrieve unread Gmail emails!
I made a library to operate AWS CloudFormation stack from CUI (Python Fabric)
I tried using the Python library "pykakasi" that can convert kanji to romaji.
I made an appdo command to execute a command in the context of the app
I made a tool to compile Hy natively
I wrote a script to revive the gulp watch that will die soon
I made a tool to get new articles
I made a program to look up words on the window (previous development)
I made a script to record the active window using win32gui of Python
A story that I was addicted to when I made SFTP communication with python
I made a github action that notifies Slack of the visual regression test
[LPIC 101] I tried to summarize the command options that are easy to make a mistake
[Python] I made a system to introduce "recipes I really want" from the recipe site!
The story of Linux that I want to teach myself half a year ago
I made a small donation to the non-profit organization "Open Source Robot Foundation" OSRF
I made a system that allows you to tweet just by making a phone call
I made a command to wait for Django to start until the DB is ready
[Python / C] I made a device that wirelessly scrolls the screen of a PC remotely.
I made a calendar that automatically updates the distribution schedule of Vtuber (Google Calendar edition)
A quick introduction to the neural machine translation library
I made a VM that runs OpenCV for Python
I made a script to put a snippet in README.md
I made a Python module to translate comment outs
I made a code to convert illustration2vec to keras model
I wanted to use the Python library from MATLAB
A story that I was addicted to at np.where
[Python] A convenient library that converts kanji to hiragana
A memo that I touched the Datastore with python
I felt that I ported the Python code to C ++ 98.
〇✕ I made a game
When writing to a csv file with python, a story that I made a mistake and did not meet the delivery date