Content of this article

I made a package that can compare morphological analyzers with Python.
Works with Python2.7 and Python3.x.
Supports 4 morphological analyzers
Can be installed with pip. pip install Japanese Tokenizer
Package URL
However, Windows is not good. Sorry.

Package features

Provides a simple interface for word splitting
The operation of "word division-> filtering by stop word and part of speech" is completed in one line
4 types of Mecab, Juman, Juman ++, Kytea can be called with the same interface
Can be used in practice
It has been used everywhere in Insight-tech Co., Ltd. for over a year and a half.
Even millions of texts are processed fairly quickly
Actually, it works faster than the original pyknp (v0.3)
Provides a simple interface for adding dictionaries
Supports neologd dictionaries and csv user dictionaries
mecab only

How to use

Preparation

Make file has been set up in the Github repository.

If you can make it, install it manually. Please refer to this section to install.

Sample code

A sample is shown in python3.x. If you want to see the sample for python2.x, [example code](https://github.com/Kensuke- See Mitsuzawa / JapaneseTokenizers / blob / master / examples / examples.py).

The part of speech system is summarized in detail on this page. The part of speech system of Juman / Human ++ is also described, so if you want to perform part of speech filtering with Juman / Human ++, please switch and use it.

By the way, you can also use the neologd dictionary in Human / Human ++. Please see this article. I made a script to make the neologd dictionary usable in juman / juman ++

The only difference between Mecab, Juman / Human ++, and Kytea is the class they call. It inherits the same common class.

Morphological analysis with mecab

Introducing how to use Version 1.3.1.

import JapaneseTokenizer
#Select a dictionary type."neologd", "all", "ipadic", "user", ""Can be selected.
mecab_wrapper = JapaneseTokenizer.MecabWrapper(dictType='neologd')
#Define the part of speech you want to acquire.
pos_condition = [('noun', '固有noun'), ('adjective', 'Independence')]

sentence = "The Islamic Republic of Iran, commonly known as Iran, is an Islamic republic in West Asia and the Middle East. Also known as Persia or Persia."
#Morpheme division,Part of speech filtering,Complete listing in one line
print(mecab_wrapper.tokenize(sentence).filter(pos_condition).convert_list_object())

Then the result looks like this:

['Iran Islamic Republic', 'Iran', 'West Asia', 'Middle East', 'Islamic republic', 'Persia', 'Persia']

Morphological analysis with juman / juman ++

It's basically the same as mecab. Only the class to call is different.

For Juman

from JapaneseTokenizer import JumanWrapper
tokenizer_obj = JumanWrapper()
#Define the part of speech you want to acquire.
pos_condition = [('noun', '固有noun'), ('noun', 'Place name'), ('noun', 'Organization name'), ('noun', '普通noun')]

sentence = "The Islamic Republic of Iran, commonly known as Iran, is an Islamic republic in West Asia and the Middle East. Also known as Persia or Persia."
#Morpheme division,Part of speech filtering,Complete listing in one line
print(tokenizer_obj.tokenize(sentence).filter(pos_condition).convert_list_object())

['Iran', 'Islam', 'Kyowa', 'Country', 'Known as', 'Iran', 'West', 'Asia', 'Middle East', 'Islam', 'Kyowa', 'System', 'Country家', 'Persia', 'Persia']

For Juman ++

from JapaneseTokenizer import JumanppWrapper
tokenizer_obj = JumanppWrapper()
#Define the part of speech you want to acquire.
pos_condition = [('noun', '固有noun'), ('noun', 'Place name'), ('noun', 'Organization name'), ('noun', '普通noun')]

sentence = "The Islamic Republic of Iran, commonly known as Iran, is an Islamic republic in West Asia and the Middle East. Also known as Persia or Persia."
#Morpheme division,Part of speech filtering,Complete listing in one line
print(tokenizer_obj.tokenize(sentence).filter(pos_condition).convert_list_object())

['Iran', 'Islam', 'Republic', 'Known as', 'Iran', 'West', 'Asia', 'Middle East', 'Islam', 'Republic', 'Nation', 'Persia', 'Persia']

In fact, if the text is as decent as Wikipedia, Human and Juman ++ will not change that much. When using Juman ++, it's a bit slow only on the first call. This is because it takes time to put the model file in memory. From the second time onwards, this slowness disappears because it calls the process that keeps running.

Morphological analysis with kytea

Everything is the same except for mecab, juman and the class name.

from JapaneseTokenizer import KyteaWrapper
tokenizer_obj = KyteaWrapper()
#Define the part of speech you want to acquire.
pos_condition = [('noun',)]

sentence = "The Islamic Republic of Iran, commonly known as Iran, is an Islamic republic in West Asia and the Middle East. Also known as Persia or Persia."
#Morpheme division,Part of speech filtering,Complete listing in one line
print(tokenizer_obj.tokenize(sentence).filter(pos_condition).convert_list_object())

Development history

Previously, I posted Mecab's Article that makes something like a binding Wrapper and is complacent. At this time, I made it self-sufficiently, so that's fine. However, after that, I came to think that __ "I want you to easily try the comparison of morphological analyzers" __, and I came to make it.

Reason 1 Everyone uses mecab, right?

It's just around me, but I feel like "morphological analysis? For the time being, with Mecab. Or something else?"

When I search with Qiita, there are 347 hits for mecab, but only 17 for juman. There are 3 cases for kytea.

Certainly, I think Mecab is very good as software. But, "I don't know anything else, but I think there's only Mecab, right?" Is something different, isn't it? I think.

That's why the first motivation was to make an appeal that __ "There are other than Mecab" __.

Reason 2 You don't know from a foreigner, right?

Recently, I have been to the foreign Python community living in Japan.

They are interested in Japanese processing, but they don't know which morphological analyzer is right for them.

They look it up, but they don't really understand the difference, so they're saying something messed up.

Below, the mysterious logic I've heard so far

chasen has been around for a long time and must be better than mecab because of its cool name
Juman doesn't even have a name, it's just made by the university, so it's shobo, right?
kytea? what is that? What did I read? Tools that you don't know how to read are not good ...

I thought that the reason why such mysterious logic came out was that the information was not prepared and could not be compared.

It's difficult to organize the information, but it can make comparisons easier. That's why I made it. Also, I tried to write all documents in English. I thought it would be nice if the information could be gathered as much as possible.

Development policy

Common as much as possible

I designed it so that it has the same structure as possible, including the interface. The class that executes processing and the data class are all common.

Simple syntax as much as possible

We designed the syntax to realize "coding preprocessing at the fastest speed". The result is an interface that handles morpheme division and part-of-speech filtering in one line.

If you like it, please give Star ☆ to Github repository: bow_tone1:

We are also looking for people who can improve together. I would like to introduce an analyzer around here as well. RakutenMA, Chansen ...

I made a package that can compare morphological analyzers with Python