pip install Japanese Tokenizer
Make file has been set up in the Github repository.
If you can make it, install it manually. Please refer to this section to install.
A sample is shown in python3.x. If you want to see the sample for python2.x, [example code](https://github.com/Kensuke- See Mitsuzawa / JapaneseTokenizers / blob / master / examples / examples.py).
The part of speech system is summarized in detail on this page. The part of speech system of Juman / Human ++ is also described, so if you want to perform part of speech filtering with Juman / Human ++, please switch and use it.
By the way, you can also use the neologd dictionary in Human / Human ++. Please see this article. I made a script to make the neologd dictionary usable in juman / juman ++
The only difference between Mecab, Juman / Human ++, and Kytea is the class they call. It inherits the same common class.
Introducing how to use Version 1.3.1.
import JapaneseTokenizer
#Select a dictionary type."neologd", "all", "ipadic", "user", ""Can be selected.
mecab_wrapper = JapaneseTokenizer.MecabWrapper(dictType='neologd')
#Define the part of speech you want to acquire.
pos_condition = [('noun', '固有noun'), ('adjective', 'Independence')]
sentence = "The Islamic Republic of Iran, commonly known as Iran, is an Islamic republic in West Asia and the Middle East. Also known as Persia or Persia."
#Morpheme division,Part of speech filtering,Complete listing in one line
print(mecab_wrapper.tokenize(sentence).filter(pos_condition).convert_list_object())
Then the result looks like this:
['Iran Islamic Republic', 'Iran', 'West Asia', 'Middle East', 'Islamic republic', 'Persia', 'Persia']
It's basically the same as mecab. Only the class to call is different.
For Juman
from JapaneseTokenizer import JumanWrapper
tokenizer_obj = JumanWrapper()
#Define the part of speech you want to acquire.
pos_condition = [('noun', '固有noun'), ('noun', 'Place name'), ('noun', 'Organization name'), ('noun', '普通noun')]
sentence = "The Islamic Republic of Iran, commonly known as Iran, is an Islamic republic in West Asia and the Middle East. Also known as Persia or Persia."
#Morpheme division,Part of speech filtering,Complete listing in one line
print(tokenizer_obj.tokenize(sentence).filter(pos_condition).convert_list_object())
['Iran', 'Islam', 'Kyowa', 'Country', 'Known as', 'Iran', 'West', 'Asia', 'Middle East', 'Islam', 'Kyowa', 'System', 'Country家', 'Persia', 'Persia']
For Juman ++
from JapaneseTokenizer import JumanppWrapper
tokenizer_obj = JumanppWrapper()
#Define the part of speech you want to acquire.
pos_condition = [('noun', '固有noun'), ('noun', 'Place name'), ('noun', 'Organization name'), ('noun', '普通noun')]
sentence = "The Islamic Republic of Iran, commonly known as Iran, is an Islamic republic in West Asia and the Middle East. Also known as Persia or Persia."
#Morpheme division,Part of speech filtering,Complete listing in one line
print(tokenizer_obj.tokenize(sentence).filter(pos_condition).convert_list_object())
['Iran', 'Islam', 'Republic', 'Known as', 'Iran', 'West', 'Asia', 'Middle East', 'Islam', 'Republic', 'Nation', 'Persia', 'Persia']
In fact, if the text is as decent as Wikipedia, Human and Juman ++ will not change that much. When using Juman ++, it's a bit slow only on the first call. This is because it takes time to put the model file in memory. From the second time onwards, this slowness disappears because it calls the process that keeps running.
Everything is the same except for mecab, juman and the class name.
from JapaneseTokenizer import KyteaWrapper
tokenizer_obj = KyteaWrapper()
#Define the part of speech you want to acquire.
pos_condition = [('noun',)]
sentence = "The Islamic Republic of Iran, commonly known as Iran, is an Islamic republic in West Asia and the Middle East. Also known as Persia or Persia."
#Morpheme division,Part of speech filtering,Complete listing in one line
print(tokenizer_obj.tokenize(sentence).filter(pos_condition).convert_list_object())
Previously, I posted Mecab's Article that makes something like a binding Wrapper and is complacent. At this time, I made it self-sufficiently, so that's fine. However, after that, I came to think that __ "I want you to easily try the comparison of morphological analyzers" __, and I came to make it.
It's just around me, but I feel like "morphological analysis? For the time being, with Mecab. Or something else?"
When I search with Qiita, there are 347 hits for mecab
, but only 17 for juman
. There are 3 cases for kytea
.
Certainly, I think Mecab is very good as software. But, "I don't know anything else, but I think there's only Mecab, right?" Is something different, isn't it? I think.
That's why the first motivation was to make an appeal that __ "There are other than Mecab" __.
Recently, I have been to the foreign Python community living in Japan.
They are interested in Japanese processing, but they don't know which morphological analyzer is right for them.
They look it up, but they don't really understand the difference, so they're saying something messed up.
Below, the mysterious logic I've heard so far
I thought that the reason why such mysterious logic came out was that the information was not prepared and could not be compared.
It's difficult to organize the information, but it can make comparisons easier. That's why I made it. Also, I tried to write all documents in English. I thought it would be nice if the information could be gathered as much as possible.
I designed it so that it has the same structure as possible, including the interface. The class that executes processing and the data class are all common.
We designed the syntax to realize "coding preprocessing at the fastest speed". The result is an interface that handles morpheme division and part-of-speech filtering in one line.
If you like it, please give Star ☆ to Github repository: bow_tone1:
We are also looking for people who can improve together. I would like to introduce an analyzer around here as well. RakutenMA, Chansen ...
Recommended Posts