You have to install MeCab in the first place.
But I think the pre-processing craftsmen are already ready, so I'll omit it.
It also supports calling the Mecab-neologd dictionary, so it's a good idea to have it installed.
You can use it with python setup.py install
by doing git clone [email protected]: Kensuke-Mitsuzawa / JapaneseTokenizers.git
.
Or you can do it with pip install git + https://github.com/Kensuke-Mitsuzawa/JapaneseTokenizers
.
The same content is written in example, so I will write it briefly.
Prepare the input statement
`sentence = u'Tehran (Persian: تهران; Tehrān Tehran.ogg pronunciation [help / file] / teɦˈrɔːn /, English: Tehran) is the capital of West Asia, Iran and the capital of Tehran Province. Population 12,223,598. The metropolitan population reaches 13,413,348. '``
python2x can only input ʻunicode`.
It doesn't matter which one you use for python3x.
Specifies the type of os.
Other than centOS
, ʻos Type =" generic "is fine. Only centOS should have ʻos Type = "centsos"
.
(Because the system command of Mecab is different only for centOS. There may be other OSs like that ... I have confirmed that it works on Ubuntu and Mac.)
Specifies the type of dictionary.
dictType =" "
or dictType =" ipadic "
dictType =" neologd "
Initialize the instance
mecab_wrapper = MecabWrapper(dictType=dictType, osType=osType)
Split words.
tokenized_obj = mecab_wrapper.tokenize(sentence=sentence)
By default, words and part of speech are returned in a tapple pair.
tokenized_obj = mecab_wrapper.tokenize(sentence=sentence, return_list=False)
Will return this class object, so if you want to use it for other processing, this It is better to specify the flag.
Filtering
The stop word is
stopwords = [u'Tehran']
Put a string in the list like this. (Both str
and ʻunicode` are acceptable)
To specify by part of speech, specify as [(part of speech tuple)]
.
Part of speech can be specified up to 3 levels. For example, in IPADIC Part of Speech System, if you want noun-proper noun-personal name
,(u'noun', Write u'proper noun', u'personal name')
.
If you want to specify up to noun-proper noun
, use (u'noun', u'proper noun)
.
Again, you can enter both str
and ʻunicode`.
Place the part of speech tuple you want to acquire in the list.
pos_condition = [(u'noun', u'proper noun'), (u'verb', u'independence')]
Perform filtering.
filtered_obj = mecab_wrapper.filter(
parsed_sentence=tokenized_obj,
pos_condition=pos_condition
)
The return value is this class object
To summarize briefly
It is a motivation.
I've been in charge of natural language processing for a long time ... I'm a saint who does pre-processing day after day, and sometimes even pre-processing for other people's research. .. .. ..
But at one point, I suddenly noticed __ "Isn't the morpheme division part writing the same process every time?" __
So, while doing the same thing over and over again, I have packaged only the processes that I have used (and will probably use) most often.
A package that can be similar in Python is natto.
However, I felt inconvenient because I had to write the filtering process in natto and I couldn't add the dictionary, so I made a new one.
Whether you are a pre-processing craftsman or an active pre-processing craftsman! I hope that you can reduce your work as much as possible and enjoy NLP.
Recommended Posts