Content of this article

I made a library that can handle Mecab easily with Python.
You can bring it from git and install it.

What you can do with this library

Enter a sentence and split it into words
Filtering by stop word and part of speech
Add dictionary (supports neologd dictionary and csv user dictionary)
Can be used with both python2x and python3x (probably)
I haven't tested it thoroughly with python3x, so there may be bugs I'm not aware of.

Installation

Mecab and Mecab-neologd

You have to install MeCab in the first place.

But I think the pre-processing craftsmen are already ready, so I'll omit it.

It also supports calling the Mecab-neologd dictionary, so it's a good idea to have it installed.

Installation of the main unit

You can use it with python setup.py install by doing git clone [email protected]: Kensuke-Mitsuzawa / JapaneseTokenizers.git.

Or you can do it with pip install git + https://github.com/Kensuke-Mitsuzawa/JapaneseTokenizers.

How to use

The same content is written in example, so I will write it briefly.

Morpheme division

Prepare the input statement

`sentence = u'Tehran (Persian: تهران; Tehrān Tehran.ogg pronunciation [help / file] / teɦˈrɔːn /, English: Tehran) is the capital of West Asia, Iran and the capital of Tehran Province. Population 12,223,598. The metropolitan population reaches 13,413,348. '``

python2x can only input ʻunicode`.

It doesn't matter which one you use for python3x.

Specifies the type of os. Other than centOS, ʻos Type =" generic "is fine. Only centOS should have ʻos Type = "centsos". (Because the system command of Mecab is different only for centOS. There may be other OSs like that ... I have confirmed that it works on Ubuntu and Mac.)

Specifies the type of dictionary.

For ipadic dictType =" " or dictType =" ipadic "
If neologd and ipadic are used together, dictType =" neologd "

Initialize the instance mecab_wrapper = MecabWrapper(dictType=dictType, osType=osType)

Split words. tokenized_obj = mecab_wrapper.tokenize(sentence=sentence)

By default, words and part of speech are returned in a tapple pair.

tokenized_obj = mecab_wrapper.tokenize(sentence=sentence, return_list=False)

Will return this class object, so if you want to use it for other processing, this It is better to specify the flag.

filtering

Filtering

Specify and exclude stop words
Specify only the part of speech you want to acquire

The stop word is

stopwords = [u'Tehran']

Put a string in the list like this. (Both str and ʻunicode` are acceptable)

To specify by part of speech, specify as [(part of speech tuple)].

Part of speech can be specified up to 3 levels. For example, in IPADIC Part of Speech System, if you want noun-proper noun-personal name,(u'noun', Write u'proper noun', u'personal name').

If you want to specify up to noun-proper noun, use (u'noun', u'proper noun).

Again, you can enter both str and ʻunicode`.

Place the part of speech tuple you want to acquire in the list.

pos_condition = [(u'noun', u'proper noun'), (u'verb', u'independence')]

Perform filtering.

filtered_obj = mecab_wrapper.filter(
    parsed_sentence=tokenized_obj,
    pos_condition=pos_condition
)

The return value is this class object

Why did you make such a thing?

To summarize briefly

It's useless to write the same process all the time ...
There is no package in python that can filter and add dictionaries ...
I want to contribute to the world's pretreatment craftsmen as much as possible

It is a motivation.

I've been in charge of natural language processing for a long time ... I'm a saint who does pre-processing day after day, and sometimes even pre-processing for other people's research. .. .. ..

But at one point, I suddenly noticed __ "Isn't the morpheme division part writing the same process every time?" __

So, while doing the same thing over and over again, I have packaged only the processes that I have used (and will probably use) most often.

A package that can be similar in Python is natto.

However, I felt inconvenient because I had to write the filtering process in natto and I couldn't add the dictionary, so I made a new one.

Whether you are a pre-processing craftsman or an active pre-processing craftsman! I hope that you can reduce your work as much as possible and enjoy NLP.

Created a library for python that can easily handle morpheme division