Chinese morphological analysis engine jieba

I tried using it with the Python version of jieba. [Other programming language versions are also available](https://github.com/fxsjy/jieba#%E5%85%B6%E4%BB%96%E8%AF%AD%E8%A8%80%E5% AE% 9E% E7% 8E% B0).

Installation

$ pip install jieba

Text segmentation

>>> import jieba
>>> text = "I am a graduate of the University of Tokyo. Hayagami 10 points started."
#"I will attend a class at the University of Tokyo tomorrow. From 10 o'clock in the morning."

The return value of jieba.cut is a generator The return value of jieba.lcut is a list The return value of jieba.cut_for_search is a generator The return value of jieba.lcut_for_search is a list

Accurate Mode

>>> segments = jieba.cut(text)
>>> list(segments)
['I', 'Mingten', 'Leaving', 'University of Tokyo', 'Class', '。', 'Hayagami', 'Ten points', 'Start', '。']

>>> segments = jieba.lcut(text)
>>> segments
['I', 'Mingten', 'Leaving', 'University of Tokyo', 'Class', '。', 'Hayagami', 'Ten points', 'Start', '。']

** The University of Tokyo ** is one word, isn't it? Full Mode Set to cut_all = True.

>>> segments = jieba.cut(text, cut_all=True)
>>> list(segments)
['I', 'Mingten', 'Leaving', 'Tokyo', 'TokyoUniversity', 'University', 'Academically', 'Class', '。', 'Hayagami', 'Ten points', 'Start', '。']

>>> segments = jieba.lcut(text, cut_all=True)
>>> segments
['I', 'Mingten', 'Leaving', 'Tokyo', 'TokyoUniversity', 'University', 'Academically', 'Class', '。', 'Hayagami', 'Ten points', 'Start', '。']

Search Engine Mode

>>> segments = jieba.cut_for_search(text)
>>> list(segments)
['I', 'Mingten', 'Leaving', 'Tokyo', 'University', 'TokyoUniversity', 'Class', '。', 'Hayagami', 'Ten points', 'Start', '。']

>>> segments = jieba.lcut_for_search(text)
>>> segments
['I', 'Mingten', 'Leaving', 'Tokyo', 'University', 'TokyoUniversity', 'Class', '。', 'Hayagami', 'Ten points', 'Start', '。']

Keyword extraction

>>> import jieba.analyse
>>> text = '''
...The progress of globalization is constantly accelerating, the human race is facing the front, and the daily profits are sharply challenged. This is a trivial challenge, each kind of talent demanding power production, dedication dedication, joint conquest this trivial globalization problem. Under the background of this kind of background, he is a talented person who works as a leader, and is reluctantly assigned to the University of Tokyo. Infinite courage after our general courage, wisdom and assignment feeling, direct opposition to this trivial challenge.
...Academic scholarship, academic discipline, scholarship, scholarship, scholarship, scholarship. Opponents of the scholarship on the road. The University of Tokyo's unscrupulous national office, this is a trivial student, a scholar-provided long-term soil, a good place to build a society.
...The University of Tokyo is now in a straightforward position, and it is a unique scholarly point of eastern and western culture, uninterrupted development, an eye-opening world, and a unique flag. The future of the outpost, the future of the prospects, the University of Tokyo's aspirations, and the talented people of each ceremony. University of Tokyo, national world, culture, breakthrough of barriers, new area science research transcendental literary world limit, industry-government-academia collaboration exhibition. This is the first target, the demand for the neck, the excellence, the internationality, and the dual-purpose research student's institute, and the parallel exhibition....The University of Tokyo's decree, the University of Tokyo's power, world peace, humanity and welfare production, timeless offering. The modern social development, the demand for ourselves, the demand for the development of the era, the scholarship research, the new era. At the same time, the system reform is not possible or is not possible. At the same time as the reform of the education of the undergraduate students, the research student's institutional fundamental transformation, the messenger's knowledge, and the independent intentions can be realized. In addition to this, the reform of the personnel system for promoting demand, the equality of men and women, the equality of men and women, the qualitative meeting of the human resources, and the qualitative nature of the human resources. Unreasonable one-problem problem, promotion The above-mentioned reforming premise, the above-mentioned reforming premise, the social credibility, the credibility of the scholarship, the scholarship of the scholarship, the scholarship of the scholarship, and the scholarship of the scholarship.
...The University of Tokyo, which has been constantly in the process of success, has been developed by the University of Tokyo, and has been established by the University of Tokyo.
... '''

The text will be The University of Tokyo President's Theory Chinese Version.

Extraction by tf-idf value

>>> keywords = jieba.analyse.extract_tags(text, topK=20, withWeight=False, allowPOS=())
>>> keywords
['University of Tokyo', 'Persistent', 'Confidence', 'Science', 'Challenge', 'Human talent', 'Physics', 'Knowledge', 'Graduate School', '爱', 'Science研究', 'Shinshin', 'Promotion', 'Globalization', 'reform', 'Kaken', 'This trivial', 'Powerful', 'Feeling of joy', 'Ritsu']

Sounds good. It's a little different from Japanese kanji, but it's generally readable.

Extraction based on TextRank

>>> keywords = jieba.analyse.textrank(text, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v'))
>>> keywords
['Strategy', 'Knowledge', 'Shinshin', 'Science', 'Exhibition', 'demand', 'reform', 'Human talent', 'Promotion', 'Kaken', 'Challenge', 'Actual', 'Area', 'Will', 'society', 'Science研究', 'Mankind', 'culture', 'Physics', 'Courage']

Other

It has many other features, You can play with the dictionary, tag parts of speech, etc., so it seems better to look at Official for details. The first half of README.md is Chinese, but the second half (https://github.com/fxsjy/jieba#jieba-1) is translated into English.

The author has nothing to do with the University of Tokyo.

Try using the Chinese morphological analysis engine jieba