TermExtract seems to be a module for extracting technical terms from text data.
Technical term (keyword) automatic extraction system
Until now, it was only provided as a Perl module, but it seems that a beta version was released for Python at the end of last year. I thought it could be used as a countermeasure against unknown words when analyzing text, so I tried using it.
Simply download the zip file from Official, unzip it to a suitable location, and execute the following.
python setup.py install
Unfortunately, it doesn't seem to be installed from pip or conda.
Officially
Receives the morphological analysis result of Wakame, a Japanese morphological analysis software, and returns a list of compound words (blank-separated single nouns) or a dictionary (compound word is the key and the number of occurrences of the compound word is the value).
There is.
The morphological analysis result to be passed seems to be passed in the following format (the official sample text below).
sample.txt
Natural language processing nouns,General,*,*,*,*,dummy,dummy,dummy
(Symbol,Open parentheses,*,*,*,*,(,(,(
Auxiliary verb,*,*,*,Literary language,Uninflected word,Ri,Li,Li
, Symbol,Comma,*,*,*,*,、,、,、
English noun,General,*,*,*,*,dummy,dummy,dummy
Word noun,General,*,*,*,*,dummy,dummy,dummy
:noun,Change connection,*,*,*,*,*
natural noun,General,*,*,*,*,*
language noun,General,*,*,*,*,*
processing noun,General,*,*,*,*,*
・
・
・
The morphological analysis result in Mecab is divided into lines. Load this with the following Python script.
import termextract.mecab
import termextract.core
import collections
#Read the file
tagged_text = open("sample.txt", "r", encoding="utf-8").read()
#Extract compound words and calculate importance
frequency = termextract.mecab.cmp_noun_dict(tagged_text)
LR = termextract.core.score_lr(frequency,
ignore_words=termextract.mecab.IGNORE_WORDS,
lr_mode=1, average_rate=1
)
term_imp = termextract.core.term_importance(frequency, LR)
#Sort and output in descending order of importance
data_collection = collections.Counter(term_imp)
for cmp_noun, value in data_collection.most_common():
print(termextract.core.modify_agglutinative_lang(cmp_noun), value, sep="\t")
The output looks like this:
Natural language processing 31.843366656181313
they 11.618950038622252
Meaning 10.392304845413264
English 10.059467437463484
Basic technology 9.361389277282864
Statistical natural language processing 9.085602964160698
Analysis 8.485281374238571
・
・
・
The result is fair, and I can extract compound words as it is (although there are many cases where it is obviously strange ...).
However, ** input is Mecab's morphological analysis result **, which is subtly difficult to use. I felt that it would be easier to use if I could pass plain text or divided text.
It seems that another method of extraction is also provided. One of them, ** Japanese stopword method terminology extraction **, is officially explained as follows.
Receives plain text in Japanese and returns a list of compound words (blank-separated single nouns) or a dictionary (compound word is the key, compound word occurrence count is the value). The compound word is cut out by dividing the sentence by "hiragana" and "symbol".
I think it means tokenizing with hiragana and symbols as delimiters (sorry, I haven't read it properly ...)
This just loads plain text.
import collections
import termextract.japanese_plaintext
import termextract.core
#Read the file
text = open("sample.txt", "r", encoding="utf-8").read()
#Extract compound words and calculate importance
frequency = termextract.japanese_plaintext.cmp_noun_dict(text)
LR = termextract.core.score_lr(frequency,
ignore_words=termextract.japanese_plaintext.IGNORE_WORDS,
lr_mode=1, average_rate=1
)
term_imp = termextract.core.term_importance(frequency, LR)
#Sort and output in descending order of importance
data_collection = collections.Counter(term_imp)
for cmp_noun, value in data_collection.most_common():
print(termextract.core.modify_agglutinative_lang(cmp_noun), value, sep="\t")
The output looks like this:
Artificial intelligence 1226.4288753047445
Human 277.1173032591193
Intelligence 185.75930965317852
Development 88.6373649378885
Awareness 60.00624902367479
Artificial 57.917332434843445
Possible 55.20783921098894
・
・
・
I hope this is easy to use.
In the above, we introduced two methods, the morphological analysis result method and the stopword method, but let's look at the top 20 scores in each method.
Intelligence 12.649110640673518
Computational intelligence 5.029733718731742
Fighter 4.7381372205375865
Combat 4.58257569495584
For fighter pilots 4.4406237146062955
Calculator 4.426727678801286
Artificial intelligence 4.355877174692862
Study 4.0
Calculation 4.0
Autopilot 3.9359793425308607
Learning 3.872983346207417
Automatic combat system 3.802742902833608
Artificial intelligence technology 3.7719455481170785
Logical operation 3.7224194364083982
Machine learning 3.6628415014847064
Symbolic AI 3.6342411856642793
Autopilot possible 3.5254687665352296
Logic 3.4641016151377544
Machine 3.4641016151377544
Mechanical calculator 3.413473673690155
Artificial intelligence 1226.4288753047445
Human 277.1173032591193
Intelligence 185.75930965317852
Development 88.6373649378885
Awareness 60.00624902367479
Artificial 57.917332434843445
Possible 55.20783921098894
Study 51.27978102078589
Learning 49.31317739277511
Japanese Society for Artificial Intelligence 48.855373993311964
Realization 48.748063633179314
Theory 40.51490946041508
Announcement 39.39438441683934
Computational intelligence 35.98098913381863
Possibility 34.82443169313786
Method 34.6517883306879
Use 32.82677759681713
Intellectual 31.52620185751426
Operation 30.582796407248203
Action 30.582796407248203
Appearance 29.146786564179294
At first glance, it seems that the morphological analysis method can extract more important keywords with a high score.
May I consider it as an easy way to deal with unknown words that cannot be detected by Mecab + Neologd? However, in the case of the morphological analysis method, it is difficult to use as a module, so it seems that you need to make a thin wrapper-like one yourself. Also, proper verification is required.
Recommended Posts