I want to make more use of MeCab in the previous flow. It was too subdivided in the word-separation. We will do things like segmentation using MeCab. I don't know that there are Cabocha and KNP (I'll say it again, ~~ yellow B-boy ~~ I want to use MeCab). I don't use morphological analysis, so I will try morphological analysis and attach the attached words (particles, auxiliary verbs) to the previous words.
From the results, I didn't use MeCab. I decided that janome was more concise because I only needed the surface form and part of speech. ~~ However, I tried various things with MeCab, read the article to put in DataFrame, and it turned strangely in for minutes and it became a memory error, so I enjoyed it so much ~~ By concatenating the adjuncts with the code below, the average length of the divided words was 2.96 (2.16 in word-separation).
from janome.tokenizer import Tokenizer
with open("./gennama.txt","r") as f:
data = f.read()
tokenizer = Tokenizer()
tokens = tokenizer.tokenize(data)
surface_list = []
part_of_speech_list = []
for token in tokens:
surface_list.append(token.surface)
part_of_speech_list.append(token.part_of_speech.split(",")[0])
text_data = []
for i in range(len(surface_list)):
if part_of_speech_list[i] == "symbol":
continue
elif part_of_speech_list[i] == "Particle" or part_of_speech_list[i] == "Auxiliary verb":
row = text_data.pop(-1) + surface_list[i]
else:
row = surface_list[i]
text_data.append(row)
With this, as in the previous time, we ranked which vowels are arranged in the input data, and output the words that include them. The output result is more than that of the word-separation, but no practicality has been found. After all, N-gram is the most suitable at present, the word order remains, and it is divided where it should not be divided, so if you connect it, you can detect the rhyme. Notice that it is here. Hiragana for kanji before N-gram ... Originally, if you convert kanji to reading, the variation of N-gram is likely to increase. (~~ Kanji can be made with kakasi ~~) By the way, when I morphologically analyzed with MeCab earlier, there was an item of "reading". Let's use MeCab.
import MeCab
with open("./gennama.txt","r") as f:
data = f.read()
yomi_data = ""
mecab = MeCab.Tagger("-Ochasen")
lines = mecab.parse(data).split("\n")
for line in lines:
if line.split("\t")[0] == "EOS":
break
else:
yomi_data += line.split("\t")[1]
"Shitamachi"-> "Shitamachi"-> "iaai", and 4 vowels can be represented by 4 letters. But what about the case of "moment"-> "shunkan"-> "ua"? Two vowels are to be represented by five letters. If you divide it into N characters according to how you read it, you can do things like "Tama / Chi" instead of "Shita / Machi". However, "Shunkan" becomes uselessly long. Then, do you use N-gram after making vowel data? The surface layer cannot be retrieved even if the index is assigned to the data. It cannot be good that the sequence of "aa" in "Shitamachi" is not detected. I can't think of an improvement plan right away, but I want to think about it here.
For one, it may be good to add various ways of scoring. For example, even if you leave consonants, if they match, you can add them, and you can add up multiple scores. In that case, it seems possible to treat the "refrigerator" like (Reizoko, Reizoko).
The other is to draw a graph of words divided into nodes using score on the edges. Actually, I tried it when the division method could only be space division, but it didn't come out as expected (I guess the theme comes to mediation centrality?). I will try to study networkx
from scratch. (~~ I'm worried about when I'll post next. However, when I was studying search engines and saw the sum of PageRank and word distance scores, I thought it might be useful, so I studied. I will ~~)
Recommended Posts