Last time Is it necessary to improve the method of dividing input data? I felt that, so I tried various division methods. The input data uses the lyrics of a certain rapper as before. For the time being, I also have a theme of the rhyme that I was warming up for verification.
import MeCab
mecab = MeCab.Tagger("-Owakati")
mecab_text = mecab.parse(data).split()
There is a part where the "lyric" can be extracted according to the lyrics, but it cannot be recognized because the part that is stepped on at "tens of thousands of yen" is divided into "tens of thousands, yen" by word division. As an aside, it seems that you can't pass a lot of data to conv.do
of kakashi
at once. text_data = [conv.do (text) for text in mecab_text]
. By the way, the maximum length of the word converted into vowels after the word division was 8 characters, and the average was 2.16 characters. It can be said that the division by word division is not suitable because it is divided too finely.
~~ I was frustrated many times before I could use this MeCab. I may have overlooked various articles. Here Thanks ~~
So what if we simply split it by N characters? N will try from 4.
def make_ngram(words, N):
ngram = []
for i in range(len(words)-N+1):
#Remove double-byte spaces and line breaks
row = "".join(words[i:i+N]).replace("\u3000","").replace("\n","")
ngram.append(row)
return ngram
As for the experience, N seems to be good at 5 or more. (If it is 4 or less, there is no difference in score.) The rhyme can be detected according to the lyrics. When I put in the verification data as a trial, I was able to detect a seemingly unnoticed rhyme. Since words are cut out in various ways, try changing the way they are scored.
def make_score_ngram(word_a, word_b):
score = 0
for i in range(len(word_a)):
if word_a[-i:] == word_b[-i:]:
score += i
return score
The output is easy to see by seeing the vowel match from the end of the word. For the value of N, len (target_word_vo)
(the length of the vowel of the original word looking for the rhyme) would be good. I feel like I've been able to express what I want to do.
I had a hard time making it possible to use MeCab, and I thought about "quantifying the rhyme" in my own way, so I want to use it. Let's combine these two.
In "quantification of rhyme", the part where the vowels match was searched, and the matching length len (word [i: j)
was used as the score. This word [i: j]
has a shape such as "eoi", and if you count the number of occurrences, you should be able to find the most appearing vowel in the input data. The idea is that if you specify a word that includes it in target_word
, you can expect many recommendations. I'm sorry to use the text prepared for word-separation and verification.
from pykakasi import kakasi
import re
from collections import defaultdict
import MeCab
with open("./test.txt","r",encoding="utf-8") as f:
data = f.read()
mecab = MeCab.Tagger("-Owakati")
mecab_text = mecab.parse(data).split()
kakasi = kakasi()
kakasi.setMode('H', 'a')
kakasi.setMode('K', 'a')
kakasi.setMode('J', 'a')
conv = kakasi.getConverter()
text_data = [conv.do(text) for text in mecab_text]
vowel_data = [re.sub(r"[^aeiou]+","",text) for text in text_data]
dic_vo = {k:v for k,v in enumerate(vowel_data)}
#voel_Create a dictionary so that you can see the data before vowel conversion by the data index.
dic = {k:v for k,v in enumerate(mecab_text)}
#Use defaultdict to skip initialization when adding new keys{vowel:Number of appearances}
dic_rhyme = defaultdict(int)
for word_a in vowel_data:
for word_b in vowel_data:
if len(word_a) > len(word_b):
word_len = len(word_b)
for i in range(word_len):
for j in range(word_len + 1):
#Only count 2 or more characters
if word_b[i:j] in word_a and not len(word_b[i:j])<2:
dic_rhyme[word_b[i:j]] += 1
else:
word_len = len(word_a)
for i in range(word_len):
for j in range(word_len + 1):
if word_a[i:j] in word_b and not len(word_a[i:j])<2:
dic_rhyme[word_a[i:j]] += 1
#Sort in descending order of count
dic_rhyme = sorted(dic_rhyme.items(), key=lambda x:x[1], reverse=True)
print(dic_rhyme)
#dic_Search for things that include the ones that came to the top of rhyme. here"ai"use
bool_index = ["ai" in text for text in vowel_data]
for i in range(len(vowel_data)):
if bool_index[i]:
print(dic[i])
I was able to get the sequence of vowels that appear frequently and output where they are used. However, the words subdivided by word-separation were difficult to understand. Perhaps there was a slightly longer rhyme.
I don't feel the need to narrow down the target_word (because I want to specify what I want to say the most), but it may be good to be able to confirm which vowel sequence is frequent. It didn't work out in the word-separation, but I want to improve it using MeCab (~~ many times, I struggled until it became usable ~~). Also, by adopting N-gram, we were able to simplify the "quantification of rhyme", so we will consider whether we can redefine "rhin" a little more complicatedly. (Currently, "tsu" is not considered) However, I made a detour. "Numericalization of rhyme" is intended to be thought out in my own way so as to correspond to the input data so as not to leak the rhyme. No way, slicing various arguments can solve the input data by slicing variously (the expression may be different) ... or not noticing it. The basics are important. But don't you feel that N-gram can speak meaningless Japanese? However, there are ways to emphasize the "lyric" part and pronounce it. After all, it is important to try using simple data anyway.
Recommended Posts