Let's try the previous correction, usage, and count expression. Since there was a duplication in the part where the vowel sequence is regarded as a word, correct it so that there is no duplication. After that, like the last time, I will express the sentences in the text in binary expression and display the ones with high cosine similarity. Let's do the same for the count expression.
from pykakasi import kakasi
import re
import numpy as np
import pandas as pd
import itertools
with open("./test.txt","r", encoding="utf-8") as f:
data = f.read()
#Word list. A 2- to 4-letter word that can be made using only vowels. 775 types
word_list2 = [i[0]+i[1] for i in itertools.product("aiueo", repeat=2)]
word_list3 = [i[0]+i[1]+i[2] for i in itertools.product("aiueo", repeat=3)]
word_list4 = [i[0]+i[1]+i[2]+i[3] for i in itertools.product("aiueo", repeat=4)]
word_list = word_list2 + word_list3 + word_list4
text_data = re.split("\u3000|\n", data)
kakasi = kakasi()
kakasi.setMode('J', 'a')
kakasi.setMode('H', 'a')
kakasi.setMode('K', 'a')
conv = kakasi.getConverter()
vowel_text_list = [conv.do(d) for d in text_data]
vowel_text_list = [re.sub(r"[^aeiou]+","",text) for text in vowel_text_list]
ʻItertools` is used to prevent duplication of word list part. It was also used to avoid calculating (0,1) and (1,0) twice when examining the cosine similarity of usage. itertools
df = pd.DataFrame({"Sentence": text_data, "vowel": vowel_text_list})
#Column name"aa"If it appears in the text, it will be 1, otherwise it will be 0.
binali_dic = {}
temp = []
for word in word_list:
for vowel in vowel_text_list:
if word in vowel:
temp.append(1)
else:
temp.append(0)
binali_dic[word] = temp
temp = []
for k, v in binali_dic.items():
df[k] = v
The third and subsequent columns indicate whether or not there is a sequence of vowels in the sentence that are likened to words such as "aa".
#Cosine similarity
def cosine_similarity(v1, v2):
cos_sim = np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
return cos_sim
#Pass an index of df and return a common vowel
def common_vowel(index1, index2):
idx = df.iloc[index1, 2:].values + df.iloc[index2, 2:].values
vowel_word = df.columns[2:]
common_list = [vowel_word[i] for i in range(len(idx)) if idx[i] == 2]
return common_list
#Cosine similarity ranking. list(index,index,cos_sim,Common vowel list)
def cos_sim_ranking(df, threshold):
ranking = []
idx = itertools.combinations(df.index, 2)
for i in idx:
cos_sim = cosine_similarity(df.iloc[i[0]][2:].values, df.iloc[i[1]][2:].values)
if cos_sim > threshold:
com_list = common_vowel(i[0], i[1])
ranking.append((i[0],i[1],cos_sim,com_list))
return sorted(ranking, key=lambda x:-x[2])
ranking = cos_sim_ranking(df, 0.4)
for r in ranking:
print(df["Sentence"][r[0]] + ":" + df["Sentence"][r[1]])
print("Common vowels:{}".format(r[3]))
print()
For items above the cosine similarity threshold (arbitrary value), the original sentence and the sequence of common vowels are output in descending order of similarity. The rhyme can be emphasized by moving the common vowel at the beginning or end of the sentence by using the inversion method of the original sentence.
df = pd.DataFrame({"Sentence": text_data, "vowel": vowel_text_list})
#Column name"aa"Etc., the value is the number of occurrences
count_dic = {}
temp = []
for word in word_list:
for vowel in vowel_text_list:
temp.append(vowel.count(word))
count_dic[word] = temp
temp = []
for k, v in count_dic.items():
df[k] = v
#Pass the index of df and return the common vowel, the number of occurrences
def common_vowel(index1, index2):
idx = df.iloc[index1, 2:].values + df.iloc[index2, 2:].values
vowel_word = df.columns[2:]
common_list = [(vowel_word[i], idx[i]) for i in range(len(idx)) if idx[i] >= 2]
return common_list
The difference between creating a data frame and adding the "number of occurrences" in common_vowel and returning it. The output results are different even if the same threshold is used, and I felt that the count expression that shows the number of occurrences is good.
The output of the test data was quite satisfactory. A vowel with a count of 2 or more is used as a common vowel, but there are some that have a count of 2 in one sentence. This showed that the text itself could be rhymed, and it was an unexpected harvest. At first, I tried to handle it as long as possible, but I remember giving up thinking that I couldn't capture the "sentence that the sentence itself can step on." After that, I was worried about how to divide it, and it was interesting that I ended up handling the sentences without dividing them. Well, it's not that what I've done so far isn't bad, and I'm glad that I realized the advantages and disadvantages. There may be minor corrections and improvements, but I found it interesting how to express the sentence by "what kind of vowel sequence it has", so "I want to handle the rhyme" ends once.
I would like to try this count expression with the lyrics of the actual rapper and see if there is any discovery.
Recommended Posts