Since I made a graph last time, I will try clustering etc. In addition, the way of capturing the rhyme is expanded. It is converted from "e i" to "ee" and "o u" to "oo", and the one with the same vowel after conversion is also regarded as "rimp". This was based on Katakana English and Japanese, which is easy for children to make mistakes. Not writing "ei" means that "eki" does not become "yes". "I" and "u" must be vowels alone. ("Reizoko" is "Reizoko")
import networkx as nx
import matplotlib.pyplot as plt
import community
G = nx.Graph()
G.add_weighted_edges_from(edge_list)
#Clustering
partition = community.best_partition(G, weight="weight")
#Separate nodes for each community into a list.[[Community 0 node list],[Community 1 node list]…]
part_sub = [[] for _ in set(list(partition.values()))]
for key in partition.keys():
part_sub[partition[key]].append(key)
#List the nodes with the highest eigenvector centrality in each community
max_betw_cent_node = []
for part in part_sub:
G_part = nx.Graph()
for edge in edge_list:
if edge[0] in part and edge[1] in part:
G_part.add_weighted_edges_from([edge])
max_betw_cent_node.append(max(G_part.nodes(), key=lambda val:
nx.eigenvector_centrality_numpy(G_part, weight="weight")[val]))
print([dic[i] for i in max_betw_cent_node])
#Modularity indicator
print(community.modularity(partition,G))
Clustering was performed to find the one with the maximum eigenvector centrality for each community. If you have a good division, you will get good results when you set each as target_word. Consider setting a threshold value in the part used for the edge weight so that the weight will be different.
from pykakasi import kakasi
import re
with open("./gennama.txt","r", encoding="utf-8") as f:
data = f.read()
kakasi = kakasi()
kakasi.setMode('J', 'K')
kakasi.setMode('H', 'K')
conv = kakasi.getConverter()
text_data = conv.do(data)
#e i → ee,Get the converted text like o u → oo
def expansion(text_data):
#Depending on the last letter, i,Resolve the extra u by adjusting the length
text_data_len = len(text_data)
#Dealing with good chairs and a series of "yes, u" like that rumor.
text_data = text_data.replace("good", "I i").replace("U","U u")
text_data = text_data.split("I")
new_text_data = []
kakasi.setMode('K', 'a')
conv = kakasi.getConverter()
for i in range(len(text_data):
if len(text_data[i]) > 0:
if ("e" in conv.do(text_data[i][-1])):
new_text_data.append(text_data[i] + "e")
else:
new_text_data.append(text_data[i] + "i")
text_data = "".join(new_text_data).split("C")
new_text_data = []
for i in range(len(text_data):
if len(text_data[i]) > 0:
if ("o" in conv.do(text_data[i][-1])):
new_text_data.append(text_data[i] + "o")
else:
new_text_data.append(text_data[i] + "u")
return "".join(new_text_data)[:text_data_len]
print(expansion(text_data))
First, I had a policy of converting the data to katakana, dividing it by "i, u", and changing the processing according to the vowel of the immediately preceding character, but I had a hard time. If the end of the data is "i, u" or otherwise, "iu" remains. I dealt with it by making the length the same as the argument data, but when I tried print
," i "remained at the end. I didn't expect the continuous appearance of "Good, U". After all, when you try it, it doesn't go smoothly, and you often don't notice it unless you do it.
I will score the matching part for each (katakana conversion data, data with only vowels left, extended data) and try to capture (consonant matching, vowel matching, sound matching). It was judged unnecessary to see the matching of long vowels, nasals, and sokuons. In other words, let's summarize what we have done so far. I think that N-gram and space division should be taken into consideration, and there is a problem in how to see the matching part. I would like to summarize the current best method, prepare some input data, and verify it.
Recommended Posts