In the flow so far, let's perform network analysis by using the divided input text data as a node and the matching of the vowels between the divided ones as the edge weight. The goal is to draw the graph and see the centrality.
from pykakasi import kakasi
import re
from collections import defaultdict
from janome.tokenizer import Tokenizer
with open("./gennama.txt","r") as f:
data = f.read()
tokenizer = Tokenizer()
tokens = tokenizer.tokenize(data)
surface_list = []
part_of_speech_list = []
for token in tokens:
surface_list.append(token.surface)
part_of_speech_list.append(token.part_of_speech.split(",")[0])
segment_text = []
for i in range(len(surface_list)):
if part_of_speech_list[i] == "symbol":
continue
elif part_of_speech_list[i] == "Particle" or part_of_speech_list[i] == "Auxiliary verb":
row = segment_text.pop(-1) + surface_list[i]
else:
row = surface_list[i]
segment_text.append(row)
kakasi = kakasi()
kakasi.setMode('H', 'a')
kakasi.setMode('K', 'a')
kakasi.setMode('J', 'a')
conv = kakasi.getConverter()
text_data = [conv.do(text) for text in segment_text]
vowel_data = [re.sub(r"[^aeiou]+","",text) for text in text_data]
#{0:"oea"}
dic_vo = {k:v for k,v in enumerate(vowel_data)}
#voel_Create a dictionary so that you can see the data before vowel conversion by the data index.{0:"I am"}
dic = {k:v for k,v in enumerate(segment_text)}
Utilize what you did in part3. Judging that N-gram is not suitable this time. There are as many nodes as there are keys of dic_vo, and the connection is checked by whether there is a vowel match between the nodes. The longer the matching length of the vowels, the more weight is given. Use the method that was made in part 1, but make it possible to make an edge from the connection to itself and the match of two or more characters.
#dic_Pass vo and index is node,The value is edge,The weight is score(node,node,score)make.
def create_edge(dic_vo):
node_len = len(dic_vo)
edge_list = []
for i in range(node_len):
for j in range(node_len):
score = create_weight(dic_vo[i],dic_vo[j])
if score != 0:
edge_list.append((i,j,score))
return edge_list
def create_weight(word_a, word_b):
weight = 0
if len(word_a) > len(word_b):
max_len = len(word_b)
for i in range(max_len):
for j in range(max_len + 1):
if word_b[i:j] in word_a:
if word_b == word_a:
continue
elif len(word_b[i:j]) < 2:
continue
else:
weight += len(word_b[i:j])
else:
max_len = len(word_a)
for i in range(max_len):
for j in range(max_len + 1):
if word_a[i:j] in word_b:
if word_a == word_b:
continue
elif len(word_b[i:j]) < 2:
continue
else:
weight += len(word_a[i:j])
return weight
edge_list = create_edge(dic_vo)
After that, draw a graph based on this edge_list. Next, get a node with high eigenvector centrality and mediation centrality, and display the original data.
import networkx as nx
import matplotlib.pyplot as plt
G = nx.Graph()
G.add_weighted_edges_from(edge_list)
pos = nx.spring_layout(G)
nx.draw_networkx_edges(G, pos)
plt.show()
#Eigenvector centrality
cent = nx.eigenvector_centrality_numpy(G)
max_cent_node = max(list(cent.keys()), key=lambda val: cent[val])
#Mediation centrality
between_cent = nx.communicability_betweenness_centrality(G, weight="weight)
max_betw_node = max(list(between_cent.keys()), key=lambda val: between_cent[val])
print("High eigenvector centrality:" + dic[max_cent_node])
print("High mediation centrality:" + dic[max_betw_node])
As expected, the result is the same as "I can narrow down the target_word" done in part2. Well, it's natural because it's doing the same thing, but with networkx
, there seems to be something that can still be done based on this graph, so I will pursue it.
When scoring, pay attention to "i" and "u", and if the previous sound is "e" and "o", that is, if "e i" and "o u", convert it to "ee" and "oo". I'm thinking of seeing the matching of vowels. It is difficult to distinguish even in Japanese (referring to the pronunciation of foreign words), and it can be said that the sound is the same. It seems to be NG in the treatment of rhymes in the rap world, but I will try it. ~~ By the way, have you ever heard the real "ABC song"? That's what makes "LMNOP" "elenenopy" all at once. "Rhythm" seems to be familiar to me since I was a child. I will not forget the respect for Japanese rap, but I will try to expand the rhyme a little ~~
Recommended Posts