Try something a little different than before. I was worried about how to divide the text, but if you see the match of the vowel "aiueo", you can compare the sentences by arranging various vowels and showing whether the vowel appears in the sentence. Isn't it? I will try it based on the idea. In other words, each word in the binary expression method, which "does not care about the frequency of appearance and focuses only on whether or not each word appears", is arranged in various vowels.
from pykakasi import kakasi
import re
import numpy as np
import pandas as pd
with open("./gennama.txt","r", encoding="utf-8") as f:
data = f.read()
vowel_list = ["a","i","u","e","o"]
#Word list. A 2- to 4-letter word that can be made using only vowels. 775 types
word_list = []
for i in vowel_list:
for j in vowel_list:
for k in vowel_list:
for l in vowel_list:
word_list.append(i+j)
word_list.append(i+j+k)
word_list.append(i+j+k+l)
text_data = re.split("\u3000|\n", data)
kakasi = kakasi()
kakasi.setMode('J', 'a')
kakasi.setMode('H', 'a')
kakasi.setMode('K', 'a')
conv = kakasi.getConverter()
vowel_text_list = [conv.do(d) for d in text_data]
vowel_text_list = [re.sub(r"[^aeiou]+","",text) for text in vowel_text_list]
Other than that, it is simple. There are not so many types of words, 755. As a rule of thumb so far, matching of vowels with 5 or more characters is extremely rare, so we limited it to 4 characters. Until now, I have created various dictionaries, but I will summarize them in DataFrame.
df = pd.DataFrame({"Sentence": text_data, "vowel": vowel_text_list})
#Column name"aa"If it appears in the text, it will be 1, otherwise it will be 0.
binaly_dic = {}
temp = []
for word in word_list:
for vowel in vowel_text_list:
if word in vowel:
temp.append(1)
else:
temp.append(0)
binaly_dic[word] = temp
temp = []
for k, v in binaly_dic.items():
df[k] = v
df.to_csv("df_test.csv")
The columns are "sentences, vowels, words ...", "sentences" are sentences in which the original text data is divided, "vowels" are those converted into vowels only, and "words ..." are sentences. 1 was given if it was inside, and 0 was given if it was not inside.
#Cosine similarity
def cosine_similarity(v1, v2):
cos_sim = np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
return cos_sim
print(cosine_similarity(df.iloc[0, 2:].values, df.iloc[3, 2:].values))
For example, in this way, the similarity between sentence 0 and sentence 3 is displayed. You can also use sum
to quickly find out which words are most commonly used.
Since I focused only on vowels, the number of words was limited to 755 even when considering all combinations with 2 to 4 letters. Until now, I had tried to divide the text and handle it, but there were some things I could do as it was. It's a big event for me, so I wrote an article though the content is thin. In the future, I will do something based on the created DataFrame, such as whether I can do something more based on the similarity of the sentences.
Recommended Posts