I want to handle the rhyme part7 (BOW)

Content

Try something a little different than before. I was worried about how to divide the text, but if you see the match of the vowel "aiueo", you can compare the sentences by arranging various vowels and showing whether the vowel appears in the sentence. Isn't it? I will try it based on the idea. In other words, each word in the binary expression method, which "does not care about the frequency of appearance and focuses only on whether or not each word appears", is arranged in various vowels.

__ Creating a word_list that resembles a sequence of various vowels as a word __

from pykakasi import kakasi
import re
import numpy as np
import pandas as pd

with open("./gennama.txt","r", encoding="utf-8") as f:
    data = f.read()

vowel_list = ["a","i","u","e","o"]
#Word list. A 2- to 4-letter word that can be made using only vowels. 775 types
word_list = []
for i in vowel_list:
    for j in vowel_list:
        for k in vowel_list:
            for l in vowel_list:
                    word_list.append(i+j)
                    word_list.append(i+j+k)
                    word_list.append(i+j+k+l)                    

text_data = re.split("\u3000|\n", data)
kakasi = kakasi()
kakasi.setMode('J', 'a')
kakasi.setMode('H', 'a')
kakasi.setMode('K', 'a')
conv = kakasi.getConverter()
vowel_text_list = [conv.do(d) for d in text_data]
vowel_text_list = [re.sub(r"[^aeiou]+","",text) for text in vowel_text_list]

Other than that, it is simple. There are not so many types of words, 755. As a rule of thumb so far, matching of vowels with 5 or more characters is extremely rare, so we limited it to 4 characters. Until now, I have created various dictionaries, but I will summarize them in DataFrame.

DataFrame creation

df = pd.DataFrame({"Sentence": text_data, "vowel": vowel_text_list})
#Column name"aa"If it appears in the text, it will be 1, otherwise it will be 0.
binaly_dic = {}
temp = []
for word in word_list:
    for vowel in vowel_text_list:
        if word in vowel:
            temp.append(1)
        else:
            temp.append(0)
        binaly_dic[word] = temp
    temp = []

for k, v in binaly_dic.items():
    df[k] = v
df.to_csv("df_test.csv")

The columns are "sentences, vowels, words ...", "sentences" are sentences in which the original text data is divided, "vowels" are those converted into vowels only, and "words ..." are sentences. 1 was given if it was inside, and 0 was given if it was not inside.

Usage example

#Cosine similarity
def cosine_similarity(v1, v2):
    cos_sim = np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
    return cos_sim

print(cosine_similarity(df.iloc[0, 2:].values, df.iloc[3, 2:].values))

For example, in this way, the similarity between sentence 0 and sentence 3 is displayed. You can also use sum to quickly find out which words are most commonly used.

Summary

Since I focused only on vowels, the number of words was limited to 755 even when considering all combinations with 2 to 4 letters. Until now, I had tried to divide the text and handle it, but there were some things I could do as it was. It's a big event for me, so I wrote an article though the content is thin. In the future, I will do something based on the created DataFrame, such as whether I can do something more based on the similarity of the sentences.

I want to handle the rhyme part7 (BOW)

__ Content __

__ Creating a word_list that resembles a sequence of various vowels as a word __

__DataFrame creation __

__ Usage example __

__ Summary __

Content

DataFrame creation

Usage example

Summary