We have summarized the points to note when using the editing distance as the pronunciation similarity between Japanese sentences. We also showed a code example to find the editing distance for kana, vowels, and consonants.
You may want to find the pronunciation similarity between sentences and words when automatically generating puns and mishearing. At this time, I can think of using the edit distance. The edit distance is an index used as an index of the similarity of character strings, and is defined as the minimum number of edits required to edit (insert, delete, replace) one character string one character at a time to make another character string. I will. In Japanese, there is a one-to-one correspondence between katakana notation and pronunciation, so you can find out how close the pronunciation is by finding the editing distance for katakana notation. However, you need to be careful when handling yoon and long vowels. Specifically, the following processing is required.
Also, when it comes to pronunciation similarity, you may want to find an index that focuses only on vowels and consonants.
Based on the above background, we aim to implement the following functions.
macOS Catalina 10.15.7 python 3.8.0
Install the library that calculates the editing distance.
pip install editdistance
Input is a uniquely pronouncable katakana character string that meets the following conditions.
Create a function that divides the input character string that meets the above assumptions into Moura units.
Reference: Separate Japanese (Katakana) in mora units [Python]
import re
#Represent each condition with a regular expression
c1 = '[Ukusutsunufumyuruguzudubupuvu][ヮ yeo]' #Udan + "ヮ/A/I/E/Oh "
c2 = '[Ixishini Himirigi Jijibipi][Nyayo]' #I-dan (excluding "I") + "Ya"/Yu/E/Yo "
c3 = '[Tedde][Yayo]' #"Te/De "+"/I/Yu/Yo "
c4 = '[A-Vu]' #One katakana character (including long vowels)
cond = '('+c1+'|'+c2+'|'+c3+'|'+c4+')'
re_mora = re.compile(cond)
#Returns a list of katakana strings divided into moura units
def mora_wakachi(kana_text):
return re_mora.findall(kana_text)
Of the character strings divided into moura units above, create a function that converts long notes (stretched bars) into vowels that correspond to pronunciation. Convert according to the following rules.
First, in order to determine the "stage" of the immediately preceding element, create a function that converts one unit of moura into a vowel. Then, use that function to create a function that converts long vowels into vowels.
#Convert one unit of received kana into a vowel and return it
#Input is a kana character string equivalent to one unit of moura (length 1 or 2)
def char2vowel(text):
t = text[-1] #Just look at the last letter to convert to a vowel
if t in "Akasatana Hamayarawagazada Bapaya":
return "A"
elif t in "Ikishini Himiligi Jijibipi":
return "I"
elif t in "Ukusutsunufmuyuruguzudubupuvu":
return "C"
elif t in "Ekesetene Hemeregese de Bepee":
return "D"
elif t in "Okosotonohomoyorowogozodobopoyo":
return "Oh"
elif t == "N":
return "N"
elif t == "Tsu":
return "Tsu"
else:
print("no match")
return text
#Input is a list of kana divided by moura
def bar2vowel(kana_list):
output = []
output.append(kana_list[0])
#I don't expect a long vowel to come first, so start the loop from the second element
for i,v in enumerate(kana_list[1:]):
if v == "-":
kana = char2vowel(output[i])#Get the previous element from the output just in case the long vowels are continuous
else:
kana = v
output.append(kana)
return output
Create a function that converts a kana string to the corresponding vowel string. The input is a list of kana divided by moura. However, it is assumed that long vowels have been converted to vowels. Vowels in the output are represented by katakana in line A. In other words
char2vowel
function and replace it mechanically.def kana2vowel(kana_list):
output = []
for v in kana_list:
kana = char2vowel(v)
output.append(kana)
return output
Create a function that converts a kana string to the corresponding consonant string. The input is a list of kana divided by moura. However, it is assumed that long vowels have been converted to vowels. Consonants in the output shall be represented by the corresponding alphabet. In other words
The rules for converting to consonants are based on the table below.
consonant | Kana |
---|---|
a | A line, wa line, ya line |
k | Ka line, Ka line, Qua line |
s | Sa line, Sha line, Su line |
t | Ta line, Tya line, Cha line, Tsa line |
n | Na line, Nya line, Nua line |
h | Ha line, Fa line, Hya line |
m | Ma line, Mya line, Mu line |
y | Not applicable(Integrated into consonant a) |
r | La line, Lya line, Lua line |
w | Not applicable (integrated into consonant a) |
g | Ga line, Ga line, Gua line |
z | The line, Ja line, Ja line, Za line, Za line |
d | Da line, Da line |
b | Ba line, Va line, Bya line, Ba line |
p | Pa line, Pya line, Pu line |
q | Tsu |
N | N |
ky | Not applicable (integrated into consonant k) |
sh | Not applicable(Integrated into consonant s) |
ch | Not applicable (integrated into consonant t) |
ny | Not applicable(Integrated into consonant n) |
hy | Not applicable(Integrated into consonant h) |
f | Not applicable(Integrated into consonant h) |
my | Not applicable(Integrated into consonant m) |
ry | Not applicable(Integrated into consonant r) |
gy | Not applicable(Integrated into consonant g) |
j | Not applicable(Integrated into consonant z) |
by | Not applicable(Integrated into consonant b) |
py | Not applicable(Integrated into consonant p) |
v | Not applicable(Integrated into consonant b) |
In the above table, consonants that are intuitively similar, such as consonants that are not yoon, are integrated. If possible, it is best to define the similarity between each consonant as a real number from 0 to 1 and find the weighted editing distance. In addition, the consonants of A and N are originally correct as silence (sp), but here, in order to distinguish them, the symbols a, N, and q are given.
Based on the above table, first create a function that converts a single kana character to a consonant, and then use it to convert the entire character string to a consonant.
#Convert one unit of received kana into a vowel and return it
#Input is a kana character string equivalent to one unit of moura (length 1 or 2)
def char2vowel(text):
t = text[0] #Just look at the first letter to convert to a consonant
if t in "Aiueoyayuyowawo":
return "a"
elif t in "Kakikukeko":
return "k"
elif t in "Sashisuseso":
return "s"
elif t in "Tachitsuteto":
return "t"
elif t in "Na ni nu ne no":
return "n"
elif t in "Hahifuheho":
return "h"
elif t in "Mamim memo":
return "m"
elif t in "Larilrero":
return "r"
elif t in "Gagigugego":
return "g"
elif t in "Zazizu Zezojizu":
return "z"
elif t in "Dadedo":
return "d"
elif t in "Babib Bebov":
return "b"
elif t in "Papipupepo":
return "p"
elif t == "Tsu":
return "q"
elif t == "N":
return "N"
else:
print("no match")
return text
def kana2consonant(kana_list):
output = []
for v in kana_list:
kana = char2consonant(v)
output.append(kana)
return output
In order to use it differently depending on the situation, we will define some methods to find the editing distance.
The edit distance as defined is calculated by a function called eval in the editdistance library. To use it, just enter two strings as arguments.
import editdistance as ed
kana_list1 = ["Ko ","N","D","Ji","Wa"]
kana_list2 = ["Ko ","N","Ba","N","Wa"]
dist = ed.eval(kana_list1,kana_list2)
print(dist)#Output is 2
You can also find the edit distance when you cannot insert or delete and only replace. This is equivalent to comparing character string A and character string B one character at a time to find the number of different characters. Note that you can't insert and delete, that is, change the string length, so you can only use it when the two strings you want to compare are equal in length.
From the point of view of pronunciation similarity, the appearance of the same sound at the same timing may have a strong effect. If this effect is important, it is advantageous to adopt the edit distance by replacement only. In addition, since the comparison is done in order from the beginning, there is an advantage that the amount of calculation can be saved compared to the general editing distance.
#Find the edit distance by replacement only
def replace_ed(kana_list1, kana_list2):
if len(kana_list1) != (kana_list2): #kana_list1 and kana_Issue a warning when the string length of list2 is different
print("warning: length of kana_list1 is different from one of kana_list2")
dist = 0
for k1,k2 in zip(kana_list1, kana_list2):
if k1 != k2:
dist += 1
return dist
The longer the string, the larger the edit distance tends to be, so if you want to treat it as an absolute similarity, it may be better to standardize by the string length. Below is the code to calculate the standardized value of the general edit distance and the edit distance by replacement only by the character string length.
#General relative edit distance(kana_Normalized with the length of list1)
def relative_ed(kana_list1, kana_list2):
dist = ed.eval(kana_list1, kana_list2)
dist /= len(kana_list1)
return dist
#Relative edit distance by replacement only(kana_Normalized with the length of list1)
def relative_replace_ed(kana_list1, kana_list2):
dist = replace_ed(kana_list1, kana_list2)
dist /= len(kana_list1)
return dist
There is room for consideration as to which of the two character strings to be compared is used as the reference. For example, if you have a specific word A and you want to find the similarity between word A and some word candidates, it seems better to use the length of word A as a reference. You may want to use the average of the lengths of the two words, especially if you don't think of a reference word. This time, I showed the code to normalize by the character string length of the word of the first argument.
By combining the chords up to this point, create a function that calculates the editing distance of kana, vowels, and consonants. The editing distance for vowels can be calculated after converting kana to vowels. The same is true for consonants. From the contents introduced so far (Absolute or relative) x (general or permutation only) x (kana or vowel or consonant) So, there are 12 possible editing distances. I hope you can create the necessary functions according to your purpose. Here, 12 examples will be long, so here is an example of a function that branches the output according to the flag argument. Input is a katakana character string that meets the conditions raised in the "Input" section.
#text1, text2:Katakana character string that can be pronounced uniquely
#is_relative:When False and True, find the absolute and relative editing distances, respectively.
#ed_type: "normal","replace"Find the editing distance by replacement only, which is common when
#kana_type: "kana","vowel","consonant"Find the editing distances for kana, vowels, and consonants, respectively.
def calc_ed(text1, text2, is_relative=False, ed_type="normal", kana_type="kana"):
#Divide the input into Moura units
kana_list1 = mora_wakachi(text1)
kana_list2 = mora_wakachi(text2)
#Convert long vowels to vowels
kana_list1 = bar2vowel(kana_list1)
kana_list2 = bar2vowel(kana_list2)
#kana_Do nothing or convert to vowels or consonants depending on type
if kana_type=="kana":
pass
elif kana_type == "vowel":
kana_list1 = kana2vowel(kana_list1)
kana_list2 = kana2vowel(kana_list2)
elif kana_type == "consonant":
kana_list1 = kana2consonant(kana_list1)
kana_list2 = kana2consonant(kana_list2)
else:
print("warning: kana_type is invalid")
#ed_Find the editing distance according to type
dist = 0
if ed_type == "normal":
dist = ed.eval(kana_list1, kana_list2)
elif ed_type == "replace":
dist = replace_ed(kana_list1, kana_list2)
else:
print("warning: ed_type is invalid")
#is_If relative is True, kana_Normalize with the character string length of list1
if is_relative:
dist /= len(kana_list1)
return dist
The whole code for copy and paste. Classes are provided for easy management.
import re
import editdistance as ed
class EditDistanceUtil():
def __init__(self):
self.re_mora = self.__get_mora_unit_re()
#Get a regular expression pattern to split into moula units
def __get_mora_unit_re(self):
#Represent each condition with a regular expression
c1 = '[Ukusutsunufumyuruguzudubupuvu][ヮ yeo]' #Udan + "ヮ/A/I/E/Oh "
c2 = '[Ixishini Himirigi Jijibipi][Nyayo]' #I-dan (excluding "I") + "Ya"/Yu/E/Yo "
c3 = '[Tedde][Yayo]' #"Te/De "+"/I/Yu/Yo "
c4 = '[A-Vu]' #One katakana character (including long vowels)
cond = '('+c1+'|'+c2+'|'+c3+'|'+c4+')'
re_mora = re.compile(cond)
return re_mora
#Returns a list of katakana strings divided into moura units
def mora_wakachi(self, kana_text):
return self.re_mora.findall(kana_text)
def char2vowel(self, text):
t = text[-1] #Just look at the last letter to convert to a vowel
if t in "Akasatana Hamayarawagazada Bapaya":
return "A"
elif t in "Ikishini Himiligi Jijibipi":
return "I"
elif t in "Ukusutsunufmuyuruguzudubupuvu":
return "C"
elif t in "Ekesetene Hemeregese de Bepee":
return "D"
elif t in "Okosotonohomoyorowogozodobopoyo":
return "Oh"
elif t == "N":
return "N"
elif t == "Tsu":
return "Tsu"
else:
print(text, "no match")
return text
#Convert long vowels to vowels
#Input is a list of kana divided by moura
def bar2vowel(self, kana_list):
output = []
output.append(kana_list[0])
#I don't expect a long vowel to come first, so start the loop from the second element
for i,v in enumerate(kana_list[1:]):
if v == "-":
kana = self.char2vowel(output[i])#Get the previous element from the output just in case the long vowels are continuous
else:
kana = v
output.append(kana)
return output
#Convert kana to vowels
def kana2vowel(self, kana_list):
output = []
for v in kana_list:
kana = self.char2vowel(v)
output.append(kana)
return output
#Converts 1 unit of received kana into consonants and returns
#Input is a kana character string equivalent to one unit of moura (length 1 or 2)
def char2consonant(self, text):
t = text[0] #Just look at the first letter to convert to a consonant
if t in "Aiueoyayuyowawo":
return "a"
elif t in "Kakikukeko":
return "k"
elif t in "Sashisuseso":
return "s"
elif t in "Tachitsuteto":
return "t"
elif t in "Na ni nu ne no":
return "n"
elif t in "Hahifuheho":
return "h"
elif t in "Mamim memo":
return "m"
elif t in "Larilrero":
return "r"
elif t in "Gagigugego":
return "g"
elif t in "Zazizu Zezojizu":
return "z"
elif t in "Dadedo":
return "d"
elif t in "Babib Bebov":
return "b"
elif t in "Papipupepo":
return "p"
elif t == "Tsu":
return "q"
elif t == "N":
return "N"
else:
print(text, "no match")
return text
#Returns a kana string as a consonant
def kana2consonant(self, kana_list):
output = []
for v in kana_list:
kana = self.char2consonant(v)
output.append(kana)
return output
#Find the edit distance by replacement only
def replace_ed(self, kana_list1, kana_list2):
if len(kana_list1) != len(kana_list2): #kana_list1 and kana_Issue a warning when the string length of list2 is different
print("warning: length of kana_list1 is different from one of kana_list2")
dist = 0
for k1,k2 in zip(kana_list1, kana_list2):
if k1 != k2:
dist += 1
return dist
#text1, text2:Katakana character string that can be pronounced uniquely
#is_relative:When False and True, find the absolute and relative editing distances, respectively.
#ed_type: "normal","replace"Find the editing distance by replacement only, which is common when
#kana_type: "kana","vowel","consonant"Find the editing distances for kana, vowels, and consonants, respectively.
def calc_ed(self, text1, text2, is_relative=False, ed_type="normal", kana_type="kana"):
#Divide the input into Moura units
kana_list1 = self.mora_wakachi(text1)
kana_list2 = self.mora_wakachi(text2)
#Convert long vowels to vowels
kana_list1 = self.bar2vowel(kana_list1)
kana_list2 = self.bar2vowel(kana_list2)
#kana_Do nothing or convert to vowels or consonants depending on type
if kana_type=="kana":
pass
elif kana_type == "vowel":
kana_list1 = self.kana2vowel(kana_list1)
kana_list2 = self.kana2vowel(kana_list2)
elif kana_type == "consonant":
kana_list1 = self.kana2consonant(kana_list1)
kana_list2 = self.kana2consonant(kana_list2)
else:
print("warning: kana_type is invalid")
#ed_Find the editing distance according to type
dist = 0
if ed_type == "normal":
dist = ed.eval(kana_list1, kana_list2)
elif ed_type == "replace":
dist = self.replace_ed(kana_list1, kana_list2)
else:
print("warning: ed_type is invalid")
#is_If relative is True, kana_Normalize with the character string length of list1
if is_relative:
dist /= len(kana_list1)
return dist
#Find the editing distance of Kana
def kana_ed(self, text1,text2):
return self.calc_ed(text1, text2, False, "normal", "kana")
#Find the vowel editing distance
def vowel_ed(self, text1,text2):
return self.calc_ed(text1, text2, False, "normal", "vowel")
#Find the editing distance of consonants
def consonant_ed(self, text1,text2):
return self.calc_ed(text1, text2, False, "normal", "consonant")
#Find the edit distance only by replacing the kana
def kana_replace_ed(self, text1,text2):
return self.calc_ed(text1, text2, False, "replace", "kana")
#Find the editing distance by replacing vowels only
def vowel_replace_ed(self, text1,text2):
return self.calc_ed(text1, text2, False, "replace", "vowel")
#Find the editing distance by replacing consonants only
def consonant_replace_ed(self, text1,text2):
return self.calc_ed(text1, text2, False, "replace", "consonant")
#Find the relative editing distance of Kana
def relative_kana_ed(self, text1,text2):
return self.calc_ed(text1, text2, True, "normal", "kana")
#Find the relative editing distance of vowels
def relative_vowel_ed(self, text1,text2):
return self.calc_ed(text1, text2, True, "normal", "vowel")
#Find the relative editing distance of consonants
def relative_consonant_ed(self, text1,text2):
return self.calc_ed(text1, text2, True, "normal", "consonant")
#Find the relative editing distance only by replacing the kana
def relative_kana_replace_ed(self, text1,text2):
return self.calc_ed(text1, text2, True, "replace", "kana")
#Find the relative editing distance by replacing vowels only
def relative_vowel_replace_ed(self, text1,text2):
return self.calc_ed(text1, text2, True, "replace", "vowel")
#Find the relative editing distance by replacing consonants only
def relative_consonant_replace_ed(self, text1,text2):
return self.calc_ed(text1, text2, True, "replace", "consonant")
if __name__=="__main__":
edu = EditDistanceUtil()
text1 = "Hello"
text2 = "Wacontara"
print("normal edit distance")
print(edu.kana_ed(text1,text2))
print("")
print(edu.kana2vowel(text1), edu.kana2vowel(text2))
print(edu.vowel_ed(text1,text2))
print("")
print(edu.kana2consonant(text1), edu.kana2consonant(text2))
print(edu.consonant_ed(text1,text2))
print("")
print("replace edit distance")
print(edu.kana_replace_ed(text1,text2))
print("")
print(edu.kana2vowel(text1), edu.kana2vowel(text2))
print(edu.vowel_replace_ed(text1,text2))
print("")
print(edu.kana2consonant(text1), edu.kana2consonant(text2))
print(edu.consonant_replace_ed(text1,text2))
print("")
print("relative normal edit distance")
print(edu.relative_kana_ed(text1,text2))
print("")
print(edu.kana2vowel(text1), edu.kana2vowel(text2))
print(edu.relative_vowel_ed(text1,text2))
print("")
print(edu.kana2consonant(text1), edu.kana2consonant(text2))
print(edu.relative_consonant_ed(text1,text2))
print("")
print("relative replace edit distance")
print(edu.relative_kana_replace_ed(text1,text2))
print("")
print(edu.kana2vowel(text1), edu.kana2vowel(text2))
print(edu.relative_vowel_replace_ed(text1,text2))
print("")
print(edu.kana2consonant(text1), edu.kana2consonant(text2))
print(edu.relative_consonant_replace_ed(text1,text2))
Recommended Posts