Overview

We have summarized the points to note when using the editing distance as the pronunciation similarity between Japanese sentences. We also showed a code example to find the editing distance for kana, vowels, and consonants.

background

You may want to find the pronunciation similarity between sentences and words when automatically generating puns and mishearing. At this time, I can think of using the edit distance. The edit distance is an index used as an index of the similarity of character strings, and is defined as the minimum number of edits required to edit (insert, delete, replace) one character string one character at a time to make another character string. I will. In Japanese, there is a one-to-one correspondence between katakana notation and pronunciation, so you can find out how close the pronunciation is by finding the editing distance for katakana notation. However, you need to be careful when handling yoon and long vowels. Specifically, the following processing is required.

Since yoon (sha, etc.) is pronounced as one beat, it is treated as one unit instead of two.
Since the pronunciation of long notes (stretched bars) differs depending on the immediately preceding kana (even if they are the same character), convert them to the corresponding kana and then calculate the editing distance.

Also, when it comes to pronunciation similarity, you may want to find an index that focuses only on vowels and consonants.

Based on the above background, we aim to implement the following functions.

Divide the kana string into moura units
Convert long vowels to vowels that correspond to the actual pronunciation
Convert a kana string to the corresponding vowel string
Convert a kana string to the corresponding consonant string

environment

macOS Catalina 10.15.7 python 3.8.0

Installation

Install the library that calculates the editing distance.

pip install editdistance

input

Input is a uniquely pronouncable katakana character string that meets the following conditions.

It is a character string of katakana only.
Does not include strings that start with long notes (stretched bars) or small katakana, or combinations of undefined yoons of general pronunciation.
The particles "ha" and "he" have been corrected to the actual pronunciation sounds ("wa" and "e").

Implementation of preprocessing function

Divide the kana string into moura units

Create a function that divides the input character string that meets the above assumptions into Moura units.

Reference: Separate Japanese (Katakana) in mora units [Python]

import re

#Represent each condition with a regular expression
c1 = '[Ukusutsunufumyuruguzudubupuvu][ヮ yeo]' #Udan + "ヮ/A/I/E/Oh "
c2 = '[Ixishini Himirigi Jijibipi][Nyayo]' #I-dan (excluding "I") + "Ya"/Yu/E/Yo "
c3 = '[Tedde][Yayo]' #"Te/De "+"/I/Yu/Yo "
c4 = '[A-Vu]' #One katakana character (including long vowels)

cond = '('+c1+'|'+c2+'|'+c3+'|'+c4+')'
re_mora = re.compile(cond)

#Returns a list of katakana strings divided into moura units
def mora_wakachi(kana_text):
    return re_mora.findall(kana_text)

Convert long vowels to vowels that correspond to the actual pronunciation

Of the character strings divided into moura units above, create a function that converts long notes (stretched bars) into vowels that correspond to pronunciation. Convert according to the following rules.

When the immediately preceding element is A/I/U/E/O, change the long sound to A/I/U/E/O.
When the previous element is a kana (tsu/n) that does not have a vowel, make the long sound the same as the previous element.
Do not convert otherwise (if the input assumptions are met, the "otherwise" situation should not occur)

First, in order to determine the "stage" of the immediately preceding element, create a function that converts one unit of moura into a vowel. Then, use that function to create a function that converts long vowels into vowels.

#Convert one unit of received kana into a vowel and return it
#Input is a kana character string equivalent to one unit of moura (length 1 or 2)
def char2vowel(text):
  t = text[-1] #Just look at the last letter to convert to a vowel
  if t in "Akasatana Hamayarawagazada Bapaya":
    return "A"
  elif t in "Ikishini Himiligi Jijibipi":
    return "I"
  elif t in "Ukusutsunufmuyuruguzudubupuvu":
    return "C"
  elif t in "Ekesetene Hemeregese de Bepee":
    return "D"
  elif t in "Okosotonohomoyorowogozodobopoyo":
    return "Oh"
  elif t == "N":
    return "N"
  elif t == "Tsu":
    return "Tsu"
  else:
    print("no match")
    return text

#Input is a list of kana divided by moura
def bar2vowel(kana_list):
  output = []
  output.append(kana_list[0])
  #I don't expect a long vowel to come first, so start the loop from the second element
  for i,v in enumerate(kana_list[1:]):
    if v == "-":
      kana = char2vowel(output[i])#Get the previous element from the output just in case the long vowels are continuous
    else:
      kana = v
    output.append(kana)
  return output

Convert a kana string to the corresponding vowel string

Create a function that converts a kana string to the corresponding vowel string. The input is a list of kana divided by moura. However, it is assumed that long vowels have been converted to vowels. Vowels in the output are represented by katakana in line A. In other words

Input: ["ko", "n", "d", "chi", "wa"]
Output: ["O", "N", "I", "I", "A"] Aim for conversion like. All you have to do is use the char2vowel function and replace it mechanically.

def kana2vowel(kana_list):
  output = []
  for v in kana_list:
    kana = char2vowel(v)
    output.append(kana)
  return output

Convert a kana string to the corresponding consonant string

Create a function that converts a kana string to the corresponding consonant string. The input is a list of kana divided by moura. However, it is assumed that long vowels have been converted to vowels. Consonants in the output shall be represented by the corresponding alphabet. In other words

Input: ["ko", "n", "d", "chi", "wa"]
Output: ["k", "N", "n", "t", "a"] Aim for conversion like.

The rules for converting to consonants are based on the table below.

consonant	Kana
a	A line, wa line, ya line
k	Ka line, Ka line, Qua line
s	Sa line, Sha line, Su line
t	Ta line, Tya line, Cha line, Tsa line
n	Na line, Nya line, Nua line
h	Ha line, Fa line, Hya line
m	Ma line, Mya line, Mu line
y	Not applicable(Integrated into consonant a)
r	La line, Lya line, Lua line
w	Not applicable (integrated into consonant a)
g	Ga line, Ga line, Gua line
z	The line, Ja line, Ja line, Za line, Za line
d	Da line, Da line
b	Ba line, Va line, Bya line, Ba line
p	Pa line, Pya line, Pu line
q	Tsu
N	N
ky	Not applicable (integrated into consonant k)
sh	Not applicable(Integrated into consonant s)
ch	Not applicable (integrated into consonant t)
ny	Not applicable(Integrated into consonant n)
hy	Not applicable(Integrated into consonant h)
f	Not applicable(Integrated into consonant h)
my	Not applicable(Integrated into consonant m)
ry	Not applicable(Integrated into consonant r)
gy	Not applicable(Integrated into consonant g)
j	Not applicable(Integrated into consonant z)
by	Not applicable(Integrated into consonant b)
py	Not applicable(Integrated into consonant p)
v	Not applicable(Integrated into consonant b)

In the above table, consonants that are intuitively similar, such as consonants that are not yoon, are integrated. If possible, it is best to define the similarity between each consonant as a real number from 0 to 1 and find the weighted editing distance. In addition, the consonants of A and N are originally correct as silence (sp), but here, in order to distinguish them, the symbols a, N, and q are given.

Based on the above table, first create a function that converts a single kana character to a consonant, and then use it to convert the entire character string to a consonant.

#Convert one unit of received kana into a vowel and return it
#Input is a kana character string equivalent to one unit of moura (length 1 or 2)
def char2vowel(text):
  t = text[0] #Just look at the first letter to convert to a consonant
  if t in "Aiueoyayuyowawo":
    return "a"
  elif t in "Kakikukeko":
    return "k"
  elif t in "Sashisuseso":
    return "s"
  elif t in "Tachitsuteto":
    return "t"
  elif t in "Na ni nu ne no":
    return "n"
  elif t in "Hahifuheho":
    return "h"
  elif t in "Mamim memo":
    return "m"
  elif t in "Larilrero":
    return "r"
  elif t in "Gagigugego":
    return "g"
  elif t in "Zazizu Zezojizu":
    return "z"
  elif t in "Dadedo":
    return "d"
  elif t in "Babib Bebov":
    return "b"
  elif t in "Papipupepo":
    return "p"
  elif t == "Tsu":
    return "q"
  elif t == "N":
    return "N"
  else:
    print("no match")
    return text

def kana2consonant(kana_list):
  output = []
  for v in kana_list:
    kana = char2consonant(v)
    output.append(kana)
  return output

Edit distance calculation

In order to use it differently depending on the situation, we will define some methods to find the editing distance.

General editing distance

The edit distance as defined is calculated by a function called eval in the editdistance library. To use it, just enter two strings as arguments.

import editdistance as ed
kana_list1 = ["Ko ","N","D","Ji","Wa"]
kana_list2 = ["Ko ","N","Ba","N","Wa"]
dist = ed.eval(kana_list1,kana_list2)
print(dist)#Output is 2

Editing distance by replacement only

You can also find the edit distance when you cannot insert or delete and only replace. This is equivalent to comparing character string A and character string B one character at a time to find the number of different characters. Note that you can't insert and delete, that is, change the string length, so you can only use it when the two strings you want to compare are equal in length.

From the point of view of pronunciation similarity, the appearance of the same sound at the same timing may have a strong effect. If this effect is important, it is advantageous to adopt the edit distance by replacement only. In addition, since the comparison is done in order from the beginning, there is an advantage that the amount of calculation can be saved compared to the general editing distance.

#Find the edit distance by replacement only
def replace_ed(kana_list1, kana_list2):
  if len(kana_list1) != (kana_list2): #kana_list1 and kana_Issue a warning when the string length of list2 is different
    print("warning: length of kana_list1 is different from one of kana_list2")
  dist = 0
  for k1,k2 in zip(kana_list1, kana_list2):
    if k1 != k2:
      dist += 1
  return dist

Relative edit distance

The longer the string, the larger the edit distance tends to be, so if you want to treat it as an absolute similarity, it may be better to standardize by the string length. Below is the code to calculate the standardized value of the general edit distance and the edit distance by replacement only by the character string length.

#General relative edit distance(kana_Normalized with the length of list1)
def relative_ed(kana_list1, kana_list2):
  dist = ed.eval(kana_list1, kana_list2)
  dist /= len(kana_list1)
  return dist

#Relative edit distance by replacement only(kana_Normalized with the length of list1)
def relative_replace_ed(kana_list1, kana_list2):
  dist = replace_ed(kana_list1, kana_list2)
  dist /= len(kana_list1)
  return dist

There is room for consideration as to which of the two character strings to be compared is used as the reference. For example, if you have a specific word A and you want to find the similarity between word A and some word candidates, it seems better to use the length of word A as a reference. You may want to use the average of the lengths of the two words, especially if you don't think of a reference word. This time, I showed the code to normalize by the character string length of the word of the first argument.

Calculation of editing distances for kana, vowels, and consonants

By combining the chords up to this point, create a function that calculates the editing distance of kana, vowels, and consonants. The editing distance for vowels can be calculated after converting kana to vowels. The same is true for consonants. From the contents introduced so far (Absolute or relative) x (general or permutation only) x (kana or vowel or consonant) So, there are 12 possible editing distances. I hope you can create the necessary functions according to your purpose. Here, 12 examples will be long, so here is an example of a function that branches the output according to the flag argument. Input is a katakana character string that meets the conditions raised in the "Input" section.

#text1, text2:Katakana character string that can be pronounced uniquely
#is_relative:When False and True, find the absolute and relative editing distances, respectively.
#ed_type: "normal","replace"Find the editing distance by replacement only, which is common when
#kana_type: "kana","vowel","consonant"Find the editing distances for kana, vowels, and consonants, respectively.
def calc_ed(text1, text2, is_relative=False, ed_type="normal", kana_type="kana"):
  #Divide the input into Moura units
  kana_list1 = mora_wakachi(text1)
  kana_list2 = mora_wakachi(text2)

  #Convert long vowels to vowels
  kana_list1 = bar2vowel(kana_list1)
  kana_list2 = bar2vowel(kana_list2)

  #kana_Do nothing or convert to vowels or consonants depending on type
  if kana_type=="kana":
    pass
  elif kana_type == "vowel":
    kana_list1 = kana2vowel(kana_list1)
    kana_list2 = kana2vowel(kana_list2)
  elif kana_type == "consonant":
    kana_list1 = kana2consonant(kana_list1)
    kana_list2 = kana2consonant(kana_list2)
  else:
    print("warning: kana_type is invalid")

  #ed_Find the editing distance according to type
  dist = 0
  if ed_type == "normal":
    dist = ed.eval(kana_list1, kana_list2)
  elif ed_type == "replace": 
    dist = replace_ed(kana_list1, kana_list2)
  else:
    print("warning: ed_type is invalid")

  #is_If relative is True, kana_Normalize with the character string length of list1
  if is_relative: 
    dist /= len(kana_list1)

  return dist

Whole code

The whole code for copy and paste. Classes are provided for easy management.

import re
import editdistance as ed

class EditDistanceUtil():
  def __init__(self):
    self.re_mora = self.__get_mora_unit_re()

  #Get a regular expression pattern to split into moula units
  def __get_mora_unit_re(self):
    #Represent each condition with a regular expression
    c1 = '[Ukusutsunufumyuruguzudubupuvu][ヮ yeo]' #Udan + "ヮ/A/I/E/Oh "
    c2 = '[Ixishini Himirigi Jijibipi][Nyayo]' #I-dan (excluding "I") + "Ya"/Yu/E/Yo "
    c3 = '[Tedde][Yayo]' #"Te/De "+"/I/Yu/Yo "
    c4 = '[A-Vu]' #One katakana character (including long vowels)

    cond = '('+c1+'|'+c2+'|'+c3+'|'+c4+')'
    re_mora = re.compile(cond)
    return re_mora
  
  #Returns a list of katakana strings divided into moura units
  def mora_wakachi(self, kana_text):
    return self.re_mora.findall(kana_text)

  def char2vowel(self, text):
    t = text[-1] #Just look at the last letter to convert to a vowel
    if t in "Akasatana Hamayarawagazada Bapaya":
      return "A"
    elif t in "Ikishini Himiligi Jijibipi":
      return "I"
    elif t in "Ukusutsunufmuyuruguzudubupuvu":
      return "C"
    elif t in "Ekesetene Hemeregese de Bepee":
      return "D"
    elif t in "Okosotonohomoyorowogozodobopoyo":
      return "Oh"
    elif t == "N":
      return "N"
    elif t == "Tsu":
      return "Tsu"
    else:
      print(text, "no match")
      return text

  #Convert long vowels to vowels
  #Input is a list of kana divided by moura
  def bar2vowel(self, kana_list):
    output = []
    output.append(kana_list[0])
    #I don't expect a long vowel to come first, so start the loop from the second element
    for i,v in enumerate(kana_list[1:]):
      if v == "-":
        kana = self.char2vowel(output[i])#Get the previous element from the output just in case the long vowels are continuous
      else:
        kana = v
      output.append(kana)
    return output

  #Convert kana to vowels
  def kana2vowel(self, kana_list):
    output = []
    for v in kana_list:
      kana = self.char2vowel(v)
      output.append(kana)
    return output

  #Converts 1 unit of received kana into consonants and returns
  #Input is a kana character string equivalent to one unit of moura (length 1 or 2)
  def char2consonant(self, text):
    t = text[0] #Just look at the first letter to convert to a consonant
    if t in "Aiueoyayuyowawo":
      return "a"
    elif t in "Kakikukeko":
      return "k"
    elif t in "Sashisuseso":
      return "s"
    elif t in "Tachitsuteto":
      return "t"
    elif t in "Na ni nu ne no":
      return "n"
    elif t in "Hahifuheho":
      return "h"
    elif t in "Mamim memo":
      return "m"
    elif t in "Larilrero":
      return "r"
    elif t in "Gagigugego":
      return "g"
    elif t in "Zazizu Zezojizu":
      return "z"
    elif t in "Dadedo":
      return "d"
    elif t in "Babib Bebov":
      return "b"
    elif t in "Papipupepo":
      return "p"
    elif t == "Tsu":
      return "q"
    elif t == "N":
      return "N"
    else:
      print(text, "no match")
      return text
  #Returns a kana string as a consonant
  def kana2consonant(self, kana_list):
    output = []
    for v in kana_list:
      kana = self.char2consonant(v)
      output.append(kana)
    return output
  
  #Find the edit distance by replacement only
  def replace_ed(self, kana_list1, kana_list2):
    if len(kana_list1) != len(kana_list2): #kana_list1 and kana_Issue a warning when the string length of list2 is different
      print("warning: length of kana_list1 is different from one of kana_list2")
    dist = 0
    for k1,k2 in zip(kana_list1, kana_list2):
      if k1 != k2:
        dist += 1
    return dist
        
  #text1, text2:Katakana character string that can be pronounced uniquely
  #is_relative:When False and True, find the absolute and relative editing distances, respectively.
  #ed_type: "normal","replace"Find the editing distance by replacement only, which is common when
  #kana_type: "kana","vowel","consonant"Find the editing distances for kana, vowels, and consonants, respectively.
  def calc_ed(self, text1, text2, is_relative=False, ed_type="normal", kana_type="kana"):
    #Divide the input into Moura units
    kana_list1 = self.mora_wakachi(text1)
    kana_list2 = self.mora_wakachi(text2)
  
    #Convert long vowels to vowels
    kana_list1 = self.bar2vowel(kana_list1)
    kana_list2 = self.bar2vowel(kana_list2)
  
    #kana_Do nothing or convert to vowels or consonants depending on type
    if kana_type=="kana":
      pass
    elif kana_type == "vowel":
      kana_list1 = self.kana2vowel(kana_list1)
      kana_list2 = self.kana2vowel(kana_list2)
    elif kana_type == "consonant":
      kana_list1 = self.kana2consonant(kana_list1)
      kana_list2 = self.kana2consonant(kana_list2)
    else:
      print("warning: kana_type is invalid")
  
    #ed_Find the editing distance according to type
    dist = 0
    if ed_type == "normal":
      dist = ed.eval(kana_list1, kana_list2)
    elif ed_type == "replace": 
      dist = self.replace_ed(kana_list1, kana_list2)
    else:
      print("warning: ed_type is invalid")
  
    #is_If relative is True, kana_Normalize with the character string length of list1
    if is_relative: 
      dist /= len(kana_list1)  
    return dist  
    
  #Find the editing distance of Kana
  def kana_ed(self, text1,text2):
    return self.calc_ed(text1, text2, False, "normal", "kana")
  
  #Find the vowel editing distance
  def vowel_ed(self, text1,text2):
    return self.calc_ed(text1, text2, False, "normal", "vowel")
  
  #Find the editing distance of consonants
  def consonant_ed(self, text1,text2):
    return self.calc_ed(text1, text2, False, "normal", "consonant")    
    
  #Find the edit distance only by replacing the kana
  def kana_replace_ed(self, text1,text2):
    return self.calc_ed(text1, text2, False, "replace", "kana")
  
  #Find the editing distance by replacing vowels only
  def vowel_replace_ed(self, text1,text2):
    return self.calc_ed(text1, text2, False, "replace", "vowel")
  
  #Find the editing distance by replacing consonants only
  def consonant_replace_ed(self, text1,text2):
    return self.calc_ed(text1, text2, False, "replace", "consonant")    
    
  #Find the relative editing distance of Kana
  def relative_kana_ed(self, text1,text2):
    return self.calc_ed(text1, text2, True, "normal", "kana")
  
  #Find the relative editing distance of vowels
  def relative_vowel_ed(self, text1,text2):
    return self.calc_ed(text1, text2, True, "normal", "vowel")
  
  #Find the relative editing distance of consonants
  def relative_consonant_ed(self, text1,text2):
    return self.calc_ed(text1, text2, True, "normal", "consonant")    
    
  #Find the relative editing distance only by replacing the kana
  def relative_kana_replace_ed(self, text1,text2):
    return self.calc_ed(text1, text2, True, "replace", "kana")
  
  #Find the relative editing distance by replacing vowels only
  def relative_vowel_replace_ed(self, text1,text2):
    return self.calc_ed(text1, text2, True, "replace", "vowel")
  
  #Find the relative editing distance by replacing consonants only
  def relative_consonant_replace_ed(self, text1,text2):
    return self.calc_ed(text1, text2, True, "replace", "consonant")    
  
if __name__=="__main__":
  edu = EditDistanceUtil()
  text1 = "Hello"
  text2 = "Wacontara"

  print("normal edit distance")
  print(edu.kana_ed(text1,text2))
  print("")
  print(edu.kana2vowel(text1), edu.kana2vowel(text2))
  print(edu.vowel_ed(text1,text2))
  print("")
  print(edu.kana2consonant(text1), edu.kana2consonant(text2))
  print(edu.consonant_ed(text1,text2))
  
  print("")
  print("replace edit distance")
  print(edu.kana_replace_ed(text1,text2))
  print("")
  print(edu.kana2vowel(text1), edu.kana2vowel(text2))
  print(edu.vowel_replace_ed(text1,text2))
  print("")
  print(edu.kana2consonant(text1), edu.kana2consonant(text2))
  print(edu.consonant_replace_ed(text1,text2))
  
  print("")
  print("relative normal edit distance")
  print(edu.relative_kana_ed(text1,text2))
  print("")
  print(edu.kana2vowel(text1), edu.kana2vowel(text2))
  print(edu.relative_vowel_ed(text1,text2))
  print("")
  print(edu.kana2consonant(text1), edu.kana2consonant(text2))
  print(edu.relative_consonant_ed(text1,text2))
  
  print("")
  print("relative replace edit distance")
  print(edu.relative_kana_replace_ed(text1,text2))
  print("")
  print(edu.kana2vowel(text1), edu.kana2vowel(text2))
  print(edu.relative_vowel_replace_ed(text1,text2))
  print("")
  print(edu.kana2consonant(text1), edu.kana2consonant(text2))
  print(edu.relative_consonant_replace_ed(text1,text2))

Calculate the editing distance as an index of pronunciation similarity [python]