** Levenshtein distance ** indicates how different the two strings are. Also known as the edit distance. A single character ** Insert / Delete / Replace ** defines the minimum number of steps required to transform one string into the other. Reference: [Wikipedia](https://ja.wikipedia.org/wiki/%E3%83%AC%E3%83%BC%E3%83%99%E3%83%B3%E3%82%B7%E3 % 83% A5% E3% 82% BF% E3% 82% A4% E3% 83% B3% E8% B7% 9D% E9% 9B% A2)
The environment uses Google Colaboratory. The Python version is below.
import platform
print("python " + platform.python_version())
# python 3.6.9
In addition, the library for calculating the Levenshtein distance must be installed in advance.
pip install python-Levenshtein
Now let's write the code. First, import the library for calculating the Levenshtein distance.
import Levenshtein
As a trial, let's calculate the Levenshtein distance between "Rievenstein" and "Levenshtein".
#Calculate the Levenshtein distance
Levenshtein.distance('Rievenstein', 'Levenshtein')
# 3
The result was 3. this is,
-** Delete ** (** R ** ievenstein → ievenstein) -** Replace ** (** i ** evenstein → ** L ** evenstein) -** Insert ** (Levenstein → Levens h </ strong> tein)
This is because I performed the operation three times.
You can check what kind of operation you performed below.
Levenshtein.editops('Rievenstein', 'Levenshtein')
# [('delete', 0, 0), ('replace', 1, 0), ('insert', 7, 6)]
The content displayed here is (Operation performed, what character in the first string, what character in the second string) is. In other words
--Removed the 0th letter (R) of Rievenstein --Replaced the first letter (i) of Rievenstein with the 0th letter (L) of Levenshtein --Insert the 6th letter (h) of Levenshtein at the position of the 7th letter (t) of Rievenstein
It means that.
You can also calculate the similarity of how similar you are, not how many times you have performed the operation.
Levenshtein.ratio('Rievenstein', 'Levenshtein')
# 0.8181818181818182
If there are several strings, it is also possible to extract the string that is the median of them.
Levenshtein.median(['Rievenstein',
'Levenshtein',
'Revenshtein',
'Lievenstein',
'Levenshtain',
'Levennshtein'])
# 'Levenshtein'
The Hamming distance is the number of different characters in the corresponding positions in two strings of ** equal number of characters **. Reference: [Wikipedia](https://ja.wikipedia.org/wiki/%E3%83%8F%E3%83%9F%E3%83%B3%E3%82%B0%E8%B7%9D%E9 % 9B% A2)
This can also be calculated using the same library.
Levenshtein.hamming('Rievenstein', 'Levenshtein')
# 7
Hamming distance compares the strings of Rievenstein and Levenshtein only with each other in the same place. It will be as follows.
-** R ** ievenstein and ** L ** even shtein → different --R i </ strong> evenstein and L e </ strong> venshtein → different --Ri e </ strong> venstein and Le v </ strong> enshtein → different
Looking at it in this way, the Hamming distance is output as 7 because the 7 characters except the last "tein" are different.
The Hamming distance must be ** equal in number of characters. ** An error will occur if the number of characters is different.
There is also the Jaro-Winkler Distance, which measures the partial match between two strings. The closer it is to 1, the more similar it is, and the closer it is to 0, the more different it is. It seems to be used to judge spelling mistakes.
Levenshtein.jaro_winkler('Rievenstein', 'Levenshtein')
# 0.8787878787878789
I used Python to calculate the Levenshtein distance. I also calculated the Hamming distance and the Jaro Winkler distance as other editing distances. By all means, please calculate the editing distance of various similar character strings and calculate the similarity.
Please refer to the following for the details of the library used this time. https://rawgit.com/ztane/python-Levenshtein/master/docs/Levenshtein.html#Levenshtein-median
Recommended Posts