Regarding the editing distance, it is a little old, but the article by Naoya Ito is helpful. Simply put, it's a way to express the closeness of two strings as a number.
Reference: Levenshtein Distance-Naoya's Hatena Diary
There was a package called python-Levenshtein, so let's put it in.
$ sudo pip install python-Levenshtein
Let's write a code like this.
#!/usr/bin/env python
# coding: utf8
import Levenshtein
string1 = "Yasuji Inoue"
string2 = "Yasuji Inoue"
string1 = string1.decode('utf-8')
string2 = string2.decode('utf-8')
print Levenshtein.distance(string1, string2)
$ python levenshtein.py
1
Japanese is also OK. If you replace one character, it will be the correct character, so the editing distance will be 1.
I'm not good at typing the letters python, and when I notice it, it becomes pyhton. The editing distance between pyhton and python is 2. (Because it will be the same if you swap the two letters)
Looking at the Documentation, it seems that you can also calculate the Jaro-Winkler distance and so on.
If you register it as a MySQL stored like the one below, it will look like ORDER BY LEVENSHTEIN (title, "Hogehoge"). It is convenient because it will be displayed in the order of the letters. However, the index does not work, so if you are searching all records and the number of records is large, the query will be quite heavy.
https://github.com/fza/mysql-doctrine-levenshtein-function
PHP-[Multibyte support] Find the Levenshtein distance-Qiita
Recommended Posts