I calculated "Levenshtein distance" using Python

What is the Levenshtein distance?

** Levenshtein distance ** indicates how different the two strings are. Also known as the edit distance. A single character ** Insert / Delete / Replace ** defines the minimum number of steps required to transform one string into the other. Reference: [Wikipedia](https://ja.wikipedia.org/wiki/%E3%83%AC%E3%83%BC%E3%83%99%E3%83%B3%E3%82%B7%E3 % 83% A5% E3% 82% BF% E3% 82% A4% E3% 83% B3% E8% B7% 9D% E9% 9B% A2)

Preparation

The environment uses Google Colaboratory. The Python version is below.

import platform
print("python " + platform.python_version())
# python 3.6.9

In addition, the library for calculating the Levenshtein distance must be installed in advance.

pip install python-Levenshtein

Let's calculate the Levenshtein distance

Now let's write the code. First, import the library for calculating the Levenshtein distance.

import Levenshtein

As a trial, let's calculate the Levenshtein distance between "Rievenstein" and "Levenshtein".

#Calculate the Levenshtein distance
Levenshtein.distance('Rievenstein', 'Levenshtein')
# 3

The result was 3. this is,

-** Delete ** (** R ** ievenstein → ievenstein) -** Replace ** (** i ** evenstein → ** L ** evenstein) -** Insert ** (Levenstein → Levens h tein)

This is because I performed the operation three times.

You can check what kind of operation you performed below.

Levenshtein.editops('Rievenstein', 'Levenshtein') # [('delete', 0, 0), ('replace', 1, 0), ('insert', 7, 6)]

The content displayed here is (Operation performed, what character in the first string, what character in the second string) is. In other words

--Removed the 0th letter (R) of Rievenstein --Replaced the first letter (i) of Rievenstein with the 0th letter (L) of Levenshtein --Insert the 6th letter (h) of Levenshtein at the position of the 7th letter (t) of Rievenstein

It means that.

You can also calculate the similarity of how similar you are, not how many times you have performed the operation.

Levenshtein.ratio('Rievenstein', 'Levenshtein') # 0.8181818181818182

If there are several strings, it is also possible to extract the string that is the median of them.

Levenshtein.median(['Rievenstein', 'Levenshtein', 'Revenshtein', 'Lievenstein', 'Levenshtain', 'Levennshtein']) # 'Levenshtein'

Other edit distances

Hamming distance

The Hamming distance is the number of different characters in the corresponding positions in two strings of ** equal number of characters **. Reference: [Wikipedia](https://ja.wikipedia.org/wiki/%E3%83%8F%E3%83%9F%E3%83%B3%E3%82%B0%E8%B7%9D%E9 % 9B% A2)

This can also be calculated using the same library.

Levenshtein.hamming('Rievenstein', 'Levenshtein') # 7

Hamming distance compares the strings of Rievenstein and Levenshtein only with each other in the same place. It will be as follows.

-** R ** ievenstein and ** L ** even shtein → different --R i evenstein and L e venshtein → different --Ri e venstein and Le v enshtein → different

… --Revenstei n and Levenshtei n → equal

Looking at it in this way, the Hamming distance is output as 7 because the 7 characters except the last "tein" are different.

The Hamming distance must be ** equal in number of characters. ** An error will occur if the number of characters is different.

Jaro Winkler Distance

There is also the Jaro-Winkler Distance, which measures the partial match between two strings. The closer it is to 1, the more similar it is, and the closer it is to 0, the more different it is. It seems to be used to judge spelling mistakes.

Levenshtein.jaro_winkler('Rievenstein', 'Levenshtein') # 0.8787878787878789

Summary

I used Python to calculate the Levenshtein distance. I also calculated the Hamming distance and the Jaro Winkler distance as other editing distances. By all means, please calculate the editing distance of various similar character strings and calculate the similarity.

Please refer to the following for the details of the library used this time. https://rawgit.com/ztane/python-Levenshtein/master/docs/Levenshtein.html#Levenshtein-median

Recommended Posts
I calculated "Levenshtein distance" using Python

[Python] I tried using OpenPose

I made a Line-bot using Python!

Find the Levenshtein Distance with python

I tried using Thonny (Python / IDE)

[Python] I tried using YOLO v3

I tried using Bayesian Optimization in Python

I tried using UnityCloudBuild API from Python

I analyzed cowrie (honeypot) using python pandas

I started python

Start using Python

Scraping using Python

vprof --I tried using the profiler for Python

I tried web scraping using python and selenium

I tried object detection using Python and OpenCV

I want to email from Gmail using Python.

I tried using mecab with python2.7, ruby2.3, php7

I tried reading a CSV file using Python

I tried using the Datetime module by Python

I tried using parameterized

What I learned about AI / machine learning using Python (1)

I tried using mimesis

Operate Redmine using Python Redmine

I tried using anytree

Fibonacci sequence using Python

I made a login / logout process using Python Bottle.

Data analysis using Python 0

I tried using aiomysql

I tried using TradeWave (BitCoin system trading in Python)

I tried using Summpy

I tried Python> autopep8

I tried using coturn

I tried using Pipenv

I tried using matplotlib

I tried using "Anvil".

I tried using Hubot

Data cleaning using Python

I tried using ESPCN

I tried using openpyxl

Using Python #external packages

I implemented Python Logging

I tried using Ipython

I tried using PyCaret

What I learned about AI / machine learning using Python (3)

I tried using cron

WiringPi-SPI communication using Python

Age calculation using python

Relearn Python (Algorithm I)

I tried using ngrok

I tried using face_recognition

[Python] I tried running a local server using flask

I tried drawing a pseudo fractal figure using Python

I tried using Jupyter

Search Twitter using Python

I tried using Heapq

Name identification using python

Notes using Python subprocesses

I tried using doctest

I tried Python> decorator

I tried using Python (3) instead of a scientific calculator

I tried using folium