Name identification using python

Means of name identification

Use the Levenshtein distance. "Levenshtein distance" is also known as "editing distance" It is the minimum number of steps required to transform one character string into the other by "inserting", "deleting", and "replacement" of one character.

For example, from cat to cut cat → c t → cut to 2 of" ʻa delete, ʻuinsert". It is easy to think that the Levenshtein distance is1` because you can use permutation.

For more information [Wiki](https://en.wikipedia.org/wiki/%E3%83%AC%E3%83%BC%E3%83%99%E3%83%B3%E3%82%B7%E3 See% 83% A5% E3% 82% BF% E3% 82% A4% E3% 83% B3% E8% B7% 9D% E9% 9B% A2). In this tutorial, we will finally identify the names while doing some experiments.

Experiment 1. Let's use "Levenshtein distance"

`test0`


import Levenshtein

target = ['oniku_a', 'oniku_b', 'oniku_c', 'yasai_a', 'yasai_b', 'yasai_c']
cates = ['kome']

for val in target:
    nearL_flag = False
    for cate in cates:
        if Levenshtein.distance(val, cate) < 1:   #If the reference Levenshtein distance is less than 1, that is, if it is even slightly different, throw it into categories.
            nearL_flag = True
    if not nearL_flag:
        cates.append(val)
cates

`output`


['kome', 'oniku_a', 'oniku_b', 'oniku_c', 'yasai_a', 'yasai_b', 'yasai_c']

The program is such that everything with a reference Levenshtein distance of 1 or more enters categories. We will remodel it based on this. First, gradually increase the distance and observe what happens to the output.

Experiment 1. Increase the reference Levenshtein distance (tolerate ambiguity)

`test1`


import Levenshtein

target = ['oniku_a', 'oniku_b', 'oniku_c', 'yasai_a', 'yasai_b', 'yasai_c']

for i in range(10):
    cates = ['kome']
    for val in target:
        nearL_flag = False
        for cate in cates:
            if Levenshtein.distance(val, cate) < i:
                nearL_flag = True
        if not nearL_flag:
            cates.append(val)
    print(i, cates)

`output`



0 ['kome', 'oniku_a', 'oniku_b', 'oniku_c', 'yasai_a', 'yasai_b', 'yasai_c']
1 ['kome', 'oniku_a', 'oniku_b', 'oniku_c', 'yasai_a', 'yasai_b', 'yasai_c']
2 ['kome', 'oniku_a', 'yasai_a']
3 ['kome', 'oniku_a', 'yasai_a']
4 ['kome', 'oniku_a', 'yasai_a']
5 ['kome', 'oniku_a', 'yasai_a']
6 ['kome', 'oniku_a', 'yasai_b']
7 ['kome', 'yasai_a']
8 ['kome']
9 ['kome']

From this output result, it can be seen that ambiguity is tolerated as the Levenshtein distance is increased. For example, whether to consider 'oniku_a' and'oniku_b' as the same classification. If the reference Levenshtein distance is small, it is "considered different" If the standard Levenshtein distance is large, "Well, let's consider it the same".

Also, from the above output result, it seems to be good to set the initial list of categories to ['kome','oniku','yasai'] instead of ['kome'].

Experiment 2. Change the initial value to `['kome','oniku','yasai']` and try to identify the name.

`test2`


import Levenshtein

target = ['oniku_a', 'oniku_b', 'oniku_c', 'yasai_a', 'yasai_b', 'yasai_c']

for i in range(10):
    cates = ['kome', 'oniku', 'yasai']
    for val in target:
        nearL_flag = False
        for cate in cates:
            if Levenshtein.distance(val, cate) < i:
                nearL_flag = True
        if not nearL_flag:
            cates.append(val)
    print(i, cates)

`output`


0 ['kome', 'oniku', 'yasai', 'oniku_a', 'oniku_b', 'oniku_c', 'yasai_a', 'yasai_b', 'yasai_c']
1 ['kome', 'oniku', 'yasai', 'oniku_a', 'oniku_b', 'oniku_c', 'yasai_a', 'yasai_b', 'yasai_c']
2 ['kome', 'oniku', 'yasai', 'oniku_a', 'yasai_a']
3 ['kome', 'oniku', 'yasai']
4 ['kome', 'oniku', 'yasai']
5 ['kome', 'oniku', 'yasai']
6 ['kome', 'oniku', 'yasai']
7 ['kome', 'oniku', 'yasai']
8 ['kome', 'oniku', 'yasai']
9 ['kome', 'oniku', 'yasai']

From the above, for this target, cates initial value: ['kome','oniku','yasai'], it was found that the name can be identified with the Levenshtein distance less than 3.

Then, finally, it is a name identification program.

Name identification program

`nayose`


import Levenshtein

target = ['oniku_a', 'oniku_b', 'oniku_c', 'yasai_a', 'yasai_b', 'yasai_c']
cates = ['kome', 'oniku', 'yasai']

nayose = []

for val in target:
    minL = 100
    afterNayose = 'dummy'
    for cate in cates:
        tmp_distance = Levenshtein.distance(val, cate)
        if tmp_distance < minL:
            minL = tmp_distance
            afterNayose = cate
    nayose.append(afterNayose)

[(before, after, Levenshtein.distance(before, after)) for before, after in zip(target, nayose)]

`output`


[('oniku_a', 'oniku', 2),
 ('oniku_b', 'oniku', 2),
 ('oniku_c', 'oniku', 2),
 ('yasai_a', 'yasai', 2),
 ('yasai_b', 'yasai', 2),
 ('yasai_c', 'yasai', 2)]

The way to read the output is (before, ʻafter, before and after Levenshtein distance). You can certainly confirm that the Levenshtein distance is less than 3` and the name can be identified.

Name identification using python

Means of name identification

Experiment 1. Let's use "Levenshtein distance"

test0

output

Experiment 1. Increase the reference Levenshtein distance (tolerate ambiguity)

test1

output

Experiment 2. Change the initial value to ['kome','oniku','yasai'] and try to identify the name.

test2

output

Name identification program

nayose

output

`test0`

`output`

`test1`

`output`

Experiment 2. Change the initial value to `['kome','oniku','yasai']` and try to identify the name.

`test2`

`output`

`nayose`

`output`