Use the Levenshtein distance. "Levenshtein distance" is also known as "editing distance" It is the minimum number of steps required to transform one character string into the other by "inserting", "deleting", and "replacement" of one character.
For example, from cat
to cut
cat
→ c t
→ cut
to
2 of" ʻa
delete, ʻuinsert". It is easy to think that the Levenshtein distance is
1` because you can use permutation.
For more information [Wiki](https://en.wikipedia.org/wiki/%E3%83%AC%E3%83%BC%E3%83%99%E3%83%B3%E3%82%B7%E3 See% 83% A5% E3% 82% BF% E3% 82% A4% E3% 83% B3% E8% B7% 9D% E9% 9B% A2). In this tutorial, we will finally identify the names while doing some experiments.
test0
import Levenshtein
target = ['oniku_a', 'oniku_b', 'oniku_c', 'yasai_a', 'yasai_b', 'yasai_c']
cates = ['kome']
for val in target:
nearL_flag = False
for cate in cates:
if Levenshtein.distance(val, cate) < 1: #If the reference Levenshtein distance is less than 1, that is, if it is even slightly different, throw it into categories.
nearL_flag = True
if not nearL_flag:
cates.append(val)
cates
output
['kome', 'oniku_a', 'oniku_b', 'oniku_c', 'yasai_a', 'yasai_b', 'yasai_c']
The program is such that everything with a reference Levenshtein distance of 1
or more enters categories
.
We will remodel it based on this. First, gradually increase the distance and observe what happens to the output.
test1
import Levenshtein
target = ['oniku_a', 'oniku_b', 'oniku_c', 'yasai_a', 'yasai_b', 'yasai_c']
for i in range(10):
cates = ['kome']
for val in target:
nearL_flag = False
for cate in cates:
if Levenshtein.distance(val, cate) < i:
nearL_flag = True
if not nearL_flag:
cates.append(val)
print(i, cates)
output
0 ['kome', 'oniku_a', 'oniku_b', 'oniku_c', 'yasai_a', 'yasai_b', 'yasai_c']
1 ['kome', 'oniku_a', 'oniku_b', 'oniku_c', 'yasai_a', 'yasai_b', 'yasai_c']
2 ['kome', 'oniku_a', 'yasai_a']
3 ['kome', 'oniku_a', 'yasai_a']
4 ['kome', 'oniku_a', 'yasai_a']
5 ['kome', 'oniku_a', 'yasai_a']
6 ['kome', 'oniku_a', 'yasai_b']
7 ['kome', 'yasai_a']
8 ['kome']
9 ['kome']
From this output result, it can be seen that ambiguity is tolerated as the Levenshtein distance is increased.
For example, whether to consider 'oniku_a'
and'oniku_b'
as the same classification.
If the reference Levenshtein distance is small, it is "considered different"
If the standard Levenshtein distance is large, "Well, let's consider it the same".
Also, from the above output result, it seems to be good to set the initial list of categories to ['kome','oniku','yasai']
instead of ['kome']
.
['kome','oniku','yasai']
and try to identify the name.test2
import Levenshtein
target = ['oniku_a', 'oniku_b', 'oniku_c', 'yasai_a', 'yasai_b', 'yasai_c']
for i in range(10):
cates = ['kome', 'oniku', 'yasai']
for val in target:
nearL_flag = False
for cate in cates:
if Levenshtein.distance(val, cate) < i:
nearL_flag = True
if not nearL_flag:
cates.append(val)
print(i, cates)
output
0 ['kome', 'oniku', 'yasai', 'oniku_a', 'oniku_b', 'oniku_c', 'yasai_a', 'yasai_b', 'yasai_c']
1 ['kome', 'oniku', 'yasai', 'oniku_a', 'oniku_b', 'oniku_c', 'yasai_a', 'yasai_b', 'yasai_c']
2 ['kome', 'oniku', 'yasai', 'oniku_a', 'yasai_a']
3 ['kome', 'oniku', 'yasai']
4 ['kome', 'oniku', 'yasai']
5 ['kome', 'oniku', 'yasai']
6 ['kome', 'oniku', 'yasai']
7 ['kome', 'oniku', 'yasai']
8 ['kome', 'oniku', 'yasai']
9 ['kome', 'oniku', 'yasai']
From the above, for this target
,
cates initial value: ['kome','oniku','yasai']
, it was found that the name can be identified with the Levenshtein distance less than 3
.
Then, finally, it is a name identification program.
nayose
import Levenshtein
target = ['oniku_a', 'oniku_b', 'oniku_c', 'yasai_a', 'yasai_b', 'yasai_c']
cates = ['kome', 'oniku', 'yasai']
nayose = []
for val in target:
minL = 100
afterNayose = 'dummy'
for cate in cates:
tmp_distance = Levenshtein.distance(val, cate)
if tmp_distance < minL:
minL = tmp_distance
afterNayose = cate
nayose.append(afterNayose)
[(before, after, Levenshtein.distance(before, after)) for before, after in zip(target, nayose)]
output
[('oniku_a', 'oniku', 2),
('oniku_b', 'oniku', 2),
('oniku_c', 'oniku', 2),
('yasai_a', 'yasai', 2),
('yasai_b', 'yasai', 2),
('yasai_c', 'yasai', 2)]
The way to read the output is (before
, ʻafter,
before and after Levenshtein distance). You can certainly confirm that the Levenshtein distance is less than
3` and the name can be identified.
Recommended Posts