TL;DR --Generate a combination of synonym group id and representative word from sudachi's synonym dictionary (synonym.txt) --Use the generated combinations to easily use synonym dictionaries with sudapipy --As an example, normalize using a synonym dictionary after word-separation
I wanted to use synonyms for morphological analysis in sudachi when performing tasks such as extracting information from text and calculating text similarity, but sudachi's synonym dictionary was not available in sudachi. Since it is simple, I will make it easy to use a synonym dictionary with sudapipy. The purpose of this time is ** normalization after morphological analysis **. In particular, it aims to align synonyms in the same heading after a word-separation. Therefore, we do not expand synonyms.
According to Document, sudachi's synonym dictionary
Synonymous word information is added to the words registered in the Sudachi dictionary. It is licensed under the same license as the Sudachi dictionary.
... apparently ...
The source of the synonym dictionary is published as a text file. https://github.com/WorksApplications/SudachiDict/blob/develop/src/main/text/synonyms.txt
If you use sudachi with python, you can install it using pip.
pip install sudachipy sudachidict_core
You can get morphemes as follows.
from sudachipy import tokenizer
from sudachipy import dictionary
tokenizer_obj = dictionary.Dictionary().create()
mode = tokenizer.Tokenizer.SplitMode.B
token = tokenizer_obj.tokenize("eat", mode)[0]
token.surface() # => 'eat'
token.dictionary_form() # => 'eat'
token.reading_form() # => 'Tabe'
token.part_of_speech() # => ['verb', 'General', '*', '*', 'Lower one-Ba line', 'Continuous form-General']
In addition, sudachi can normalize characters.
token.normalized_form()
The main subject is from here.
A partial excerpt of the synonym file is created in the following format.
000001,1,0,1,0,0,0,(),ambiguous,,
000001,1,0,1,0,0,2,(),Ambiguous,,
000001,1,0,2,0,0,0,(),Unclear,,
000001,1,0,3,0,0,0,(),Ayafuya,,
000001,1,0,4,0,0,0,(),Obscure,,
000001,1,0,5,0,0,0,(),Uncertain,,
000002,1,0,1,0,0,0,(),Destination,,
000002,1,0,1,0,0,2,(),address,,
000002,1,0,1,0,0,2,(),destination,,
000002,1,0,2,0,0,0,(),Destination,,
000002,1,0,3,0,0,0,(),Destination,,
000002,1,0,4,0,0,0,(),Delivery address,,
000002,1,0,5,0,0,0,(),Shipping address,,
000002,1,0,6,0,0,0,(),Shipping address,,
Described one word at a time, the synonym groups are separated by blank lines, and the format is as follows.
0 :Group number
1 :Words/Word flag(Optional)
2 :Deployment control flag(Optional)
3 :Vocabulary number in the group(Optional)
4 :Word form type within the same lexeme(Optional)
5 :Abbreviation information within words of the same word form(Optional)
6 :Inflection information in words with the same word form(Optional)
7 :Field information(Optional)
8 :Heading
9 :Reservation
10 :Reservation
See the documentation (https://github.com/WorksApplications/SudachiDict/blob/develop/docs/synonyms.md#%E5%90%8C%E7%BE%A9%E8%AA%9E%E8%BE%9E%E6%9B%B8%E3%82%BD%E3%83%BC%E3%82%B9-%E3%83%95%E3%82%A9%E3%83%BC%E3%83%9E%E3%83%83%E3%83%88) for a detailed description.
What is important this time
--0: Group number ―― 3: Vocabulary number in the group ―― 6: Notation fluctuation information in words with the same word form
There are three.
The group number is a 6-digit number used to manage and identify synonyms in the source. The vocabulary number in the group is the management number of the lexeme in the group. A serial number is given starting with "1". Notational fluctuation information in words with the same word form is information that indicates the relevance of notation in words with the same abbreviation / abbreviation form (those with the same numbers 3, 4, and 5). 0 is the representative word for the abbreviation / abbreviation.
I haven't checked everything, but synonym.txt lists each value in ascending order. That is, ** each synonym group begins with a representative of one of multiple lexemes **. Also, by assuming that the lexeme of control number 1 is a representative lexeme of a synonym group, the first word of each synonym group can be treated as a representative word of that synonym group **. I will.
Follow this rule to create a combination of group number and synonym group headings.
import csv
with open("synonyms.txt", "r") as f:
reader = csv.reader(f)
data = [r for r in reader]
output_data = []
synonym_set = []
synonym_group_id = None
for line in data:
if not line:
if synonym_group_id:
base_keyword = synonym_set[0]
output_data.append([
synonym_group_id, base_keyword
])
synonym_set = []
continue
else:
synonym_group_id = line[0]
synonym_set.append(line[8])
with open("synonyms_base.csv", "w") as f:
writer = csv.writer(f)
writer.writerows(output_data)
You cannot get the synonym dictionary with sudachipy, but you can get the id of the synonym group to which the token corresponds.
token.synonym_group_ids()
# => [1]
The representative word of the synonym is acquired from the combination generated earlier with this acquired synonym group id. One thing to note is that the id of the synonym group in synonym.txt is a 6-digit string, but the id you can get is an int.
import csv
with open('synonym_base.csv', "r") as f:
reader = csv.reader(f)
data = [[int(r[0]), r[1]] for r in reader]
synonyms = dict(data)
synonym_group_ids = token.synonym_group_ids()
if synonym_group_ids:
#There can be more than one, but select the beginning for the time being
surface = synonyms[synonym_group_ids[0]]
Use the generated synonym data to normalize the word-separation.
fetch_synonym_surface
returns the representative word of the synonym group if there is a synonym, or the normalized headword if there is no synonym.
import csv
with open('synonym_base.csv', "r") as f:
reader = csv.reader(f)
data = [[int(r[0]), r[1]] for r in reader]
synonyms = dict(data)
def fetch_synonym_surface(token):
synonym_group_ids = token.synonym_group_ids()
if synonym_group_ids:
#There can be more than one, but select the beginning for the time being
surface = synonyms[synonym_group_ids[0]]
else:
surface = token.normalized_form()
return surface
The following is
A comparison of the code and the result.
def wakati(sentence):
tokenizer_obj = dictionary.Dictionary().create()
mode = tokenizer.Tokenizer.SplitMode.C
return " ".join([m.surface() for m in tokenizer_obj.tokenize(sentence, mode)])
def wakati_normalized(sentence):
tokenizer_obj = dictionary.Dictionary().create()
mode = tokenizer.Tokenizer.SplitMode.C
return " ".join([m.normalized_form() for m in tokenizer_obj.tokenize(sentence, mode)])
def wakati_synonym_normalized(sentence):
tokenizer_obj = dictionary.Dictionary().create()
mode = tokenizer.Tokenizer.SplitMode.C
return " ".join([fetch_synonym_surface(m) for m in tokenizer_obj.tokenize(sentence, mode)])
sentence = "Adobe is a company that produces money, which is American money"
print("1:", wakati(sentence))
print("2:", wakati_normalized(sentence))
print("3:", wakati_synonym_normalized(sentence))
1:Adobe is a company that produces money, which is American money
2:Adobe is American money, a money-producing company
3:Adobe Systems is the money of the United States, a money-producing company
Recommended Posts