On November 11, 2019, we developed a morphological analyzer Sudachi. A synonym dictionary has been released by Works Applications!
The synonym dictionary is mainly used for document retrieval and chatbots to absorb notational fluctuations.
This time, I examined the contents of this dictionary using the Python library Pandas.
$ wget https://raw.githubusercontent.com/WorksApplications/SudachiDict/develop/src/main/text/synonyms.txt
Looking at Documents, it looks like csv format! I will write a python script from here.
import pandas as pd
df = pd.read_csv("synonyms.txt", skip_blank_lines=True,
names=('group_id', 'type', 'expand', 'vocab_id',
'relation', 'abbreviation', 'spelling', 'domain',
'surface', 'reserve1', 'reserve2'))
Since it is a csv with blank lines, set skip_blank_lines
. names
is appropriate.
For the time being, let's create a function that displays all df
s with matching headings.
def search_synonyms(word):
for row in df[df.surface==word].itertuples():
print(df[df.group_id==row.group_id].loc[:,['group_id', 'domain', 'surface']])
Synonyms are grouped by group number (group_id
), so it seems okay if you take the same line as the group number of the line whose heading (surface
) matches word
!
For example, running search_synonyms ('giant')
will look like this!
group_id domain surface
5662 3895 (Sports)Yomiuri Giants
5663 3895 (Sports)Giant
5664 3895 (Sports)Yomiuri
5665 3895 (Sports)Giants
5666 3895 (Sports)Yomiuri Giants
5667 3895 (Sports)Giants
5668 3895 (Sports) Giants
group_id domain surface
31690 16305 (Man) ĺ·¨Man
31691 16305 (Man)Giant
31692 16305 (Man) giant
I was able to fetch various notations of the team's "Giants" and the general noun "Giants"!
This time I took a look at Sudachi's synonym dictionary. Both Sudachi itself and this synonym dictionary will be updated steadily in the future. Keep an eye on future updates!
Recommended Posts