I read the Sudachi synonym dictionary with Pandas and searched for synonyms

On November 11, 2019, we developed a morphological analyzer Sudachi. A synonym dictionary has been released by Works Applications!

The synonym dictionary is mainly used for document retrieval and chatbots to absorb notational fluctuations.

This time, I examined the contents of this dictionary using the Python library Pandas.

environment

Download Sudachi Synonyms Dictionary

$ wget https://raw.githubusercontent.com/WorksApplications/SudachiDict/develop/src/main/text/synonyms.txt

Read

Looking at Documents, it looks like csv format! I will write a python script from here.

import pandas as pd

df = pd.read_csv("synonyms.txt", skip_blank_lines=True,
                 names=('group_id', 'type', 'expand', 'vocab_id',         
                'relation', 'abbreviation', 'spelling', 'domain',  
                'surface', 'reserve1', 'reserve2'))

Since it is a csv with blank lines, set skip_blank_lines. names is appropriate.

Search

For the time being, let's create a function that displays all dfs with matching headings.

def search_synonyms(word):
    for row in df[df.surface==word].itertuples():
        print(df[df.group_id==row.group_id].loc[:,['group_id', 'domain', 'surface']])

Synonyms are grouped by group number (group_id), so it seems okay if you take the same line as the group number of the line whose heading (surface) matches word!

One case

For example, running search_synonyms ('giant') will look like this!

      group_id    domain   surface
5662      3895  (Sports)Yomiuri Giants
5663      3895  (Sports)Giant
5664      3895  (Sports)Yomiuri
5665      3895  (Sports)Giants
5666      3895  (Sports)Yomiuri Giants
5667      3895  (Sports)Giants
5668      3895  (Sports)    Giants
       group_id domain surface
31690     16305  (Man)      ĺ·¨Man
31691     16305  (Man)Giant
31692     16305  (Man)   giant

I was able to fetch various notations of the team's "Giants" and the general noun "Giants"!

Summary

This time I took a look at Sudachi's synonym dictionary. Both Sudachi itself and this synonym dictionary will be updated steadily in the future. Keep an eye on future updates!

Recommended Posts

I read the Sudachi synonym dictionary with Pandas and searched for synonyms
Read CSV and analyze with Pandas and Seaborn
I measured the speed of list comprehension, for and while with python2.7.
I searched for railway senryu from the data
I played with Floydhub for the time being
[Python] I searched for the longest Pokemon Shiritori
I compared the moving average of IIR filter type with pandas and scipy
I read and implemented the Variants of UKR
I tried to compare the processing speed with dplyr of R and pandas of Python
I tried to read and save automatically with VOICEROID2 2
I tried to automatically read and save with VOICEROID2
Extract the maximum value with pandas and change that value
I searched for the contents of CloudWatch Logs Agent
I was hooked for 2 minutes with the Python debugger pdb
I wrote the code for Japanese sentence generation with DeZero
Read the URL list with Robot Framework and surround the screenshots
[Python] Read the csv file and display the figure with matplotlib
I hacked the Amazon Dash Button and registered with Salesforce
Read the VTK file and display the color map with jupyter.
I tried to create serverless batch processing for the first time with DynamoDB and Step Functions
Read csv with python pandas
Simple synonym dictionary with sudachipy
I searched for CD commands.
I read the SHAP paper
I compared the speed of Hash with Topaz, Ruby and Python
I read PEP 560 (Core support for typing module and generic types)
I searched for a similar card of Hearthstone with Deep Learning
[Pandas] I tried to analyze sales data with Python [For beginners]
[Python3] Save the mean and covariance matrix in json with pandas
I wrote the basic operation of Pandas with Jupyter Lab (Part 1)
For the time being, I want to convert files with ffmpeg !!
I wrote the basic operation of Pandas with Jupyter Lab (Part 2)
I tried running the TensorFlow tutorial with comments (_TensorFlow_2_0_Introduction for beginners)
I compared the performance of Vaex, Dask, and Pandas in CSV, Parquet, and HDF5 formats (for single files).