WordNet structure and synonym search

Introduction

There is a need to get synonyms in Program to weaken Japanese, and as a result of searching, I found that WordNet is necessary, so I searched for WordNet. I saw it.

What is WordNet

Japanese WordNet is "a Japanese concept dictionary in which individual concepts are grouped into units called" synsets ", which are semantically linked to other synsets" (by provided site). I think the main use is for synonym search. Investigation is required for other uses.

WordNet structure

WordNet is provided at here, and there seem to be several types, but here, "* Japanese Wordnet and English WordNet in" Take a look at an sqlite3 database * ".

The following 11 tables are included in WordNet.

Of these, the minimum tables required to get synonyms for a particular word were word, synset, and sense. However, since it is lonely that there is no relation of semantic information with that alone, the data model excerpted from these tables after including synset_def is as follows. ER_WordNet_noKeys.png

The word word i </ sub> belongs to the concept synset j </ sub> and is an image that connects them with the item sense. By the way, using the word "warm" as an example, the result of outputting synset and synset_def is as follows.

wordnet.png

Acquisition of synonyms

Now, I will explain the procedure and program to get a list of synonyms using WordNet. The processing flow is as follows.

  1. Get the wordid of the target word
  2. Get the sense of the synset to which the wordid belongs
  3. Get words that belong to synset as synonyms

The big picture of the code is below.

def search_synonyms(lemma, lang="jpn"):
    synonym_list = []
    # 1.Get the word id of a word
    wobj = get_word(lemma)
    if wobj:
        word = wobj[0]
        # 2.Get the sense of the synset to which the wordid belongs
        senses = get_senses(word, lang)
        for s in senses:
            # 3.Get words that belong to synset as synonyms
            synonyms = get_words_from_synset(s.synset, word, lang)
            for syn in synonyms:
                if syn.lemma not in synonym_list:
                    synonym_list.append(syn.lemma)
    else:
        print(f"'{lemma}'No synonyms were found for.")
    
    return synonym_list

Hereafter, we will describe each of the processes 1 to 3.

1. Get the wordid of the target word

The processing of the function `` `get_word (lemma) ``` to get the wordid of the target word is as follows. In addition, here, not the wordid alone, but the entire Word object is acquired. (From the point of view of readability and extensibility.)

Word = namedtuple('Word', 'wordid lang lemma pron pos')

def get_word(lemma):
    cur = conn.execute("select * from word where lemma=?", (lemma,))
    return [Word(*row) for row in cur]

2. Get the sense of the synset to which the wordid belongs

The processing of the function `get_senses (word [, lang])` that gets the sense from word (id) is as follows.

Sense = namedtuple('Sense', 'synset wordid lang rank lexid freq src')

def get_senses(word, lang):
    cur = conn.execute("select * from sense where wordid=? and lang=?", (word.wordid, lang))
    return [Sense(*row) for row in cur]

The language limitation ( lang =" jpn ") may be just the following processing, but I have included it for the time being.

3. Get words that belong to synset as synonyms

The processing of the function `get_words_from_synset (synset, word [, lang])` to get the word belonging to it from synset is as follows.

def get_words_from_synset(synset, word):
    cur = conn.execute("select * from word where wordid in (select wordid from sense where synset=? and lang=?) and wordid<>?;", (synset, lang, word.wordid))
    return [Word(*row) for row in cur]

The final `wordid <> {word.wordid}` is included to exclude the target word itself. I think there are some patterns in how to write SQL.


bonus

Synonyms could be obtained with only 1 to 3, but if you want to see what kind of concept each synonym is similar to, you can also get `` `synset_def```.

SynsetDef = namedtuple('SynsetDef', 'synset lang defi sid')
#Since def cannot be used as a reserved word, it is set to defi.

def get_synset_def_from_synset(synset, lang):
    cur = conn.execute("select * from synset_def where synset=? and lang=?", (synset, lang))
    return [SynsetDef(*row) for row in cur]

at the end

I'm sorry I don't have any new information, but I hope it helps. that's all.

Recommended Posts

WordNet structure and synonym search
[Python] Depth-first search and breadth-first search
Get and visualize google search trends
Search / list synonyms using Japanese WordNet
Vectorize sentences and search for similar sentences
Python 2-minute search and its derivation