I searched for similar sentences with Doc2Vec, so I will introduce the implementation.

What is Doc2Vec?

In order for a computer to process natural language, it must first have human language values that can be handled by the computer. [Word2Vec] exists as a method for vectorizing the meaning of words. For details, the link destination is very easy to understand, but roughly speaking, that word is expressed by a list of n words before and after. By doing this, for example, "dog" and "cat" are used in similar contexts and can be thought of as having similar "meanings". Doc2Vec is an application of Word2Vec to vectorize sentences.

Implementation sample

The following two functions will be realized using Doc2Vec this time.

Search for sentences by word
Search for similar sentences

As a sample, I used the text from Aozora Bunko. The code used in this article is [published on GitHub] [GitHub]. (I also made the text used for learning into a zip, but please note that it is large in size)

environment

Python 3
MeCab
gensim

Please be able to use.

Learning

Get the text file
Get text from file
Remove unnecessary parts from the text
Break down into words
Learn with Doc2Vec
Output training data

Process according to the flow of.

1. Get the text file

import os
import sys
import MeCab
import collections
from gensim import models
from gensim.models.doc2vec import LabeledSentence

First, import the required libraries.

def get_all_files(directory):
    for root, dirs, files in os.walk(directory):
        for file in files:
            yield os.path.join(root, file)

Gets all files under the given directory.

2. Get text from file

def read_document(path):
    with open(path, 'r', encoding='sjis', errors='ignore') as f:
        return f.read()

3. Remove unnecessary parts from the text

def trim_doc(doc):
    lines = doc.splitlines()
    valid_lines = []
    is_valid = False
    horizontal_rule_cnt = 0
    break_cnt = 0
    for line in lines:
        if horizontal_rule_cnt < 2 and '-----' in line:
            horizontal_rule_cnt += 1
            is_valid = horizontal_rule_cnt == 2
            continue
        if not(is_valid):
            continue
        if line == '':
            break_cnt += 1
            is_valid = break_cnt != 3
            continue
        break_cnt = 0
        valid_lines.append(line)
    return ''.join(valid_lines)

I think that the processing here will change depending on the target sentence. This time, I ignored the explanation part of the text before and after the text. It is unclear to what extent this affects accuracy in the first place.

4. Break down into words

def split_into_words(doc, name=''):
    mecab = MeCab.Tagger("-Ochasen")
    valid_doc = trim_doc(doc)
    lines = mecab.parse(doc).splitlines()
    words = []
    for line in lines:
        chunks = line.split('\t')
        if len(chunks) > 3 and (chunks[3].startswith('verb') or chunks[3].startswith('adjective') or (chunks[3].startswith('noun') and not chunks[3].startswith('noun-number'))):
            words.append(chunks[0])
    return LabeledSentence(words=words, tags=[name])

def corpus_to_sentences(corpus):
    docs = [read_document(x) for x in corpus]
    for idx, (doc, name) in enumerate(zip(docs, corpus)):
        sys.stdout.write('\r Preprocessing{} / {}'.format(idx, len(corpus)))
        yield split_into_words(doc, name)

Takes a sentence from a file and breaks it down into words. In order to improve accuracy, it seems that the only words used for learning are nouns. This time, I used verbs, adjectives, and nouns (other than numbers).

5. Learn with Doc2Vec

def train(sentences):
    model = models.Doc2Vec(size=400, alpha=0.0015, sample=1e-4, min_count=1, workers=4)
    model.build_vocab(sentences)
    for x in range(30):
        print(x)
        model.train(sentences)
        ranks = []
        for doc_id in range(100):
            inferred_vector = model.infer_vector(sentences[doc_id].words)
            sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
            rank = [docid for docid, sim in sims].index(sentences[doc_id].tags[0])
            ranks.append(rank)
        print(collections.Counter(ranks))
        if collections.Counter(ranks)[0] >= PASSING_PRECISION:
            break
    return model

Parameters at the time of learning are set in the part of models.Doc2Vec.

size: Number of dimensions when vectorized
alpha: Learning rate
sample: Threshold for frequency when ignoring words
min_count: Minimum number of occurrences of words used for learning
workers: Number of threads during learning

alpha The higher it is, the faster it converges, but if it is too high, it diverges. The lower the value, the higher the accuracy, but the slower the convergence.

sample Words that appear too often are likely to be meaningless words and may be ignored. Set that threshold.

min_count Contrary to sample, words that are too infrequent may not be appropriate to describe the sentence and may be ignored. However, this time I targeted all words.

for x in range(30):
        print(x)
        model.train(sentences)
        ranks = []
        for doc_id in range(100):
            inferred_vector = model.infer_vector(sentences[doc_id].words)
            sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
            rank = [docid for docid, sim in sims].index(sentences[doc_id].tags[0])
            ranks.append(rank)
        print(collections.Counter(ranks))
        if collections.Counter(ranks)[0] >= PASSING_PRECISION:
            break
    return model

We are studying and evaluating in this part. The evaluation is based on the number of times that 100 of the learned sentences are searched for similar sentences, and the most similar sentence is oneself. This time, the learning is finished when 94 times or more. (Because the accuracy did not improve any more after turning it several times)

6. Output training data

model.save(OUTPUT_MODEL)

OUTPUT_MODEL contains the output path.

Search sentences by word

model = models.Doc2Vec.load('doc2vec.model')

def search_similar_texts(words):
    x = model.infer_vector(words)
    most_similar_texts = model.docvecs.most_similar([x])
    for similar_text in most_similar_texts:
        print(similar_text[0])

Since Doc2Vec also vectorizes words (Word2Vec) at the same time, I tried to search for similar words.

def search_similar_words(words):
    for word in words:
        print()
        print(word + ':')
        for result in model.most_similar(positive=word, topn=10):
            print(result[0])

Example of searching for "cat"

猫.PNG

Example of searching for "snow"

雪.PNG

Search for similar sentences

model = models.Doc2Vec.load('doc2vec.model')

def search_similar_texts(path):
    most_similar_texts = model.docvecs.most_similar(path)
    for similar_text in most_similar_texts:
        print(similar_text[0])

An example of searching for "I am a cat" by Soseki Natsume

夏目漱石.PNG

An example of searching for "No Longer Human" by Osamu Dazai

太宰治.PNG

Summary

I implemented a search for similar sentences in Doc2Vec. I hope you find it helpful.

If an error occurs

Only the error that occurred in my environment, but I will post the solution.

Intel MKL FATAL ERROR: Cannot load libmkl_avx2.so or libmkl_def.so.
It seems to happen when using anaconda3-4.2.0. I solved it with conda update numpy.
Invalid argument error It seems to happen if you are using Bash on Ubuntu on Windows. Resolved with ʻexport KMP_AFFINITY = disabled`.

reference

[Word2Vec: The amazing power of word vectors that surprises the inventor] [Word2Vec] [How Doc2Vec works and document similarity calculation tutorial using gensim] [Tutorial] [What happens if you do machine learning with a pixiv novel [learned model data is distributed]] [pixiv] [Use TensorFlow to check the difference in movement depending on the learning rate] [Learning rate] [models.doc2vec – Deep learning with paragraph2vec][doc2vec]

<!-Link-> [Word2Vec]:https://deepage.net/bigdata/machine_learning/2016/09/02/word2vec_power_of_word_vector.html [GitHub]:https://github.com/Foo-x/doc2vec-sample [Tutorial]: https://deepage.net/machine_learning/2017/01/08/doc2vec.html [pixiv]:http://inside.pixiv.net/entry/2016/09/13/161454 [Learning rate]: http://qiita.com/isaac-otao/items/6d44fdc0cfc8fed53657 [doc2vec]:https://radimrehurek.com/gensim/models/doc2vec.html

Vectorize sentences and search for similar sentences