I searched for similar sentences with Doc2Vec, so I will introduce the implementation.
In order for a computer to process natural language, it must first have human language values that can be handled by the computer. [Word2Vec] exists as a method for vectorizing the meaning of words. For details, the link destination is very easy to understand, but roughly speaking, that word is expressed by a list of n words before and after. By doing this, for example, "dog" and "cat" are used in similar contexts and can be thought of as having similar "meanings". Doc2Vec is an application of Word2Vec to vectorize sentences.
The following two functions will be realized using Doc2Vec this time.
As a sample, I used the text from Aozora Bunko. The code used in this article is [published on GitHub] [GitHub]. (I also made the text used for learning into a zip, but please note that it is large in size)
Please be able to use.
Process according to the flow of.
import os
import sys
import MeCab
import collections
from gensim import models
from gensim.models.doc2vec import LabeledSentence
First, import the required libraries.
def get_all_files(directory):
for root, dirs, files in os.walk(directory):
for file in files:
yield os.path.join(root, file)
Gets all files under the given directory.
def read_document(path):
with open(path, 'r', encoding='sjis', errors='ignore') as f:
return f.read()
def trim_doc(doc):
lines = doc.splitlines()
valid_lines = []
is_valid = False
horizontal_rule_cnt = 0
break_cnt = 0
for line in lines:
if horizontal_rule_cnt < 2 and '-----' in line:
horizontal_rule_cnt += 1
is_valid = horizontal_rule_cnt == 2
continue
if not(is_valid):
continue
if line == '':
break_cnt += 1
is_valid = break_cnt != 3
continue
break_cnt = 0
valid_lines.append(line)
return ''.join(valid_lines)
I think that the processing here will change depending on the target sentence. This time, I ignored the explanation part of the text before and after the text. It is unclear to what extent this affects accuracy in the first place.
def split_into_words(doc, name=''):
mecab = MeCab.Tagger("-Ochasen")
valid_doc = trim_doc(doc)
lines = mecab.parse(doc).splitlines()
words = []
for line in lines:
chunks = line.split('\t')
if len(chunks) > 3 and (chunks[3].startswith('verb') or chunks[3].startswith('adjective') or (chunks[3].startswith('noun') and not chunks[3].startswith('noun-number'))):
words.append(chunks[0])
return LabeledSentence(words=words, tags=[name])
def corpus_to_sentences(corpus):
docs = [read_document(x) for x in corpus]
for idx, (doc, name) in enumerate(zip(docs, corpus)):
sys.stdout.write('\r Preprocessing{} / {}'.format(idx, len(corpus)))
yield split_into_words(doc, name)
Takes a sentence from a file and breaks it down into words. In order to improve accuracy, it seems that the only words used for learning are nouns. This time, I used verbs, adjectives, and nouns (other than numbers).
def train(sentences):
model = models.Doc2Vec(size=400, alpha=0.0015, sample=1e-4, min_count=1, workers=4)
model.build_vocab(sentences)
for x in range(30):
print(x)
model.train(sentences)
ranks = []
for doc_id in range(100):
inferred_vector = model.infer_vector(sentences[doc_id].words)
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
rank = [docid for docid, sim in sims].index(sentences[doc_id].tags[0])
ranks.append(rank)
print(collections.Counter(ranks))
if collections.Counter(ranks)[0] >= PASSING_PRECISION:
break
return model
Parameters at the time of learning are set in the part of models.Doc2Vec.
alpha The higher it is, the faster it converges, but if it is too high, it diverges. The lower the value, the higher the accuracy, but the slower the convergence.
sample Words that appear too often are likely to be meaningless words and may be ignored. Set that threshold.
min_count Contrary to sample, words that are too infrequent may not be appropriate to describe the sentence and may be ignored. However, this time I targeted all words.
for x in range(30):
print(x)
model.train(sentences)
ranks = []
for doc_id in range(100):
inferred_vector = model.infer_vector(sentences[doc_id].words)
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
rank = [docid for docid, sim in sims].index(sentences[doc_id].tags[0])
ranks.append(rank)
print(collections.Counter(ranks))
if collections.Counter(ranks)[0] >= PASSING_PRECISION:
break
return model
We are studying and evaluating in this part. The evaluation is based on the number of times that 100 of the learned sentences are searched for similar sentences, and the most similar sentence is oneself. This time, the learning is finished when 94 times or more. (Because the accuracy did not improve any more after turning it several times)
model.save(OUTPUT_MODEL)
OUTPUT_MODEL contains the output path.
model = models.Doc2Vec.load('doc2vec.model')
def search_similar_texts(words):
x = model.infer_vector(words)
most_similar_texts = model.docvecs.most_similar([x])
for similar_text in most_similar_texts:
print(similar_text[0])
Since Doc2Vec also vectorizes words (Word2Vec) at the same time, I tried to search for similar words.
def search_similar_words(words):
for word in words:
print()
print(word + ':')
for result in model.most_similar(positive=word, topn=10):
print(result[0])
model = models.Doc2Vec.load('doc2vec.model')
def search_similar_texts(path):
most_similar_texts = model.docvecs.most_similar(path)
for similar_text in most_similar_texts:
print(similar_text[0])
I implemented a search for similar sentences in Doc2Vec. I hope you find it helpful.
Only the error that occurred in my environment, but I will post the solution.
Intel MKL FATAL ERROR: Cannot load libmkl_avx2.so or libmkl_def.so.
conda update numpy
.[Word2Vec: The amazing power of word vectors that surprises the inventor] [Word2Vec] [How Doc2Vec works and document similarity calculation tutorial using gensim] [Tutorial] [What happens if you do machine learning with a pixiv novel [learned model data is distributed]] [pixiv] [Use TensorFlow to check the difference in movement depending on the learning rate] [Learning rate] [models.doc2vec – Deep learning with paragraph2vec][doc2vec]
<!-Link-> [Word2Vec]:https://deepage.net/bigdata/machine_learning/2016/09/02/word2vec_power_of_word_vector.html [GitHub]:https://github.com/Foo-x/doc2vec-sample [Tutorial]: https://deepage.net/machine_learning/2017/01/08/doc2vec.html [pixiv]:http://inside.pixiv.net/entry/2016/09/13/161454 [Learning rate]: http://qiita.com/isaac-otao/items/6d44fdc0cfc8fed53657 [doc2vec]:https://radimrehurek.com/gensim/models/doc2vec.html
Recommended Posts