There is a way to use Doc2Vec etc. to calculate the similarity between sentences, but it is a little troublesome because I have to create a model for it from scratch. It may be more versatile and easier to use the Word2Vec model as it is if you only want to achieve a certain degree of accuracy.
So, I calculated the similarity between sentences based on the feature vector average of the words contained in the sentence and the cosine similarity between sentences.
# OS
macOS Sierra
# Python(Use Anaconda)
Python : Python 3.5.3 :: Anaconda custom (x86_64)
pip : 9.0.1 from /Users/username/anaconda/lib/python3.5/site-packages (python 3.5)
It didn't work well with python3.6, so the python3.5 version of Anaconda ([Anaconda 4.2.0 for python3](https://repo.continuum.io/archive/Anaconda3-4.2.0-MacOSX-x86_64] I'm using .pkg)).
It took too long for my MacBook Air to generate a dictionary from the corpus
The trained model of fastText has been released
We used a more published trained model. This time, we will use a model (model_neologd.vec) in which the contents of Wikipedia are divided by MeCab's NEologd and the text is trained by fastText. (Number of dimensions: 300)
import gensim
word2vec_model = gensim.models.KeyedVectors.load_word2vec_format('model/model_neologd.vec', binary=False)
(Since the file is close to 1GB, it takes tens of seconds to read)
By using this model, you can perform semantic calculations of words using feature vectors.
import pprint
pprint.pprint(word2vec_model.most_similar(positive=['woman', 'King'], negative=['Man']))
# => [('Queen', 0.7062159180641174),
# ('Royal family', 0.6530475616455078),
# ('Royal', 0.6122198104858398),
# ('Crown prince', 0.6098779439926147),
# ('Royal family', 0.6084121465682983),
# ('princess', 0.6005773544311523),
# ('Queen', 0.5964134335517883),
# ('king', 0.593998908996582),
# ('Monarch', 0.5929002165794373),
# ('Royal palace', 0.5772185325622559)]
#If you want to calculate the similarity between simple words, model.Can be calculated by similarity
pprint.pprint(word2vec_model.similarity('King', 'Queen'))
# => 0.74155587641044496
pprint.pprint(word2vec_model.similarity('King', 'ramen'))
# => 0.036460763469822188
Somehow the result is like that.
Use MeCab to break down natural language into word-separated words. Specify mecab-ipadic-neologd, which is also used to generate trained models as a dictionary, and specify the output in the form of word division.
import MeCab
mecab = MeCab.Tagger("-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd -Owakati")
mecab.parse("He got hungry yesterday")
# => 'He broke his stomach yesterday\n'
The word-separated text is separated by spaces. Since a line break is included at the end, it seems necessary to delete it at the time of implementation. (By the way, MeCab was installed using mecab-python3. It seems that it does not work properly with python3.6 series, so I had to use python3.5 series at 2017/5)
In this method, the feature vector average of the words used in the sentence is used as the feature vector of the sentence itself, so a function for that is defined.
import numpy as np
def avg_feature_vector(sentence, model, num_features):
words = mecab.parse(sentence).replace(' \n', '').split() #Line break at the end of mecab word-separation(\n)Is output, so remove it
feature_vec = np.zeros((num_features,), dtype="float32") #Initialize feature vector container
for word in words:
feature_vec = np.add(feature_vec, model[word])
if len(words) > 0:
feature_vec = np.divide(feature_vec, len(words))
return feature_vec
You're just averaging the feature vectors for each word. (Since the trained model has 300 dimensions, specify 300 for num_features)
avg_feature_vector("He got hungry yesterday", word2vec_model, 300)
# => array([ 6.39975071e-03, -6.38077855e-02, -1.41418248e-01,
# -2.01289997e-01, 1.76049918e-01, 1.99666247e-02,
# : : :
# -7.54096806e-02, -5.46530560e-02, -9.14395228e-02,
# -2.21335635e-01, 3.34903784e-02, 1.81226760e-01], dtype=float32)
When executed, I think that 300-dimensional features will be output.
Next, the above function is used to calculate the cosine similarity of the average vector between the two sentences.
from scipy import spatial
def sentence_similarity(sentence_1, sentence_2):
#The Word2Vec model used this time is generated with a 300-dimensional feature vector, so num_features also specified as 300
num_features=300
sentence_1_avg_vector = avg_feature_vector(sentence_1, word2vec_model, num_features)
sentence_2_avg_vector = avg_feature_vector(sentence_2, word2vec_model, num_features)
#Calculate cosine similarity by subtracting the distance between vectors from 1
return 1 - spatial.distance.cosine(sentence_1_avg_vector, sentence_2_avg_vector)
By using this function, you can easily calculate the similarity between sentences. (The range is 0 to 1, and the closer it is to 1, the more similar it is.)
result = sentence_similarity(
"He ate a spicy ramen yesterday and got hungry",
"Yesterday, I ate a spicy Chinese food and got hungry"
)
print(result)
# => 0.973996032475
result = sentence_similarity(
"It's no good ... I have to do something quickly ...",
"We will deliver carefully selected job information"
)
print(result)
# => 0.608137464334
I was able to calculate a numerical value like that!
** In the case of long sentences, the degree of similarity is high. ** ** Since the average of the words is taken and compared, it becomes difficult to make a difference in the average value between sentences in a long sentence, and the similarity becomes high even in unrelated sentences.
result = sentence_similarity(
"It's finally in the story of this story. At last, other educators would come to this point where they shouldn't push forward, but I'm sure they'll misunderstand it, and I'm content with it to some extent.",
"Even if I'm sick, it's like a good day. Thinking to Gauche as a mouse, your face squeezed the Doremifa's late breath and the next raccoon dog cello, and the difference between them is quite different."
)
print(result)
# => 0.878950984671
Even if it can be done, the comparison between sentences of 10 words is the limit.
** Cannot handle unknown words. ** ** Since it is not possible to output a feature vector for an unknown word that is not registered in the trained model, it seems necessary to take measures such as filling the word itself with the average feature vector of other words. (However, in that case, unknown words often have semantic characteristics, which reduces the accuracy of similarity.)
>>> result = sentence_similarity(
... "Referral adoption has become popular in recent years",
... "The era of new graduate recruitment is over"
... )
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
File "<stdin>", line 5, in sentence_similarity
File "<stdin>", line 6, in avg_feature_vector
File "/Users/username/anaconda/lib/python3.5/site-packages/gensim/models/keyedvectors.py", line 574, in __getitem__
return self.word_vec(words)
File "/Users/username/anaconda/lib/python3.5/site-packages/gensim/models/keyedvectors.py", line 273, in word_vec
raise KeyError("word '%s' not in vocabulary" % word)
KeyError: "word 'Referral' not in vocabulary"
In this case, I can't find the word referral and I'm missing it.
Since the method itself is simple, I think that the cases that can be used are quite limited. On the contrary, if it is only necessary to deal with short sentences, it seems that this method can also provide some accuracy. Isn't it a straightforward approach to use a method like Doc2Vec when finding the similarity between sentences in earnest, and prepare a corpus that suits the purpose of the model itself? ..
import gensim
import MeCab
import numpy as np
from scipy import spatial
word2vec_model = gensim.models.KeyedVectors.load_word2vec_format('model/model_neologd.vec', binary=False)
mecab = MeCab.Tagger("-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd -Owakati")
#Calculate the average of the feature vectors of the words used in the sentence
def avg_feature_vector(sentence, model, num_features):
words = mecab.parse(sentence).replace(' \n', '').split() #Line break at the end of mecab word-separation(\n)Is output, so remove it
feature_vec = np.zeros((num_features,), dtype="float32") #Initialize feature vector container
for word in words:
feature_vec = np.add(feature_vec, model[word])
if len(words) > 0:
feature_vec = np.divide(feature_vec, len(words))
return feature_vec
#Calculate the similarity between two sentences
def sentence_similarity(sentence_1, sentence_2):
#The Word2Vec model used this time is generated with a 300-dimensional feature vector, so num_features also specified as 300
num_features=300
sentence_1_avg_vector = avg_feature_vector(sentence_1, word2vec_model, num_features)
sentence_2_avg_vector = avg_feature_vector(sentence_2, word2vec_model, num_features)
#Calculate cosine similarity by subtracting the distance between vectors from 1
return 1 - spatial.distance.cosine(sentence_1_avg_vector, sentence_2_avg_vector)
result = sentence_similarity(
"He ate a spicy ramen yesterday and got hungry",
"Yesterday, I ate a spicy Chinese food and got hungry"
)
print(result)
# => 0.973996032475
-Trained model of fastText has been released -Which is better, Cos similarity or Doc2Vec?
Recommended Posts