It's usually written officially by gensim, but there aren't many Japanese materials, so I'll summarize the basic ones I often use for beginners.
pip install gensim
The writing style is different depending on the site, but personally I am calm with this writing style
#coding: UTF-8
from gensim.models.doc2vec import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
f = open('Training data.txt','r')#Text data that separates words with spaces and separates documents with line breaks
#Divide each document into words and put them in the list[([Word 1,Word 2,Word 3],Document id),...]Such an image
#words: List of words contained in the document (with duplicate words)
#tags: Document identifier (specified in a list. Multiple tags can be added to one document)
trainings = [TaggedDocument(words = data.split(),tags = [i]) for i,data in enumerate(f)]
By the way, what I learned this time is the data of 12 million reviews of Reading Meter. I collected it by scraping. Since it exceeds 1GB, it is quite difficult to get on the memory depending on the PC
#Training (later on parameters)
m = Doc2Vec(documents= trainings, dm = 1, size=300, window=8, min_count=10, workers=4)
#Save model
m.save("model/doc2vec.model")
#Model loading(If you have a model, you can start here)
m = Doc2Vec.load('model/doc2vec2.model')
Note that it may take a long time depending on the size of the training data.
#The argument is the document id
print m.docvecs.most_similar(0)
#Returns a set of top 10 document ids and similarity similar to document 0
>> [(55893, 0.6868613362312317), (85550, 0.6866280436515808), (80831, 0.6864551305770874), (61463, 0.6863148212432861), (72602, 0.6847503185272217), (56876, 0.6835699081420898), (80847, 0.6832736134529114), (92838, 0.6829516291618347), (24495, 0.6820268630981445), (45589, 0.679581880569458)]
print m.docvecs.similarity(1,307)
#Similarity between document 1 and document 307
>> 0.279532733106
#For example, try to calculate the similarity of some combinations of the following four new documents.
doc_words1 = ["last", "Deployment" ,"early" ,"other" ,"the work", "impact", "receive" ,"Behind the back" ,"Tsukuri", "trick" ,"Every time" ,"thing", "Take off your hat", "To do", "Read", "Cheap" ,"Me" ,"Mystery"]
doc_words2 = [ "Initiation love", "Similarly" ,"last", "A few lines", "Plot twist", "Go", "Time", "Time", "various", "scene", "To do" ,"To be", "Foreshadowing" ,"Sprinkle", "らTo be" ,"Is", "thing", "notice"]
doc_words3 = ["last", "Deployment" ,"early" ,"other" ,"the work", "impact", "receive" ,"Behind the back" ,"Tsukuri","Mystery"]
doc_words4 = ["Unique", "View of the world", "Everyday" ,"Leave","Calm down","Time","Read","Book"]
print "1-2 sim"
sim_value = m.docvecs.similarity_unseen_docs(m, doc_words1, doc_words2, alpha=1, min_alpha=0.0001, steps=5)
print sim_value
print "1-3 sim"
print m.docvecs.similarity_unseen_docs(m, doc_words1, doc_words3, alpha=1, min_alpha=0.0001, steps=5)
print "1-4 sim"
print m.docvecs.similarity_unseen_docs(m, doc_words1, doc_words4, alpha=1, min_alpha=0.0001, steps=5)
print "2-3 sim"
print m.docvecs.similarity_unseen_docs(m, doc_words2, doc_words3, alpha=1, min_alpha=0.0001, steps=5)
>> 1-2 sim
0.10429317017
1-3 sim
0.472984922936
1-4 sim
-0.02320307339
2-3 sim
0.228117846023
Even if people look at it, it is clear that documents 1-3 and 2-3 are similar, and on the contrary, documents 1-4 are not similar, so the similarity is quite good.
newvec = m.infer_vector(doc_words1)
print newvec
>> [ 1.19107231e-01 -4.06390838e-02 -2.55129002e-02 1.16982162e-01
-1.47758834e-02 1.07912444e-01 -4.76960577e-02 -9.73785818e-02
#...(Omission)
-1.61364377e-02 -9.76370368e-03 4.98018935e-02 -8.88026431e-02
1.34409174e-01 -1.01136886e-01 -4.24979888e-02 7.16169327e-02]
--Adjustment of parameters when training the model ――What can it be applied to?
Also, regarding the doc2vec algorithm itself I found an article explained on the blog of Kitayama Lab. Of Kogakuin University. [algorithm of doc2vec (Paragraph Vector)](https://kitayamalab.wordpress.com/2016/12/10/algorithm of doc2vecparagraph-vector-/)
Recommended Posts