The other day, I participated in a Python study session sponsored by Team Zet Co., Ltd. The theme this time is "text emotions using word2vec" "Analysis". To be honest, it was a crazy theme for me who first touched Python a week ago, but I wonder if it's possible to experience how the grammar I'm studying is being put to good use. I rushed in and boarded one day before the event.
By the way, I will leave the introduction to this extent and get into the main subject.
Neural network model (machine learning) that analyzes words. Simply put, it seems that words can be vectorized and weighted. (For more information, refer to here)
This time White Goat Corporation I used the word2vec model of.
How to install word2vec is here
First of all Make sure the words are vectorized by word2vec. Let's type this code with word2vec implemented.
sample.py
import gensim.models.word2vec.Word2Vec as wv
print(len(model.wv["Love"]))
model.wv["Love"]
Then
50 array([ 0.09289702, -0.16302316, -0.08176763, -0.29827002, 0.05170078, 0.07736144, -0.06452437, 0.19822665, -0.11941547, -0.11159643, 0.03224859, 0.03042056, -0.09065174, -0.1677992 , -0.19054233, 0.10354111, 0.02630192, -0.06666993, -0.06296805, 0.00500843, 0.26934028, 0.05273635, 0.0192258 , 0.2924312 , -0.23919497, 0.02317964, -0.21278766, -0.01392282, 0.24962738, 0.11264788, 0.05772769, 0.20941015, -0.01239212, -0.1256235 , -0.19794041, 0.1267719 , -0.12306885, 0.01006295, 0.08548331, -0.08936502, -0.05429656, -0.09757583, 0.10338967, 0.13714872, 0.23966707, 0.02216845, 0.02270923, 0.32569838, -0.0311841 , -0.00150117], dtype=float32)
The result will be returned. This is the word "love" made up of 50 dimensions. It shows that the component is composed of the above elements.
next
sample.py
#Extract words similar to keyword
sim_do = model.wv.most_similar(positive=["Girlfriend"], topn=30)
#Since it is listed, it is shaped for easy viewing
print(*[" ".join([v, str("{:.5f}".format(s))]) for v, s in sim_do], sep="\n")
When you hit Herself 0.82959 Molly 0.82547 He 0.82406 Sylvia 0.80452 Charlie 0.80336 Lover 0.80197 You can extract words with similar meanings such as. The number to the right of the word is a quantification of how much you are with the word "she".
Also, when you want to know how long the two words are
similarity = model.wv.similarity(w1="Apple", w2="Strawberry")
print(similarity)
similarity = model.wv.similarity(w1="Apple", w2="Aomori")
print(similarity)
similarity = model.wv.similarity(w1="Apple", w2="Anpanman")
print(similarity)
Then 0.79041845 0.30861858 0.45321244 Will be returned. We quantified how similar the words w1 and w2 are. If you say apples, Aomori! I think that many people associate it with, but since I have decided that Anpanman is more similar than Aomori, I understand that this model is not perfect yet.
Well, here
"King"-"Man" + "Woman" = "Queen" ???
I will consider the famous proposition.
sample.py
sim_do = model.wv.most_similar(positive = ["King", "Female"], negative=["male"], topn=5)
print(*[" ".join([v, str("{:.5f}".format(s))]) for v, s in sim_do], sep="\n")
#Words in positive compare the degree of similarity, words in negative compare the degree of dissimilarity
Result is…
Princess 0.85313 Bride 0.83918 Beast 0.83155 Witch 0.82982 Maiden 0.82356
I got a similar answer, though it didn't exactly match the "queen".
By the way, we have compared only words so far, but it is also possible to quantify what kind of emotions a sentence contains.
sample.py
import numpy as np
t = Tokenizer()
s = '
# Enter your favorite sentences.
'
output_data=[]
x = np.empty((0,4), float)
for token in t.tokenize(s):
if token.part_of_speech.split(',')[0]=="noun" or token.part_of_speech.split(',')[0]=="adjective":
print(token.surface)
similarity1 = model.wv.similarity(w1=token.surface, w2="happy")
#print("joy:{0}".format(similarity1))
similarity2 = model.wv.similarity(w1=token.surface, w2="pleasant")
#print("sorrow:{0}".format(similarity2))
similarity3 = model.wv.similarity(w1=token.surface, w2="sad")
#print("anxiety:{0}".format(similarity3))
similarity4 = model.wv.similarity(w1=token.surface, w2="excitement")
#print("Interest:{0}".format(similarity4))
x = np.append(x, np.array([[similarity1, similarity2, similarity3, similarity4]]), axis=0)
print("-"*30)
print(np.mean(x, axis=0))
print("Happy:{0}".format(np.mean(x, axis=0)[0]))
print("easy:{0}".format(np.mean(x, axis=0)[1]))
print("Sadness:{0}".format(np.mean(x, axis=0)[2]))
print("Xing:{0}".format(np.mean(x, axis=0)[3]))
Enter your favorite sentence in the variable s As an example "I proposed at a restaurant with a view of the night view." Let's put in a romantic sentence Result is
Night view Restaurant propose
[0.29473324 0.44027831 0.27123818 0.20060815]
Happy: 0.29473323623339337 Easy: 0.4402783115704854 Sad: 0.27123818174004555 Xing: 0.20060815351704755
Will come out. So this system "I proposed at a restaurant with a view of the night view." Is judged to be a "fun" sentence. (The larger the number, the stronger the emotion)
Then another example "A pistol murder occurred in a prison at midnight." Let's put in a very negative aura punpun sentence Then
Midnight prison Handgun murder Incident
[-0.00661952 0.01671012 0.12141706 0.23172273] Happy: -0.0006619524117559195 Fun: 0.01671011543367058 Sad: 0.12141705807298422 Excitement: 0.2317227303981781
As a result, In fact, the value may take a negative value. Certainly, I don't feel happy even a millimeter.
It's a wonderful time to be able to analyze the sentiment of sentences so easily. I am deeply grateful to Team Zet for giving me such a useful learning place.
Recommended Posts