Introduction

I suddenly started studying "Deep Learning from scratch ❷ --- Natural language processing" Note that I stumbled in Chapter 4. is.

The execution environment is macOS Catalina + Anaconda 2019.10, and the Python version is 3.7.4. For details, refer to Chapter 1 of this memo.

Chapter 4 Speeding up word2vec

This chapter is a speedup of the word2vec CBOW model created in Chapter 3.

4.1 Improvement of word2vec ①

First is the speedup from the input layer to the intermediate layer. This part plays the role of embedding that converts words into distributed expressions, but since the MatMul layer is wasteful, replace it with the Embedding layer.

The Embedding layer is simple, but the part that adds $ dW $ when idx is duplicated in the backpropagation implementation may be a bit confusing. In the book, it is taken up in Figure 4-5, and the explanation is omitted, "Let's think about why we add."

Therefore, I thought about it by comparing it with the backpropagation calculation for the MatMul layer. This is because the Embedding layer must have the same result as the MatMul layer.

First, change $ idx $ in Figure 4-5 back to $ x $ in the MatMul layer.

\begin{align}
idx &= 
\begin{pmatrix}
0\\
2\\
0\\
4\\
\end{pmatrix}\\
\\
x &=
\begin{pmatrix}
1 & 0 & 0 & 0 & 0 & 0 & 0\\
0 & 0 & 1 & 0 & 0 & 0 & 0\\
1 & 0 & 0 & 0 & 0 & 0 & 0\\
0 & 0 & 0 & 0 & 1 & 0 & 0\\
\end{pmatrix}
\end{align}

The backpropagation formula for the MatMul layer is $ \ frac {\ partial L} {\ partial W} = x ^ T \ frac {\ partial L} {\ partial y} $ (see page 33), so Figure 4-5 If you replace it with the notation of, it becomes $ dw = x ^ Tdh $. Applying $ x $ and $ dh $ in Figure 4-5 here to calculate $ dW $, we get: Actually, I wanted to make $ dh $ as shown in Fig. 4-5, but I can't express the shade of ● like a book, so here I express it with $ ●, ◆, a, b $.

\begin{align}
dW &= x^Tdh\\
\\
\begin{pmatrix}
? & ? & ? \\
○ & ○ & ○ \\
●_1 & ●_2 & ●_3 \\
○ & ○ & ○ \\
◆_1 & ◆_2 & ◆_3 \\
○ & ○ & ○ \\
○ & ○ & ○ \\
\end{pmatrix}
&=
\begin{pmatrix}
1 & 0 & 1 & 0\\
0 & 0 & 0 & 0\\
0 & 1 & 0 & 0\\
0 & 0 & 0 & 0\\
0 & 0 & 0 & 1\\
0 & 0 & 0 & 0\\
0 & 0 & 0 & 0\\
\end{pmatrix}
\begin{pmatrix}
a_1 & a_2 & a_3 \\
●_1 & ●_2 & ●_3 \\
b_1 & b_2 & b_3 \\
◆_1 & ◆_2 & ◆_3 \\
\end{pmatrix}\\
\end{align}

As you can see from the calculation, the second line ($ ● _1 ● _2 ● _3 ) and the fourth line ( ◆ _1 ◆ _2 ◆ _3 $) of $ dh $ are $ dW $ as they are as shown in Fig. 4-5. It will be the 3rd and 5th lines of. And the $? $ On the first line of the $ dW $ in question looks like this:

\begin{align}
\begin{pmatrix}
a_1 + b_1 & a_2 + b_2 & a_3 + b_3 \\
○ & ○ & ○ \\
●_1 & ●_2 & ●_3 \\
○ & ○ & ○ \\
◆_1 & ◆_2 & ◆_3 \\
○ & ○ & ○ \\
○ & ○ & ○ \\
\end{pmatrix}
&=
\begin{pmatrix}
1 & 0 & 1 & 0\\
0 & 0 & 0 & 0\\
0 & 1 & 0 & 0\\
0 & 0 & 0 & 0\\
0 & 0 & 0 & 1\\
0 & 0 & 0 & 0\\
0 & 0 & 0 & 0\\
\end{pmatrix}
\begin{pmatrix}
a_1 & a_2 & a_3 \\
●_1 & ●_2 & ●_3 \\
b_1 & b_2 & b_3 \\
◆_1 & ◆_2 & ◆_3 \\
\end{pmatrix}
\end{align}

In other words, you can see that we are adding the first and third lines of $ dh $. The same thing as the calculation of this MatMul layer must be implemented in the Embedding layer, so it is necessary to add it.

4.2 Improvement of word2vec ②

Next is the improvement from the intermediate layer to the output layer. Negative Sampling's idea of boldly reducing learning with negative examples is interesting.

I didn't have a big stumbling block, but in the book, the explanation of backpropagation of the Embedding Dot layer is omitted because it is "not a difficult problem, so let's think about it for yourself", so I will summarize here a little. to watch.

In Figure 4-12, if you cut out only the part of the Embedding Dot layer, it will be as follows.

図1.png

What the dot node is doing is multiplying each element and adding the results. Therefore, consider back propagation by decomposing into a multiplication node (see "1.3.4.1 Multiplication node" in Chapter 1) and a Sum node (see "1.3.4.4 Sum node" in Chapter 1). Then it will have the following form. Blue letters are backpropagation.

図2.png

Returning this to the previous Dot node diagram, it looks like this:

図3.png

If you implement it as shown in this figure, it is OK, but as it is, the shape of dout does not match with h and target_W, so the product of each element is not calculated by*of NumPy. Therefore, first match the shapes with dout.reshape (dout.shape [0], 1) and then calculate the product. If you implement it in this way, you can see that it becomes the code of ʻEmbeddingDot.backwad ()` in the book.

4.3 Learning improved word2vec

The learning implementation is not particularly stumbling. I use the PTB corpus in the book, but I like Japanese after all, so I tried learning with Aozora Bunko's divided text as in Chapter 2. It was.

To get the corpus, use the modified version of dataset / aozorabunko.py instead of dataset / ptb.py. For more information on this source and mechanism, see [Chapter 2 Memo "Improvement of Count-Based Method"](https://qiita.com/segavvy/items/52feabbf7867020e117d#24-Improvement of Count-Based Method). I have written it, so please refer to it.

ch04 / train.py has also been changed to use the corpus of Aozora Bunko as follows. The changes are those with ★ in the comments.

`ch04/train.py`


# coding: utf-8
import sys
sys.path.append('..')
from common import config
#When executing on GPU, delete the comment out below (cupy required)
# ===============================================
# config.GPU = True
# ===============================================
from common.np import *
import pickle
from common.trainer import Trainer
from common.optimizer import Adam
from cbow import CBOW
from skip_gram import SkipGram
from common.util import create_contexts_target, to_cpu, to_gpu
from dataset import aozorabunko  #★ Changed to use the corpus of Aozora Bunko

#Hyperparameter settings
window_size = 5
hidden_size = 100
batch_size = 100
max_epoch = 10

#Data reading
corpus, word_to_id, id_to_word = aozorabunko.load_data('train')  #★ Change corpus
vocab_size = len(word_to_id)

contexts, target = create_contexts_target(corpus, window_size)
if config.GPU:
    contexts, target = to_gpu(contexts), to_gpu(target)

#Generation of models etc.
model = CBOW(vocab_size, hidden_size, window_size, corpus)
# model = SkipGram(vocab_size, hidden_size, window_size, corpus)
optimizer = Adam()
trainer = Trainer(model, optimizer)

#Start learning
trainer.fit(contexts, target, max_epoch, batch_size)
trainer.plot()

#Save the data you need for later use
word_vecs = model.word_vecs
if config.GPU:
    word_vecs = to_cpu(word_vecs)
params = {}
params['word_vecs'] = word_vecs.astype(np.float16)
params['word_to_id'] = word_to_id
params['id_to_word'] = id_to_word
pkl_file = 'cbow_params.pkl'  # or 'skipgram_params.pkl'
with open(pkl_file, 'wb') as f:
    pickle.dump(params, f, -1)

In addition, it took about 8 hours to study in the environment at hand. Next is the confirmation of the result. I changed ch04 / eval.py a little so that I can try various words from standard input. ★ is the changed part.

`ch04/eval.py`


# coding: utf-8
import sys
sys.path.append('..')
from common.util import most_similar, analogy
import pickle


pkl_file = 'cbow_params.pkl'
# pkl_file = 'skipgram_params.pkl'

with open(pkl_file, 'rb') as f:
    params = pickle.load(f)
    word_vecs = params['word_vecs']
    word_to_id = params['word_to_id']
    id_to_word = params['id_to_word']

#most similar task ★ Changed the query to standard input
while True:
    query = input('\n[similar] query? ')
    if not query:
        break
    most_similar(query, word_to_id, id_to_word, word_vecs, top=5)


#analogy task ★ Changed the query to standard input
print('-'*50)
while True:
    query = input('\n[analogy] query? (3 words) ')
    if not query:
        break
    a, b, c = query.split()
    analogy(a, b, c,  word_to_id, id_to_word, word_vecs)

Below are the results of various trials.

First, check for similar words. For comparison, I also listed the count-based one I tried in Chapter 2. Also, the window size of CBOW was 5 in the book code, but I also tried 2 which is the same as the count base.

Similar words	Chapter 2 count base (Window size: 2)	CBOW (Window size: 5)	CBOW (Window size: 2)
you	Wife: 0.6728986501693726 wife: 0.6299399137496948 Ｋ: 0.6205178499221802 father: 0.5986840128898621 I: 0.5941839814186096	you: 0.7080078125 Wife: 0.6748046875 wife: 0.64990234375 The handmaiden: 0.63330078125 I: 0.62646484375	Wife: 0.7373046875 you: 0.7236328125 wife: 0.68505859375 The person: 0.677734375 teacher: 0.666015625
Year	Anti: 0.8162745237350464 hundred: 0.8051895499229431 Minutes: 0.7906433939933777 Eight: 0.7857747077941895 Circle: 0.7682645320892334	Circle: 0.78515625 Minutes: 0.7744140625 Year: 0.720703125 century: 0.70751953125 30:30: 0.70361328125	Tsubo: 0.71923828125 Meter: 0.70947265625 Minutes: 0.7080078125 Minutesの: 0.7060546875 Seconds: 0.69091796875
car	door: 0.6294019222259521 Door: 0.6016885638237 Automobile: 0.5859153270721436 gate: 0.5726617574691772 curtain: 0.5608214139938354	Upper body: 0.74658203125 Warehouse: 0.744140625 Western-style building: 0.7353515625 Stairs: 0.7216796875 door: 0.71484375	Stairs: 0.72216796875 Automobile: 0.7216796875 cave: 0.716796875 underground: 0.7138671875 door: 0.71142578125
Toyota	Toyotais not found	Toyotais not found	Toyotais not found
Morning	night: 0.7267987132072449 Around: 0.660172164440155 Noon: 0.6085118055343628 evening: 0.6021789908409119 Next time: 0.6002975106239319	evening: 0.65576171875 Kunimoto: 0.65576171875 the first: 0.65087890625 The Emperor's Birthday: 0.6494140625 Next: 0.64501953125	evening: 0.68115234375 Noon: 0.66796875 Last night: 0.6640625 night: 0.64453125 Inside the gate: 0.61376953125
school	Tokyo: 0.6504884958267212 Higher: 0.6290650367736816 Junior high school: 0.5801640748977661 University: 0.5742003917694092 Boarding house: 0.5358142852783203	University: 0.81201171875 Boarding house: 0.732421875 Sumita: 0.7275390625 student: 0.68212890625 Junior high school: 0.6767578125	Junior high school: 0.69677734375 University: 0.68701171875 recently: 0.6611328125 Tokyo: 0.65869140625 here: 0.65771484375
Zashiki	Study: 0.6603355407714844 Sou side: 0.6362787485122681 Room: 0.6142982244491577 room: 0.6024710536003113 kitchen: 0.6014574766159058	floor: 0.77685546875 desk: 0.76513671875 threshold: 0.76513671875 Main hall: 0.744140625 Entrance: 0.73681640625	desk: 0.69970703125 floor: 0.68603515625 椽: 0.6796875 Study: 0.6748046875 Zoshigaya: 0.6708984375
kimono	Beard: 0.5216895937919617 black: 0.5200990438461304 clothes: 0.5096032619476318 洋clothes: 0.48781922459602356 hat: 0.4869200587272644	Avoid: 0.68896484375 cold sweat: 0.6875 Awaken: 0.67138671875 underwear: 0.6708984375 Which means: 0.662109375	Costume: 0.68359375 Sightseeing: 0.68212890625 cotton: 0.6787109375 Play: 0.66259765625 Inkstone: 0.65966796875
I	master: 0.6372452974319458 Extra: 0.5826579332351685 Kaneda: 0.4684762954711914 they: 0.4676626920700073 Labyrinth: 0.4615904688835144	master: 0.7861328125 they: 0.7490234375 Extra: 0.71923828125 Cat: 0.71728515625 Inevitable: 0.69287109375	master: 0.80517578125 they: 0.6982421875 Cat: 0.6962890625 wife: 0.6923828125 Lessing: 0.6611328125
Criminal	Phantom: 0.6609077453613281 Thieves: 0.6374931931495667 Member: 0.6308270692825317 that person: 0.6046633720397949 Dive: 0.5931873917579651	Next time: 0.7841796875 boss: 0.75439453125 that person: 0.74462890625 jewelry: 0.74169921875 eagle, I: 0.73779296875	Fish fishing: 0.77392578125 that person: 0.74072265625 Coming soon: 0.7392578125 Light balloon: 0.7021484375 Intractable disease: 0.70166015625
order	Talk: 0.6200630068778992 Consultation: 0.5290789604187012 Busy: 0.5178924202919006 Kindness: 0.5033778548240662 Lecture: 0.4894390106201172	Reminder: 0.6279296875 Appraisal: 0.61279296875 graduate: 0.611328125 General meeting: 0.6103515625 luxury: 0.607421875	Consultation: 0.65087890625 advice: 0.63330078125 Appraisal: 0.62451171875 Resignation: 0.61474609375 Proposal: 0.61474609375
Gunless gun	Obsolete: 0.7266454696655273 Old-fashioned: 0.6771457195281982 saw: 0.6735808849334717 Nose breath: 0.6516652703285217 ignorance: 0.650424063205719	Creed: 0.7353515625 Top sorting: 0.7294921875 Protagonist: 0.693359375 Born: 0.68603515625 For sale: 0.68603515625	position: 0.724609375 At hand: 0.71630859375 Road next: 0.71142578125 Face: 0.70458984375 Subject: 0.69921875
Cat	amen: 0.6659030318260193 Nobume: 0.5759447813034058 Ink: 0.5374482870101929 Status: 0.5352671146392822 usually: 0.5205280780792236	Wisdom: 0.728515625 I: 0.71728515625 Picture: 0.70751953125 dyspepsia: 0.67431640625 Gluttony: 0.66796875	I: 0.6962890625 Junior high school: 0.6513671875 love: 0.64306640625 they: 0.63818359375 Pig: 0.6357421875
Liquor	book: 0.5834404230117798 tea: 0.469807893037796 Rest: 0.4605821967124939 Eat: 0.44864168763160706 rod: 0.4349029064178467	Drink: 0.6728515625 quarrel: 0.6689453125 food: 0.66259765625 Yamakoshi: 0.646484375 Soba: 0.64599609375	Violin: 0.63232421875 Monthly salary: 0.630859375 medicine: 0.59521484375 Grenade: 0.59521484375 Kira: 0.5947265625
cuisine	Skein: 0.5380040407180786 Sign: 0.5214874744415283 original: 0.5175281763076782 Law: 0.5082278847694397 Shop: 0.5001937747001648	Hall: 0.68896484375 History: 0.615234375 novel: 0.59912109375 Literature: 0.5947265625 take: 0.59033203125	magazine: 0.666015625 Booth: 0.65625 Blacksmith: 0.61376953125 musics: 0.6123046875 Kimono: 0.6083984375

It's quite confusing as it was in Chapter 2. No superiority or inferiority can be given. The area where "cats" appear in "myself" shows the bias of the corpus. It seems that the reason is that the size of the corpus is too small, as it uses only the works of Soseki Natsume, Kenji Miyazawa, and Ranpo Edogawa.

Next is the analogy problem.

Analogy problem	CBOW (window size: 5)	CBOW (window size: 2)
Man:king=woman:?	Nu: 5.25390625 Absent: 4.2890625 Zu: 4.21875 Ruru: 3.98828125 shit: 3.845703125	Big bird: 3.4375 Every moment: 3.052734375 back gate: 2.9140625 Kage: 2.912109375 Floor pillar: 2.873046875
body:face=Automobile:?	Cop: 6.5 door: 5.83984375 Two people: 5.5625 Inspector: 5.53515625 Chief: 5.4765625	door: 3.85546875 hole: 3.646484375 Lamp: 3.640625 Inspector: 3.638671875 shoulder: 3.6328125
go:come=speak:?	To tell: 4.6640625 eleven: 4.546875 Thirteen: 4.51171875 listen: 4.25 ask: 4.16796875	listen: 4.3359375 Regrettable: 4.14453125 Miya: 4.11328125 Say: 3.671875 eleven: 3.55078125
food:Eat=book:?	Have: 4.3671875 Ask: 4.19140625 popularity: 4.1328125 Mountain road: 4.06640625 receive: 3.857421875	Prompt: 3.51171875 Go: 3.357421875 Say: 3.2265625 listen: 3.2265625 Start to get slimy: 3.17578125
summer:hot=winter:?	Accumulate: 5.23828125 Teru: 4.171875 come: 4.10546875 Everywhere: 4.05859375 Go: 3.978515625	eleven: 4.29296875 Finished: 3.853515625 Thirteen: 3.771484375 Become: 3.66015625 Bad: 3.66015625

Suddenly the first problem is a mixture of low scores but distressing results. Insufficient training data can be scary. I feel that I have a glimpse of the background of the recent demand for "explainable AI."

Other results are tattered, but barely the correct answers were mixed in "body: face = car :?" And "go: come = speak :?". The appearance of "cops" and "inspectors" in "cars" is probably due to Ranpo Edogawa.

After all, I should have used the Japanese version of Wikipedia obediently, but it's an amanojaku: sweat: There are many people who are trying Wikipedia, so if you are interested, "wikipedia Japanese corpus" Please try google with.

4.4 Remaining themes for word2vec

Negative / positive judgment of email is explained as an example of transfer learning, but with the knowledge up to this chapter, even if words can be converted to fixed-length vectors, sentences such as email cannot be converted to fixed-length vectors. .. Therefore, we cannot challenge such a task yet.

Also, regarding the quality of distributed expressions, in the case of Japanese, the quality of prior word-separation seems to have a large effect. Some Japanese distributed expression models have been released, but when considering transfer learning of them, I think that it is a prerequisite to use the same word-separation mechanism (logic, dictionary contents, parameters, etc.). .. In that case, does it mean that transfer learning cannot be easily performed with tasks that deal with technical terms specific to the industry or individual companies, for example? Japanese is a lot of work.

4.5 Summary

The first half of this book is finally over, but considering that Chapter 1 was a review of the first volume, it may still be about 1/3. The destination seems to be long ...

That's all for this chapter. If you have any mistakes, I would be grateful if you could point them out.

Making from scratch Deep Learning ❷ An amateur stumbled Note: Chapter 4