Deep Learning / Deep Learning from Zero 2 Chapter 4 Memo

1.First of all

I'm reading a masterpiece, ** "Deep Learning from Zero 2" **. This time is a memo of Chapter 4. To execute the code, download the entire code from Github and use jupyter notebook in ch04.

2. High-speed version CBOW model

The theme of Chapter 4 is to speed up the Word2vec CBOW model implemented in Chapter 3 and make it a practical model. Execute ch04 / train.py and look at the contents in order.

The data set uses Penn Tree Bank, the number of vocabulary is 10,000, and the corpus size of train. Is about 900,000 words.

import sys
sys.path.append('..')
from common import config
#When executing on GPU, delete the comment out below (cupy required)
# ===============================================
# config.GPU = True
# ===============================================
from common.np import *
import pickle
from common.trainer import Trainer
from common.optimizer import Adam
from cbow import CBOW
from skip_gram import SkipGram
from common.util import create_contexts_target, to_cpu, to_gpu
from dataset import ptb

#Hyperparameter settings
window_size = 5
hidden_size = 100
batch_size = 100
max_epoch = 10

#Data reading
corpus, word_to_id, id_to_word = ptb.load_data('train')
vocab_size = len(word_to_id)

#Get context and target
contexts, target = create_contexts_target(corpus, window_size)
if config.GPU:
    contexts, target = to_gpu(contexts), to_gpu(target)

#Network construction
model = CBOW(vocab_size, hidden_size, window_size, corpus)

#Learning, loss transition graph display
optimizer = Adam()
trainer = Trainer(model, optimizer)
trainer.fit(contexts, target, max_epoch, batch_size)
trainer.plot()

#Save the data you need for later use
word_vecs = model.word_vecs
if config.GPU:
    word_vecs = to_cpu(word_vecs)
params = {}
params['word_vecs'] = word_vecs.astype(np.float16)
params['word_to_id'] = word_to_id
params['id_to_word'] = id_to_word
pkl_file = 'cbow_params.pkl'  # or 'skipgram_params.pkl'
with open(pkl_file, 'wb') as f:
    pickle.dump(params, f, -1)

スクリーンショット 2020-05-13 16.27.53.png

Los Angeles seems to have fallen steadily. Then, it becomes a point. Let's take a look at class CBOW in cbow.py in the network construction part.

# --------------- from cbow.py ---------------
class CBOW:
    def __init__(self, vocab_size, hidden_size, window_size, corpus):
        V, H = vocab_size, hidden_size

        #Weight initialization
        W_in = 0.01 * np.random.randn(V, H).astype('f')
        W_out = 0.01 * np.random.randn(V, H).astype('f')

        #Layer generation
        self.in_layers = []
        for i in range(2 * window_size):
            layer = Embedding(W_in)  #Use Embedding layer
            self.in_layers.append(layer)
        self.ns_loss = NegativeSamplingLoss(W_out, corpus, power=0.75, sample_size=5)

        #List all weights and gradients
        layers = self.in_layers + [self.ns_loss]
        self.params, self.grads = [], []
        for layer in layers:
            self.params += layer.params
            self.grads += layer.grads

        #Set distributed representation of words in member variables
        self.word_vecs = W_in

One of the points of speeding up is the adoption of ** Embedding layer **. Take a look at common / layers.py.

3. Embedding layer

# --------------- from common/layers.py --------------
class Embedding:
    def __init__(self, W):
        self.params = [W]
        self.grads = [np.zeros_like(W)]
        self.idx = None

    def forward(self, idx):
        W, = self.params
        self.idx = idx
        out = W[idx]  #Output the line specified by idx
        return out

    def backward(self, dout):
        dW, = self.grads
        dW[...] = 0
        if GPU:
            np.scatter_add(dW, self.idx, dout)
        else:
            np.add.at(dW, self.idx, dout)  #Add data to the row specified by idx
        return None

スクリーンショット 2020-05-14 15.56.08.png

In Chapter 3, the ** MatMul layer ** was used to find the inner product of the vector and the weight matrix, but when you think about it, it is the inner product of the one-hot vector and the weight matrix, so ** the weight matrix $ W_ {in} All you have to do is specify the $ line **. This is the ** Embed Layer **.

That way, backpropagation only needs to update the corresponding row with the previously transmitted data. However, in mini-batch learning, it is possible that multiple data will happen to come back to the same row and overlap, so instead of replacing, ** data is added **.

4.Negative Sampling The second point of speeding up is ** Negative Sampling **. As in Chapter 3, it is unrealistic to classify by Softmax from the output of the number of vocabularies. Then what should we do. The answer is to solve the multi-value classification problem by approximating it to the binary classification problem **.

Take a look at class NegativeSamplingLoss in negative_sampling_layer.py.

# ------------- form negative_sampling_layer.py --------------
class NegativeSamplingLoss:
    def __init__(self, W, corpus, power=0.75, sample_size=5):
        self.sample_size = sample_size
        self.sampler = UnigramSampler(corpus, power, sample_size)
        self.loss_layers = [SigmoidWithLoss() for _ in range(sample_size + 1)]
        self.embed_dot_layers = [EmbeddingDot(W) for _ in range(sample_size + 1)]

        self.params, self.grads = [], []
        for layer in self.embed_dot_layers:
            self.params += layer.params
            self.grads += layer.grads

    def forward(self, h, target):
        batch_size = target.shape[0]
        negative_sample = self.sampler.get_negative_sample(target)

        #Positive forward
        score = self.embed_dot_layers[0].forward(h, target)
        correct_label = np.ones(batch_size, dtype=np.int32)
        loss = self.loss_layers[0].forward(score, correct_label)

        #Negative forward
        negative_label = np.zeros(batch_size, dtype=np.int32)
        for i in range(self.sample_size):
            negative_target = negative_sample[:, i]
            score = self.embed_dot_layers[1 + i].forward(h, negative_target)
            loss += self.loss_layers[1 + i].forward(score, negative_label)

        return loss

    def backward(self, dout=1):
        dh = 0
        for l0, l1 in zip(self.loss_layers, self.embed_dot_layers):
            dscore = l0.backward(dout)
            dh += l1.backward(dscore)

        return dh

スクリーンショット 2020-05-13 17.36.16.png To approximate a multi-value classification to a binary classification, first make the probability that say (1) is correct for the answer of the word between you (0) and goodbye (2) as much as possible (correct). Example). But this is not enough.

Therefore, I add that the probability that a properly selected hello (5) or I (4) will be incorrect is as large as possible (negative example).

This technique is called ** Negative Sampling **. The number of negative examples to choose from is sample_size = 5 in the code.

At this point, ʻEmbedding_dot_layerswill appear, so take a look at this as well. Similarly, it is innegative_sampling_layer.py`.

# ------------- form negative_sampling_layer.py --------------
class EmbeddingDot:
    def __init__(self, W):
        self.embed = Embedding(W)
        self.params = self.embed.params
        self.grads = self.embed.grads
        self.cache = None

    def forward(self, h, idx):
        target_W = self.embed.forward(idx)
        out = np.sum(target_W * h, axis=1)

        self.cache = (h, target_W)
        return out

    def backward(self, dout):
        h, target_W = self.cache
        dout = dout.reshape(dout.shape[0], 1)

        dtarget_W = dout * h
        self.embed.backward(dtarget_W)
        dh = dout * target_W
        return dh

スクリーンショット 2020-05-14 10.56.04.png In order to support mini-batch, the sum of target_w * h is taken at the end so that it can be calculated even if there are multiple idx and h.

5. Model evaluation

First, we moved ch04 / train.py, so the learned parameters are stored in cbow_params.pkl. Use this to see if you have a good distributed representation of words in ʻeval.py`.

import sys
sys.path.append('..')
from common.util import most_similar, analogy
import pickle

pkl_file = 'cbow_params.pkl'  #File name specification

#Reading each parameter
with open(pkl_file, 'rb') as f:  
    params = pickle.load(f)  
    word_vecs = params['word_vecs']
    word_to_id = params['word_to_id']
    id_to_word = params['id_to_word']

# most similar task
querys = ['you']
for query in querys:
    most_similar(query, word_to_id, id_to_word, word_vecs, top=5)

スクリーンショット 2020-05-14 15.41.11.png

First, check the word similarity using the most_similar method (common / util.py). The closest thing to you is we, and i, they, your, followed by personal pronouns. This is the result of calculating the similarity of each word with the following cosine similarity. スクリーンショット 2020-05-14 15.25.40.png

# analogy task
analogy('king', 'man', 'queen',  word_to_id, id_to_word, word_vecs)

スクリーンショット 2020-05-14 15.40.21.png

Now let's check the famous ** king --man + woman = queen ** problem using the ʻalalogy` method (common / util.py). That's true, isn't it?

This solves the task of finding the ** word x ** so that the ** "king → x" vector ** is as close as possible to the ** "man → woman" vector **.

Recommended Posts

Deep Learning / Deep Learning from Zero 2 Chapter 4 Memo
Deep Learning / Deep Learning from Zero Chapter 3 Memo
Deep Learning / Deep Learning from Zero 2 Chapter 5 Memo
Deep Learning / Deep Learning from Zero 2 Chapter 7 Memo
Deep Learning / Deep Learning from Zero 2 Chapter 8 Memo
Deep Learning / Deep Learning from Zero Chapter 5 Memo
Deep Learning / Deep Learning from Zero Chapter 4 Memo
Deep Learning / Deep Learning from Zero 2 Chapter 3 Memo
Deep Learning / Deep Learning from Zero 2 Chapter 6 Memo
[Learning memo] Deep Learning made from scratch [Chapter 7]
Deep learning / Deep learning made from scratch Chapter 6 Memo
[Learning memo] Deep Learning made from scratch [Chapter 5]
[Learning memo] Deep Learning made from scratch [Chapter 6]
Deep learning / Deep learning made from scratch Chapter 7 Memo
[Learning memo] Deep Learning made from scratch [~ Chapter 4]
Deep Learning from scratch Chapter 2 Perceptron (reading memo)
Reinforcement learning to learn from zero to deep
"Deep Learning from scratch" Self-study memo (Part 12) Deep learning
Deep Learning from scratch
"Deep Learning from scratch" self-study memo (unreadable glossary)
"Deep Learning from scratch" Self-study memo (9) MultiLayerNet class
Deep Learning from scratch ① Chapter 6 "Techniques related to learning"
[Learning memo] Deep Learning from scratch ~ Implementation of Dropout ~
"Deep Learning from scratch" Self-study memo (10) MultiLayerNet class
"Deep Learning from scratch" Self-study memo (No. 11) CNN
Deep Learning from scratch 1-3 chapters
"Deep Learning from scratch" Self-study memo (No. 19) Data Augmentation
"Deep Learning from scratch 2" Self-study memo (No. 21) Chapters 3 and 4
Python learning memo for machine learning by Chainer from Chapter 2
Deep Learning 2 Made from Zero Natural Language Processing 1.3 Summary
An amateur stumbled in Deep Learning from scratch Note: Chapter 1
Making from scratch Deep Learning ❷ An amateur stumbled Note: Chapter 5
Making from scratch Deep Learning ❷ An amateur stumbled Note: Chapter 2
Deep learning from scratch (cost calculation)
An amateur stumbled in Deep Learning from scratch Note: Chapter 3
An amateur stumbled in Deep Learning from scratch Note: Chapter 7
An amateur stumbled in Deep Learning from scratch Note: Chapter 5
Making from scratch Deep Learning ❷ An amateur stumbled Note: Chapter 7
Making from scratch Deep Learning ❷ An amateur stumbled Note: Chapter 1
Making from scratch Deep Learning ❷ An amateur stumbled Note: Chapter 4
"Deep Learning from scratch" self-study memo (No. 18) One! Meow! Grad-CAM!
"Deep Learning from scratch" self-study memo (No. 19-2) Data Augmentation continued
An amateur stumbled in Deep Learning from scratch Note: Chapter 4
Deep Learning memos made from scratch
Deep Learning
An amateur stumbled in Deep Learning from scratch Note: Chapter 2
"Deep Learning from scratch" self-study memo (No. 15) TensorFlow beginner tutorial
Making from scratch Deep Learning ❷ An amateur stumbled Note: Chapter 6
Deep learning tutorial from environment construction
"Deep Learning from scratch" Self-study memo (No. 14) Run the program in Chapter 4 on Google Colaboratory
"Deep Learning from scratch" Self-study memo (Part 8) I drew the graph in Chapter 6 with matplotlib
"Deep Learning from scratch" self-study memo (No. 13) Try using Google Colaboratory
"Deep Learning from scratch" Self-study memo (No. 10-2) Initial value of weight
Deep learning from scratch (forward propagation edition)
Deep learning / Deep learning from scratch 2-Try moving GRU
Image alignment: from SIFT to deep learning
"Deep Learning from scratch" in Haskell (unfinished)
[Windows 10] "Deep Learning from scratch" environment construction
Learning record of reading "Deep Learning from scratch"
[Deep Learning from scratch] About hyperparameter optimization
Deep Learning from mathematical basics (during attendance)