1.First of all

I'm reading a masterpiece, ** "Deep Learning from Zero 2" **. This time is a memo of Chapter 4. To execute the code, download the entire code from Github and use jupyter notebook in ch04.

2. High-speed version CBOW model

The theme of Chapter 4 is to speed up the Word2vec CBOW model implemented in Chapter 3 and make it a practical model. Execute ch04 / train.py and look at the contents in order.

The data set uses Penn Tree Bank, the number of vocabulary is 10,000, and the corpus size of train. Is about 900,000 words.

import sys
sys.path.append('..')
from common import config
#When executing on GPU, delete the comment out below (cupy required)
# ===============================================
# config.GPU = True
# ===============================================
from common.np import *
import pickle
from common.trainer import Trainer
from common.optimizer import Adam
from cbow import CBOW
from skip_gram import SkipGram
from common.util import create_contexts_target, to_cpu, to_gpu
from dataset import ptb

#Hyperparameter settings
window_size = 5
hidden_size = 100
batch_size = 100
max_epoch = 10

#Data reading
corpus, word_to_id, id_to_word = ptb.load_data('train')
vocab_size = len(word_to_id)

#Get context and target
contexts, target = create_contexts_target(corpus, window_size)
if config.GPU:
    contexts, target = to_gpu(contexts), to_gpu(target)

#Network construction
model = CBOW(vocab_size, hidden_size, window_size, corpus)

#Learning, loss transition graph display
optimizer = Adam()
trainer = Trainer(model, optimizer)
trainer.fit(contexts, target, max_epoch, batch_size)
trainer.plot()

#Save the data you need for later use
word_vecs = model.word_vecs
if config.GPU:
    word_vecs = to_cpu(word_vecs)
params = {}
params['word_vecs'] = word_vecs.astype(np.float16)
params['word_to_id'] = word_to_id
params['id_to_word'] = id_to_word
pkl_file = 'cbow_params.pkl'  # or 'skipgram_params.pkl'
with open(pkl_file, 'wb') as f:
    pickle.dump(params, f, -1)

スクリーンショット 2020-05-13 16.27.53.png

Los Angeles seems to have fallen steadily. Then, it becomes a point. Let's take a look at class CBOW in cbow.py in the network construction part.

# --------------- from cbow.py ---------------
class CBOW:
    def __init__(self, vocab_size, hidden_size, window_size, corpus):
        V, H = vocab_size, hidden_size

        #Weight initialization
        W_in = 0.01 * np.random.randn(V, H).astype('f')
        W_out = 0.01 * np.random.randn(V, H).astype('f')

        #Layer generation
        self.in_layers = []
        for i in range(2 * window_size):
            layer = Embedding(W_in)  #Use Embedding layer
            self.in_layers.append(layer)
        self.ns_loss = NegativeSamplingLoss(W_out, corpus, power=0.75, sample_size=5)

        #List all weights and gradients
        layers = self.in_layers + [self.ns_loss]
        self.params, self.grads = [], []
        for layer in layers:
            self.params += layer.params
            self.grads += layer.grads

        #Set distributed representation of words in member variables
        self.word_vecs = W_in

One of the points of speeding up is the adoption of ** Embedding layer **. Take a look at common / layers.py.

3. Embedding layer

# --------------- from common/layers.py --------------
class Embedding:
    def __init__(self, W):
        self.params = [W]
        self.grads = [np.zeros_like(W)]
        self.idx = None

    def forward(self, idx):
        W, = self.params
        self.idx = idx
        out = W[idx]  #Output the line specified by idx
        return out

    def backward(self, dout):
        dW, = self.grads
        dW[...] = 0
        if GPU:
            np.scatter_add(dW, self.idx, dout)
        else:
            np.add.at(dW, self.idx, dout)  #Add data to the row specified by idx
        return None

スクリーンショット 2020-05-14 15.56.08.png

In Chapter 3, the ** MatMul layer ** was used to find the inner product of the vector and the weight matrix, but when you think about it, it is the inner product of the one-hot vector and the weight matrix, so ** the weight matrix $ W_ {in} All you have to do is specify the $ line **. This is the ** Embed Layer **.

That way, backpropagation only needs to update the corresponding row with the previously transmitted data. However, in mini-batch learning, it is possible that multiple data will happen to come back to the same row and overlap, so instead of replacing, ** data is added **.

4.Negative Sampling The second point of speeding up is ** Negative Sampling **. As in Chapter 3, it is unrealistic to classify by Softmax from the output of the number of vocabularies. Then what should we do. The answer is to solve the multi-value classification problem by approximating it to the binary classification problem **.

Take a look at class NegativeSamplingLoss in negative_sampling_layer.py.

# ------------- form negative_sampling_layer.py --------------
class NegativeSamplingLoss:
    def __init__(self, W, corpus, power=0.75, sample_size=5):
        self.sample_size = sample_size
        self.sampler = UnigramSampler(corpus, power, sample_size)
        self.loss_layers = [SigmoidWithLoss() for _ in range(sample_size + 1)]
        self.embed_dot_layers = [EmbeddingDot(W) for _ in range(sample_size + 1)]

        self.params, self.grads = [], []
        for layer in self.embed_dot_layers:
            self.params += layer.params
            self.grads += layer.grads

    def forward(self, h, target):
        batch_size = target.shape[0]
        negative_sample = self.sampler.get_negative_sample(target)

        #Positive forward
        score = self.embed_dot_layers[0].forward(h, target)
        correct_label = np.ones(batch_size, dtype=np.int32)
        loss = self.loss_layers[0].forward(score, correct_label)

        #Negative forward
        negative_label = np.zeros(batch_size, dtype=np.int32)
        for i in range(self.sample_size):
            negative_target = negative_sample[:, i]
            score = self.embed_dot_layers[1 + i].forward(h, negative_target)
            loss += self.loss_layers[1 + i].forward(score, negative_label)

        return loss

    def backward(self, dout=1):
        dh = 0
        for l0, l1 in zip(self.loss_layers, self.embed_dot_layers):
            dscore = l0.backward(dout)
            dh += l1.backward(dscore)

        return dh

スクリーンショット 2020-05-13 17.36.16.png To approximate a multi-value classification to a binary classification, first make the probability that say (1) is correct for the answer of the word between you (0) and goodbye (2) as much as possible (correct). Example). But this is not enough.

Therefore, I add that the probability that a properly selected hello (5) or I (4) will be incorrect is as large as possible (negative example).

This technique is called ** Negative Sampling **. The number of negative examples to choose from is sample_size = 5 in the code.

At this point, ʻEmbedding_dot_layerswill appear, so take a look at this as well. Similarly, it is innegative_sampling_layer.py`.

# ------------- form negative_sampling_layer.py --------------
class EmbeddingDot:
    def __init__(self, W):
        self.embed = Embedding(W)
        self.params = self.embed.params
        self.grads = self.embed.grads
        self.cache = None

    def forward(self, h, idx):
        target_W = self.embed.forward(idx)
        out = np.sum(target_W * h, axis=1)

        self.cache = (h, target_W)
        return out

    def backward(self, dout):
        h, target_W = self.cache
        dout = dout.reshape(dout.shape[0], 1)

        dtarget_W = dout * h
        self.embed.backward(dtarget_W)
        dh = dout * target_W
        return dh

スクリーンショット 2020-05-14 10.56.04.png In order to support mini-batch, the sum of target_w * h is taken at the end so that it can be calculated even if there are multiple idx and h.

5. Model evaluation

First, we moved ch04 / train.py, so the learned parameters are stored in cbow_params.pkl. Use this to see if you have a good distributed representation of words in ʻeval.py`.

import sys
sys.path.append('..')
from common.util import most_similar, analogy
import pickle

pkl_file = 'cbow_params.pkl'  #File name specification

#Reading each parameter
with open(pkl_file, 'rb') as f:  
    params = pickle.load(f)  
    word_vecs = params['word_vecs']
    word_to_id = params['word_to_id']
    id_to_word = params['id_to_word']

# most similar task
querys = ['you']
for query in querys:
    most_similar(query, word_to_id, id_to_word, word_vecs, top=5)

スクリーンショット 2020-05-14 15.41.11.png

First, check the word similarity using the most_similar method (common / util.py). The closest thing to you is we, and i, they, your, followed by personal pronouns. This is the result of calculating the similarity of each word with the following cosine similarity. スクリーンショット 2020-05-14 15.25.40.png

# analogy task
analogy('king', 'man', 'queen',  word_to_id, id_to_word, word_vecs)

スクリーンショット 2020-05-14 15.40.21.png

Now let's check the famous ** king --man + woman = queen ** problem using the ʻalalogy` method (common / util.py). That's true, isn't it?

This solves the task of finding the ** word x ** so that the ** "king → x" vector ** is as close as possible to the ** "man → woman" vector **.

Deep Learning / Deep Learning from Zero 2 Chapter 4 Memo

1.First of all

2. High-speed version CBOW model

3. Embedding layer

5. Model evaluation