I'm reading a masterpiece, ** "Deep Learning from Zero 2" **. This time is a memo of Chapter 4. To execute the code, download the entire code from Github and use jupyter notebook in ch04.
The theme of Chapter 4 is to speed up the Word2vec CBOW model implemented in Chapter 3 and make it a practical model. Execute ch04 / train.py and look at the contents in order.
The data set uses Penn Tree Bank, the number of vocabulary is 10,000, and the corpus size of train. Is about 900,000 words.
import sys
sys.path.append('..')
from common import config
#When executing on GPU, delete the comment out below (cupy required)
# ===============================================
# config.GPU = True
# ===============================================
from common.np import *
import pickle
from common.trainer import Trainer
from common.optimizer import Adam
from cbow import CBOW
from skip_gram import SkipGram
from common.util import create_contexts_target, to_cpu, to_gpu
from dataset import ptb
#Hyperparameter settings
window_size = 5
hidden_size = 100
batch_size = 100
max_epoch = 10
#Data reading
corpus, word_to_id, id_to_word = ptb.load_data('train')
vocab_size = len(word_to_id)
#Get context and target
contexts, target = create_contexts_target(corpus, window_size)
if config.GPU:
contexts, target = to_gpu(contexts), to_gpu(target)
#Network construction
model = CBOW(vocab_size, hidden_size, window_size, corpus)
#Learning, loss transition graph display
optimizer = Adam()
trainer = Trainer(model, optimizer)
trainer.fit(contexts, target, max_epoch, batch_size)
trainer.plot()
#Save the data you need for later use
word_vecs = model.word_vecs
if config.GPU:
word_vecs = to_cpu(word_vecs)
params = {}
params['word_vecs'] = word_vecs.astype(np.float16)
params['word_to_id'] = word_to_id
params['id_to_word'] = id_to_word
pkl_file = 'cbow_params.pkl' # or 'skipgram_params.pkl'
with open(pkl_file, 'wb') as f:
pickle.dump(params, f, -1)
Los Angeles seems to have fallen steadily. Then, it becomes a point. Let's take a look at class CBOW
in cbow.py
in the network construction part.
# --------------- from cbow.py ---------------
class CBOW:
def __init__(self, vocab_size, hidden_size, window_size, corpus):
V, H = vocab_size, hidden_size
#Weight initialization
W_in = 0.01 * np.random.randn(V, H).astype('f')
W_out = 0.01 * np.random.randn(V, H).astype('f')
#Layer generation
self.in_layers = []
for i in range(2 * window_size):
layer = Embedding(W_in) #Use Embedding layer
self.in_layers.append(layer)
self.ns_loss = NegativeSamplingLoss(W_out, corpus, power=0.75, sample_size=5)
#List all weights and gradients
layers = self.in_layers + [self.ns_loss]
self.params, self.grads = [], []
for layer in layers:
self.params += layer.params
self.grads += layer.grads
#Set distributed representation of words in member variables
self.word_vecs = W_in
One of the points of speeding up is the adoption of ** Embedding layer **. Take a look at common / layers.py.
# --------------- from common/layers.py --------------
class Embedding:
def __init__(self, W):
self.params = [W]
self.grads = [np.zeros_like(W)]
self.idx = None
def forward(self, idx):
W, = self.params
self.idx = idx
out = W[idx] #Output the line specified by idx
return out
def backward(self, dout):
dW, = self.grads
dW[...] = 0
if GPU:
np.scatter_add(dW, self.idx, dout)
else:
np.add.at(dW, self.idx, dout) #Add data to the row specified by idx
return None
In Chapter 3, the ** MatMul layer ** was used to find the inner product of the vector and the weight matrix, but when you think about it, it is the inner product of the one-hot vector and the weight matrix, so ** the weight matrix $ W_ {in} All you have to do is specify the $ line **. This is the ** Embed Layer **.
That way, backpropagation only needs to update the corresponding row with the previously transmitted data. However, in mini-batch learning, it is possible that multiple data will happen to come back to the same row and overlap, so instead of replacing, ** data is added **.
4.Negative Sampling The second point of speeding up is ** Negative Sampling **. As in Chapter 3, it is unrealistic to classify by Softmax from the output of the number of vocabularies. Then what should we do. The answer is to solve the multi-value classification problem by approximating it to the binary classification problem **.
Take a look at class NegativeSamplingLoss
in negative_sampling_layer.py
.
# ------------- form negative_sampling_layer.py --------------
class NegativeSamplingLoss:
def __init__(self, W, corpus, power=0.75, sample_size=5):
self.sample_size = sample_size
self.sampler = UnigramSampler(corpus, power, sample_size)
self.loss_layers = [SigmoidWithLoss() for _ in range(sample_size + 1)]
self.embed_dot_layers = [EmbeddingDot(W) for _ in range(sample_size + 1)]
self.params, self.grads = [], []
for layer in self.embed_dot_layers:
self.params += layer.params
self.grads += layer.grads
def forward(self, h, target):
batch_size = target.shape[0]
negative_sample = self.sampler.get_negative_sample(target)
#Positive forward
score = self.embed_dot_layers[0].forward(h, target)
correct_label = np.ones(batch_size, dtype=np.int32)
loss = self.loss_layers[0].forward(score, correct_label)
#Negative forward
negative_label = np.zeros(batch_size, dtype=np.int32)
for i in range(self.sample_size):
negative_target = negative_sample[:, i]
score = self.embed_dot_layers[1 + i].forward(h, negative_target)
loss += self.loss_layers[1 + i].forward(score, negative_label)
return loss
def backward(self, dout=1):
dh = 0
for l0, l1 in zip(self.loss_layers, self.embed_dot_layers):
dscore = l0.backward(dout)
dh += l1.backward(dscore)
return dh
To approximate a multi-value classification to a binary classification, first make the probability that say (1) is correct for the answer of the word between you (0) and goodbye (2) as much as possible (correct). Example). But this is not enough.
Therefore, I add that the probability that a properly selected hello (5) or I (4) will be incorrect is as large as possible (negative example).
This technique is called ** Negative Sampling **. The number of negative examples to choose from is sample_size = 5 in the code.
At this point, ʻEmbedding_dot_layerswill appear, so take a look at this as well. Similarly, it is in
negative_sampling_layer.py`.
# ------------- form negative_sampling_layer.py --------------
class EmbeddingDot:
def __init__(self, W):
self.embed = Embedding(W)
self.params = self.embed.params
self.grads = self.embed.grads
self.cache = None
def forward(self, h, idx):
target_W = self.embed.forward(idx)
out = np.sum(target_W * h, axis=1)
self.cache = (h, target_W)
return out
def backward(self, dout):
h, target_W = self.cache
dout = dout.reshape(dout.shape[0], 1)
dtarget_W = dout * h
self.embed.backward(dtarget_W)
dh = dout * target_W
return dh
In order to support mini-batch, the sum of target_w * h is taken at the end so that it can be calculated even if there are multiple idx and h.
First, we moved ch04 / train.py, so the learned parameters are stored in cbow_params.pkl
. Use this to see if you have a good distributed representation of words in ʻeval.py`.
import sys
sys.path.append('..')
from common.util import most_similar, analogy
import pickle
pkl_file = 'cbow_params.pkl' #File name specification
#Reading each parameter
with open(pkl_file, 'rb') as f:
params = pickle.load(f)
word_vecs = params['word_vecs']
word_to_id = params['word_to_id']
id_to_word = params['id_to_word']
# most similar task
querys = ['you']
for query in querys:
most_similar(query, word_to_id, id_to_word, word_vecs, top=5)
First, check the word similarity using the most_similar
method (common / util.py). The closest thing to you is we, and i, they, your, followed by personal pronouns. This is the result of calculating the similarity of each word with the following cosine similarity.
# analogy task
analogy('king', 'man', 'queen', word_to_id, id_to_word, word_vecs)
Now let's check the famous ** king --man + woman = queen ** problem using the ʻalalogy` method (common / util.py). That's true, isn't it?
This solves the task of finding the ** word x ** so that the ** "king → x" vector ** is as close as possible to the ** "man → woman" vector **.
Recommended Posts