This article is the 11th day article of Ibaraki University Advent Calendar 2019.

Implement Word2Vec with Pytorch. Word2Vec When I thought about building Word2Vec, many articles on gensim were hit, but since there were few articles that implemented Word2Vec using Pytorch, I decided to post it. Since there are many articles that explain Word2Vec, I will briefly explain it.

Skip-gram

skip-gram maximizes the output probability of the surrounding word sequence $ {, w (t-1), w (t + 1)} $ given the input word $ w (t) $. Therefore, the objective variable to be minimized is as follows.

\begin{align}
E 
&= -log p( w_{t-1},w_{t+1} | w_{t} ) \\
&= -log p(w_{t-1},w_{t})*p(w_{t+1},w_{t}) \\
&= -log \prod_{i}\frac{exp(p(w_{i},w_{t}))}{\sum_{j}exp(p(w_{j},w_{t}))}
\end{align}

Here, the numerator is a word for window size, but the denominator needs to calculate the total number of words. Since that is not possible, we approximate it with negative sampling.

Negative sampling code

Words output by negative sampling are determined by the frequency of occurrence of words as shown in reference [1]. The program looks like this:

def sample_negative(sample_size):
    prob = {}
    word2cnt = dict(Counter(list(itertools.chain.from_iterable(corpus))))
    
    pow_sum = sum([v**0.75 for v in word2cnt.values()])
    for word in word2cnt:
        prob[word] = word_counts[word]**0.75 / pow_sum
    words = np.array(list(word2cnt.keys()))
    while True:
        word_list = []
        sampled_index = np.array(multinomial(sample_size, list(prob.values())))
        for index, count in enumerate(sampled_index):
            for _ in range(count):
                 word_list.append(words[index])
        yield word_list

Creating a model

There is also a method of expressing a word with Onehot for inputting a word, but that would increase the dimension by the number of words, so after converting it to a word vector using the Embedding Layer, apply it to Encoder and Decoder. The evaluation takes the inner product of word vectors and outputs it with the log sigmoid function. The calculation formula is as follows.

L= \sum_{i} log \sigma({v'}_{w_{i}}^{T}v_{w_{I}})+\sum_{i}log \sigma(-{v'}_{w_{i}}^{T}v_{w_{I}})

class SkipGram(nn.Module):
    def __init__(self, V, H):
        super(SkipGram, self).__init__()
        self.encode_embed = nn.Embedding(V, H)
        self.decode_embed = nn.Embedding(V, H)
        
        self.encode_embed.weight.data.uniform_(-0.5/H, 0.5/H)
        self.decode_embed.weight.data.uniform_(0.0, 0.0)
        
    def forward(self, contexts, center, neg_target):
        embed_ctx = self.encode_embed(contexts)
        embed_center = self.decode_embed(center)
        neg_embed_center= self.encode_embed(neg_target)

        #inner product
        ##Positive example
        score = torch.matmul(embed_ctx, torch.t(embed_center))
        score = torch.sum(score, dim=2).view(1, -1)
        log_target = F.logsigmoid(score)
        
        ##Negative example
        neg_score = torch.matmul(embed_ctx, torch.t(neg_embed_center))
        neg_score = -torch.sum(neg_score, dim=2).view(1, -1)
        log_neg_target = F.logsigmoid(neg_score)

        return -1 * (torch.mean(log_target) + torch.mean(log_neg_target))

It seems that it is common to separate the Emcoder and Decoder Embedding. Since it is a maximization problem, it is multiplied by a minus.

result

Screen Shot 2019-12-10 at 17.02.41.png

The accuracy is not good as a whole, and it is necessary to set the Scheduler and learning rate appropriately.

I haven't organized the code, so I will publish the whole code after organizing it.

References

[1] Distributed Representations of Words and Phrases and their Compositionality [2] word2vec Parameter Learning Explained

I made Word2Vec with Pytorch

Negative sampling code

Creating a model

result

References