Summary of sharpening and smoothing of probabilities + Applied to simple document generation

Recently. I am reading a paper on molecular design [1], and since there was a talk about rescaling of probability, I summarized it.

Probability rescaling is used for sharpening and smoothing discrete probability distributions.

Probability sharpening is the operation of converting a high probability into a higher probability and a low probability into a lower probability. By this operation, the discrete probability distribution approaches a one-hot probability distribution (1 for some elements and 0 for others).

On the other hand, probability smoothing is an operation that smoothes the probability distribution. This is an operation that brings the discrete probability distribution closer to a uniform distribution.

sample.png

Sharpening and smoothing are used when adjusting the probability distribution for sample generation. If you use sharpening in natural language processing document generation, more commonly used sentences will be generated. On the other hand, smoothing produces a wide range of sentences.

This idea is used not only in natural language processing but also in various places such as unsupervised clustering and molecular design generation.

Here, I will introduce two rescaling methods and try them with a simple sentence generation problem.

Thermal Rescaling This method uses the Boltzmann distribution used in statistical mechanics. Assuming that the probability distribution before rescaling is q_i and the probability distribution after rescaling is p_i, Thermal Rescaling is as follows.

p_i = \frac{\exp(\frac{q_i}{T})}{\sum_{j}\exp(\frac{q_j}{T})}

Here, T is a temperature parameter, and when T is small, it is sharpened, and when T is large, it is smoothed. The changes in the probability distribution due to Thermal Rescaling are as follows.

example_thermal.png

The features of Thermal Rescaling are the following three points.

  1. As T approaches 0, it approaches the One-Hot probability distribution "v".
  2. As T increases, it approaches the uniform distribution "u".
  3. There is not always a temperature parameter "T" that matches the probability distribution before conversion **.

As a system of feature 3, it is important that elements with a probability of 0 are not preserved (element 3 in the above figure). This problem can break the rules of probability distribution that were supposed before rescaling.

The following shows the KL information of the distribution after rescaling, (A) the original distribution q, (B) the uniform distribution u, and (C) the one-hot probability distribution v.

KL_thermal.png

When T approaches 0, the amount of KL information with the one-hot probability distribution v approaches 0, and as T increases, the amount of KL information with the uniform distribution u approaches 0. On the other hand, the amount of KL information with the original distribution q never approaches 0.

It's not as much as code, but if you use numpy, it looks like this:

#Original discrete probability distribution
q = np.array([0.1,0.05,0.6,0,0.2,0.05])
#Temperature parameters
T = 0.05
#Rescaling
p = np.exp(q/T) / np.sum(np.exp(q/T))

Freezing function

There is a Freezing function as a rescaling method that exponentially multiplies the probability. This idea is also used in the unsupervised learning method, Deep Embedded Clustering (DEC) [2]. Assuming that the probability distribution before rescaling is q_i and the probability distribution after rescaling is p_i, the Freezing function is as follows.

p_i = \frac{q_i^{\frac{1}{T}}}{\sum_{j}q_j^{\frac{1}{T}}}

Similar to Thermal Rescaling, T is a temperature parameter, which is sharpened when T is small and smoothed when T is large. The changes in the probability distribution due to Thermal Rescaling are as follows.

example_freesing.png

The features of the Freezing function are the following three points.

  1. As T approaches 0, it approaches the One-Hot probability distribution "v".
  2. As T increases, it does not always approach the uniform distribution "u" **.
  3. When T = 1, it matches the probability distribution before conversion.

Unlike Thermal rescaling, elements with a probability of 0 are saved. Therefore, if there is an element with a probability of 0, smoothing will not result in a uniform distribution.

The following shows the KL information of the distribution after rescaling, (A) the original distribution, (B) the uniform distribution u, and (C) the one-hot probability distribution v. (D(u||p)Cannot be calculated, so D(p||u)Was calculated. )

KL_freesing.png

Similar to Thermal rescaling, when T approaches 0, the amount of KL information with the one-hot probability distribution v approaches 0, but even if T increases, the amount of KL information with the uniform distribution u does not approach 0. Hmm. On the other hand, when T = 1, the amount of KL information with the original distribution q converges to 0.

It's not as much as code, but if you use numpy, it looks like this:

#Original discrete probability distribution
q = np.array([0.1,0.05,0.6,0,0.2,0.05])
#Temperature parameters
T = 0.2
#Rescaling
p = q**(1/T) / np.sum(q**(1/T))

Application to document generation

The following three sentences are used for learning.

I have a pen. 
I have a dog. 
I buy a pen.

This time, the sentence generation model uses a simple 1-gram. If you create about 10 sentences using 1-gram, it will be as follows.

 [SoS] I have a pen [EoS]
 [SoS] I have a pen [EoS]
 [SoS] I have a pen [EoS]
 [SoS] I have a pen [EoS]
 [SoS] I have a pen [EoS]
 [SoS] I have a dog [EoS]
 [SoS] I have a pen [EoS]
 [SoS] I buy a dog [EoS]
 [SoS] I buy a dog [EoS]
 [SoS] I have a pen [EoS]

A certain messy sentence is generated.

First, using the Freezing function, the sharpened sentences are as follows (T = 0.1).

 [SoS] I have a pen [EoS]
 [SoS] I have a pen [EoS]
 [SoS] I have a pen [EoS]
 [SoS] I have a pen [EoS]
 [SoS] I have a pen [EoS]
 [SoS] I have a pen [EoS]
 [SoS] I have a pen [EoS]
 [SoS] I have a pen [EoS]
 [SoS] I have a pen [EoS]
 [SoS] I have a pen [EoS]

Since there are many learning sentences that use "have" after "I" and "pen" after "a", only one sentence was generated by sharpening.

Next, the text smoothed using the Freezing function is as follows (T = 2.0).

 [SoS] I have a dog [EoS]
 [SoS] I buy a dog [EoS]
 [SoS] I have a pen [EoS]
 [SoS] I have a pen [EoS]
 [SoS] I have a pen [EoS]
 [SoS] I have a dog [EoS]
 [SoS] I buy a dog [EoS]
 [SoS] I buy a pen [EoS]
 [SoS] I buy a pen [EoS]
 [SoS] I have a dog [EoS]

The number of samples that use "buy" after "I" is increasing, but the rules of writing are not broken.

Finally, the text smoothed using Thermal rescaling is as follows. (T = 1.0)

 [SoS] a I have dog I have I [EoS]
 [SoS] [EoS]
 [SoS] have [SoS] I buy [EoS]
 [SoS] a a [EoS]
 [SoS] [SoS] [SoS] [EoS]
 [SoS] [EoS]
 [SoS] a buy a dog dog [SoS] I dog pen pen pen buy pen [EoS]
 [SoS] dog buy a I pen I have buy I buy a dog [EoS]
 [SoS] dog buy buy I dog a pen have dog pen [EoS]
 [SoS] I [SoS] buy dog [SoS] a pen pen [EoS]

The sentence rule is broken. Thermal rescaling is due to not saving elements with a probability of 0.

The code for sentence generation is shown below. ..

import numpy as np

corpus = "I have a pen. I have a dog. I buy a pen."
#Divide into sentences
sentences =  corpus.split('.')
#With the deletion of blank text[SoS],[EoS]Addition of
sentences = ['[SoS] ' + s + ' [EoS]' for s in sentences if s != '']
# [EoS]→[EoS]Add
sentences.append("[EoS] [EoS]")
#Double space removal
sentences = [s.replace('  ',' ')  for s in sentences]
#Divide into words
words =[]
for s in sentences:
    words.append(s.split(' '))
#Creating a word list
word_list = ['[SoS]'] + list(set(sum(words,[])) - set(['[SoS]','[EoS]'])) + ['[EoS]']
#Convert word list to element number
num_list = np.arange(len(word_list))
word_dict = dict(zip(word_list,num_list))
#Creating transition probabilities
A = np.zeros((len(word_list),len(word_list)))
for s in words:
    for i in range(len(s) - 1):
        #Enumeration
        A[word_dict[s[i+1]],word_dict[s[i]]] +=1
A = A / A.sum(axis = 0)
#Original document generation
sentences_g = []

for i in range(10):
    #Initially[SoS]
    w = [0]
    while True:
        #Next probability distribution
        q = A[:,w[-1]]
        # sampling and append
        w.append(np.argmax(np.random.multinomial(1,q)))
        # if EoS then break
        if w[-1] == len(word_list)-1:
            break;
    #Convert to string
    s = ''
    for i in w:
        s = s + ' ' +word_list[i]  

    #Add document
    sentences_g.append(s)
#display
for s in sentences_g:
    print(s)

Code of sharpening by Freezing functon:

#  Sharpening by Freezing : T = 0.1
T = 0.1
sentences_g = []
for i in range(10):
    w = [0]
    while True:
        q = A[:,w[-1]]
        #Rescaling
        p = q**(1/T) / np.sum(q**(1/T))
        w.append(np.argmax(np.random.multinomial(1,p)))
        if w[-1] == len(word_list)-1:
            break;
    s = ''
    for i in w:
        s = s + ' ' +word_list[i]  
    sentences_g.append(s)
for s in sentences_g:
    print(s)

Code of smoothing by Thermal Rescaling:

#  Smoothing by Thermal : T = 1.0
T = 1.0
sentences_g = []
for i in range(10):
    w = [0]
    while True:
        q = A[:,w[-1]]
        #Rescaling
        p = np.exp(q/T) / np.sum(np.exp(q/T))
        w.append(np.argmax(np.random.multinomial(1,p)))
        if w[-1] == len(word_list)-1:
            break;
    s = ''
    for i in w:
        s = s + ' ' +word_list[i]  

    sentences_g.append(s)
for s in sentences_g:
    print(s)

If you want to bias and rescale

We may rescale to make certain elements appear. In that case, you may want to use the Gumbel-softmax function [3].

Assuming that the probability distribution before rescaling is $ q_i $ and the probability distribution after rescaling is $ p_i $, the Gumbel-Softmax function is as follows.

p_i = \frac{\exp(\frac{\log(q_i)+ g_i}{T})}{\sum_{j}\exp(\frac{\log(q_j)+ g_j}{T})}

Here, T is a temperature parameter, and like the rescaling so far, it is sharpened when T is small and smoothed when T is large. In addition, g_i is a parameter that determines the priority when rescaling. In the paper, g_i is generated from the following distribution.

g_i = -\log (-\log u),\quad u\sim {\rm Uniform}(0, 1)

In addition to the Gumbel-Softmax function, I think there is a method of biasing and rescaling. I want to use it according to the situation.

Summary

If you want to keep the element with 0 probability, use Freezing function, and if you don't want to keep the element with 0 probability, use Thermal Rescaling.

code https://github.com/yuji0001/2020GenerativeModel

Author Yuji Okamoto: [email protected]

Reference [1] Elton, D. C., Boukouvalas, Z., Fuge, M. D., & Chung, P. W. (2019, August 1). Deep learning for molecular design - A review of the state of the art. Molecular Systems Design and Engineering, Vol. 4, pp. 828–849. https://doi.org/10.1039/c9me00039a

[2] Xie, J., Girshick, R., & Farhadi, A. (2015). Unsupervised Deep Embedding for Clustering Analysis. 48. Retrieved from http://arxiv.org/abs/1511.06335

[3] Jang, E., Gu, S., & Poole, B. (2019). Categorical reparameterization with gumbel-softmax. 5th International Conference on Learning Representations, ICLR 2017.

Recommended Posts

Summary of sharpening and smoothing of probabilities + Applied to simple document generation
[python] Summary of how to retrieve lists and dictionary elements
[Python] Summary of how to use split and join functions
Sphinx extension to arbitrarily convert text in pre-processing of document generation
Summary of how to use pandas.DataFrame.loc
Summary of how to use pyenv-virtualenv
Explanation and implementation of simple perceptron
Summary of how to use csvkit
Summary of Python indexes and slices
Description and summary of what you need to install Chainer on Mac