Tips for handling variable length inputs in deep learning frameworks

Introduction

There are several patterns when using variable-length matrices in natural language processing. I feel like I'm reimplementing it every time, so I'll summarize it as a memorandum.

In this article, I will put the implementation of Chainer and Tensorflow that I often use. (Note: I didn't copy and paste the production code, I reimplemented it from scratch for this post, so it hasn't been tested.)

Things to keep in mind about Chainer

I think Chainer recommends managing variable length as a list of Variable instead of managing it with Variable + length as handled below. Specifically, L.NStepLSTM and [F.pad_sequence](https: / /docs.chainer.org/en/stable/reference/generated/chainer.functions.pad_sequence.html) and so on.

Note

The code below is based on the assumption that the following imports have been made.

Chainer


import chainer
import chainer.functions as F
import numpy as np

Tensorflow


import tensorflow as tf
import numpy as np


sess = tf.InteractiveSession()

Text

Padding

Many deep learning frameworks do not directly support the calculation of variable length matrices to take advantage of GPU and CPU parallel computing. Therefore, padding is performed to fill the part outside the series length with an appropriate value according to the maximum length matrix.

In addition, this part is often done by yourself at the stage of data creation, not the deep learning framework.

X = [np.array([1, 2]),
     np.array([11, 12, 13, 14]),
     np.array([21])]

#With int32 assuming handling word ID
x = np.zeros([3, 4], dtype=np.int32)

for i, xi in enumerate(X):
    x[i, :len(xi)] = xi[:]

print x
# [[ 1  2  0  0]
#  [11 12 13 14]
#  [21  0  0  0]]

When using Chainer's L.EmbedId, it is better to use -1 padding instead of 0 padding and useL.EmbedId (..., ignore_label = -1).

Masking

When doing Sum pooling etc., mask the part outside the series length created by Padding with 0 (however, [Do not overconfide masking](Do not overconfide #Mask). The calculation can be realized by the calculation of where.

mask.png (If True, the lvalue is used, and if False, the rvalue is used to function as masking.)

If you write this process step by step:

Chainer


x = chainer.Variable(np.arange(1, 7).reshape(2, 3))
print x
# variable([[1 2 3]
#           [4 5 6]])

length = np.array([3, 2], dtype=np.int32)
print length
# [3 2]

xp = chainer.cuda.get_array_module(x.data)
mask = xp.tile(xp.arange(x.shape[-1]).reshape(1, -1), (x.shape[0], 1))
print mask
# [[0 1 2]
#  [0 1 2]]

mask = mask < length.reshape(-1, 1)
print mask
# [[ True  True  True]
#  [ True  True False]]

padding = xp.zeros(x.shape, dtype=x.dtype)
print padding
# [[0 0 0]
#  [0 0 0]]

z = F.where(mask, x, padding)
print z
# variable([[1 2 3]
#           [4 5 0]])

sequence_mask is convenient in Tensorflow.

Tensorflow


x = tf.constant(np.arange(1, 7).reshape(2, 3).astype(np.float32))
length = tf.constant(np.array([3, 2], dtype=np.int32))

mask = tf.sequence_mask(length, tf.shape(x)[-1])
padding = tf.fill(tf.shape(x), 0.0)
z = tf.where(mask, x, padding)
print z.eval()
# [[ 1.  2.  3.]
#  [ 4.  5.  0.]]

Chainer version (rather than numpy version) sequence_mask

Chainer


def sequence_mask(length, max_num=None):
    xp = chainer.cuda.get_array_module(length.data)
    if max_num is None:
        max_num = xp.max(length)
    # create permutation on (length.ndim + 1) dimension
    perms = xp.arange(max_num).reshape([1] * length.ndim + [-1])
    length = length.reshape([1] * (length.ndim - 1) + [-1] + [1])
    return perms < length

Reshape

Since deep learning often handles rank 2 matrices of mini-batch size x features, many frameworks provide many functions that take such matrices as input. In order to enjoy the benefits of these functions, the mini-batch x sequence length x feature matrix is converted to a (mini-batch size * sequence length) x feature rank 2 matrix for processing.

reshape_1.png

However, this is a waste of extra processing when the matrix is relatively sparse. You can reduce the processing by doing your best in indexing. (I have not tried it, but if the matrix is not sparse, it may take time to reallocate memory, so be careful)

reshape_2.png

In the case of Tensorflow, such processing can be realized by the following processing.

reshape_4.png

Chainer


# WARNING: I have not checked it in case of rank != 3

x = chainer.Variable(np.arange(18).astype(np.float32).reshape(3, 3, 2))
length = np.array([2, 3, 1], dtype=np.int32)
w = chainer.Variable(np.ones([2, 3], dtype=np.float32))

# sequence_mask is mentioned above
mask = sequence_mask(length, x.shape[length.ndim])
print mask
# [[ True  True False]
#  [ True  True  True]
#  [ True False False]]

x_reshaped = F.get_item(x, mask)
print x_reshaped
# [[  0.   1.]
#  [  2.   3.]
#  [  6.   7.]
#  [  8.   9.]
#  [ 10.  11.]
#  [ 12.  13.]]

y_reshaped = F.matmul(x_reshaped, w)
print y_reshaped
# [[  1.   1.   1.]
#  [  5.   5.   5.]
#  [ 13.  13.  13.]
#  [ 17.  17.  17.]
#  [ 21.  21.  21.]
#  [ 25.  25.  25.]]

pad_shape = [[0, 0] for _ in xrange(y_reshaped.ndim)]
pad_shape[length.ndim - 1][1] = 1
y_reshaped = F.pad(y_reshaped, pad_shape, 'constant', constant_values=0.)
print y_reshaped
# variable([[  1.,   1.,   1.],
#           [  5.,   5.,   5.],
#           [ 13.,  13.,  13.],
#           [ 17.,  17.,  17.],
#           [ 21.,  21.,  21.],
#           [ 25.,  25.,  25.],
#           [  0.,   0.,   0.]])


idx_size = np.prod(mask.shape)
inv_idx = np.ones([idx_size], dtype=np.int32) * -1
inv_idx[np.nonzero(mask.flat)[0]] = np.arange(x_reshaped.shape[0]).astype(np.int32)
print inv_idx
# [ 0  1 -1  2  3  4  5 -1 -1]

y = F.reshape(F.get_item(y_reshaped, inv_idx), list(x.shape[:length.ndim + 1]) + [-1])
print y
# [[[  1.   1.   1.]
#   [  5.   5.   5.]
#   [  0.   0.   0.]]
# 
#  [[ 13.  13.  13.]
#   [ 17.  17.  17.]
#   [ 21.  21.  21.]]
# 
#  [[ 25.  25.  25.]
#   [  0.   0.   0.]
#   [  0.   0.   0.]]]

In the case of Tensorflow, such processing can be realized by the following processing.

reshape_3.png

Tensorflow


# WARNING: I have not checked it in case of rank != 3
x = tf.constant(np.arange(18).astype(np.float32).reshape(3, 3, 2))
length = tf.constant(np.array([2, 3, 1], dtype=np.int32))
w = tf.constant(np.ones([2, 3], dtype=np.float32))

mask = tf.sequence_mask(length, tf.shape(x)[tf.rank(length)])
print mask.eval()
# [[ True  True False]
#  [ True  True  True]
#  [ True False False]]

x_reshaped = tf.boolean_mask(x, mask)
print x_reshaped.eval()
# [[  0.   1.]
#  [  2.   3.]
#  [  6.   7.]
#  [  8.   9.]
#  [ 10.  11.]
#  [ 12.  13.]]

y_reshaped = tf.matmul(x_reshaped, w)
print y_reshaped.eval()
# [[  1.   1.   1.]
#  [  5.   5.   5.]
#  [ 13.  13.  13.]
#  [ 17.  17.  17.]
#  [ 21.  21.  21.]
#  [ 25.  25.  25.]]

idx = tf.to_int32(tf.where(mask))
print idx.eval()
# [[0 0]
#  [0 1]
#  [1 0]
#  [1 1]
#  [1 2]
#  [2 0]]

shape = tf.concat([tf.shape(x)[:-1], tf.shape(y_reshaped)[-1:]], 0)
print shape.eval()
# [3 3 3]

y = tf.scatter_nd(idx, y_reshaped, shape)
print y.eval()
# [[[  1.   1.   1.]
#   [  5.   5.   5.]
#   [  0.   0.   0.]]
# 
#  [[ 13.  13.  13.]
#   [ 17.  17.  17.]
#   [ 21.  21.  21.]]
# 
#  [[ 25.  25.  25.]
#   [  0.   0.   0.]
#   [  0.   0.   0.]]]

Implementation of Softmax

Consider doing a softmax on the outermost dimension of a given matrix. Such situations occur in ListNet Permutation probability distribution and in the calculation of attention.

Softmax formula $ y_i = \frac{exp(x_i)}{\sum_jexp({x_j})} $

x = np.random.random([2, 3]).astype(np.float32)
# array([[ 0.44715771,  0.85983515,  0.08915455],
#        [ 0.02465274,  0.63411605,  0.01340247]], dtype=float32)

length = np.array([3, 2], dtype=np.int32)

I want to calculate Softmax using only the blue area as shown in the figure below.

masked_softmax.png

By the way, don't wear a mask before / after.

Chainer


#Bad example 1
x_ = np.copy(x)
x_[1, 2] = 0.
print F.softmax(x_)
# variable([[ 0.31153342,  0.47068265,  0.21778394],
#           [ 0.26211682,  0.48214924,  0.25573397]])

#Bad example 2
y = F.softmax(x)
y[1, 2] = 0.
print y
# variable([[ 0.31153342,  0.47068265,  0.21778394],
#           [ 0.26121548,  0.48049128,  0.0       ]])
#The total of the second line is 1.Obviously not because it is not 0

The reason is very simple, example 1 is for $ exp (0.258) \ neq 0 $. In Example 2, x [2,1] affects the calculation of the denominator.

In Softmax calculation, masking is performed by using $ exp (-inf) = 0 $.

Chainer


def masked_softmax(x, length):
    """
    Softmax operation on the ourter-most dimenstion of x.

    Args:
         x (chainer.Variable): Values to be passed to softmax
         length (numpy.ndarray or cupy.ndarray):
             Number of items in the outer-most dimension of x
    """
    assert x.ndim - 1 == length.ndim
    xp = chainer.cuda.get_array_module(x.data)
    x_shape = x.shape
    x = F.reshape(x, (-1, x_shape[-1]))
    # mask: (B, T)
    mask = xp.tile(xp.arange(x.shape[-1]).reshape(1, -1), (x.shape[0], 1))
    mask = mask < length.reshape(-1, 1)
    padding = xp.ones(x.shape, dtype=x.dtype) * -np.inf
    z = F.where(mask, x, padding)
    return F.reshape(F.softmax(z), x_shape)


print masked_softmax(chainer.Variable(x), length)
# variable([[ 0.31153342,  0.47068265,  0.21778394],
#           [ 0.35218161,  0.64781839,  0.        ]])

Tensorflow


def masked_softmax(x, length):
    """
    Softmax operation on the ourter-most dimenstion of x.

    Args:
         x (tf.Tensor): Values to be passed to softmax
         length (tf.Tensor): Number of items in the outer-most dimension of x
    """
    mask = tf.sequence_mask(length, tf.shape(x)[-1])
    padding = tf.fill(tf.shape(x), -np.inf)
    z = tf.where(mask, x, padding)
    return tf.nn.softmax(z, dim=-1)


print masked_softmax(
    tf.constant(x),
    tf.constant(length)).eval()
# [[ 0.31153342,  0.47068265,  0.21778394],
#  [ 0.35218161,  0.64781839,  0.        ]]

Appendix:

Don't overconfide in Mask

In the deep learning framework, when division by zero occurs, there is a specification where the gradient becomes ʻinf even if you use where`. Therefore, "I should mask even if I make an unstable calculation" does not work.

There is a network like the following formula.

e = f_0(x) \\
w = f_1(e)

This is expressed by the chain rule as follows. $ \frac{\partial w}{\partial x} = \frac{\partial w}{\partial e}\frac{\partial e}{\partial x} $

By the way, this is realized by automatic differentiation as follows (roughly).

x.grad = e.grad * g(f_0, e, x)

Here, g (f_0, e, x) is the partial derivative expressed from $ f_0 $ and its input / output. In other words, no matter what derivative value ʻe.grad comes from the upper equation, if the partial derivative value of equation $ f_0 $ is ʻinf or nan, x.grad is also ʻinf. It becomes or nan`. If you try this with Chainer and Tensorflow,

Tensorflow


sess = tf.InteractiveSession()

x = tf.constant(0.0)

t = x
e = 1. / x
w = tf.where(True, t, e)

print w.eval()  # 0.0
print tf.gradients(w, x)[0].eval()  # nan

Chainer


x = chainer.Variable(np.array([0.0], dtype=np.float32))
t = x
e = 1. / x
w = chainer.functions.where(np.array([True]), t, e)

w.grad = np.array([1.0], np.float32)
w.backward(retain_grad=True)

print w  # 0.
print x.grad  # nan

Recommended Posts

Tips for handling variable length inputs in deep learning frameworks
Deep Learning Experienced in Python Chapter 2 (Materials for Journals)
Deep learning for compound formation?
[AI] Deep Learning for Image Denoising
Deep Learning from scratch-Chapter 4 tips on deep learning theory and implementation learned in Python
Windows → linux Tips for bringing in data
Tips for dealing with binaries in Python
Image recognition model using deep learning in 2016
Make your own PC for deep learning
Sample for handling eml files in Python
Tips for building large applications in Flask
"Deep Learning from scratch" in Haskell (unfinished)
[Deep learning] Nogizaka face detection ~ For beginners ~
Tips for making small tools in python
About data expansion processing for deep learning
Introduction to Deep Learning (1) --Chainer is explained in an easy-to-understand manner for beginners-
[For beginners] After all, what is written in Deep Learning made from scratch?