Introduction

There are two main ways to reduce class bias within a dataset when learning classification.

Suppress bias by shaping the number of training data itself
Suppress bias by controlling the value that propagates back during learning

This time, I summarized what I was interested in when adopting the second method using Chainer.

motivation

Specifically, the softmax_cross_entropy function has an argument called class_weight, but by controlling this, you can change the learning strength for each class. For example, in the case of two-class classification, the learning of class '1' can be performed twice as strongly as the learning of class '0'. Then, when you give it double weight, what does it mean to learn twice as strongly? I was wondering, so I looked it up.

Execution environment

Mac OS X 10.10.5 (Yosemite)
Python 2.7.13
numpy @1.12.1
Chainer 1.24.0

Let's see the effect on Loss

First, read Chainer's [Documentation](! Https://docs.chainer.org/en/stable/reference/generated/chainer.functions.softmax_cross_entropy.html#chainer.functions.softmax_cross_entropy).

chainer.functions.softmax_cross_entropy(x, t, normalize=True, cache_score=True, class_weight=None, ignore_label=-1, reduce='mean') ... · Class_weight (ndarray or ndarray) – An array that contains constant weights that will be multiplied with the loss values along with the second dimension. The shape of this array should be (x.shape [1],). If this is not None, each class weightclass_weight [i]is actually multiplied toy [:, i]that is the corresponding log-softmax output of x and has the same shape as x before calculating the actual loss value.

In other words, it seems that $ log (Softmax (x)) $ calculated before calculating Loss is multiplied according to the shape of x. I see.

Now, let's see at what stage the class_weight is actually multiplied. First, let's take a look at the forward function in softmax_cross_entropy.py.

`chainer/functions/loss/softmax_cross_entropy.py`


    def forward_cpu(self, inputs):
        x, t = inputs
        if chainer.is_debug():
            self._check_input_values(x, t)

        log_y = log_softmax._log_softmax(x, self.use_cudnn)
        if self.cache_score:
            self.y = numpy.exp(log_y)
        if self.class_weight is not None:
            shape = [1 if d != 1 else -1 for d in six.moves.range(x.ndim)]
            log_y *= _broadcast_to(self.class_weight.reshape(shape), x.shape)
        log_yd = numpy.rollaxis(log_y, 1)
        log_yd = log_yd.reshape(len(log_yd), -1)
        log_p = log_yd[numpy.maximum(t.ravel(), 0), numpy.arange(t.size)]

        log_p *= (t.ravel() != self.ignore_label)
        if self.reduce == 'mean':
            # deal with the case where the SoftmaxCrossEntropy is
            # unpickled from the old version
            if self.normalize:
                count = (t != self.ignore_label).sum()
            else:
                count = len(x)
            self._coeff = 1.0 / max(count, 1)

            y = log_p.sum(keepdims=True) * (-self._coeff)
            return y.reshape(()),
        else:
            return -log_p.reshape(t.shape),

It should be noted that in line 11, class_weight is broadcast for the calculation result of log (Softmax (x)).

**`log_y *= _broadcast_to(self.class_weight.reshape(shape), x.shape)`**


 That is, the calculation of the cross entropy error $ L =-\ sum t_ {k} \ log {(Softmax (y_ {k}))} $ before adding to $ \ log {(Softmax (y_ {k}))} $ Will be multiplied by.
 At this time, $ k $ indicates the number of classes.
 In other words, the formula is $ L =-\ sum t_ {k} ClassWeight_ {k} \ log {(Softmax (y_ {k}))} $.
 It's as documented.

 To see this, let's experiment interactively.

import numpy as np import chainer x = np.array([[1, 0]]).astype(np.float32) t = np.array([1]).astype(np.int32) #class'1'To be trained with double the weight cw = np.array([1, 2]).astype(np.float32) sce_nonweight = chainer.functions.loss.softmax_cross_entropy.SoftmaxCrossEntropy() sce_withweight = chainer.functions.loss.softmax_cross_entropy.SoftmaxCrossEntropy(class_weight=cw) loss_nonweight = sce_nonweight(x, t) loss_withweight = sce_withweight(x, t) loss_nonweight.data array(1.31326162815094, dtype=float32)

loss_withweight.data array(2.62652325630188, dtype=float32)


 You can see that the Loss value is doubled.

 Therefore, what we have learned so far is that the weighting in class_weight will be reflected as it is in the output Loss value.

# Let's see the effect on back propagation

 So what is the impact of learning, or backpropagation?
 What we want to see here is what is the value of $ y-t $, which is the value backpropagated from softmax_cross_entropy.
 As expected, the value of $ y-t $ is multiplied by the weight as it is, but for the time being, let's check the implementation of chainer.


#### **`chainer/functions/loss/softmax_cross_entropy.py`**
```python

    def backward_cpu(self, inputs, grad_outputs):
        x, t = inputs
        gloss = grad_outputs[0]
        if hasattr(self, 'y'):
            y = self.y.copy()
        else:
            y = log_softmax._log_softmax(x, self.use_cudnn)
            numpy.exp(y, out=y)
        if y.ndim == 2:
            gx = y
            gx[numpy.arange(len(t)), numpy.maximum(t, 0)] -= 1
            if self.class_weight is not None:
                shape = [1 if d != 1 else -1 for d in six.moves.range(x.ndim)]
                c = _broadcast_to(self.class_weight.reshape(shape), x.shape)
                c = c[numpy.arange(len(t)), numpy.maximum(t, 0)]
                gx *= _broadcast_to(numpy.expand_dims(c, 1), gx.shape)
            gx *= (t != self.ignore_label).reshape((len(t), 1))
        else:
            # in the case where y.ndim is higher than 2,
            # we think that a current implementation is inefficient
            # because it yields two provisional arrays for indexing.
            n_unit = t.size // len(t)
            gx = y.reshape(y.shape[0], y.shape[1], -1)
            fst_index = numpy.arange(t.size) // n_unit
            trd_index = numpy.arange(t.size) % n_unit
            gx[fst_index, numpy.maximum(t.ravel(), 0), trd_index] -= 1
            if self.class_weight is not None:
                shape = [1 if d != 1 else -1 for d in six.moves.range(x.ndim)]
                c = _broadcast_to(self.class_weight.reshape(shape), x.shape)
                c = c.reshape(gx.shape)
                c = c[fst_index, numpy.maximum(t.ravel(), 0), trd_index]
                c = c.reshape(y.shape[0], 1, -1)
                gx *= _broadcast_to(c, gx.shape)
            gx *= (t != self.ignore_label).reshape((len(t), 1, -1))
            gx = gx.reshape(y.shape)
        if self.reduce == 'mean':
            gx *= gloss * self._coeff
        else:
            gx *= gloss[:, None]
        return gx, None

Here, lines 9 to 17 are calculating $ y-t $ this time, and you can see that class_weight is being broadcast to the backpropagation value as expected.

You can also see that the gloss is multiplied at the end. And what is gloss is like grad_output, which is a member of the Variable class, grad. You can check the grad of the initial value, so let's see it.

>> loss_nonweight.backward()
>> aloss_nonweight.backward()
>> loss_nonweight.grad
array(1.0, dtype=float32)
>> loss_withweight.grad
array(1.0, dtype=float32)

Of course I was in trouble otherwise, but the first backpropagation value is $ \ frac {\ partial L} {\ partial L} = 1 $. So this result doesn't seem to be wrong.

Also, to mention it, there is a parameter _coeff that is multiplied in addition to gloss, but this is only the reciprocal of batchsize entered during batch learning (that is, the member for averaging), and in this case Is 1. By the way, when calculating Loss, _coeff is also multiplied.

The thing is that the weight defined by class_weight is directly related to learning as expected. Then it's a little forced, but it's an experiment.

>> sce_nonweight.backward_cpu((x,t),[loss_nonweight.grad])
(array([[ 0.7310586, -0.7310586]], dtype=float32), None)
>> sce_withweight.backward_cpu((x,t),[loss_withweight.grad])
(array([[ 1.4621172, -1.4621172]], dtype=float32), None)

The value of backpropagation was array ([[0.7310586, 0.26894143]], dtype = float32)` `` when looking at `` `chainer.functions.softmax (x) .data. From, you can see that it is $ y --t $. And it was confirmed that the value of back propagation was also doubled properly. Congratulations.

In conclusion, we found that the weight of class_weight is reflected proportionally to the value of backpropagation.

Conclusion

The argument class_weigth in softmax_cross_entropy implemented in Chainer is

Multiplied as is when calculating Loss
Multiplies the backpropagation value output from softmax_cross_entropy as is

I found out that.

I don't know who will get it, but if it helps. I would appreciate it if you could let me know if there is anything wrong with it.

I examined the argument class_weight of Chainer's softmax_cross_entropy function.