I examined the argument class_weight of Chainer's softmax_cross_entropy function.

Introduction

There are two main ways to reduce class bias within a dataset when learning classification.

This time, I summarized what I was interested in when adopting the second method using Chainer.

motivation

Specifically, the softmax_cross_entropy function has an argument called class_weight, but by controlling this, you can change the learning strength for each class. For example, in the case of two-class classification, the learning of class '1' can be performed twice as strongly as the learning of class '0'. Then, when you give it double weight, what does it mean to learn twice as strongly? I was wondering, so I looked it up.

Execution environment

Let's see the effect on Loss

First, read Chainer's [Documentation](! Https://docs.chainer.org/en/stable/reference/generated/chainer.functions.softmax_cross_entropy.html#chainer.functions.softmax_cross_entropy).

chainer.functions.softmax_cross_entropy(x, t, normalize=True, cache_score=True, class_weight=None, ignore_label=-1, reduce='mean') ... · Class_weight (ndarray or ndarray) – An array that contains constant weights that will be multiplied with the loss values along with the second dimension. The shape of this array should be (x.shape [1],). If this is not None, each class weightclass_weight [i]is actually multiplied toy [:, i]that is the corresponding log-softmax output of x and has the same shape as x before calculating the actual loss value.

In other words, it seems that $ log (Softmax (x)) $ calculated before calculating Loss is multiplied according to the shape of x. I see.

Now, let's see at what stage the class_weight is actually multiplied. First, let's take a look at the forward function in softmax_cross_entropy.py.

chainer/functions/loss/softmax_cross_entropy.py


    def forward_cpu(self, inputs):
        x, t = inputs
        if chainer.is_debug():
            self._check_input_values(x, t)

        log_y = log_softmax._log_softmax(x, self.use_cudnn)
        if self.cache_score:
            self.y = numpy.exp(log_y)
        if self.class_weight is not None:
            shape = [1 if d != 1 else -1 for d in six.moves.range(x.ndim)]
            log_y *= _broadcast_to(self.class_weight.reshape(shape), x.shape)
        log_yd = numpy.rollaxis(log_y, 1)
        log_yd = log_yd.reshape(len(log_yd), -1)
        log_p = log_yd[numpy.maximum(t.ravel(), 0), numpy.arange(t.size)]

        log_p *= (t.ravel() != self.ignore_label)
        if self.reduce == 'mean':
            # deal with the case where the SoftmaxCrossEntropy is
            # unpickled from the old version
            if self.normalize:
                count = (t != self.ignore_label).sum()
            else:
                count = len(x)
            self._coeff = 1.0 / max(count, 1)

            y = log_p.sum(keepdims=True) * (-self._coeff)
            return y.reshape(()),
        else:
            return -log_p.reshape(t.shape),

It should be noted that in line 11, class_weight is broadcast for the calculation result of log (Softmax (x)).

log_y *= _broadcast_to(self.class_weight.reshape(shape), x.shape)


 That is, the calculation of the cross entropy error $ L =-\ sum t_ {k} \ log {(Softmax (y_ {k}))} $ before adding to $ \ log {(Softmax (y_ {k}))} $ Will be multiplied by.
 At this time, $ k $ indicates the number of classes.
 In other words, the formula is $ L =-\ sum t_ {k} ClassWeight_ {k} \ log {(Softmax (y_ {k}))} $.
 It's as documented.

 To see this, let's experiment interactively.

import numpy as np import chainer x = np.array([[1, 0]]).astype(np.float32) t = np.array([1]).astype(np.int32) #class'1'To be trained with double the weight cw = np.array([1, 2]).astype(np.float32) sce_nonweight = chainer.functions.loss.softmax_cross_entropy.SoftmaxCrossEntropy() sce_withweight = chainer.functions.loss.softmax_cross_entropy.SoftmaxCrossEntropy(class_weight=cw) loss_nonweight = sce_nonweight(x, t) loss_withweight = sce_withweight(x, t) loss_nonweight.data array(1.31326162815094, dtype=float32)

loss_withweight.data array(2.62652325630188, dtype=float32)


 You can see that the Loss value is doubled.

 Therefore, what we have learned so far is that the weighting in class_weight will be reflected as it is in the output Loss value.

# Let's see the effect on back propagation

 So what is the impact of learning, or backpropagation?
 What we want to see here is what is the value of $ y-t $, which is the value backpropagated from softmax_cross_entropy.
 As expected, the value of $ y-t $ is multiplied by the weight as it is, but for the time being, let's check the implementation of chainer.


#### **`chainer/functions/loss/softmax_cross_entropy.py`**
```python

    def backward_cpu(self, inputs, grad_outputs):
        x, t = inputs
        gloss = grad_outputs[0]
        if hasattr(self, 'y'):
            y = self.y.copy()
        else:
            y = log_softmax._log_softmax(x, self.use_cudnn)
            numpy.exp(y, out=y)
        if y.ndim == 2:
            gx = y
            gx[numpy.arange(len(t)), numpy.maximum(t, 0)] -= 1
            if self.class_weight is not None:
                shape = [1 if d != 1 else -1 for d in six.moves.range(x.ndim)]
                c = _broadcast_to(self.class_weight.reshape(shape), x.shape)
                c = c[numpy.arange(len(t)), numpy.maximum(t, 0)]
                gx *= _broadcast_to(numpy.expand_dims(c, 1), gx.shape)
            gx *= (t != self.ignore_label).reshape((len(t), 1))
        else:
            # in the case where y.ndim is higher than 2,
            # we think that a current implementation is inefficient
            # because it yields two provisional arrays for indexing.
            n_unit = t.size // len(t)
            gx = y.reshape(y.shape[0], y.shape[1], -1)
            fst_index = numpy.arange(t.size) // n_unit
            trd_index = numpy.arange(t.size) % n_unit
            gx[fst_index, numpy.maximum(t.ravel(), 0), trd_index] -= 1
            if self.class_weight is not None:
                shape = [1 if d != 1 else -1 for d in six.moves.range(x.ndim)]
                c = _broadcast_to(self.class_weight.reshape(shape), x.shape)
                c = c.reshape(gx.shape)
                c = c[fst_index, numpy.maximum(t.ravel(), 0), trd_index]
                c = c.reshape(y.shape[0], 1, -1)
                gx *= _broadcast_to(c, gx.shape)
            gx *= (t != self.ignore_label).reshape((len(t), 1, -1))
            gx = gx.reshape(y.shape)
        if self.reduce == 'mean':
            gx *= gloss * self._coeff
        else:
            gx *= gloss[:, None]
        return gx, None

Here, lines 9 to 17 are calculating $ y-t $ this time, and you can see that class_weight is being broadcast to the backpropagation value as expected.

You can also see that the gloss is multiplied at the end. And what is gloss is like grad_output, which is a member of the Variable class, grad. You can check the grad of the initial value, so let's see it.

>> loss_nonweight.backward()
>> aloss_nonweight.backward()
>> loss_nonweight.grad
array(1.0, dtype=float32)
>> loss_withweight.grad
array(1.0, dtype=float32)

Of course I was in trouble otherwise, but the first backpropagation value is $ \ frac {\ partial L} {\ partial L} = 1 $. So this result doesn't seem to be wrong.

Also, to mention it, there is a parameter _coeff that is multiplied in addition to gloss, but this is only the reciprocal of batchsize entered during batch learning (that is, the member for averaging), and in this case Is 1. By the way, when calculating Loss, _coeff is also multiplied.

The thing is that the weight defined by class_weight is directly related to learning as expected. Then it's a little forced, but it's an experiment.

>> sce_nonweight.backward_cpu((x,t),[loss_nonweight.grad])
(array([[ 0.7310586, -0.7310586]], dtype=float32), None)
>> sce_withweight.backward_cpu((x,t),[loss_withweight.grad])
(array([[ 1.4621172, -1.4621172]], dtype=float32), None)

The value of backpropagation was array ([[0.7310586, 0.26894143]], dtype = float32)` `` when looking at `` `chainer.functions.softmax (x) .data. From, you can see that it is $ y --t $. And it was confirmed that the value of back propagation was also doubled properly. Congratulations.

In conclusion, we found that the weight of class_weight is reflected proportionally to the value of backpropagation.

Conclusion

The argument class_weigth in softmax_cross_entropy implemented in Chainer is

I found out that.

I don't know who will get it, but if it helps. I would appreciate it if you could let me know if there is anything wrong with it.

Recommended Posts

I examined the argument class_weight of Chainer's softmax_cross_entropy function.
I tried the pivot table function of pandas
Fix the argument of the function used in map
I examined the device tree
[Python3] Call by dynamically specifying the keyword argument of the function
I tried a little bit of the behavior of the zip function
I tried to fight the Local Minimum of Goldstein-Price Function
I want to get the name of the function / method being executed
If you give a list with the default argument of the function ...
Python: I want to measure the processing time of a function neatly
I made a function to see the movement of a two-dimensional array (Python)
I checked the contents of docker volume
I tried the asynchronous server of Django 3.0
I checked the options of copyMakeBorder of OpenCV
I summarized the folder structure of Flask
[Python3] Rewrite the code object of the function
I didn't know the basics of Python
About the arguments of the setup function of PyCaret
The Python project template I think of.
I read the implementation of golang channel
Make the default value of the argument immutable
[Python] I tried substituting the function name for the function name
I read the implementation of range (Objects / rangeobject.c)
Defeat the probability density function of the normal distribution
Get the caller of a function in Python
I checked the list of shortcut keys of Jupyter
I tried to touch the API of ebay
I tried to correct the keystone of the image
Try the free version of Progate [Python I]
I checked the session retention period of django
I checked the processing speed of numpy one-dimensionalization
I touched some of the new features of Python 3.8 ①
I implemented the inverse gamma function in python
I read and implemented the Variants of UKR
I want to customize the appearance of zabbix
I tried using the image filter of OpenCV
About the * (asterisk) argument of python (and itertools.starmap)
I want to use the activation function Mish
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
Note: The meaning of specifying only * (asterisk) as an argument in the Python function definition.