I tried to implement ListNet of rank learning with Chainer

Introduction

We will implement ListNet, which is a learning to rank method, in Chainer!

This article is the 7th day of Chainer Advent Calendar 2016.

Explanation of the method

First of all, regarding rank learning, sz_dr wrote a great article on the 5th day of Advent Calender, so please check it out. Please look. In a nutshell, for those who don't have the time, "learn ordering under supervised conditions when there are multiple data in a set (query) and they are given a relative scale." It's a problem. The difference from normal supervised learning is that labels do not take absolute numbers between queries.

Difference from RankNet

I think RankNet is the first thing that many people think of when it comes to neural network + rank learning. In fact, there are multiple methods for formulating rank learning, with the difference that RankNet is a pairwise method and ListNet is a listwise method.

Pairwise Listwise
pairwise_small.png listwise_small.png
Perform by sampling innumerable pairs from one query Learn for each query

The image is taken from DSIRNLP # 1 Ranking Learning Beginning

ListNet basic idea

ListNet is based on the idea of Permutation probability distribution (PPD). As shown in the figure below, PPD is a probability distribution of the likelihood of each permutation of data. PPD can be calculated from the score for each data. Given a certain order G, the order PPD is given by the following equation.

P(\pi|\mathbf{s}) = \prod_{j}^n{\frac{exp(s_j)}{\sum_{i=j}^{n}{exp(s_i)}}}

However, $ s_j \ in \ mathbf {s} $ indicates the score of the data $ j $, $ n $ indicates the number of data per query, and $ \ pi $ indicates the order. This formula becomes the following formula when written in the example of $ n = 3 $.

P(\pi) = \frac{exp(s_0)}{exp(s_0) + exp(s_1) + exp(s_2)}\frac{exp(s_1)}{exp(s_1) + exp(s_2)}\frac{exp(s_2)}{exp(s_2)}

Permutation Probability Loss (PPL) takes the cross entropy of two PPDs [^ 1],

PPL = -\sum{P(\pi|\bar{\mathbf{s}}) \log P(\pi|\mathbf{s})}

PPL has the advantage that it can be learned purely from the order, compared to the case where the teacher data is trained by square loss by adding numbers such as (1.0, 0.9, 0.8, ..) at equal intervals.

permutation_probability_small.png

Image from Yan Liu, Learning to Rank: from Pairwise Approach to Listwise Approach Quote.

However, if you calculate the probabilities of all the order, the amount of calculation will be $ L! $, And it will not be possible to calculate in practice. Therefore, in general, the calculation amount of $ L! / (L-k)! $ Is set by considering only the top $ k $ lineup [^ 2].

P(\pi|\mathbf{s}) = \prod_{j}^k{\frac{exp(s_j)}{\sum_{i=j}^{n}{exp(s_i)}}}

For example, if $ k = 1 $, which is also used in the paper, PPL will be the same formula as softmax.

ListNet can be implemented with any differentiable function to calculate the score for each data just by using PPL. The original paper implements score calculation with Feed forward NN. ..

Implementation

First is the implementation of PPL ($ k = 1 $).


def permutation_probability_loss(x, t, length):
    length = length.reshape(-1, 1)
    # log_p: (batch size, n)
    log_p_x = x - F.broadcast_to(F.expand_dims(F.logsumexp(x, axis=1), 1), x.shape)
    # p_t: (batch size, n)
    p_t = F.softmax(t)

    # loss normalized over all instances
    loss = p_t * log_p_x
    mask = np.tile(np.arange(x.shape[1]).reshape(1, -1), (x.shape[0],  1)) < length
    mask = chainer.Variable(mask)
    padding = chainer.Variable(np.zeros(x.shape, dtype=x.dtype))
    loss = F.where(mask, loss, padding)

    return -F.sum(loss / length) / p_t.shape[0]

This time, I implemented it on the assumption that L is variable. By masking unnecessary data with length, data of different lengths can be sent to the same batch. As Pitfall, if you calculate the formula according to the paper, the calculation of division and exp becomes unstable, so you need to use logsumexp which is also prepared in chainer.

The score was calculated by Perceptron, which receives the feature quantity (B, L, M) and outputs the score (B, L). Since the data in the list are independent of each other, we modified (B, L, M) to (B, L * M) and finally implemented it in the original form.

class ListNet(chainer.Chain):
    def __init__(self, input_size, n_units, dropout):
        super(ListNet, self).__init__(
            l1=L.Linear(input_size, n_units),
            l2=L.Linear(n_units, 1))

        self.add_persistent("_dropout", dropout)

    def __call__(self, x, train=True):
        s = list(x.shape)
        n_tokens = np.prod(s[:-1])
        x = F.reshape(x, (n_tokens, -1))

        if self._dropout > 0.:
            x = F.dropout(x, self._dropout, train=train)
        o_1 = F.relu(self.l1(x))
        if self._dropout > 0.:
            o_1 = F.dropout(o_1, self._dropout, train=train)

        # o_2: (N*M, 1)
        o_2 = self.l2(o_1)

        return F.reshape(o_2, s[:-1])

All implementations can be found on github.

Experiment

The dataset used was MQ2007 from LETOR 4.0. Since I didn't have time, the parameters were fixed feeling, and I didn't devise anything other than early stopping.

The evaluation target is mean average precision (MAP).

result

TRAIN: 0.4693
DEV:   0.4767
TEST:  0.4877

(Corrected on August 29, 2017: See comments)

~~ Original paper ~~ Results by the authors with the same data [^ 3].

TRAIN: 0.4526
DEV:   0.4790
TEST:  0.4884

~~ Well, there is a lot of difference. In the original paper, "Perceptron for score calculation" was used, and it was written only with a certain degree of particle size, so there may have been some ingenuity. ~~ The performance is about the same. Instead of tuning less, I was successful using a technique that didn't exist at the time (i.e. Adam optimizer, dropout, ReLU activation). (Corrected on August 29, 2017)

The biggest advantage of ListNet is that it's fast anyway. I haven't actually benchmarked it, but I think it's about 100 times faster than RankNet.

At the end

I used Tensorflow and chainer half and half this year, but I think chainer is overwhelmingly faster in prototype speed & debugging. Also, cupy is really good! Due to the number of users, there are few implementation examples, and I feel that Tensorflow has been a little lately, so I would like to expose it to the excitement.


change history:

[^ 1]: The figure shows KL divergence distance, but it seems that cross entropy is actually used. Since $ P (\ pi | \ bar {\ mathbf {s}}) $ is fixed, it seems that the optimization will be the same after all. [^ 2]: After all, you're sampling from the list! Without the slogan. Moreover, it is usually done with k = 1 ... [^ 3]: I remember that the benchmark results were published at https://www.microsoft.com/en-us/research/project/letor-learning-rank-information-retrieval/, but ~~ As of August 27, 2017, the results were not disclosed. ~~ When I emailed the author, he republished.

Recommended Posts

I tried to implement ListNet of rank learning with Chainer
I tried to implement Autoencoder with TensorFlow
I tried to implement CVAE with PyTorch
I tried to implement reading Dataset with PyTorch
I tried to implement deep learning that is not deep with only NumPy
I tried to implement PCANet
I tried to implement StarGAN (1)
I tried to learn the sin function with chainer
I tried to move machine learning (ObjectDetection) with TouchDesigner
I tried to extract features with SIFT of OpenCV
I tried to implement and learn DCGAN with PyTorch
I tried to implement an artificial perceptron with python
I tried to implement time series prediction with GBDT
I tried to implement Grad-CAM with keras and tensorflow
I tried to implement SSD with PyTorch now (Dataset)
I tried to implement automatic proof of sequence calculation
I tried to implement Cifar10 with SONY Deep Learning library NNabla [Nippon Hurray]
I tried to implement a volume moving average with Quantx
I tried to find the entropy of the image with python
I tried to implement Deep VQE
I tried to find the average of the sequence with TensorFlow
I tried to implement anomaly detection by sparse structure learning
I tried machine learning with liblinear
I tried to implement breakout (deception avoidance type) with Quantx
I tried to implement adversarial validation
Mayungo's Python Learning Episode 3: I tried to print numbers with print
I tried to implement hierarchical clustering
I tried to implement Harry Potter sort hat with CNN
[Machine learning] I tried to summarize the theory of Adaboost
I tried to implement Realness GAN
I tried to implement Perceptron Part 1 [Deep Learning from scratch]
I tried to implement blackjack of card game in Python
I tried learning LightGBM with Yellowbrick
I tried to divide with a deep learning language model
I tried to implement SSD with PyTorch now (model edition)
I tried to make Othello AI that I learned 7.2 million hands by deep learning with Chainer
I tried to make deep learning scalable with Spark × Keras × Docker
I tried to create a list of prime numbers with python
I tried to fix "I tried stochastic simulation of bingo game with Python"
I tried to implement sentence classification by Self Attention with PyTorch
I tried to expand the size of the logical volume with LVM
I tried running the DNN part of OpenPose with Chainer CPU
I tried to improve the efficiency of daily work with Python
I tried to automatically collect images of Kanna Hashimoto with Python! !!
I tried to make a mechanism of exclusive control with Go
I tried to implement PLSA in Python
I tried to implement permutation in Python
I tried to visualize AutoEncoder with TensorFlow
I tried to get started with Hy
I tried to implement PLSA in Python 2
I tried to implement ADALINE in Python
I tried to implement PPO in Python
I tried to solve TSP with QAOA
I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Introduction ~
I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Implementation ~
I tried to implement a blockchain that actually works with about 170 lines
[Deep Learning from scratch] I tried to implement sigmoid layer and Relu layer.
765 I tried to identify the three professional families by CNN (with Chainer 2.0.0)
Mayungo's Python Learning Episode 2: I tried to put out characters with variables
I tried to get the authentication code of Qiita API with Python.
[Reinforcement learning] Finally surpassed humans! ?? I tried to explain / implement Agent57 (Keras-RL)