I tried to implement ListNet of rank learning with Chainer

Introduction

We will implement ListNet, which is a learning to rank method, in Chainer!

This article is the 7th day of Chainer Advent Calendar 2016.

Explanation of the method

First of all, regarding rank learning, sz_dr wrote a great article on the 5th day of Advent Calender, so please check it out. Please look. In a nutshell, for those who don't have the time, "learn ordering under supervised conditions when there are multiple data in a set (query) and they are given a relative scale." It's a problem. The difference from normal supervised learning is that labels do not take absolute numbers between queries.

Difference from RankNet

I think RankNet is the first thing that many people think of when it comes to neural network + rank learning. In fact, there are multiple methods for formulating rank learning, with the difference that RankNet is a pairwise method and ListNet is a listwise method.

Pairwise	Listwise

Perform by sampling innumerable pairs from one query	Learn for each query

The image is taken from DSIRNLP # 1 Ranking Learning Beginning

ListNet basic idea

ListNet is based on the idea of Permutation probability distribution (PPD). As shown in the figure below, PPD is a probability distribution of the likelihood of each permutation of data. PPD can be calculated from the score for each data. Given a certain order G, the order PPD is given by the following equation.

P(\pi|\mathbf{s}) = \prod_{j}^n{\frac{exp(s_j)}{\sum_{i=j}^{n}{exp(s_i)}}}

However, $ s_j \ in \ mathbf {s} $ indicates the score of the data $ j $, $ n $ indicates the number of data per query, and $ \ pi $ indicates the order. This formula becomes the following formula when written in the example of $ n = 3 $.

P(\pi) = \frac{exp(s_0)}{exp(s_0) + exp(s_1) + exp(s_2)}\frac{exp(s_1)}{exp(s_1) + exp(s_2)}\frac{exp(s_2)}{exp(s_2)}

Permutation Probability Loss (PPL) takes the cross entropy of two PPDs [^ 1],

PPL = -\sum{P(\pi|\bar{\mathbf{s}}) \log P(\pi|\mathbf{s})}

PPL has the advantage that it can be learned purely from the order, compared to the case where the teacher data is trained by square loss by adding numbers such as (1.0, 0.9, 0.8, ..) at equal intervals.

Image from Yan Liu, Learning to Rank: from Pairwise Approach to Listwise Approach Quote.

However, if you calculate the probabilities of all the order, the amount of calculation will be $ L! $, And it will not be possible to calculate in practice. Therefore, in general, the calculation amount of $ L! / (L-k)! $ Is set by considering only the top $ k $ lineup [^ 2].

P(\pi|\mathbf{s}) = \prod_{j}^k{\frac{exp(s_j)}{\sum_{i=j}^{n}{exp(s_i)}}}

For example, if $ k = 1 $, which is also used in the paper, PPL will be the same formula as softmax.

ListNet can be implemented with any differentiable function to calculate the score for each data just by using PPL. The original paper implements score calculation with Feed forward NN. ..

Implementation

First is the implementation of PPL ($ k = 1 $).


def permutation_probability_loss(x, t, length):
    length = length.reshape(-1, 1)
    # log_p: (batch size, n)
    log_p_x = x - F.broadcast_to(F.expand_dims(F.logsumexp(x, axis=1), 1), x.shape)
    # p_t: (batch size, n)
    p_t = F.softmax(t)

    # loss normalized over all instances
    loss = p_t * log_p_x
    mask = np.tile(np.arange(x.shape[1]).reshape(1, -1), (x.shape[0],  1)) < length
    mask = chainer.Variable(mask)
    padding = chainer.Variable(np.zeros(x.shape, dtype=x.dtype))
    loss = F.where(mask, loss, padding)

    return -F.sum(loss / length) / p_t.shape[0]

This time, I implemented it on the assumption that L is variable. By masking unnecessary data with length, data of different lengths can be sent to the same batch. As Pitfall, if you calculate the formula according to the paper, the calculation of division and exp becomes unstable, so you need to use logsumexp which is also prepared in chainer.

The score was calculated by Perceptron, which receives the feature quantity (B, L, M) and outputs the score (B, L). Since the data in the list are independent of each other, we modified (B, L, M) to (B, L * M) and finally implemented it in the original form.

class ListNet(chainer.Chain):
    def __init__(self, input_size, n_units, dropout):
        super(ListNet, self).__init__(
            l1=L.Linear(input_size, n_units),
            l2=L.Linear(n_units, 1))

        self.add_persistent("_dropout", dropout)

    def __call__(self, x, train=True):
        s = list(x.shape)
        n_tokens = np.prod(s[:-1])
        x = F.reshape(x, (n_tokens, -1))

        if self._dropout > 0.:
            x = F.dropout(x, self._dropout, train=train)
        o_1 = F.relu(self.l1(x))
        if self._dropout > 0.:
            o_1 = F.dropout(o_1, self._dropout, train=train)

        # o_2: (N*M, 1)
        o_2 = self.l2(o_1)

        return F.reshape(o_2, s[:-1])

All implementations can be found on github.

Experiment

The dataset used was MQ2007 from LETOR 4.0. Since I didn't have time, the parameters were fixed feeling, and I didn't devise anything other than early stopping.

The evaluation target is mean average precision (MAP).

result

TRAIN: 0.4693
DEV:   0.4767
TEST:  0.4877

(Corrected on August 29, 2017: See comments)

~~ Original paper ~~ Results by the authors with the same data [^ 3].

TRAIN: 0.4526
DEV:   0.4790
TEST:  0.4884

~~ Well, there is a lot of difference. In the original paper, "Perceptron for score calculation" was used, and it was written only with a certain degree of particle size, so there may have been some ingenuity. ~~ The performance is about the same. Instead of tuning less, I was successful using a technique that didn't exist at the time (i.e. Adam optimizer, dropout, ReLU activation). (Corrected on August 29, 2017)

The biggest advantage of ListNet is that it's fast anyway. I haven't actually benchmarked it, but I think it's about 100 times faster than RankNet.

At the end

I used Tensorflow and chainer half and half this year, but I think chainer is overwhelmingly faster in prototype speed & debugging. Also, cupy is really good! Due to the number of users, there are few implementation examples, and I feel that Tensorflow has been a little lately, so I would like to expose it to the excitement.

change history:

2016/12/7: It was too rough, so I added the formula.
2017/8/27: List Net benchmark acquisition source is listed.
2017/8/29: The experimental results have been updated. See comments for details. I also posted a link to the repository.
2017/12/22: Added because the author republished the benchmark.

[^ 1]: The figure shows KL divergence distance, but it seems that cross entropy is actually used. Since $ P (\ pi | \ bar {\ mathbf {s}}) $ is fixed, it seems that the optimization will be the same after all. [^ 2]: After all, you're sampling from the list! Without the slogan. Moreover, it is usually done with k = 1 ... [^ 3]: I remember that the benchmark results were published at https://www.microsoft.com/en-us/research/project/letor-learning-rank-information-retrieval/, but ~~ As of August 27, 2017, the results were not disclosed. ~~ When I emailed the author, he republished.