What is this article?

Within the framework of machine learning, there is rank learning (Learning to Rank). RankNet, which uses a neural network, is one of the rank learning models. In this article, I will talk about implementing RankNet using Chainer.

Looking at the prediction results, the implementation seems to be correct, but if there are any mistakes such as how to use Chainer, I would be grateful if you could point out.

What is rank learning in the first place

In rank learning, a set of $ ({\ mathbf x}, y) = $ (features, goodness of fit) is learned. We will learn a function $ f $ that returns a large value for items with high goodness of fit and a small value for items with low goodness of fit.

Now suppose two items $ A $ and $ B $ are represented as follows.

A=(\mathbf{x}_A, y_A) \\
B=(\mathbf{x}_B, y_B)

Also assume that $ A $ has a greater goodness of fit than $ B $. In other words, suppose $ y_A> y_B $ holds. At this time, the function $ f $ learns so that the following relationship holds.

f(\mathbf{x}_A)>f(\mathbf{x}_B)

It feels like something is wrong, so love live! I will explain using. First of all, love live! Let's add the goodness of fit (☆, ☆☆, ☆☆☆) to the character of. The larger the number of ☆, the higher the goodness of fit (favorite).

If you learn the following pairs, you should get the function $ f $ like this.

When $ f ({\ bf x}) = {\ bf w} ^ T {\ bf x} $, the weight $ {\ bf w} $ is orthogonally projected on the features as shown in the figure below. It is estimated that the points are arranged in order of goodness of fit. In the actual problem, there would be no $ {\ bf w} $ that would be perfectly ordered in order of fit, but ... (⇒ linear separability)

The function $ f $ obtained as a result of learning can be used to predict the ranking of characters that have not yet been fitted.

Here, I wrote "learning pairs", but in fact, there are roughly three methods for rank learning, and it corresponds to the method called pairwise. There are other methods such as pointwise and listwise, but please refer to the following materials for these methods.

DSIRNLP # 1 Beginning of ranking learning
[I implemented rank learning with Confidence Weighted] (http://www.slideshare.net/tkng/confidence-weighted)

About RankNet

RankNet is a model of rank learning based on a neural network and is a pairwise method. The prediction model itself is the same as a normal neural network, but the design of the cost function is different. However, the idea is the same as the cross entropy function that is often used as a cost function.

As above, suppose the items $ A $ and $ B $ are represented as follows, and $ A $ is better fit than $ B $.

A=(\mathbf{x}_A, y_A) \\
B=(\mathbf{x}_B, y_B)

The scores of $ A $ and $ B $ by the function $ f $ are calculated as follows.

s_A=f(\mathbf{x}_A) \\
s_B=f(\mathbf{x}_B)

The probability that $ A $ is ranked higher than $ B $ output by RankNet is expressed as follows using $ s_A $ and $ s_B $.

P_{AB}=\frac{1}{1+e^{-\sigma(s_A-s_B)}}

Intuitively, we can see that as $ s_A $ increases, $ P_ {AB} $ approaches 1, and as $ s_B $ increases, $ P_ {AB} $ approaches 0.

Using $ P_ {AB} $, the cost function is expressed as follows.

C_{AB}=-\bar{P}_{AB}\log{P_{AB}}-(1-\bar{P}_{AB})\log(1-P_{AB})

Here, $ \ bar {P} _ {AB} $ is the known probability that $ A $ will be ranked higher than $ B $. In References, the settings were as follows.

\bar{P}_{AB}=\begin{cases}
1 & (A \succ B) \\
0 & (A \prec B) \\
\frac{1}{2} & (otherwise)
\end{cases}

$ C_ {AB} $ is just a cross-entropy function, and the farther $ P_ {AB} $ is $ \ bar {P} _ {AB} $, the larger the value. RankNet learns to minimize this cost function.

Implementation of RankNet on Chainer

Now, let's implement RankNet with Chainer. The loss function softmax_cross_entropy () is implemented in Chainer, but softmax_cross_entropy () cannot be used because it can only receive int32 type as the target variable. ($ \ Bar {P} _ {AB} $ can be $ \ frac {1} {2} $.)

Therefore, I have to implement the cost function $ C_ {AB} $ by myself, but I can learn without implementing ** backward () which is implemented in other loss functions *. *… I wonder if it is automatically differentiated numerically if backward () is not implemented, I would be grateful if you could comment on it.

The following is an implementation of RankNet. As a neural network model, a simple multi-layer perceptron of [input layer]-> [hidden layer]-> [hidden layer]-> [output layer] is used.

`net.py`


from chainer import Chain
import chainer.functions as F
import chainer.links as L


class MLP(Chain):

    def __init__(self, n_in, n_hidden):
        super(MLP, self).__init__(
            l1=L.Linear(n_in, n_hidden),
            l2=L.Linear(n_hidden, n_hidden),
            l3=L.Linear(n_hidden, 1)
        )

    def __call__(self, x):
        h1 = F.relu(self.l1(x))
        h2 = F.relu(self.l2(h1))
        return self.l3(h2)


class RankNet(Chain):

    def __init__(self, predictor):
        super(RankNet, self).__init__(predictor=predictor)

    def __call__(self, x_i, x_j, t_i, t_j):
        s_i = self.predictor(x_i)
        s_j = self.predictor(x_j)
        s_diff = s_i - s_j
        if t_i.data > t_j.data:
            S_ij = 1
        elif t_i.data < t_j.data:
            S_ij = -1
        else:
            S_ij = 0
        self.loss = (1 - S_ij) * s_diff / 2. + \
            F.math.exponential.Log()(1 + F.math.exponential.Exp()(-s_diff))
        return self.loss

Let's apply the implemented RankNet to artificial data. The artificial data was generated as follows.

X=({\bf x}_1, {\bf x}_2, \dots, {\bf x}_{1000})^{\rm T} \\
{\bf y}=(y_1, y_2, \dots, y_{1000}) \\
{\bf w} = {\mathcal N}({\bf 0}, {\bf 1}) \\
y_i = uniform(1,5) \\
{\bf x}_i={\mathcal N}(0, 5) + y_i{\bf w}

1000 pieces of 50-dimensional data, goodness of fit is either [1,2,3,4,5]. When plotted in PCA space, the figure is as follows.

Somehow, the goodness of fit seems to be higher toward the upper left of the figure, and smaller toward the lower right.

Below is the learning and evaluation code. The training uses the stochastic gradient descent method, and the training data is randomly sampled each time. NDCG @ 100 was used to evaluate the prediction accuracy.

`train_toy.py`


import numpy as np
from chainer import Variable, optimizers
from sklearn.cross_validation import train_test_split
import net
import matplotlib.pyplot as plt
import seaborn as sns


def make_dataset(n_dim, n_rank, n_sample, sigma):
    ys = np.random.random_integers(n_rank, size=n_sample)
    w = np.random.randn(n_dim)
    X = [sigma * np.random.randn(n_dim) + w * y for y in ys]
    X = np.array(X).astype(np.float32)
    ys = np.reshape(np.array(ys), (-1, 1))
    return X, ys


def ndcg(y_true, y_score, k=100):
    y_true = y_true.ravel()
    y_score = y_score.ravel()
    y_true_sorted = sorted(y_true, reverse=True)
    ideal_dcg = 0
    for i in range(k):
        ideal_dcg += (2 ** y_true_sorted[i] - 1.) / np.log2(i + 2)
    dcg = 0
    argsort_indices = np.argsort(y_score)[::-1]
    for i in range(k):
        dcg += (2 ** y_true[argsort_indices[i]] - 1.) / np.log2(i + 2)
    ndcg = dcg / ideal_dcg
    return ndcg

if __name__ == '__main__':
    np.random.seed(0)
    n_dim = 50
    n_rank = 5
    n_sample = 1000
    sigma = 5.
    X, ys = make_dataset(n_dim, n_rank, n_sample, sigma)
    X_train, X_test, y_train, y_test = train_test_split(X, ys, test_size=0.33)

    n_iter = 2000
    n_hidden = 20
    loss_step = 50
    N_train = np.shape(X_train)[0]

    model = net.RankNet(net.MLP(n_dim, n_hidden))
    optimizer = optimizers.Adam()
    optimizer.setup(model)

    N_train = np.shape(X_train)[0]
    train_ndcgs = []
    test_ndcgs = []
    for step in range(n_iter):
        i, j = np.random.randint(N_train, size=2)
        x_i = Variable(X_train[i].reshape(1, -1))
        x_j = Variable(X_train[j].reshape(1, -1))
        y_i = Variable(y_train[i])
        y_j = Variable(y_train[j])
        model.zerograds()
        loss = model(x_i, x_j, y_i, y_j)
        loss.backward()
        optimizer.update()
        if (step + 1) % loss_step == 0:
            train_score = model.predictor(Variable(X_train))
            test_score = model.predictor(Variable(X_test))
            train_ndcg = ndcg(y_train, train_score.data)
            test_ndcg = ndcg(y_test, test_score.data)
            train_ndcgs.append(train_ndcg)
            test_ndcgs.append(test_ndcg)
            print("step: {0}".format(step + 1))
            print("NDCG@100 | train: {0}, test: {1}".format(train_ndcg, test_ndcg))

    plt.plot(train_ndcgs, label="Train")
    plt.plot(test_ndcgs, label="Test")
    xx = np.linspace(0, n_iter / loss_step, num=n_iter / loss_step + 1)
    labels = np.arange(loss_step, n_iter + 1, loss_step)
    plt.xticks(xx, labels, rotation=45)
    plt.legend(loc="best")
    plt.xlabel("step")
    plt.ylabel("NDCG@10")
    plt.ylim(0, 1.1)
    plt.tight_layout()
    plt.savefig("result.pdf")

Since the training data and the test data are generated from the same distribution, it is natural to say that it is natural, but you can see how the evaluation values increase together.

Other

I want to make something interesting

Rank learning using neural network (Implementation of RankNet by Chainer)