Implemented SimRank in Python

Preface

SimRank is a method of calculating (recursively) the similarity between nodes in a graph.

It has the feature that the similarity can be calculated without having a common adjacent node.

Papers Easy-to-understand explanation in Japanese Wikipedia

By the way, the second author of the SimRank paper is a super-famous DB researcher.

Implementation

In the paper, there was a method to calculate SimRank for bipartite graph, so I implemented it as well.

`simrank.py`


import numpy as np

def simrank(G,C,n,t=10):
    S = np.identity(n)
    I = np.identity(n)
    G = normalize(G)
    i = 1
    while True:
        S = C * np.dot(np.dot(G.T,S),G) + (1-C) * I
        for j in range(n):
            S[j][j] = 1
        if i >= t:
            break
        i += 1
    return S

def bipertite_simrank(G,C,n,m,diag_1=True,t=10):
    S1 = np.identity(n)
    S2 = np.identity(m)
    I1 = np.identity(n)
    I2 = np.identity(m)
    G_T = normalize(G.T)
    G = normalize(G)
    i = 1
    while True:
        S2 = C * np.dot(np.dot(G.T,S1),G) + (1-C) * I2
        for j in range(m):
            S2[j][j] = 1
        S1 = C * np.dot(np.dot(G_T.T,S2),G_T) + (1-C) * I1
        for j in range(n):
            S1[j][j] = 1
        if i >= t:
            break
        i += 1
    return (S1,S2)

def normalize(G):
    s = G.sum(0)
    return G/s

if __name__ == '__main__':
    C = 0.8

    """ univ """
    n = 5
    G = np.zeros((n,n))
    G[0][1] = 1
    G[0][2] = 1
    G[1][3] = 1
    G[2][4] = 1
    G[3][0] = 1
    G[4][2] = 1

    S = simrank(G,C,n)
    print S

    print '=================='

    """ cake """
    n = 2
    m = 4
    G = np.zeros((n,m))
    G[0][0] = 1
    G[0][1] = 1
    G[0][2] = 1
    G[1][1] = 1
    G[1][2] = 1
    G[1][3] = 1

    S1, S2 = bipertite_simrank(G,C,n,m)
    print S1
    print S2

As a matter of fact, just calculate and update the matrix product.

The value of C is said to be about 0.8 in the paper, but it seems that about 0.6 is good in subsequent research (source is Wikipedia).

Also, the number of iterations is enough in the paper, but it seems that it is better to do more in subsequent research (source is Wikipedia).

result

>> python simrank.py
[[ 1.          0.          0.1321943   0.          0.032768  ]
 [ 0.          1.          0.4131072   0.          0.10575544]
 [ 0.1321943   0.4131072   1.          0.04096     0.08687124]
 [ 0.          0.          0.04096     1.          0.33048576]
 [ 0.032768    0.10575544  0.08687124  0.33048576  1.        ]]
==================
[[ 1.          0.54658196]
 [ 0.54658196  1.        ]]
[[ 1.          0.61863088  0.61863088  0.43726175]
 [ 0.61863088  1.          0.61863088  0.61863088]
 [ 0.61863088  0.61863088  1.          0.61863088]
 [ 0.43726175  0.61863088  0.61863088  1.        ]]

It matches the results in the paper (Figure 1, Figure 2), so I'm sure it matches!