1.First of all

This article is ** a summary of network analysis (community extraction) by python ** and a simple usage of NetworkX, which is a convenient library for network analysis **. ** Implement spectral clustering with numpy ** as an algorithm for community extraction.

As a personal motivation, read this article (Derive Probabilistic Matrix Factorization and Implement it in Edward). Book (Relationship data learning) I started reading it because I thought it would be interesting, so it's like a summary.

2. Environment construction

2.1. Implementation environment this time

Python 3.6.0 :: Anaconda 4.3.1 (x86_64)
numpy 1.11.3

2.2. Installation

pip install networkx==1.9.1

Install networkx 1.9.1. The latest version is 1.11 but when I use it I get `NetworkXError: cannot tokenize'graph' at (2, 1)` when loading data. When I looked it up, there was a person who said, "I got it, but when I downgraded to 1.9.1, I could use it," so I will learn from it.

3. About data

3.1. Source of data

Here has published a number of trial data related to network analysis.

Les Miserables: coappearance network of characters in the novel Les Miserables. Please cite D. E. Knuth, The Stanford GraphBase: A Platform for Combinatorial Computing, Addison-Wesley, Reading, MA (1993).

Is used.

3.2. Data content

This data seems to be the data of people who appeared together in the same scene in the novel of Les Miserables. The number of times they appear together becomes the edge value, and the node becomes a character.

It's interesting data, I was wondering if there was such data, but after finishing writing the code, I found this article by Mr. TJO, who is famous in the twitter area.

Play with UCI machine learning repository data (etc.) (2): Person correlation diagram of "Les Miserables"

I am trying different methods for the same community extraction task. If you read this article, you will be able to understand what kind of data this Les Misérables data is and what you can understand by extracting the community.

After reading this article, I thought that the network analysis library was better in R than in python. Or should I use cytoscape etc.? (Or I just don't know anything else)

3.3. Data format

`.The data with the extension gml is in the zip of the above link. gml is an abbreviation for Graph Modeling Language, which is a storage format for graph structure data.`



 The graph node and edge information is stored in the following structure.

```sh
$ cat lesmis.gml
Creator "Mark Newman on Fri Jul 21 12:44:53 2006"
graph
[
  node
  [
    id 0
    label "Myriel"
  ]
  node
  [
    id 1
    label "Napoleon"
  ]
  
  ...
  
   edge
  [
    source 14
    target 11
    value 1
  ]
  edge
  [
    source 15
    target 11
    value 1

3.4. Reading data

The .gml data can be read into the graph object defined by NetworkX with the NetworkX read_gml function.

import networkx as nx
data_path = "./data/lesmis/lesmis.gml"
G = nx.read_gml(data_path)

If you want to make this an array of numpy,

X = np.array(nx.to_numpy_matrix(G))

You can do it like this. The return value of `` `to_numpy_matrix``` is numpy matrix, but since ndarray is easier to use personally, I changed it to ndarray as above.

By the way, in this spectral clustering, it is assumed that the matrix of the graph is a symmetric matrix. As you can see by looking at the data, NetworkX

nx.is_directed(G)
# False

You can check whether the graph is a directed graph (not a symmetric matrix) or an undirected graph (symmetric matrix) with the function. It is an undirected graph because it is False by checking whether it is directed or not.

4. Code

4.1. Overview

here, --Spectral clustering based on non-normalized graph Laplacian --Spectral clustering based on drunken normalization graph Laplacian Two are implemented.

In the method based on the denormalized graph Laplacian, when clustering is executed, it tends to be classified into one node and the other, and the clustering is subtle, so the improved version is based on the normalized graph Laplacian. There are different types of normalized graph Laplacian depending on the normalization method, and the drunken normalized graph Laplacian is implemented.

4.2. Definition of classes and functions

Define the class that actually performs spectral clustering and the function required to plot the execution result.

import networkx as nx
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
from sklearn.cluster import KMeans

class graph_clustering():
    def __init__(self, K, n_clusters, lap_kind):
        """
        K : int
        n_clusters : int
        lap_kind : None or str
            Ex. "rw"
        """
        self.lap_kind = lap_kind
        self.K = K
        self.n_clusters = n_clusters

    def _calc_g_laplacian(self, X):
        D = np.diag(np.sum(X, axis=0))
        L = D - X
        self.L = L
        self.D = D
        return L

    def _calc_rw_g_laplacian(self, X):
        L = self._calc_g_laplacian(X)
        Lrw = np.dot(np.linalg.inv(self.D), L)
        self.L = Lrw
        return Lrw

    def _get_K_eigen_vec(self, Lam, V):
        sorted_index = np.argsort(Lam.real)
        Lam_K = Lam[sorted_index][0:self.K]
        V_K = V[sorted_index][0:self.K]

        self.Lam_K = Lam_K
        self.V_K = V_K

        return Lam_K, V_K

    def _Kmeans_V_K(self, V_K):
        kmeans = KMeans(n_clusters=self.n_clusters, random_state=0).fit(V_K.T.real)
        clusters = kmeans.labels_
        self.clusters = clusters
        return clusters

    def laplacian_clustering(self, X):
        """
        X : ndarray
            a matrix representation of undirected graph
        """
        if self.lap_kind is None:
            L = self._calc_g_laplacian(X)
        elif self.lap_kind == "rw":
            L = self._calc_rw_g_laplacian(X)
        else: 
            raise NotImplementedError

        Lam, V = np.linalg.eig(L)
        Lam_K, V_K = self._get_K_eigen_vec(Lam, V)
        clusters = self._Kmeans_V_K(V_K)
        return clusters

# for plot
def get_labels_dic(G):
    """
    G : networkx.classes.graph.Graph
    """
    labels_dic = { key:G.node[key]['label'] for key in G.node.keys() }
    return labels_dic

def get_labels_list(G):
    labels_list = np.array([ G.node[key]['label'] for key in G.node.keys() ])
    return labels_list

def split_labels(labels_list, clusters):
    """
    labels_list : list
        return value of get_labels_list function.
    clusters : ndarray
        return value of graph_clustering.laplacian_clustering
    """
    class_uniq = np.unique(clusters)
    node_index_split= []
    labels_split = []
    index = np.arange(clusters.shape[0])
    for class_elem in class_uniq:
        node_index_split.append(list(index[clusters == class_elem]))
        labels_split.append(list(labels_list[clusters == class_elem]))
    return node_index_split, labels_split

def plot_clusters(G, node_index_split):
    """
    G : networkx.classes.graph.Graph
    node_index_split : list (2-dim list)
        return value of split_labels function.
    """
    labels_dic = get_labels_dic(G)
    pos = nx.spring_layout(G)
    colors = [ cm.jet(x) for x in np.arange(0, 1, 1/len(node_index_split)) ]
    for i, nodes in enumerate(node_index_split):
        nx.draw_networkx(G, node_color=colors[i], labels=labels_dic, nodelist=nodes, pos=pos, font_color="r")
    plt.show()

A detailed explanation of the code can be postponed, and spectral clustering and the resulting plot can be done as follows.

4.3. Move the code

data_path = "./data/lesmis/lesmis.gml"
G = nx.read_gml(data_path)
X = np.array(nx.to_numpy_matrix(G))

GClus = graph_clustering(K=15, n_clusters=5, lap_kind="rw")
clusters = GClus.laplacian_clustering(X)
labels_list = get_labels_list(G)
node_index_split, labels_split = split_labels(labels_list, clusters)
plot_clusters(G, node_index_split)

I use it like that.

graph_clustering(k=15, n_clusters=5, lap_kind="rw")The argument of iskHow many eigenvectorsn_clustersHow many clusters should be divided intolap_kindGives the type of spectral clustering algorithm. lap_kind=noneDenormalized withlap_kind="rw"Perform clustering of the drunken normalization graph Laplacian with.

4.4. About the code to plot

The plot part needed some ingenuity. I'm plotting with the `nx.draw_networkx``` function in plot_clusters () `, but I didn't have the option to plot different colors for each identified cluster, so this way I'm plotting.

nx.draw_networkx(G, node_color=colors[i], labels=labels_dic, nodelist=nodes, pos=pos, font_color="r")

Since I was able to plot a part of the whole graph with this argument `nodelist```, divide the node for each cluster and plot while changing the color with ``` node_color = colors [i] doing. Also, since this overwrites the plot more and more, if you do not specify the position with the ``pos``` argument, the position will change every time and it will be messed up.

5. Execution result

5.1. Denormalized Graph Laplacian

gclus = graph_clustering(k=15, n_clusters=5, lap_kind=none)In the case of. It's true that one node is one cluster, and it doesn't feel like a cluster at all.

5.2. Drunk walk normalization graph Laplacian

gclus = graph_clustering(k=15, n_clusters=5, lap_kind="rw")In the case of. It looks a little like a cluster.

5.3. Cluster breakdown

labels_split[0]
['MmeMagloire', 'CountessDeLo', 'Geborand']

labels_split[1]
['Myriel',
 'Napoleon',
 'MlleBaptistine',
 'Champtercier',
 'Cravatte',
 'Count',
 'OldMan',
 'Labarre',
 'Valjean',
 'MmeDeR',
 'Isabeau',
 'Gervais',
 'Tholomyes',
 'Blacheville',
 'Favourite',
 'Dahlia',
 'Zephine',
 'MmeThenardier',
 'Cosette',
 'Javert',
 'Fauchelevent',
 'Bamatabois',
 'Perpetue',
 'Woman1',
 'Judge',
 'Champmathieu',
 'Brevet',
 'Chenildieu',
 'Cochepaille',
 'Pontmercy',
 'Boulatruelle',
 'Eponine',
 'Anzelma',
 'Woman2',
 'MotherInnocent',
 'Gribier',
 'Jondrette',
 'Gavroche',
 'Gillenormand',
 'Magnon',
 'MlleGillenormand',
 'MmePontmercy',
 'MlleVaubois',
 'LtGillenormand',
 'Marius',
 'BaronessT',
 'Mabeuf',
 'Enjolras',
 'Courfeyrac',
 'Bahorel',
 'Bossuet',
 'MotherPlutarch',
 'Gueulemer',
 'Babet',
 'Montparnasse',
 'Toussaint',
 'Child2',
 'MmeHucheloup']
 
labels_split[2]
['Combeferre', 'Prouvaire', 'Joly', 'Grantaire', 'Claquesous', 'Brujon']

labels_split[3]
['Marguerite',
 'Listolier',
 'Fameuil',
 'Fantine',
 'Thenardier',
 'Simplice',
 'MmeBurgon',
 'Feuilly',
 'Child1']
 
labels_split[4]
['Scaufflaire']

What kind of cluster was divided into is like this. I only know Les Misérables in movies and I only remember Henri Fortin, so I'm not sure. Henri Fortin is in the largest cluster, I can only say.

6. At the end

--As for the theoretical story, in my current knowledge, I just copy and paste Relationship data learning, which is troublesome and meaningless, so I omit it. Then refer to this book. It's a good book. ――By the way, another clustering by symmetric normalization Laplacian appears in this book. There is no clustering, but the symmetric normalized Laplacian calculation has a function defined in NetworkX, [here](https://networkx.github.io/documentation/networkx-1.10/reference/generated/networkx.linalg. laplacianmatrix.normalized_laplacian_matrix.html # networkx.linalg.laplacianmatrix.normalized_laplacian_matrix) for details. ――When performing clustering, we focus on the one with the smallest eigenvalue, which is fresh and interesting. Focus on the larger one, such as PCA.

Relationship data learning with numpy and NetworkX (spectral clustering)