This article is ** a summary of network analysis (community extraction) by python ** and a simple usage of NetworkX, which is a convenient library for network analysis **. ** Implement spectral clustering with numpy ** as an algorithm for community extraction.
As a personal motivation, read this article (Derive Probabilistic Matrix Factorization and Implement it in Edward). Book (Relationship data learning) I started reading it because I thought it would be interesting, so it's like a summary.
pip install networkx==1.9.1
Install networkx 1.9.1. The latest version is 1.11 but when I use it I get `NetworkXError: cannot tokenize'graph' at (2, 1)`
when loading data. When I looked it up, there was a person who said, "I got it, but when I downgraded to 1.9.1, I could use it," so I will learn from it.
Here has published a number of trial data related to network analysis.
Les Miserables: coappearance network of characters in the novel Les Miserables. Please cite D. E. Knuth, The Stanford GraphBase: A Platform for Combinatorial Computing, Addison-Wesley, Reading, MA (1993).
Is used.
This data seems to be the data of people who appeared together in the same scene in the novel of Les Miserables. The number of times they appear together becomes the edge value, and the node becomes a character.
It's interesting data, I was wondering if there was such data, but after finishing writing the code, I found this article by Mr. TJO, who is famous in the twitter area.
I am trying different methods for the same community extraction task. If you read this article, you will be able to understand what kind of data this Les Misérables data is and what you can understand by extracting the community.
After reading this article, I thought that the network analysis library was better in R than in python. Or should I use cytoscape etc.? (Or I just don't know anything else)
.The data with the extension gml is in the zip of the above link. gml is an abbreviation for Graph Modeling Language, which is a storage format for graph structure data.
The graph node and edge information is stored in the following structure.
```sh
$ cat lesmis.gml
Creator "Mark Newman on Fri Jul 21 12:44:53 2006"
graph
[
node
[
id 0
label "Myriel"
]
node
[
id 1
label "Napoleon"
]
...
edge
[
source 14
target 11
value 1
]
edge
[
source 15
target 11
value 1
The .gml
data can be read into the graph object defined by NetworkX with the NetworkX read_gml
function.
import networkx as nx
data_path = "./data/lesmis/lesmis.gml"
G = nx.read_gml(data_path)
If you want to make this an array of numpy,
X = np.array(nx.to_numpy_matrix(G))
You can do it like this. The return value of `` `to_numpy_matrix``` is numpy matrix, but since ndarray is easier to use personally, I changed it to ndarray as above.
By the way, in this spectral clustering, it is assumed that the matrix of the graph is a symmetric matrix. As you can see by looking at the data, NetworkX
nx.is_directed(G)
# False
You can check whether the graph is a directed graph (not a symmetric matrix) or an undirected graph (symmetric matrix) with the function. It is an undirected graph because it is False by checking whether it is directed or not.
here, --Spectral clustering based on non-normalized graph Laplacian --Spectral clustering based on drunken normalization graph Laplacian Two are implemented.
In the method based on the denormalized graph Laplacian, when clustering is executed, it tends to be classified into one node and the other, and the clustering is subtle, so the improved version is based on the normalized graph Laplacian. There are different types of normalized graph Laplacian depending on the normalization method, and the drunken normalized graph Laplacian is implemented.
Define the class that actually performs spectral clustering and the function required to plot the execution result.
import networkx as nx
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
from sklearn.cluster import KMeans
class graph_clustering():
def __init__(self, K, n_clusters, lap_kind):
"""
K : int
n_clusters : int
lap_kind : None or str
Ex. "rw"
"""
self.lap_kind = lap_kind
self.K = K
self.n_clusters = n_clusters
def _calc_g_laplacian(self, X):
D = np.diag(np.sum(X, axis=0))
L = D - X
self.L = L
self.D = D
return L
def _calc_rw_g_laplacian(self, X):
L = self._calc_g_laplacian(X)
Lrw = np.dot(np.linalg.inv(self.D), L)
self.L = Lrw
return Lrw
def _get_K_eigen_vec(self, Lam, V):
sorted_index = np.argsort(Lam.real)
Lam_K = Lam[sorted_index][0:self.K]
V_K = V[sorted_index][0:self.K]
self.Lam_K = Lam_K
self.V_K = V_K
return Lam_K, V_K
def _Kmeans_V_K(self, V_K):
kmeans = KMeans(n_clusters=self.n_clusters, random_state=0).fit(V_K.T.real)
clusters = kmeans.labels_
self.clusters = clusters
return clusters
def laplacian_clustering(self, X):
"""
X : ndarray
a matrix representation of undirected graph
"""
if self.lap_kind is None:
L = self._calc_g_laplacian(X)
elif self.lap_kind == "rw":
L = self._calc_rw_g_laplacian(X)
else:
raise NotImplementedError
Lam, V = np.linalg.eig(L)
Lam_K, V_K = self._get_K_eigen_vec(Lam, V)
clusters = self._Kmeans_V_K(V_K)
return clusters
# for plot
def get_labels_dic(G):
"""
G : networkx.classes.graph.Graph
"""
labels_dic = { key:G.node[key]['label'] for key in G.node.keys() }
return labels_dic
def get_labels_list(G):
labels_list = np.array([ G.node[key]['label'] for key in G.node.keys() ])
return labels_list
def split_labels(labels_list, clusters):
"""
labels_list : list
return value of get_labels_list function.
clusters : ndarray
return value of graph_clustering.laplacian_clustering
"""
class_uniq = np.unique(clusters)
node_index_split= []
labels_split = []
index = np.arange(clusters.shape[0])
for class_elem in class_uniq:
node_index_split.append(list(index[clusters == class_elem]))
labels_split.append(list(labels_list[clusters == class_elem]))
return node_index_split, labels_split
def plot_clusters(G, node_index_split):
"""
G : networkx.classes.graph.Graph
node_index_split : list (2-dim list)
return value of split_labels function.
"""
labels_dic = get_labels_dic(G)
pos = nx.spring_layout(G)
colors = [ cm.jet(x) for x in np.arange(0, 1, 1/len(node_index_split)) ]
for i, nodes in enumerate(node_index_split):
nx.draw_networkx(G, node_color=colors[i], labels=labels_dic, nodelist=nodes, pos=pos, font_color="r")
plt.show()
A detailed explanation of the code can be postponed, and spectral clustering and the resulting plot can be done as follows.
data_path = "./data/lesmis/lesmis.gml"
G = nx.read_gml(data_path)
X = np.array(nx.to_numpy_matrix(G))
GClus = graph_clustering(K=15, n_clusters=5, lap_kind="rw")
clusters = GClus.laplacian_clustering(X)
labels_list = get_labels_list(G)
node_index_split, labels_split = split_labels(labels_list, clusters)
plot_clusters(G, node_index_split)
I use it like that.
graph_clustering(k=15, n_clusters=5, lap_kind="rw")
The argument of isk
How many eigenvectorsn_clusters
How many clusters should be divided intolap_kind
Gives the type of spectral clustering algorithm.
lap_kind=none
Denormalized withlap_kind="rw"
Perform clustering of the drunken normalization graph Laplacian with.
The plot part needed some ingenuity. I'm plotting with the `nx.draw_networkx``` function in
plot_clusters ()
`, but I didn't have the option to plot different colors for each identified cluster, so this way I'm plotting.
nx.draw_networkx(G, node_color=colors[i], labels=labels_dic, nodelist=nodes, pos=pos, font_color="r")
Since I was able to plot a part of the whole graph with this argument `nodelist```, divide the node for each cluster and plot while changing the color with ``` node_color = colors [i]
doing. Also, since this overwrites the plot more and more, if you do not specify the position with the ``
pos``` argument, the position will change every time and it will be messed up.
gclus = graph_clustering(k=15, n_clusters=5, lap_kind=none)
In the case of.
It's true that one node is one cluster, and it doesn't feel like a cluster at all.
gclus = graph_clustering(k=15, n_clusters=5, lap_kind="rw")
In the case of.
It looks a little like a cluster.
labels_split[0]
['MmeMagloire', 'CountessDeLo', 'Geborand']
labels_split[1]
['Myriel',
'Napoleon',
'MlleBaptistine',
'Champtercier',
'Cravatte',
'Count',
'OldMan',
'Labarre',
'Valjean',
'MmeDeR',
'Isabeau',
'Gervais',
'Tholomyes',
'Blacheville',
'Favourite',
'Dahlia',
'Zephine',
'MmeThenardier',
'Cosette',
'Javert',
'Fauchelevent',
'Bamatabois',
'Perpetue',
'Woman1',
'Judge',
'Champmathieu',
'Brevet',
'Chenildieu',
'Cochepaille',
'Pontmercy',
'Boulatruelle',
'Eponine',
'Anzelma',
'Woman2',
'MotherInnocent',
'Gribier',
'Jondrette',
'Gavroche',
'Gillenormand',
'Magnon',
'MlleGillenormand',
'MmePontmercy',
'MlleVaubois',
'LtGillenormand',
'Marius',
'BaronessT',
'Mabeuf',
'Enjolras',
'Courfeyrac',
'Bahorel',
'Bossuet',
'MotherPlutarch',
'Gueulemer',
'Babet',
'Montparnasse',
'Toussaint',
'Child2',
'MmeHucheloup']
labels_split[2]
['Combeferre', 'Prouvaire', 'Joly', 'Grantaire', 'Claquesous', 'Brujon']
labels_split[3]
['Marguerite',
'Listolier',
'Fameuil',
'Fantine',
'Thenardier',
'Simplice',
'MmeBurgon',
'Feuilly',
'Child1']
labels_split[4]
['Scaufflaire']
What kind of cluster was divided into is like this. I only know Les Misérables in movies and I only remember Henri Fortin, so I'm not sure. Henri Fortin is in the largest cluster, I can only say.
--As for the theoretical story, in my current knowledge, I just copy and paste Relationship data learning, which is troublesome and meaningless, so I omit it. Then refer to this book. It's a good book. ――By the way, another clustering by symmetric normalization Laplacian appears in this book. There is no clustering, but the symmetric normalized Laplacian calculation has a function defined in NetworkX, [here](https://networkx.github.io/documentation/networkx-1.10/reference/generated/networkx.linalg. laplacianmatrix.normalized_laplacian_matrix.html # networkx.linalg.laplacianmatrix.normalized_laplacian_matrix) for details. ――When performing clustering, we focus on the one with the smallest eigenvalue, which is fresh and interesting. Focus on the larger one, such as PCA.
Recommended Posts