Clustering and visualization using Python and CytoScape

Overview

Clustering is a simple data analysis method that makes it easy to obtain useful results. Not only data that originally has a network structure, but also data that does not have a network structure can be networked and clustered by defining a distance function. This entry introduces how to perform clustering and visualize the results.

What is a network?

What is a network structure in the first place? Generally, it is data composed of nodes and edges. The edges may or may not be directional A graph with a direction is called a directed graph, and a graph without a direction is called an undirected graph. Some edges have weights and some do not, and some are called weighted graphs.

What is clustering?

To divide a set of data into several groups (subsets). Divide each subset so that it has some common characteristics In network analysis, it is sometimes called Community Detection. When analyzing data, the method of calculating each relationship, networking it, clustering it, and visualizing it as a group is often used to understand the structure of the data. This is a useful method that makes it easy to obtain knowledge.

About the algorithm introduced in this entry

The algorithm introduced this time is an algorithm aimed at dividing the network to maximize the modularity. Modularity is an indicator of how dense a network is than a random network. In a 2010 comparative paper, it received the best evaluation for community extraction from networks (http://arxiv.org/abs/0906.0612).

A well-known clustering algorithm is K-means. Since K-means reflects the closeness of each node, it can be said that it is an algorithm that considers only the first-order connection. The algorithm introduced this time clusters the network. The difference from K-means

No need to determine the number of clusters in advance
Can reflect higher-order connections

That is the point.

About the library

Method paper

Fast unfolding of communities in large networks, Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, Renaud Lefebvre, Journal of Statistical Mechanics: Theory and Experiment 2008(10)

Installation

From the following Bitbucket

https://bitbucket.org/taynaud/python-louvain

Sample code

Visualize a typical data set of a social network called karate_club.

`karate_community.py`


import community
import networkx as nx
import matplotlib.pyplot as plt
G = nx.karate_club_graph()
partition = community.best_partition(G)
size = float(len(set(partition.values())))
pos = nx.spring_layout(G)
count = 0.
for com in set(partition.values()):
	count += 1.
	list_nodes = [nodes for nodes in partition.keys() if partition[nodes] == com]
	nx.draw_networkx_nodes(G, pos, list_nodes, node_size=20, node_color = str(count/size) )
plt.show()

kobito.1392165578.802329.png

Use graph visualization tools

As mentioned above, it is possible to draw a graph only with Python, When the network becomes large, it is often convenient to be able to perform interactive operations by drawing, such as using only the largest connected graph or erasing edges below a certain weight. There are several graph visualization tools, but this time we will introduce how to reflect the clustering results in the visualization tool using CytoScape as an example.

The library introduced this time is expressed based on NetworkX. The network represented by NetworkX can be written out in a data format that can be linked with various visualization software.

Let's write it out in GraphML format this time

How to write out community information?

Nodes can have various attribute information. It can be given by passing a dict with the node ID as the key. This allows you to keep data on how each node is clustered. Export this in GML format.

import community
import networkx as nx
G = nx.karate_club_graph()
partition = community.best_partition(G)
labels = dict([(i, str(i)) for i in xrange(nx.number_of_nodes(G))])
nx.set_node_attributes(G, 'label', labels)
nx.set_node_attributes(G, 'community', partition)
nx.write_gml(G, "community.gml")

Read data using CytoScape

kobito.1419932553.584279.png

Specify a file from From Network File in the dialog after startup and read it. (Or File-> import-> Network-> File)

kobito.1419939918.525853.png

I'm sure you're not sure, but this is because you didn't specify the layout. First, let's specify the display algorithm. If you select Layout-> yFiles Layouts-> Organic, the display will be as follows.

kobito.1419940057.684839.png

It was displayed like that.

Add clustering results to visualization

Next, the result of clustering is reflected in this graph. This time, let's change the color of the node according to the cluster. Use the Style menu of the Control Panel to change the color and size of the graph. If the Control Panel is not displayed, display it with View-> Show Contorol Panel.

kobito.1419940252.444640.png

Use Fill Color to change the color. You can decide under what conditions the settings are assigned on the map. Click on the Map of the item you want to change

kobito.1419941288.020166.png

Such items are displayed. Set the conditional data items in Column. Items set in set_node_attributes can be selected. Let's select community

スクリーンショット 2014-12-30 21.23.14.png

Select the conditions under which the definition is changed in Mapping Type. There are 3 types

--Continuous Mapping: Make the value continuous and change the size and color gradation depending on the size. --Discrete Mapping: Set the value as a discrete value and set the color for each value. --PassThrough Mapping: The value is converted to data as it is. For example, if you specify red, it will turn red.

This time, the number of clusters is as small as 4, and the size of the value is meaningless, so select Discrete Mapping. When you select it, the following pull-down menu will appear, so select and specify the color.

kobito.1419951614.192048.png

With this setting, the color of the first graph changes like this. I was able to visualize the results of clustering.

kobito.1419951735.280271.png

Summary

In this entry, we introduced the flow from network clustering to visualization. Network analysis is relatively easy, but it is a method that makes it easy to obtain interesting results, so if you are interested, please try it!