Create a partial correlation matrix and draw an independent graph

I will introduce the procedure to draw an independent graph with graphviz

Partial correlation matrix and independent graph

There are two reasons why the correlation is observed:

--If there is a causal relationship --When there is a common factor that has a causal relationship

The partial correlation is to obtain the correlation coefficient after removing the latter effect, and the independent graph shows the factors with high partial correlation connected to each other. See below for details.

Derivation of the meaning and formula of the partial correlation coefficient https://mathtrain.jp/partialcor

1. Install graphviz

I haven't confirmed it yet, but I think it will probably be below

Install graphviz's python wrapper with pip

`terminal`


pip install graphviz

Install the main body of graphviz and make it available from jupyter notebook

`terminal`


conda install -c conda-forge python-graphviz

2. How to draw a graph

You can define a node with node () and define a concatenation with edge () as shown below. When render () is executed, the graphviz source code is exported once, and the graph is exported as png or pdf based on it. If cleanup = True, after exporting the image file, export it as png below

Undirected graph

`python`


from graphviz import Graph

g = Graph(format='png')

g.node('1')
g.node('2')
g.node('3')
g.edge('1', '2')
g.edge('2', '3')
g.edge('3', '1')

g.render(filename='../test', format='png', cleanup=True, directory=None)
display(Image.open('../test.png'))

ダウンロード (2).png

Directed graph

`python`


from graphviz import Digraph

dg = Digraph(format='png')

dg.node('1')
dg.node('2')
dg.node('3')
dg.edge('1', '2')  # 1 -> 2
dg.edge('2', '3')  # 2 -> 3
dg.edge('3', '1')  # 3 -> 1

dg.render(filename='../test', format='png', cleanup=True, directory=None)
display(Image.open('../test.png'))

ダウンロード (3).png

3. Data preparation

This time I will use iris as sample data

`python`


import numpy as np
import pandas as pd
from sklearn import datasets
import seaborn as sns

iris = datasets.load_iris()
df = pd.DataFrame(np.hstack([iris.data, iris.target.reshape(-1, 1)]), 
                  columns=iris.feature_names + ['label'])
sns.pairplot(df, hue='label')

ダウンロード (8).png

4. Creating a correlation matrix

`python`


import matplotlib.pyplot as plt

cm = pd.DataFrame(np.corrcoef(df.T), columns=df.columns, index=df.columns)

sns.heatmap(cm, annot=True, square=True, vmin=-1, vmax=1, fmt=".2f", cmap="RdBu")
plt.savefig("pcor.png ")
plt.show()

ダウンロード (4).png

5. Creating a partial correlation matrix

I borrowed this code. Hatena Blog Hashikure Engineer Mocking notes

There seems to be a way to test it a little more carefully and not subtract the correlation that is not significant, but here it is a uniform subtraction.

`python`


import scipy

def cor2pcor(R):
    inv_cor = scipy.linalg.inv(R)
    rows = inv_cor.shape[0]
    regu_1 = 1 / np.sqrt(np.diag(inv_cor))
    regu_2 = np.repeat(regu_1, rows).reshape(rows, rows)
    pcor = (-inv_cor) * regu_1 * regu_2
    np.fill_diagonal(pcor, 1)
    return pcor

pcor = pd.DataFrame(cor2pcor(cm), columns=cm.columns, index=cm.index)

sns.heatmap(pcor, annot=True, square=True, vmin=-1, vmax=1, fmt=".2f", cmap="RdBu")
plt.savefig("pcor.png ")
plt.show()

ダウンロード (5).png

6. Draw a graph

Draw an undirected graph by concatenating places where the absolute value of the correlation coefficient is larger than the appropriately set threshold.

`python`


from graphviz import Graph
from PIL import Image

def draw_graph(cm, threshold):
    edges = np.where(np.abs(cm) > threshold)
    edges = [[cm.index[i], cm.index[j]] for i, j in zip(edges[0], edges[1]) if i > j]

    g = Graph(format='png')
    for k in range(cm.shape[0]):
        g.node(cm.index[k])

    for i, j in edges:
        g.edge(j, i)

    g.render(filename='../test', format='png', cleanup=True, directory=None)
    display(Image.open('../test.png'))

threshold = 0.3
draw_graph(cm, threshold)
draw_graph(pcor, threshold)

Graph made from correlation matrix

ダウンロード (6).png

Graph made from partial correlation matrix

ダウンロード (7).png

Summary

Since the correlation coefficient is low, it seems a little difficult to conclude with this alone, but if this is correct, the length and width of the calyx only correlate with the length and width of the petals, not directly with the type of iris. It seems like a thing. It is better to make a graph rather than looking at the correlation matrix so that the image is easier to understand.

Let's try