Try multivariable correlation analysis using Graphical lasso at explosive speed

motivation

When analyzing the data, I sometimes come across the following situations:

――I want to know the relationship between many types of data ――It's relatively noisy, but I want to compare it with other data. --I want to exclude fake correlations ――I want to make the assumption that there are only a few things that are really relevant. ――I want to try some method for the time being

In such a case, let's try using ** Graphical lasso ** for the time being.

What is Graphical lasso?

Roughly speaking, it is a method of graphing the relationships between variables. The method is based on a multivariate Gaussian distribution, so I feel that it can be used in quite a variety of places. For more information [this book](https://www.amazon.co.jp/%E7%95%B0%E5%B8%B8%E6%A4%9C%E7%9F%A5%E3%81%A8% E5% A4% 89% E5% 8C% 96% E6% A4% 9C% E7% 9F% A5-% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92% E3 % 83% 97% E3% 83% AD% E3% 83% 95% E3% 82% A7% E3% 83% 83% E3% 82% B7% E3% 83% A7% E3% 83% 8A% E3% 83 % AB% E3% 82% B7% E3% 83% AA% E3% 83% BC% E3% 82% BA-% E4% BA% 95% E6% 89% 8B-% E5% 89% 9B / dp / 4061529080 ) Explains in a very easy-to-understand manner. If you are interested in theory, please get it.

Implementation

The one implemented this time is a program that converts the estimated accuracy matrix into a correlation matrix and outputs it as a graph. There is still room for improvement, but I think it will be an indicator for advancing data analysis.

Test data preparation


#Prepare test data.
from sklearn.datasets import load_boston
boston = load_boston()
X = boston.data
feature_names = boston.feature_names

Main processing


import pandas as pd
import numpy as np
import scipy as sp
from sklearn.covariance import GraphicalLassoCV
import igraph as ig

#Standardize within the same features.
X = sp.stats.zscore(X, axis=0)

#Run GraphicalLassoCV.
model = GraphicalLassoCV(alphas=4, cv=5)
model.fit(X)

#Generate graph data.
grahp_data = glasso_graph_make(model, feature_names, threshold=0.2)

#Display the graph.
grahp_data

Graph generation function


def glasso_graph_make(model, feature_names, threshold):
    #Get the covariance matrix.
    # ->Reference URL: https://scikit-learn.org/stable/modules/generated/sklearn.covariance.GraphicalLassoCV.html
    covariance_matrix = model.covariance_

    #Convert the covariance matrix to a correlation matrix.
    diagonal = np.sqrt(covariance_matrix.diagonal())
    correlation_matrix = ((covariance_matrix.T / diagonal).T) / diagonal
    
    #Generate a matrix with zero diagonal elements for graph display.
    correlation_matrix_diag_zero = correlation_matrix - np.diag( np.diag(correlation_matrix) )
    df_grahp_data = pd.DataFrame( index=feature_names, columns=feature_names, data=correlation_matrix_diag_zero.tolist() )

    #Graph generation preparation
    grahp_data = ig.Graph()
    grahp_data.add_vertices(len(feature_names))
    grahp_data.vs["feature_names"] = feature_names
    grahp_data.vs["label"] = grahp_data.vs["feature_names"]
    visual_style = {}
    edge_width_list = []
    edge_color_list = []

    #Graph generation
    for target_index in range(len(df_grahp_data.index)):
        for target_column in range(len(df_grahp_data.columns)):
            if target_column >= target_index:
                grahp_data_abs_element = df_grahp_data.iloc[target_index, target_column]
                if abs(grahp_data_abs_element) >= threshold:
                    edge = [(target_index, target_column)]
                    grahp_data.add_edges(edge)
                    edge_width_list.append(abs(grahp_data_abs_element)*10)
                    if grahp_data_abs_element > 0:
                        edge_color_list.append("red")
                    else:
                        edge_color_list.append("blue")

    visual_style["edge_width"] = edge_width_list
    visual_style["edge_color"] = edge_color_list

    return ig.plot(grahp_data, **visual_style, vertex_size=50, bbox=(500, 500), vertex_color="skyblue", layout = "circle", margin = 50)

result

data.png

Probably because the threshold is set quite low, the direct correlation has been found as it is. Let's take a look at the variables that seem to have the strongest correlation. The highest correlation coefficients are RAD (highway accessibility) and TAX (total real estate tax rate per $ 10,000). In other words, tax is high when the highway is easily accessible. To be honest, I can't say anything with this data alone, but it seems that the result is not misguided.

Finally

The environment where you can try it for the time being is the best. Regarding the result, it may have been easier to understand if the data was a little more and noisy. Isn't it somewhere?

This time, I used Graphical lasso to graph the relationships between variables, but beyond that, there are change detection methods that focus on the graph structure, and I feel that it is still worth studying. I am.

Attention and disclaimer

The content of this article is my personal opinion, not the official opinion of the organization to which I belong. The author and the organization to which he belongs cannot be held responsible for any troubles that may occur to users or third parties after implementing the contents of this article.

Recommended Posts

Try multivariable correlation analysis using Graphical lasso at explosive speed
Try to solve Sudoku at explosive speed using numpy
Python template for log analysis at explosive speed
[PyStan] Try Graphical Lasso with Stan.
Try cluster analysis using the K-means method
[TPU] [Transformers] Make BERT at explosive speed