Continued) Try other distance functions with kmeans in Scikit-learn

What to introduce in this article

In the previous article (http://qiita.com/Kensuke-Mitsuzawa/items/67c17f6bda8a1097740c), I introduced that kmeans tries distance functions other than Euclidean distance.

You can also use Euclid. However, there is a case where another distance function is better clustering. So why not give it a try? It is the content.

By the way, last time, I ended up with the conclusion, "I tried using the distance function of the Pearson correlation coefficient, but hey!".

This is not practical. So I tried various things.

Try1 use cdist method

scipy has a method called cdist.

This is a very useful method even if it implements a similarity scale on a distance scale.

What's more, if you don't have an implemented scale, you can define it yourself.

So let's define our own distance function for the Pearson correlation coefficient.

#Distance function using cdist
def special_operation(X, Y):
    coeficient = scipy.stats.pearsonr(X, Y)[0]
    #coeficient = numpy.corrcoef(X, Y)[0][1]

    if coeficient < 0:
        return abs(coeficient)
    else:
        return 1 - coeficient

def special_pearsonr_corrcoef(X, Y):
    return cdist(X, Y, special_operation)

special_operation () is a function that takes X and Y vectors and returns a float.

If you wrap this function with cdist (), you will have a method that comprehensively calculates the distance between data in no time.

Overwrite the Euclidean function with monkey patch as before.

# pearson distance with numpy cdist
def new_euclidean_distances(X, Y=None, Y_norm_squared=None, squared=False):
    return special_pearsonr_corrcoef(X, Y)
from sklearn.cluster import k_means_
k_means_.euclidean_distances = new_euclidean_distances
kmeans_model = KMeans(n_clusters=3, random_state=10, init='random').fit(features)
print(kmeans_model.labels_)
elapsed_time = time.time() - start
print ("Special Coeficient k-means:{0}".format(elapsed_time))
print('='*40)

The results will be introduced later

Write with Try2 Cython

Reimplement with Cython for even faster speeds.

I thought, "Isn't it still slow if I call scipy even with cython?", So I copy and paste the relevant part of scipy to C.

I also do type definitions to do as much as I can.

Compile and when it's done, monkey patch again.

# -*- coding: utf-8 -*-
#cython: boundscheck=False
#cython: wraparound=False
__author__ = 'kensuke-mi'

import numpy as np
cimport numpy as np
cimport cython
from itertools import combinations
from scipy.stats import pearsonr
import time
import scipy.special as special
from scipy.special import betainc

ctypedef np.float64_t FLOAT_t


def _chk_asarray(np.ndarray[FLOAT_t, ndim=1] a, axis):
    if axis is None:
        a = np.ravel(a)
        outaxis = 0
    else:
        a = np.asarray(a)
        outaxis = axis

    if a.ndim == 0:
        a = np.atleast_1d(a)

    return a, outaxis


def ss(np.ndarray[FLOAT_t, ndim=1] a, int axis=0):
    """
    Squares each element of the input array, and returns the sum(s) of that.
    Parameters
    ----------
    a : array_like
        Input array.
    axis : int or None, optional
        Axis along which to calculate. Default is 0. If None, compute over
        the whole array `a`.
    Returns
    -------
    ss : ndarray
        The sum along the given axis for (a**2).
    See also
    --------
    square_of_sums : The square(s) of the sum(s) (the opposite of `ss`).
    Examples
    --------
    >>> from scipy import stats
    >>> a = np.array([1., 2., 5.])
    >>> stats.ss(a)
    30.0
    And calculating along an axis:
    >>> b = np.array([[1., 2., 5.], [2., 5., 6.]])
    >>> stats.ss(b, axis=1)
    array([ 30., 65.])
    """
    cdef np.ndarray[FLOAT_t, ndim=1] new_a
    cdef int new_axis

    new_a, new_axis = _chk_asarray(a, axis)
    return np.sum(a*a, axis)

def betai(FLOAT_t a, FLOAT_t b, FLOAT_t x):
    """
    Returns the incomplete beta function.
    I_x(a,b) = 1/B(a,b)*(Integral(0,x) of t^(a-1)(1-t)^(b-1) dt)
    where a,b>0 and B(a,b) = G(a)*G(b)/(G(a+b)) where G(a) is the gamma
    function of a.
    The standard broadcasting rules apply to a, b, and x.
    Parameters
    ----------
    a : array_like or float > 0
    b : array_like or float > 0
    x : array_like or float
        x will be clipped to be no greater than 1.0 .
    Returns
    -------
    betai : ndarray
        Incomplete beta function.
    """
    cdef FLOAT_t typed_x = np.asarray(x)
    cdef FLOAT_t betai_x = np.where(typed_x < 1.0, typed_x, 1.0)  # if x > 1 then return 1.0
    return betainc(a, b, betai_x)


def pearsonr(np.ndarray[FLOAT_t, ndim=1] x, np.ndarray[FLOAT_t, ndim=1] y):
    """
    Calculates a Pearson correlation coefficient and the p-value for testing
    non-correlation.
    The Pearson correlation coefficient measures the linear relationship
    between two datasets. Strictly speaking, Pearson's correlation requires
    that each dataset be normally distributed. Like other correlation
    coefficients, this one varies between -1 and +1 with 0 implying no
    correlation. Correlations of -1 or +1 imply an exact linear
    relationship. Positive correlations imply that as x increases, so does
    y. Negative correlations imply that as x increases, y decreases.
    The p-value roughly indicates the probability of an uncorrelated system
    producing datasets that have a Pearson correlation at least as extreme
    as the one computed from these datasets. The p-values are not entirely
    reliable but are probably reasonable for datasets larger than 500 or so.
    Parameters
    ----------
    x : (N,) array_like
        Input
    y : (N,) array_like
        Input
    Returns
    -------
    (Pearson's correlation coefficient,
     2-tailed p-value)
    References
    ----------
    http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation
    """
    # x and y should have same length.
    cdef np.ndarray[FLOAT_t, ndim=1] typed_x = np.asarray(x)
    cdef np.ndarray[FLOAT_t, ndim=1] typed_y = np.asarray(y)
    cdef int n = len(typed_x)
    cdef float mx = typed_x.mean()
    cdef float my = typed_y.mean()
    cdef np.ndarray[FLOAT_t, ndim=1] xm, ym
    cdef FLOAT_t r_den, r_num, r, prob, t_squared
    cdef int df


    xm, ym = typed_x - mx, typed_y - my
    r_num = np.add.reduce(xm * ym)
    r_den = np.sqrt(ss(xm) * ss(ym))
    r = r_num / r_den

    # Presumably, if abs(r) > 1, then it is only some small artifact of floating
    # point arithmetic.
    r = max(min(r, 1.0), -1.0)
    df = n - 2
    if abs(r) == 1.0:
        prob = 0.0
    else:
        t_squared = r**2 * (df / ((1.0 - r) * (1.0 + r)))
        prob = betai(0.5*df, 0.5, df/(df+t_squared))

    return r, prob

cdef special_pearsonr(np.ndarray[FLOAT_t, ndim=1] X, np.ndarray[FLOAT_t, ndim=1] Y):
    cdef float pearson_coefficient
    pearsonr_value = pearsonr(X, Y)
    pearson_coefficient = pearsonr_value[0]
    
    if pearson_coefficient < 0:
        return abs(pearson_coefficient)
    else:
        return 1 - pearson_coefficient


def copy_inverse_index(row_col_data_tuple):
    return (row_col_data_tuple[1], row_col_data_tuple[0], row_col_data_tuple[2])


def sub_pearsonr(np.ndarray X, int row_index_1, int row_index_2):
    cdef float pearsonr_distance = special_pearsonr(X[row_index_1], X[row_index_2])

    return (row_index_1, row_index_2, pearsonr_distance)


cdef XY_both_2dim(np.ndarray[FLOAT_t, ndim=2] X, np.ndarray[FLOAT_t, ndim=2] Y):
    cdef int row = X.shape[0]
    cdef int col = Y.shape[0]
    cdef np.ndarray[FLOAT_t, ndim=2] pearsonr_divergence_set = np.zeros([row, col])
    cdef float pearson_value = 0.0 

    for x_index in xrange(row):
        x_array = X[x_index]
        for y_index in xrange(col):
            y_array = Y[y_index]
            pearson_value = special_pearsonr(x_array, y_array)
            pearsonr_divergence_set[x_index, y_index] = pearson_value 
    
    return pearsonr_divergence_set


def pearsonr_distances(X, Y, Y_norm_squared=None, squared=False): 
    #start = time.time()
    
    cdef np.ndarray[FLOAT_t, ndim=2] pearsonr_divergence_set = XY_both_2dim(X, Y)
    #elapsed_time = time.time() - start
    #print ("Pearsonr(Cython version) k-means:{0}".format(elapsed_time))

    return pearsonr_divergence_set

Try3 Use other distance functions

As I introduced last time, when K-means clustering is performed using various distance functions, the result is a table.

41094b24-3bcb-0d04-93ab-edc54eb59501.png

Looking at this, both cos similarity and Jackard similarity (distance function made from) give good results.

There is no need to stick to the Pearson correlation coefficient. If you get good results, you can try it.

These two are prepared after being implemented as a distance function in cdist of scipy, so you can use them immediately just by writing the parameters.

cos distance

def new_euclidean_distances(X, Y=None, Y_norm_squared=None, squared=False):
    return cdist(X, Y, 'cosine')

Jackard distance

def new_euclidean_distances(X, Y=None, Y_norm_squared=None, squared=False):
    return cdist(X, Y, 'jaccard')

I tried to measure

The only thing I measured was the processing time. I haven't checked the clustering accuracy because I don't use labeled data. (Because the purpose of this time is just to motivate me to use the method shown in the paper)

The data is the same as in the previous article. Let's look at them in order.

First is the normal Euclidean distance. The numbers are the clustering label and execution time.

[1 1 0 1 1 2 1 1 2 1 1 1 2 1 1 0 2 1 0 0 2 0 2 0 1 1 1 0 2 2 2 2 2 2 0 2 2]
Normal k-means:0.00853204727173
========================================

cos distance

[2 2 0 2 2 1 2 2 1 0 2 2 1 0 2 0 1 2 0 0 1 0 1 0 2 2 2 0 1 0 1 1 1 1 0 1 1]
Cosine k-means:0.0154190063477
========================================

Jackard distance

[0 0 0 0 1 0 0 0 2 0 0 0 1 0 0 0 0 0 2 2 0 2 0 2 0 0 0 0 2 2 0 0 0 0 0 0 0]
jaccard k-means:0.0656130313873

Pearson correlation coefficient using cdist

[0 0 0 1 1 2 1 0 2 1 0 2 2 0 0 1 1 0 1 0 2 0 2 0 2 2 1 1 2 1 1 2 2 2 1 2 2]
Special Coeficient k-means:9.39525008202
========================================

Cythonized Pearson correlation coefficient

[0 0 0 1 1 2 1 0 2 1 0 2 2 0 0 1 1 0 1 0 2 0 2 0 2 2 1 1 2 1 1 2 2 2 1 2 2]
Pearsonr k-means:9.40655493736
========================================

Obviously the execution time of the Pearson correlation coefficient is useless. The speed is

Euclid <Jackard <cos <(Insurmountable Wall) <Pearson

was.

The result of the Jackard coefficient distance is quite different from the other clustering labels ...

Perhaps this time the input data has a small amount of features (a situation that is significantly different from document clustering), so the result may be significantly different from the others.

At this point, I would like to take another opportunity to thoroughly verify it.

Summary

To use other distance functions with scikit-learn k-means,

Recommended Posts

Continued) Try other distance functions with kmeans in Scikit-learn
kmeans ++ with scikit-learn
Define your own distance function with k-means of scikit-learn
Try using scikit-learn (1) --K-means clustering
Identify outliers with RandomForestClassifier in scikit-learn
Try machine learning with scikit-learn SVM
Try logging in to qiita with Python
Try SVM with scikit-learn on Jupyter Notebook
Try using Python with Google Cloud Functions
Clustering representative schools in summer 2016 with scikit-learn
Fill in missing values with Scikit-learn impute
[Continued] Try PLC register access with Python
Try working with Mongo in Python on Mac
Get latitude / longitude distance in meters with QGIS
Try converting videos in real time with OpenCV