Define your own distance function with k-means of scikit-learn

What to introduce in this article

Speaking of k-means method, as you all know, it is the one that divides the data.

It's something that often appears in textbooks, so I won't explain it in detail, but what is still used is (prohibited matter).

Oh, by the way, this kind of video or this kind of tool, I think it's nice to understand the movement of k-means well.

So, in k-means, a "distance function" is required to measure the distance between the centroid and each data.

In textbooks, "Euclidean distance" is often selected, but is that really correct?

This paper, Anna Huang, "Similarity Measures for Text Document Clustering", 2008

Will answer such a question clearly.

Let's take a look at Table 4 borrowed from the paper. Since it is an Entropy index, lower is better clustering.

スクリーンショット 2015-11-04 20.24.24.png

You can see that Euclidean distance is not justice in the text classification task.

Then, it is humanity that makes you want to try other distance functions.

Let's do it with scikit-learn

In conclusion, there is no such option.

Mendokusei to investigate. It's not really.

You can see that by looking at the scikit-learn code.

    closest_dist_sq = euclidean_distances(
        centers[0, np.newaxis], X, Y_norm_squared=x_squared_norms,
        squared=True)
    current_pot = closest_dist_sq.sum()

This is k-means code snippet, but the Euclidean distance is hardcoded You can see that.

In other words, there is no room to change the distance function as an option.

But what isn't it? Looking at issues on github, it says something like "because it is difficult to generalize with other distance functions". There is.

Still, I'd like to make it a hidden option ...

Then how do you use other distance functions! ??

There seem to be several ways.

  1. Write k-means yourself. On the stack-overflow page (http://stackoverflow.com/questions/5529625/is-it-possible-to-specify-your-own-distance-function-using-scikit-learn-k-means) The method is written
  2. Use the k-centroid method. It seems that it can be used with scikit-learn. I haven't checked it.
  3. Use monkey-patch. This time I will do this.

What is monkey-patch? In other words, "overwrite a method and do whatever you want".

This time, let's overwrite the Euclidean distance function in k-means. This page describes how to do this.

Let's actually do it.

The data to be used is borrowed from here. It's not text data, but it's cute ...

Let's see the behavior of scikit-learn's Euclidean distance

The Euclidean distance that k-means calls is this.

The behavior can be roughly divided.

  1. If Y is None
  2. When X is an array of one sample and Y is an array of multiple samples
  3. When X is an array of 2 samples and Y is an array of multiple samples

is.

In either case

array([
    [Distance between X and Y]
])

Returns an array of the form.

So when you define a new distance function, you will follow this format.

Implement the Pearson correlation coefficient distance function

Let's implement Pearson's correlation coefficient introduced in the paper. In python it is available in scipy.

However, as mentioned in the paper, it does not become a distance function as it is. Therefore, I will transform the formula a little.

special_pearsonr(X, Y) = {
    1 - pearsonr(X, Y) if pearsonr(X, Y) > 0,
    |pearsonr(X, Y)| else
}

The formula is transformed.

Now, let's implement it in consideration of the three cases of Euclidean distance.

def special_pearsonr(X, Y):
    pearsonr_value = scipy.stats.pearsonr(X, Y)
    if pearsonr_value[0] < 0:
        return abs(pearsonr_value[0])
    else:
        return 1 - pearsonr_value[0]


def sub_pearsonr(X, row_index_1, row_index_2):
    pearsonr_distance = special_pearsonr(X[row_index_1], X[row_index_2])
    return (row_index_1, row_index_2, pearsonr_distance)


def pearsonr_distances(X, Y=None):
    if Y==None:
        #Only X is entered and X is 2d-For array
        row_combinations = list(combinations(range(0, len(X)), 2))
        pearsonr_set = [sub_pearsonr(X, index_set_tuple[0], index_set_tuple[1]) for index_set_tuple in row_combinations]
        matrix_source_data = pearsonr_set + map(copy_inverse_index, pearsonr_set)

        row = [t[0] for t in matrix_source_data]
        col = [t[1] for t in matrix_source_data]
        data = [t[2] for t in matrix_source_data]

        pearsonr_matrix = special_pearsonr((data, (row, col)), (X.shape[0], X.shape[0]))
        return pearsonr_matrix.toarray()

    elif len(X.shape)==1 and len(Y.shape)==2:
        #When X is one sample and Y is multiple samples
        #return is an array of 1 sample
        pearsonr_set = numpy.array([special_pearsonr(X, Y[y_sample_index]) for y_sample_index in range(0, Y.shape[0])])

        return pearsonr_set

    elif len(X.shape)==2 and len(Y.shape)==2:
        #X on one line[i]And Y[j] in n(Y)Record and return the distance
        # X[i]A function that calculates all distances between and Y
        pearsonr_x_and_all_y = lambda X, Y: numpy.array([special_pearsonr(X, Y[y_sample_index]) for y_sample_index in range(0, Y.shape[0])])
        pearsonr_divergence_set = numpy.array([pearsonr_x_and_all_y(X[x_i], Y) for x_i in range(0, X.shape[0])])
        return pearsonr_divergence_set
    else:
        raise Exception("Exception case caused")

The distance function that pearsonr_distances calls with k-means.

Now, let's monkey-patch the Euclidean distance function.

start = time.time()

#To guarantee the input, prepare a function with the same parameters as the Euclidean distance function.
def new_euclidean_distances(X, Y=None, Y_norm_squared=None, squared=False):
    return pearsonr_distances(X, Y)

from sklearn.cluster import k_means_

#Overwrite here!
k_means_.euclidean_distances = new_euclidean_distances
kmeans_model = KMeans(n_clusters=3, random_state=10, init='random').fit(features)
print(kmeans_model.labels_)
elapsed_time = time.time() - start
print ("Pearsonr k-means:{0}".format(elapsed_time))

I wanted to see the execution time, so I decided to measure the time as well.

I forgot to say that the data is

features = numpy.array([
        [  80,  85, 100 ],
        [  96, 100, 100 ],
        [  54,  83,  98 ],
        [  80,  98,  98 ],
        [  90,  92,  91 ],
        [  84,  78,  82 ],
        [  79, 100,  96 ],
        [  88,  92,  92 ],
        [  98,  73,  72 ],
        [  75,  84,  85 ],
        [  92, 100,  96 ],
        [  96,  92,  90 ],
        [  99,  76,  91 ],
        [  75,  82,  88 ],
        [  90,  94,  94 ],
        [  54,  84,  87 ],
        [  92,  89,  62 ],
        [  88,  94,  97 ],
        [  42,  99,  80 ],
        [  70,  98,  70 ],
        [  94,  78,  83 ],
        [  52,  73,  87 ],
        [  94,  88,  72 ],
        [  70,  73,  80 ],
        [  95,  84,  90 ],
        [  95,  88,  84 ],
        [  75,  97,  89 ],
        [  49,  81,  86 ],
        [  83,  72,  80 ],
        [  75,  73,  88 ],
        [  79,  82,  76 ],
        [ 100,  77,  89 ],
        [  88,  63,  79 ],
        [ 100,  50,  86 ],
        [  55,  96,  84 ],
        [  92,  74,  77 ],
        [  97,  50,  73 ],
])

is. When I run it,

[0 0 0 1 1 2 1 0 2 1 0 2 2 0 0 1 1 0 1 0 2 0 2 0 2 2 1 1 2 1 1 2 2 2 1 2 2]
Pearsonr k-means:11.0138161182

I was able to confirm that it was working firmly! Yattane!

By the way, the normal k-means, Euclidean distance version is

[1 1 0 1 1 2 1 1 2 1 1 1 2 1 1 0 2 1 0 0 2 0 2 0 1 1 1 0 2 2 2 2 2 2 0 2 2]
========================================
Normal k-means:0.016499042511

is. You can see if there is some data with different clusters of 0 and 1.

The execution time is ... Unfortunately, there is a difference of nearly 600 times ...

Since the number of functions to call has increased, it can't be helped. Around this time, if you devise an implementation, there is a good chance that it will be a little faster. (Erotic people who will speed up. Recruitment)

I haven't seen the difference because I haven't used it for document classification, but since it has been proved in the paper, Pearson's correlation coefficient will surely give good results (should be ...).

Summary


By the way ...

I also implemented the Kullback-Leibler distance. However, unfortunately, it stops with an error.

When measuring the distance between X and Y, a negative factor is inevitably generated and the distance becomes inf, so it cannot even be initialized in the first place.

I wonder what to do with the negative factors, but what should I do? I didn't mention it in the paper.

from sklearn.preprocessing import normalize
def averaged_kullback_leibler(X, Y):
    #norm_X = normalize(X=X, norm='l1')[0]
    #norm_Y = normalize(X=Y, norm='l1')[0]
    norm_X = X
    norm_Y = Y

    pie_1 = norm_X / (norm_X + norm_Y)
    pie_2 = norm_Y / (norm_X + norm_Y)
    M = (pie_1 * norm_X) + (pie_2 * norm_Y)

    KLD_averaged = (pie_1 * scipy.stats.entropy(pk=norm_X, qk=M)) + (pie_2 * scipy.stats.entropy(pk=norm_Y, qk=M))

    return KLD_averaged

def sub_KL_divergence(X, row_index_1, row_index_2):
    #kl_distance = scipy.stats.entropy(pk=X[row_index_1], qk=X[row_index_2])
    kl_distance = averaged_kullback_leibler(X[row_index_1], X[row_index_2])
    return (row_index_1, row_index_2, kl_distance)

def copy_inverse_index(row_col_data_tuple):
    return (row_col_data_tuple[1], row_col_data_tuple[0], row_col_data_tuple[2])

def KLD_distances(X, Y=None):
    if Y==None:
        #Only X is entered and X is 2d-For array
        row_combinations = list(combinations(range(0, len(X)), 2))
        kl_divergence_set = [sub_KL_divergence(X, index_set_tuple[0], index_set_tuple[1]) for index_set_tuple in row_combinations]
        matrix_source_data = kl_divergence_set + map(copy_inverse_index, kl_divergence_set)

        row = [t[0] for t in matrix_source_data]
        col = [t[1] for t in matrix_source_data]
        data = [t[2] for t in matrix_source_data]

        kl_distance_matrix = scipy.sparse.csr_matrix((data, (row, col)), (X.shape[0], X.shape[0]))
        return kl_distance_matrix.toarray()
    elif len(X.shape)==1 and len(Y.shape)==2:
        #When X is one sample and Y is multiple samples
        #return is an array of 1 sample
        kl_divergence_set = numpy.array([averaged_kullback_leibler(X, Y[y_sample_index]) for y_sample_index in range(0, Y.shape[0])])

        return kl_divergence_set
    elif len(X.shape)==2 and len(Y.shape)==2:
        #X on one line[i]And Y[j] in n(Y)Record and return the distance
        # X[i]A function that calculates all distances between and Y
        for x_i in range(0, X.shape[0]):
            xx = X[x_i]
            for y_sample_index in range(0, Y.shape[0]):
                #print(xx)
                #print(Y[y_sample_index])
                print(averaged_kullback_leibler(xx, Y[y_sample_index]))

        kld_x_and_all_y = lambda X, Y: numpy.array([averaged_kullback_leibler(X, Y[y_sample_index]) for y_sample_index in range(0, Y.shape[0])])
        kl_divergence_set = numpy.array([kld_x_and_all_y(X[x_i], Y) for x_i in range(0, X.shape[0])])
        return kl_divergence_set
    else:
        raise Exception("Exception case caused")

Recommended Posts

Define your own distance function with k-means of scikit-learn
kmeans ++ with scikit-learn
Flow of creating your own package with setup.py with python
Write your own activation function with Pytorch (hard sigmoid)
Continued) Try other distance functions with kmeans in Scikit-learn
Parallel processing with Parallel of scikit-learn
Run the intellisense of your own python library with VScode.
Solve your own maze with Q-learning
[Introduction to StyleGAN] Unique learning of anime with your own machine ♬
Train UGATIT with your own dataset
Solve your own maze with DQN
Calculate AUC with groupBy of PySpark DataFrame (define aggregate function with pandas_udf)
Your own Twitter client made with Django
[Reinforcement learning] DQN with your own library
Create your own DNS server with Twisted
Create your own Composite Value with SQLAlchemy
To import your own module with jupyter
Publish your own Python library with Homebrew
Get lots of your tweets with Tweepy
I tried handwriting recognition of runes with scikit-learn
Try to make your own AWS-SDK with bash
Argument implementation (with code) in your own language
Comparison of k-means implementation examples of scikit-learn and pyclustering
How to define your own target in Sage
Make your own module quickly with setuptools (python)
Predict the second round of summer 2016 with scikit-learn
Train Stanford NER Tagger with your own data
Create a wheel of your own OpenCV module
Make your own music player with Bottle0.13 + jPlayer2.5!
Steps to install your own library with pip
One of the cluster analysis methods, k-means, is executed with scikit-learn or implemented without scikit-learn.