There are many ways to calculate how similar one data is to another. http://wikiwiki.jp/cattail/?%CE%E0%BB%F7%C5%D9%A4%C8%B5%F7%CE%A5

Among them, Euclidean distance, Pearson's product moment correlation coefficient, and Jaccard coefficient are implemented in python. Collective Intelligence Programming Chapter 2 Reference. https://www.oreilly.co.jp/books/9784873113647/

Euclidean distance

A general distance similar to that of junior high school and high school mathematics. If it is 2D or 3D, it can be represented in a diagram and an image can be created, but it is not possible to imagine any more dimensions. But what I'm doing is basically the same as the third and lower.

(a_1, a_2, a_3, ... a_i), (b_1, b_2, b_3, ... b_i)

When there are two data such as, the Euclidean distance d between ab is

d = \sqrt{(a_1 - b_1)^2 + (a_2 -b_2)^2 + (a_3 -b_3)^2 + ...+(a_i-b_i)^2}

If this is left as it is, the distance will be returned, but I want a value that is easy to understand as a degree of similarity, such that the more similar the values are from 0 to 1, the closer they are to 1. To prevent division by zero error, add 1 to this d and take the reciprocal to get such a value.

1/(1 + d)

The following is the implementation of this in python.

`recommendation.py`


import math

def sim_distance(prefs, person1, person2):
    #List of things that both person1 and person2 rate
    si = {}

    for item in prefs[person1]:
        if item in prefs[person2]:
            si[item] = 1

    #Similarity is 0 if neither person1 nor person2 evaluates
    if len(si) == 0 :
        return 0

    #Square of difference for each item
    squares = [(prefs[person1][item] - prefs[person2][item]) ** 2 for item in si]
    sum_of_sqrt = math.sqrt(sum(squares))
    return 1/(1 + sum_of_sqrt)

Try to find the similarity using the following data. critcs stores several movies and each person's 5-point rating for that movie.

critics = {
    'Lisa Rose': {
        'Lady in the Water': 2.5,
        'Snakes on a Plane': 3.5,
        'Just My Luck': 3.0,
        'Superman Returns': 3.5,
        'The Night Listener': 3.0,
        'You, Me and Dupree': 2.5,
    },
    'Gene Seymour': {
        'Lady in the Water': 3.0,
        'Snakes on a Plane': 3.5,
        'Just My Luck': 1.5,
        'Superman Returns': 5.0,
        'The Night Listener': 3.0,
        'You, Me and Dupree': 3.5,
    },
    'Michael Phillips': {
        'Lady in the Water': 2.5,
        'Snakes on a Plane': 3.0,
        'Superman Returns': 3.5,
        'The Night Listener': 4.0,
    },
    'Claudia Puig': {
        'Snakes on a Plane': 3.5,
        'Just My Luck': 3.0,
        'The Night Listener': 4.5,
        'Superman Returns': 4.0,
        'You, Me and Dupree': 2.5,
    },
    'Mick LaSalle': {
        'Lady in the Water': 3.0,
        'Snakes on a Plane': 4.0,
        'Just My Luck': 2.0,
        'Superman Returns': 3.0,
        'The Night Listener': 3.0,
        'You, Me and Dupree': 2.0,
    },
    'Jack Matthews': {
        'Lady in the Water': 3.0,
        'Snakes on a Plane': 4.0,
        'The Night Listener': 3.0,
        'Superman Returns': 5.0,
        'You, Me and Dupree': 3.5,
    },
    'Toby': {
        'Snakes on a Plane': 4.5,
        'You, Me and Dupree': 1.0,
        'Superman Returns': 4.0,
    }
}

$ python
Python 3.5.1 (default, Nov  7 2016, 22:30:16)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import recommendation
>>> recommendation.sim_distance(critics, 'Lisa Rose', 'Gene Seymour')
0.29429805508554946

Pearson's product moment correlation coefficient

If the data is not normalized, simply finding the Euclidean distance will give only subtle results. For example, in the evaluation of a movie, when Mr. A and Mr. B have similar tastes. In this case, I want Mr. A and Mr. B to have a high degree of similarity. Suppose the two people's evaluations of movies X, Y, and Z are as follows.

	Movie X	Movie Y	Movie Z
Mr. A	3	1.5	3.5
Mr. B	4	2	5

Although the tastes seem to be similar, Mr. A has a rather dry evaluation, and Mr. B has a sweet evaluation. When this is calculated by the above sim_distance, the similarity is 0.348. If the evaluation points are biased or clogs are used, the Euclidean distance cannot cover even if the tastes are similar.

Pearson's product moment correlation coefficient is used in such cases. Quantify the correlation, not the simple distance between the data.

\frac{ \sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y}) } { \sqrt{ \sum_{i=1}^{n} (X_i - \bar{X})^2} \sqrt{ \sum_{i=1}^{n} (Y_i - \bar{Y})^2} }

The superscript bar is the average value. I'm not sure what I'm doing just by looking at the formula, but the numerator is the covariance and the denominator is the standard deviation of each data. Can it be considered as a calculation of cosine similarity? (I don't understand well ...)

Reference: http://mathtrain.jp/correlation http://aoki2.si.gunma-u.ac.jp/lecture/Soukan/pearson.html http://d.hatena.ne.jp/sleepy_yoshi/20110325/p1

Implement this in python

def sim_pearson(prefs, person1, person2):
    si = {}

    for item in prefs[person1]:
        if item in prefs[person2]:
            si[item] = 1

    n = len(si)

    if n == 0: return 0

    mean1 = sum([prefs[person1][item] for item in si]) / n
    mean2 = sum([prefs[person2][item] for item in si]) / n
    variance1 = math.sqrt(sum([((prefs[person1][item] - mean1) ** 2) for item in si]))
    variance2 = math.sqrt(sum([((prefs[person2][item] - mean2) ** 2) for item in si]))

    covariance = sum([(prefs[person1][item] - mean1)*(prefs[person2][item] - mean2) for item in si])

    if variance1 * variance2 == 0: return 0

    return covariance / (variance1 * variance2)

>>> data = {'Asan': {'X': 3.0,'Y': 1.5,'Z': 3.5,},'Bsan': {'X': 4.0,'Y': 2.0,'Z': 5.0,}}
>>> recommendation.sim_pearson(data, 'Asan', 'Bsan')
0.9958705948858225

A number much higher than the Euclidean distance came out. Then I thought that Pearson's product moment correlation coefficient was the strongest, but it could not be grasped well unless it was a linear relationship on the scatter plot, the comparison data had to be normally distributed, and the outliers were It seems that if there is, it will be dragged by it, so we have to meet the conditions to some extent.

Another python implementation

In collective intelligence programming, we implemented a function to find the product-moment correlation coefficient of the same Pearson as follows.

def sim_pearson(prefs, p1, p2):
    '''
    Returns the Pearson correlation coefficient for p1 and p2.
    '''

    # Get the list of mutually rated items
    si = {}
    for item in prefs[p1]:
        if item in prefs[p2]:
            si[item] = 1
    # If they are no ratings in common, return 0
    if len(si) == 0:
        return 0
    # Sum calculations
    n = len(si)
    # Sums of all the preferences
    sum1 = sum([prefs[p1][it] for it in si])
    sum2 = sum([prefs[p2][it] for it in si])
    # Sums of the squares
    sum1Sq = sum([pow(prefs[p1][it], 2) for it in si])
    sum2Sq = sum([pow(prefs[p2][it], 2) for it in si])
    # Sum of the products
    pSum = sum([prefs[p1][it] * prefs[p2][it] for it in si])
    # Calculate r (Pearson score)
    num = pSum - sum1 * sum2 / n
    den = sqrt((sum1Sq - pow(sum1, 2) / n) * (sum2Sq - pow(sum2, 2) / n))
    if den == 0:
        return 0
    r = num / den
    return r

https://github.com/arthur-e/Programming-Collective-Intelligence/blob/master/chapter2/recommendations.py

When I looked at this code in this book, I couldn't understand how to transform the above formula into this kind of implementation, so I implemented the formula as before. When I checked with scipy.stats.pearsonr in scipy, the code I implemented and this code returned the same value. It doesn't matter which implementation you use, but I don't know how to transform the formula to get the code shown in the collective intelligence programming below ... If you know, please let me know.

Jaccard coefficient

Calculate the similarity between sets.

 J( A, B ) = \frac { \mid A \cap B \mid } { \mid A \cup B \mid  }　 = \frac { \mid A \cap B \mid } { |A| + |B| - \mid A \cap B \mid }

https://en.wikipedia.org/wiki/Jaccard_index#Tanimoto_similarity_and_distance

It is used when you want to calculate the similarity between sentences. Extract the words used in sentence A and the words used in sentence B, and find the value from the union and intersection of the words. In such cases, the more words that are commonly used, the higher the similarity.

def sim_jaccard(prefs, a, b):
    si = {}
    for item in prefs[a]:
        if item in prefs[b]:
            si[item] = 1

    n = len(si)
    if n == 0:
        return 0

    len_a = len(prefs[a])
    len_b = len(prefs[b])

    return n / (len_a + len_b - n)

>>> data = {'machine-learning': ['DNN', 'python', 'chainer', 'scikit-learn'], 'python-waf': ['python', 'django', 'flask', 'pyenv']}
>>> recommendation.sim_pearson(data, 'machine-learning', 'python-waf')
0.14285714285714285

Various ways to calculate the similarity between data in python

Euclidean distance

recommendation.py

Pearson's product moment correlation coefficient

Another python implementation

Jaccard coefficient

`recommendation.py`