Notes for implementing simple collaborative filtering in Python

About this article

Let's write a simple algorithm for collaborative filtering in Python Collaborative filtering is the so-called "people who see this also see this" mechanism.

It's a very simple algorithm, so it's not something you can actually use for anything, but it can help you easily understand how to work with collaborative filtering algorithms.

If you actually write the code in this article, you'll understand that the logic of ** "People who see this also sees this" ** is not that esoteric in concept.

Useful site for studying collaborative filtering

The code used in this article is based on this site. Those who are comfortable reading English may read the original site.

Here are some other sites that are useful for studying the concept of the recommendation system. Coursera's lecture is especially recommended

Coursera "intro to recommender systems" -kamishima.net recommender system algorithm -gihyo.co.jp Basics of Information Recommendation System -Slideshare recommender system construction using collaborative filtering

The basic concept of collaborative filtering

Consider an algorithm that recommends a recommended movie for a user A. At this time, what is done by the algorithm is simplified as follows.

step ① That user and other users**Degree of similarity**To calculate
↓
step② Extract a set of movies that user A has not seen yet from the movies that other users have watched.
↓
step③ Return a list of highly recommended movies from those movies.
In this selection, the more similar the movie is watched by the user, the higher the weight.

Preparation

Installation of required packages

`package.py`


from math import sort

Data preparation

The data used here contains the movies watched by some movie lovers and the results of their reviews (scores) in a dictionary format.

`dataset.py`


dataset={
 'Lisa Rose': {
 'Lady in the Water': 2.5, 'Snakes on a Plane': 3.5, 'Just My Luck': 3.0, 'Superman Returns': 3.5,'You, Me and Dupree': 2.5, 'The Night Listener': 3.0
  },
 'Gene Seymour': {
 'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5, 'Just My Luck': 1.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0, 'You, Me and Dupree': 3.5
  },
 'Michael Phillips': {
 'Lady in the Water': 2.5, 'Snakes on a Plane': 3.0, 'Superman Returns': 3.5, 'The Night Listener': 4.0
  },
 'Claudia Puig': {
 'Snakes on a Plane': 3.5, 'Just My Luck': 3.0, 'The Night Listener': 4.5, 'Superman Returns': 4.0, 'You, Me and Dupree': 2.5
  },
 'Mick LaSalle': {
 'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0, 'Just My Luck': 2.0, 'Superman Returns': 3.0, 'The Night Listener': 3.0, 'You, Me and Dupree': 2.0
  },
 'Jack Matthews': {
 'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0, 'The Night Listener': 3.0, 'Superman Returns': 5.0, 'You, Me and Dupree': 3.5
  },
 'Toby': {
 'Snakes on a Plane':4.5, 'You, Me and Dupree':1.0, 'Superman Returns':4.0
  }
}

"Calculation of similarity" between users

In collaborative filtering, "similarity between users" is calculated first. The key to the design of the algorithm here is how to define ** "some users are similar or dissimilar" **.

There are innumerable definitions for this, depending on the designer's wishes. Here, we define it as "the more users give the same movie a similar score" and "the higher the similarity".

In this case, a function that calculates the similarity between users (person1, person2) can be implemented as follows:

`similarity.py`


def get_similairty(person1, person2):
  
  ##Take a set of movies that both watched
  set_person1 = set(dataset[person1].keys())
  set_person2 = set(dataset[person2].keys())
  set_both = set_person1.intersection(set_person2)
  
  if len(set_both)==0: #If there is no movie seen in common, set the similarity to 0
    return 0
  
  list_destance = []
  
  for item in set_both:
    #Calculate the square of the difference between review points for the same movie
    #The larger this number, the more "I don't like it"=Can be defined as "not similar"
    distance = pow(dataset[person1][item]-dataset[person2][item], 2) 
    list_destance.append(distance)
  
  return 1/(1+sqrt(sum(list_destance))) #Returns an inverse index of the total discomfort of each movie

Here, the following numbers are defined as similarity. Similarity = `` `1 / (1 + sqrt (sum (list_destance))) ``` ... (1)

Note that sum (list_destance) is the square of the distance between users in the review score space. The larger this distance is, the more similar is expressed, so (1) indicates the degree of similar. When the distance is `0, the degree of similarity is 1, and when the distance is extremely large, the degree of similarity approaches 0.

get_similairty('Lisa Rose','Jack Matthews')
0.3405424265831667

Implement the recommendation function

Recommendation design and implementation ideas are written in the comments

`recomend.py`


def get_recommend(person, top_N):
  
  totals = {} ; simSums = {} #Make a box to put the recommendation score
  
  #Get a list of users other than yourself and turn the For statement
  # ->To calculate the similarity with each person and the recommendation score of the movie from each person (not yet seen by the person)
  list_others = dataset.keys() ; list_others.remove(person)

  for other in list_others:
    #Get a set of movies that he hasn't seen yet
    set_other = set(dataset[other]); set_person = set(dataset[person])
    set_new_movie = set_other.difference(set_person)

    #Calculate the similarity between a user and the person(sim is 0~Number 1)
    sim = get_similairty(person, other)
    
    # (I haven't seen it yet)Turn the For statement in the list of movies
    for item in set_new_movie:

      # "Similarity x review score"Is calculated as a recommendation level score for all users.
      totals.setdefault(item,0)
      totals[item] += dataset[other][item]*sim 

      #Also, save the integrated value of user similarity and divide the above score by this.
      simSums.setdefault(item,0)
      simSums[item] += sim

  rankings = [(total/simSums[item],item) for item,total in totals.items()]
  rankings.sort()
  rankings.reverse()

  return [i[1] for i in rankings][:top_N]

result

get_recommend('Toby',2)

['The Night Listener', 'Lady in the Water']