Let's create the "People who bought this product also bought this product" function that often appears on Amazon.
A function generally called the recommendation function (recommended function). There are two main ways to implement recommendations: "collaborative filtering" and "content-based filtering".
In content-based filtering, for example, when implementing the recommended product of "The Old Man and the Sea (Hemingway)" in the above example on a content basis, the attribute tag is added to the product in advance. For example, if you tag with the attribute of author, the book written by the same Hemingway will be displayed as a recommendation.
Collaborative filtering displays products bought by others who bought this product as recommendations.
This time, we will implement "collaborative filtering".
redis is KVS Use Redis SortedSet.
MacPorts:http://blog.katsuma.tv/2010/03/start_redis.html HomeBrew:http://qiita.com/items/3d2a2fc683ae19302071
It is not realistic to calculate the recommended products each time from the viewpoint of the amount of calculation, and it was necessary to calculate in advance and ** record it in a form that is easy to take out **. (If you can easily retrieve and record, you can use other than Redis without any problem)
A list that automatically sorts (on the redis side) when data is entered
It can be implemented if the similarity of each product to product X can be obtained as a value.
There are many, but it is common to use the Jaccard index. In the sample data below, the formula for product A is 1/5. 1 means that one customer has purchased both product X and product A. That is, the intersection 5 is the total number of customers who purchased either product X or product A. That is, the union
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from __future__ import unicode_literals
def jaccard(e1, e2):
"""
Calculate the Jackard Index
:param e1: list of int
:param e2: list of int
:rtype: float
"""
set_e1 = set(e1)
set_e2 = set(e2)
return float(len(set_e1 & set_e2)) / float(len(set_e1 | set_e2))
def get_key(k):
return 'JACCARD:PRODUCT:{}'.format(k)
#Customer ID that purchased product X is 1,3,5
product_x = [1, 3, 5]
product_a = [2, 4, 5]
product_b = [1, 2, 3]
product_c = [2, 3, 4, 7]
product_d = [3]
product_e = [4, 6, 7]
#Product data
products = {
'X': product_x,
'A': product_a,
'B': product_b,
'C': product_c,
'D': product_d,
'E': product_e,
}
# redis
import redis
r = redis.Redis(host='localhost', port=6379, db=10)
#Calculate the Jackard Index and record it in the Redis Sorted Set for each product
for key in products:
base_customers = products[key]
for key2 in products:
if key == key2:
continue
target_customers = products[key2]
#Calculate Jackard Index
j = jaccard(base_customers, target_customers)
#Record in Redis Sorted Set
r.zadd(get_key(key), key2, j)
#Example 1 The person who bought the product X also bought this product.
print r.zrevrange(get_key('X'), 0, 2)
# > ['B', 'D', 'A']
#Example 2 The person who bought the product E also bought this product.
print r.zrevrange(get_key('E'), 0, 2)
# > ['C', 'A', 'X']
Products B, D, and A are recommended for those who bought product X. When checked, the similarity is 0.5, 0.33, and 0.2, respectively, so it seems that they are properly recommended.
As the number of customers and products increases, the amount of calculation explodes and dies
Let's create an inverted index by Amazon http://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf
Recommended Posts