The recommendation engine is often used in WEB API, etc., and it is required to return the result quickly while maintaining the accuracy. Therefore, in this article, when implementing the recommendation function based on the movie viewing history, we use principal component analysis and K-means clustering to maintain the recommendation accuracy as much as possible and balance it with the calculation speed so that it can withstand the use of the API. Take.
The recommendation logic used this time is a simple item-based collaborative filtering that extracts and recommends movies with similar user ratings for the selected movie.
Principal component analysis (PCA) is a method of extracting only the principal components of high-dimensional or super-dimensional vectors and lowering the dimensions of the vectors to reduce the amount of data. Since the number of dimensions after principal component analysis can be determined in advance, the amount of data can be determined by determining the dimensions of the principal component, so the amount of calculation can be kept constant even if the dimensions of the source vector to be analyzed become enormous.
Furthermore, by clustering to some extent with Kmeans clustering, we will limit the movies to be evaluated, reduce the number of comparisons, and aim to improve the speed of response.
This time, we will use a free movie evaluation dataset called MovieLens. https://grouplens.org/datasets/movielens/100k/
You should also like movies that have similar ratings from users to the movies you like! !! However, it would be difficult to compare with everyone because it would take a long time, so first make a group of people who are roughly similar (clustering), and extract only the evaluation of the movie that seems to be more important (principal component analysis). ) Let's compare!
For example, if there is evaluation data of 10,000 movies, and there is data that everyone is watching or almost no one is watching, it will not be very good information, so you can exclude it from the comparison. .. Let's narrow down to about 100, which are easy to get the characteristics. (Strictly speaking, it's different, but it's almost like this.)
https://grouplens.org/datasets/movielens/100k/ Download and unzip ml-100k.zip from. There are some files inside, but they are explained in the README. This time, I mainly use u1.base.
I will write a comment in the program and follow it as follows.
recommend.py
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
#Reading TSV format data
datum = np.loadtxt("u1.base", delimiter="\t", usecols=(0, 1, 2), skiprows=1)
#Prepare a list of user IDs and movie IDs
user_ids = []
movie_ids = []
for row in datum:
user_ids.append( row[0] )
movie_ids.append( row[1] )
user_ids = list(set(user_ids))
movie_ids = list(set(movie_ids))
#Organize evaluation data for each movie ID
vectors = {}
for movie_id in sorted(movie_ids):
vectors[movie_id] = {}
for user_id in user_ids:
#Movies you haven't watched are the default-1 Evaluation
vectors[movie_id][user_id] = -1
#Store each user rating in a vector
for row in datum:
vectors[row[1]][row[0]] = row[2]
dataset = []
#Format data
for movie_id in vectors:
temp_data = []
for user_id in sorted(vectors[movie_id]):
temp_data.append(vectors[movie_id][user_id])
dataset.append(temp_data)
#Classified into 3 clusters by Kmeans
predict = KMeans(n_clusters=3).fit_predict(dataset)
#Number of dimensions after principal component analysis
DIMENTION_NUM = 128
#Principal component analysis
pca = PCA(n_components=DIMENTION_NUM)
dataset = pca.fit_transform(dataset)
print('Cumulative contribution rate: {0}'.format(sum(pca.explained_variance_ratio_)))
#Find a movie similar to movie ID1
MOVIE_ID = 1
#Get the cluster ID of movie ID1
CLUSTER_ID = predict[movie_ids.index(MOVIE_ID)]
distance_data = {}
for index in range(len(predict)):
#Compare vector distances when in the same cluster
if predict[index] == CLUSTER_ID:
distance = np.linalg.norm( np.array(dataset[index], dtype=float) - np.array(dataset[movie_ids.index(MOVIE_ID)], dtype=float) )
distance_data[movie_ids[index]] = distance
#Display in vector distance order
print(sorted(distance_data.items(), key=lambda x: x[1]))
Cumulative contribution rate: 0.7248119795849713
[(1.0, 0.0), (121.0, 67.0315681132561), (117.0, 69.90161652852805), (405.0, 71.07049485275981), (151.0, 71.39559068741323), (118.0, 72.04600188124728), (222.0, 72.78595965661094), (181.0, 74.18442192660996), (742.0, 76.10520742268852), (28.0, 76.27732956739469), (237.0, 76.31850794116573), (25.0, 76.82773190547944), (7.0, 76.96541606511116), (125.0, 77.07961442692692), (95.0, 77.42577990621398), (257.0, 77.87452368514414), (50.0, 78.80566867021435), (111.0, 78.9631520879044), (15.0, 78.97825600046046), (69.0, 79.22663656944697), (588.0, 79.64989759225082), (82.0, 80.23718315576053), (71.0, 80.26936193506091), (79.0, 81.02025503780014).....
The movie with ID = 1 was Toy Story, and the closest movie ID = 121 was Independence Day. The contribution rate of 0.72 means that about 72% of the original data can be restored with only the main components. I feel like I can understand it somehow! !! !!
This time, I went through data shaping, clustering, principal component analysis, and comparison with one script. Originally, the data that has been analyzed for principal components is stored in the database, and it is implemented so that only comparison is performed each time.
In addition, since movie data and evaluation data increase daily, we will decide an appropriate period and repeat clustering and principal component analysis in batches.
When converting to API, we will adjust the recommendation accuracy and response speed by making full use of the number of dimensions and clusters of the main component, API cache, etc. This * recommendation accuracy * can be seen at first by seeing if it fits in with a sense, but if you make full use of deep learning that takes into account actual data in combination with CTR and CVR, it will be even more modern. It's like machine learning.
Next time, I will write about making API using Python flask etc.
Recommended Posts