Of the recommendation system, I would like to implement content-based filtering that makes recommendations based only on item characteristics. (See this article for the types of recommendation systems.)
Content-based filtering is a method of making recommendations based on item characteristics . Calculates and presents items with high similarity to items in the user's browsing/purchase history.
Actually implement in the following flow.
To calculate the similarity, first convert the item features (words and sentences) into feature vectors . There are several methods of vectorization, such as One-Hot Encoding and TF-IDF, but this time we will use One-Hot expression because the feature of the item uses word data.
Once the items are vectorized, the next step is to calculate the similarity. There are several ways to calculate the similarity, but this time we will use the commonly used cosine similarity .
For example, when calculating the similarity of items $ x, y $, if the feature vectors of $ x $ and $ y $ are as follows,
Let's actually implement it using the data of the kaggle competition. First, check the data. (This time, since the genre column is used to calculate the characteristics of the item, the type and rating columns are deleted for easy viewing. It is okay to execute without deleting the columns.)
code
import pandas as pd
import numpy as np
#Data reading
anime_data = pd.read_csv("anime.csv")
#Check the length of the data
print("The number of data:", len(anime_data.anime_id))
#Delete unused columns
anime_data = anime_data.drop(columns = ['type', 'episodes', 'rating', 'members'])
#Check the contents of the data
anime_data.head()
The data is as follows.
Execution result
The number of data: 12294
anime_id name genre
0 32281 Kimi no Na wa. Drama, Romance, School, Supernatural
1 5114 Fullmetal Alchemist: Brotherhood Action, Adventure, Drama, Fantasy, Magic, Mili...
2 28977 Gintama Action, Comedy, Historical, Parody, Samurai, S...
3 9253 Steins;Gate SciFi, Thriller
4 9969 Gintama039; Action, Comedy, Historical, Parody, Samurai, S...
Next, vectorize the items. This time, the genre data is included in words, so use One-Hot Encoding to make it a feature vector. Since the genre column of anime_data contains comma-separated genre names, create a genre name column with the following code. We will add genre to genre_col, but since we are using set () at this time, duplicate elements will be removed.
code
genres = anime_data['genre'].map(lambda x: x.split(',')).to_list()
genre_col = list()
for i in genres:
genre_col.extend(i)
genre_col = list(set(genre_col))
#Check the column length of the genre name
print(len(genre_col)
Execution result
#Genre name column length
83
Use the created genre name column to make a One-Hot expression of the genre element. List each row with row_list and add it to rows. Finally, create a DataFrame and store it in genre_df.
code
#One-Hot Encoding
rows = list()
for index, row in enumerate(genres):
row_list = np.array([0] * len(genre_col))
index_list = [genre_col.index(item) for item in row]
row_list[index_list] = 1
rows.append(list(row_list))
genre_df = pd.DataFrame(rows, columns = genre_col)
one_hot_data = pd.concat([anime_data, genre_df], axis= 1)
If you output one_hot_data that combines the id and name of the animation for easy understanding, it looks like this.
anime_id | name | Kids | Game | Psychological | Fantasy | Space | School | |
---|---|---|---|---|---|---|---|---|
0 | 32281 | Kimi no Na wa. | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 5114 | Fullmetal Alchemist | 0 | 0 | 1 | 0 | 0 | 0 |
2 | 28977 | Gintama | 0 | 0 | 0 | 0 | 0 | 0 |
(12294×86)
code
#one-Create an array with the hot expression part
item_vectors = np.array(one_hot_data[genre_col])
#Row-by-row vector norm
norm = np.matrix(np.linalg.norm(item_vectors, axis=1))
#Create a similarity matrix using the cosine similarity formula
sim_mat = np.array(np.dot(item_vectors, item_vectors.T)/np.dot(norm.T, norm))
If this sim_mat is left as it is, it is difficult to know which line is which animation, so create a correspondence table of anime_id and index with key-value type.
code
itemindex = dict()
for num, item_id in enumerate(one_hot_data.anime_id):
itemindex[item_id] = num
itemindex
Execution result
{32281: 0,
5114: 1,
28977: 2,
9253: 3,
9969: 4,
32935: 5,
11061: 6,
Let's actually take out items with high similarity from the similarity matrix. Here, 10 items with high similarity to "Your Name (anime_id: 32281)" are displayed.
code
#anime_Search index by specifying id, row_Store in num
row_num = itemindex[32281]
#Similarity matrix row_Extract the top 10 in the num column
top10_index = np.argsort(sim_mat[row_num])[::-1][1:11]
top10_index
Execution result
array([6394, 5805, 208, 1959, 504, 1494, 2300, 1201, 5127, 1436])
Search for the index of top10_index and the corresponding anime_id.
code
rec_id = list()
for search_index in top10_index:
for anime_id, index in itemindex.items():
if index == search_index:
rec_id.append(anime_id)
rec_id
Execution result
[546, 547, 28725, 713, 6351, 20903, 12175, 10067, 1607, 8481]
Let's display items with high similarity from the obtained anime_id.
code
anime_data.query("anime_id == [546, 547, 28725, 713, 6351, 20903, 12175, 10067, 1607, 8481] ")
anime_id | name | genre | |
---|---|---|---|
208 | 28725 | Kokoro ga Sakebitagatterunda. | Drama, Romance, School |
504 | 6351 | Clannad: After Story - Mou Hitotsu no Sekai | Drama, Romance, School |
1201 | 10067 | Angel Beats!: Another Epilogue | Drama, School, Supernatural |
1436 | 8481 | "Bungaku Shoujo" Memoire | Drama, Romance, School |
1494 | 20903 | Harmonie | Drama, School, Supernatural |
1959 | 713 | Air Movie | Drama, Romance, Supernatural |
2300 | 12175 | Koi to Senkyo to Chocolate | Drama, Romance, School |
5127 | 1607 | Venus Versus Virus | Drama, Romance, Supernatural |
5805 | 547 | Wind: A Breath of Heart OVA | Drama, Romance, School, Supernatural |
6394 | 546 | Wind: A Breath of Heart (TV) | Drama, Romance, School, Supernatural |
I did One-Hot Encoding on my own this time, but it's easier with Category Encoders. This article was easy to understand, so I would like to introduce it.
I think there are other ways to vectorize items and calculate similarity, and I think there are other ways to write code, so I'll try another one. This time, I referred to this article . There were many easy-to-understand articles other than recommendations.
Recommended Posts