Continuing from the previous article Implementing item-based collaborative filtering in python-using MovieLens as an example, let's consider content-based filtering.
A general term for a system that recommends items by matching the user's tastes with the characteristics of the item.
If you look at the features of content, you can say that it is content-based filtering, so there is no fixed algorithm. (And I understand)
For example, in the case of a movie, the feature amount of the item may be {actor, genre, director, country}, and the feature amount of the user may be expressed by the average value of the favorite item.
When recommending articles on news sites, it is possible to use words that appear in the document. For example, the feature amount of an item is expressed by the TF / IDF value of a word, and the feature amount of a user is expressed by the average value of favorite items (= articles).
This time, assuming Example 2, I would like to implement a method for recommending documents based on content.
Use the scikit-learn dataset The 20 newsgroups text dataset.
Contains data for 20 categories of newsgroups such as "comp.graphics" and "rec.sport.baseball".
>>> from sklearn.datasets import fetch_20newsgroups
>>> newsgroups_train = fetch_20newsgroups(subset='train')
>>> newsgroups_train.data[0]
"From: [email protected] (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"
We will build a system that inputs the frequency of occurrence of words (bag of words) and returns documents with high similarity (in this case, posting to newsgroup).
from gensim import corpora, models, similarities
from sklearn.datasets import fetch_20newsgroups
from collections import defaultdict
import nltk
import re
def create_dictionary_and_corpus(documents):
texts = [tokens(document) for document in documents]
frequency = defaultdict(int)
for text in texts:
for token in text:
frequency[token] += 1
#Discard tokens with 1 occurrence across all documents
texts = [[token for token in text if frequency[token] > 1] for text in texts]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
return dictionary, corpus
def tokens(document):
symbols = ["'", '"', '`', '.', ',', '-', '!', '?', ':', ';', '(', ')', '*', '--', '\\']
stopwords = nltk.corpus.stopwords.words('english')
#Discard stop words and symbols
tokens = [re.sub(r'[,\.]$', '', word) for word in document.lower().split() if word not in stopwords + symbols]
return tokens
#Download data from 20 newsgroups
newsgroups_train = fetch_20newsgroups(subset='train',remove=('headers', 'footers', 'quotes'))
#Create a dictionary and corpus using the first 100
dictionary, corpus = create_dictionary_and_corpus(newsgroups_train.data[0:100])
For the handling of dictonary and corpus, see gensim tutorial.
def create_model_and_index(corpus):
tfidf = models.TfidfModel(corpus)
index = similarities.MatrixSimilarity(tfidf[corpus])
return tfidf, index
model, index = create_model_and_index(corpus)
Build a Tfidf model from corpus. For more information, go to Topics and Transformations.
Try using the first training data (index: 0) for input. If the recommender system is built correctly, you should be recommended as a similar document.
bow = dictionary.doc2bow(tokens(newsgroups_train.data[0]))
vec_tfidf = model[bow]
sims = index[vec_tfidf]
sims = sorted(enumerate(sims), key=lambda item: item[1], reverse=True)
for i in range(3):
doc_id = sims[i][0]
simirarity = round(sims[i][1] * 100, 0)
print(doc_id, simirarity)
As expected, doc_id: 0 was recommended with 100% similarity.
0 100.0
17 12.0
84 11.0
It should be noted that
sims = index[vec_tfidf]
The simirarity is calculated in the part of, but you can see the tutorial Similarity Queries for details.
This time, we used training data as input data, but if you express the user's favorite document in bag of words and input it, I think that it will be a personalized document recommendation system. (Although it will probably require some ingenuity in practice)
I will also paste the combined code.
from gensim import corpora, models, similarities
from sklearn.datasets import fetch_20newsgroups
from collections import defaultdict
import nltk
import re
def create_dictionary_and_corpus(documents):
texts = [tokens(document) for document in documents]
frequency = defaultdict(int)
for text in texts:
for token in text:
frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1] for text in texts]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
return dictionary, corpus
def tokens(document):
symbols = ["'", '"', '`', '.', ',', '-', '!', '?', ':', ';', '(', ')', '*', '--', '\\']
stopwords = nltk.corpus.stopwords.words('english')
tokens = [re.sub(r'[,\.]$', '', word) for word in document.lower().split() if word not in stopwords + symbols]
return tokens
def create_model_and_index(corpus):
tfidf = models.TfidfModel(corpus)
index = similarities.MatrixSimilarity(tfidf[corpus])
return tfidf, index
# Use 100 samples to build dictionary and corpus
newsgroups_train = fetch_20newsgroups(subset='train',remove=('headers', 'footers', 'quotes'))
dictionary, corpus = create_dictionary_and_corpus(newsgroups_train.data[0:100])
# Create TfIdf Model and its index
model, index = create_model_and_index(corpus)
# System Evaluation
bow = dictionary.doc2bow(tokens(newsgroups_train.data[0]))
vec_tfidf = model[bow]
sims = index[vec_tfidf]
sims = sorted(enumerate(sims), key=lambda item: item[1], reverse=True)
for i in range(3):
doc_id = sims[i][0]
simirarity = round(sims[i][1] * 100, 0)
print(doc_id, simirarity)