Try implementing content-based filtering or document recommendation

Continuing from the previous article Implementing item-based collaborative filtering in python-using MovieLens as an example, let's consider content-based filtering.

About content-based filtering

A general term for a system that recommends items by matching the user's tastes with the characteristics of the item.

If you look at the features of content, you can say that it is content-based filtering, so there is no fixed algorithm. (And I understand)

Content-based filtering techniques

Example 1: Utilization of item metadata

For example, in the case of a movie, the feature amount of the item may be {actor, genre, director, country}, and the feature amount of the user may be expressed by the average value of the favorite item.

Example 2: Utilization of words appearing in documents

When recommending articles on news sites, it is possible to use words that appear in the document. For example, the feature amount of an item is expressed by the TF / IDF value of a word, and the feature amount of a user is expressed by the average value of favorite items (= articles).

Content-based recommendation of documents

This time, assuming Example 2, I would like to implement a method for recommending documents based on content.

Data source

Use the scikit-learn dataset The 20 newsgroups text dataset.

Contains data for 20 categories of newsgroups such as "comp.graphics" and "rec.sport.baseball".

>>> from sklearn.datasets import fetch_20newsgroups
>>> newsgroups_train = fetch_20newsgroups(subset='train')
>>> newsgroups_train.data[0]
"From: [email protected] (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

Document recommender system to build

We will build a system that inputs the frequency of occurrence of words (bag of words) and returns documents with high similarity (in this case, posting to newsgroup).

1. Build corpus

from gensim import corpora, models, similarities
from sklearn.datasets import fetch_20newsgroups
from collections import defaultdict
import nltk
import re

def create_dictionary_and_corpus(documents):
    texts = [tokens(document) for document in documents]

    frequency = defaultdict(int)
    for text in texts:
        for token in text:
            frequency[token] += 1

    #Discard tokens with 1 occurrence across all documents
    texts = [[token for token in text if frequency[token] > 1] for text in texts]

    dictionary = corpora.Dictionary(texts)
    corpus = [dictionary.doc2bow(text) for text in texts]
    
    return dictionary, corpus

def tokens(document):
    symbols = ["'", '"', '`', '.', ',', '-', '!', '?', ':', ';', '(', ')', '*', '--', '\\']
    stopwords = nltk.corpus.stopwords.words('english')

    #Discard stop words and symbols
    tokens = [re.sub(r'[,\.]$', '', word) for word in document.lower().split() if word not in stopwords + symbols]
    
    return tokens

#Download data from 20 newsgroups
newsgroups_train = fetch_20newsgroups(subset='train',remove=('headers', 'footers', 'quotes'))

#Create a dictionary and corpus using the first 100
dictionary, corpus = create_dictionary_and_corpus(newsgroups_train.data[0:100])

For the handling of dictonary and corpus, see gensim tutorial.

2. Build a model

def create_model_and_index(corpus):
    tfidf = models.TfidfModel(corpus)
    index = similarities.MatrixSimilarity(tfidf[corpus])
    return tfidf, index

model, index = create_model_and_index(corpus)

Build a Tfidf model from corpus. For more information, go to Topics and Transformations.

3. Recommend a document

Try using the first training data (index: 0) for input. If the recommender system is built correctly, you should be recommended as a similar document.

bow = dictionary.doc2bow(tokens(newsgroups_train.data[0]))
vec_tfidf = model[bow]
sims = index[vec_tfidf] 

sims = sorted(enumerate(sims), key=lambda item: item[1], reverse=True)

for i in range(3):
    doc_id     = sims[i][0]
    simirarity = round(sims[i][1] * 100, 0)
    print(doc_id, simirarity)

As expected, doc_id: 0 was recommended with 100% similarity.

0 100.0
17 12.0
84 11.0

It should be noted that

sims = index[vec_tfidf]

The simirarity is calculated in the part of, but you can see the tutorial Similarity Queries for details.

Summary

This time, we used training data as input data, but if you express the user's favorite document in bag of words and input it, I think that it will be a personalized document recommendation system. (Although it will probably require some ingenuity in practice)

I will also paste the combined code.


from gensim import corpora, models, similarities
from sklearn.datasets import fetch_20newsgroups
from collections import defaultdict
import nltk
import re

def create_dictionary_and_corpus(documents):
    texts = [tokens(document) for document in documents]

    frequency = defaultdict(int)
    for text in texts:
        for token in text:
            frequency[token] += 1

    texts = [[token for token in text if frequency[token] > 1] for text in texts]

    dictionary = corpora.Dictionary(texts)
    corpus = [dictionary.doc2bow(text) for text in texts]
    
    return dictionary, corpus

def tokens(document):
    symbols = ["'", '"', '`', '.', ',', '-', '!', '?', ':', ';', '(', ')', '*', '--', '\\']
    stopwords = nltk.corpus.stopwords.words('english')
    tokens = [re.sub(r'[,\.]$', '', word) for word in document.lower().split() if word not in stopwords + symbols]
    
    return tokens

def create_model_and_index(corpus):
    tfidf = models.TfidfModel(corpus)
    index = similarities.MatrixSimilarity(tfidf[corpus])
    return tfidf, index

# Use 100 samples to build dictionary and corpus
newsgroups_train = fetch_20newsgroups(subset='train',remove=('headers', 'footers', 'quotes'))
dictionary, corpus = create_dictionary_and_corpus(newsgroups_train.data[0:100])

# Create TfIdf Model and its index
model, index = create_model_and_index(corpus)

# System Evaluation
bow = dictionary.doc2bow(tokens(newsgroups_train.data[0]))
vec_tfidf = model[bow]
sims = index[vec_tfidf] 

sims = sorted(enumerate(sims), key=lambda item: item[1], reverse=True)

for i in range(3):
    doc_id     = sims[i][0]
    simirarity = round(sims[i][1] * 100, 0)
    print(doc_id, simirarity)