I'm Kubota from NTT DoCoMo. This is the second appearance.

Do you know the technology called Item2Vec? Item2Vec is a technology that applies Word2Vec, which acquires distributed expressions of words from sentences, to a recommender system. Specifically, when considering the recommendation on the EC site, the word of Word2Vec is used as the product, the sentence is used as the item set evaluated by the user, the distributed expression of the item is acquired, and the recommendation is made based on the similarity between the items. I feel like doing it.

Since it is easy to implement, there are many articles that I tried relatively, but there are some points to be aware of when actually considering application to recommender systems.

Item2Vec implementation policy

There is a topic analysis library called gensim, which makes it easy to implement Item2Vec. You can train the model by inputting a set of items evaluated by the user on one line and a text file (in this case, item_buskets.txt) separated by spaces for each item as shown in the following example. The parameters will be explained later. It's really easy!

from gensim.models import word2vec

sentences = word2vec.LineSentence('item_buskets.txt')
model = word2vec.Word2Vec(sentences)

Be careful when considering application to recommender systems

Item2Vec can be easily implemented using gensim, a library of topic analysis. However, since gensim was originally created with the application to natural language processing in mind, it is necessary to make changes according to the problem settings when applying to recommender systems with different problem settings.

Difference between Word2Vec and Item2Vec

Dataset differences Since the corpus handled by Word2Vec is a sentence, it is a data structure that is restricted as a sentence such as grammar, whereas the item set such as the purchase history of E-commerce handled by Item2Vec is defined by the user's behavior and the nature of the item. Data structure. Therefore, the nature of the dataset can be different.
Differences in applicable fields Natural language processing, which is the application field of Word2Vec, and the recommendation system, which is the application field of Item2Vec, are different in the first place. In natural language processing, obtaining an accurate distributed representation of frequently-used words that appear in various sentences can indirectly affect accuracy, but it should not be definitive. On the other hand, in recommender systems, acquiring accurate distributed representations of products purchased by various users is decisively effective in increasing the conversion rate that recommender systems want to achieve.

From these differences, it is hypothesized that the hyperparameters of Word2Vec and Item2Vec will be different.

A paper that tested that hypothesis was Word2vec applied to Recommendation: Hyperparameters Matter reported at recsys 2018. The following experimental settings and evaluation results are quoted from this paper.

Experimental settings

Dataset
There are four types: 30Music dataset, Deezer dataset, E-commerce dataset, and Click-Stream dataset. The figure below shows the $ \ log $ of the number of appearances of each item.

There are quite a few characteristics in the distribution depending on the data set. 30 Music datasets that are data from last.fm and Deezer datasets that are data from Deezer Is a music streaming system, and there is a considerable difference between popular songs and unpopular songs. Click-Stream datasets are also different in popularity and unpopularity. On the other hand, the E-commerce dataset has a gentler curve than the previous two.

Architecture I am using Skip-gram with negative sampling (SGNS).
Problem setting The problem is set to predict the next action from the past action history. When actually training the model, the SGNS model is trained up to the $ (n-1) $ th, and the $ n $ th is predicted and evaluated.
Evaluation index (Hit Ratio @ K (HR @ K)) It is an index that creates a list of K items per user, sets 1 if the $ n $ th item is included, 0 if it is not included, and divides the sum of them by the number of users. The larger K, the larger HR.
Evaluation index (Normalized Discounted Cumulative Gain @ K (NDCG @ K)) NDCG is an evaluation index of ranking, and it is an index to evaluate in what order the prediction of the $ n $ th item was actually hit with the presented K items. The larger the value, the better the ranking.

NDCG@K = \left\{
\begin{array}{ll}
\frac{1}{\log_{2} (j+1)} & (\text{if} \ j^{th} \ \text{predicted item is correct}) \\
0 & (\text{otherwise})
\end{array}
\right.

Search parameters

In the paper shown above, the following parameters are searched and evaluated.

Parameters	Corresponding options in gensim's Word2Vec
window size $ L $	window
epochs $ n $	iter
sub-sampling parameter $ t $	sample
negative sampling distribution parameter $ \alpha $	ns_exponent
embedding size	size
the number of negative samples	negative
learning rate	alpha, min_alpha

Probably not very familiar, so I think it's $ t $ and $ \ alpha $. The sub-sampling parameter $ t $ is a parameter related to downsampling of high frequency words. In natural language processing, the high-frequency words "a" and "the" are downsampled because they do not have much information compared to the low-frequency words. In the problem setting of recommender systems, popular items that are frequently used words should have a considerable effect on the accuracy of recommender systems, so it is understandable that the influence of parameters is likely to be high.

Next, the negative sampling distribution parameter $ \ alpha $ is a parameter that changes the shape of the distribution to be negatively sampled. The default for gensim is 0.75. $ \ Alpha = 1 $ results in sampling based on word frequency, $ \ alpha = 0 $ results in random sampling, and negative values make it easier to sample infrequent ones.

In the paper, it seems that the parameters shown in the table were investigated, but it seems that the parameters other than the four parameters in bold did not affect the performance so much, and the four parameters are evaluated in detail.

The figure below shows the evaluation results of the paper. If you look only at Item2Vec (Out-of-the-box SGNS in the table) implemented with the default parameters of gensim and Item2Vec (Fully optimised SGNS in the table) with the four parameters as the optimum parameters, I think that it is okay for the time being. item2vec結果.PNG

Music datasets (30 Music dataset and Deezer dataset), which had a big difference between popular and unpopular items, have about twice the performance of the default! With Click-Stream dataset, the accuracy is improved about 10 times, which is amazing.

The paper shows the relationship between the distribution and accuracy of the negative sampling distribution parameter $ \ alpha $ (ns_exponent for gensim) on a 30Music dataset. You can see that the default parameter of gensim, 0.75, is not the optimal parameter. By the way, based on the result of this paper, ns_exponent, which corresponds to $ \ alpha $, has been added as an option of gensim.

Summary

It was an introduction of a paper trying to set hyperparameters according to the problem setting. Since ○○ Vec is quite popular, it may be interesting to search for optimization with what parameters.