Indian Buffet Process and latent feature model

Implement the latent feature model using IBP (Indian buffet process).

motivation

The purpose is to implement a latent feature model using IBP. First, I will use a common recommendation example to explain the latent feature model (linear factor model?) Here. First, suppose you have a matrix (X) with people vertically and movie types horizontal. (As for the component, (i, j) is the evaluation of j by i) At this time, the latent feature can be known by decomposing it well with X = UV as shown below. For example, potential features include adventure, mystery, and romance. (It's just an image) At this time, U represents a person's liking for those components, and V represents what kind of component each movie has. (Used to predict missing values in the actual recommendation system) In other situations of natural language processing, we think about the same thing for a matrix consisting of document types and words, and in speech processing, we use the same model in the context of sound source separation.

But how many latent features should I have? With a deterministic model, the number of latent features can be determined well, but with a Bayesian model, it becomes difficult to determine how many latent features there are. Therefore, the motivation is to use an infinite dimensional stochastic process (IBP) to automatically determine the number of latent features. At first glance, it is not easy to understand, but the result of actually predicting Z is shown in the figure below. The left is the true Z, and the right is the Z predicted by gibbs sampling. At first glance, it may seem that something is different, but since the position (column) of the feature is originally an exchangeable model, you can see that it can be predicted after 33 steps.

In this article, I will explain a little more formalized explanation, and then move on to the implementation part. After that, as a theory, I will touch on the beta process, which is closely related to IBP. Finally, I will write about application examples.

Implementation

You can do this by implementing Dr. Sato's non-parametric book [1] as it is. It's easy to understand, so I recommend you to read it. The code is given here. https://github.com/Ma-sa-ue/practice/tree/master/machine%20learning(python)/nonparabayes

Indian buffet process

As the name suggests, it represents the situation at an Indian restaurant. When I sample it, it looks like the one below.

You can also get the above by generating a matrix consisting of 0s and 1s as follows.

for i in 1, ..., N (N is the number of rows)
For columns generated so far, generate 1 with a probability of $ z_ {i, k} \ sim Bernoulli (\ frac {m_ {i-1, k}} {i} ) ( m_ {i, k) } $ Is the number of 1s generated from lines 1 to i)
Generate $ m \ sim Poisson (\ frac {\ alpha} {i}) $ 1 as a new column

It is sensuously easy and shows how people take their favorite dishes in order. Specifically, people are vertical and food is horizontal. People come in turn and take as many dishes as they are popular, and I get tired of it, so I repeat asking for new dishes. $ \ Alpha $ is a hyperparameter that indicates the degree of diffusion, and you can see that the larger it is, the more various dishes will come out. At this rate, it seems that there is a problem with exchangeability, but it is important that IBP is actually a distribution of the history of the binary matrix of the beta-bernoulli generative model.

Latent feature model to consider this time

When $ x_ {k} $ is a D-dimensional vector x_{k}\sim N(0,\sigma_{X}^{2}I),y_{k}\sim N(\sum_{k=0}^{\infty}z_{i,k}x_{k},\sigma_{y}^{2}I) Suppose you have an indian buffet process as the priority of Z. In matrix notation, $ Y = ZX + E $.

Implementation

First, consider the following situations. In the above situation, the left is Y, the middle is Z, and the right is X. Note that only Y is observed. The policy is to gibbs sample Z and X along the conditional distribution in order. If you repeat gibbs sampling, it will be as follows. Since the features are interchangeable, you can see that they are well predicted. In reality, it's rarely the case.

Theory

I will write about sampling from beta process and the relationship between IBP and beta process. For details, see [4] and [6]. Levy process

The Levy process is a stochastic process with steady independent increments. It is a fairly general stochastic process and includes various stochastic processes. First, Levy-ito decomposition divides the levy process into Brownian motion with drift and jump process. Jump process is the main subject of this time. Jump process includes poisson process, beta process, gamma process in a stochastic process like jumping. First of all, as a major premise, the levy process that jumps monotonously like this corresponds to a random measure. And the levy process can be characterized by the levy measure. It is important when you want to sample the correspondence between levy measure and random measure (levy process), and you can configure random measure (levy process) by poisson process such as levy measure as mean measure.

Poisson process

Consider a simulation from the Poisson process. There are basically two ways to simulate. The first is a method of sampling people one after another assuming that people come in order. The second is a method of deciding in advance how many people will come in order and arranging them according to the intensity function (mean measure). The latter is used when sampling from the Beta process. See [7] and [8] for details.

Beta process Consider the beta process on $ \ Omega $. In the Beta process, it is $ B (d \ omega) \ sim Beta (c \ mu (d \ omega), c (1- \ mu (d \ omega))) $. At this time, the Levy measure corresponding to the Beta process is when $ B_ {0} $ is the basis measure. \nu(d\omega,dp) = cp^{-1}(1-p)^{c-1}dp B_{0}(d\omega) is. In other words, Beta process: B is from poisson process $ (\ omega_ {i}, p_ {i}) (i = 1,2, ..) $ with the above levy measure as the mean measure to $ B = \ sum_ {i} It consists of p_ {i} \ delta_ {\ omega_ {i}} $. Unfortunately, the integral of (0,1) of $ p ^ {-1} (1-p) ^ {c-1} $ is not finite, so we will consider a good method. So we found that $ \ nu $ can be further decomposed using $ \ pi ^ {-1} = \ sum_ {k = 0} ^ {\ infinty} (1- \ pi) ^ {k} $ I will. Then \nu = \sum_{k=0}\nu_{k} \nu_{k}(d\pi, d\omega)=Beta(1,c(\omega )+k)d\pi \mu_{k}(d\omega ) \mu_{k} = \frac{c(\omega)}{c(\omega)+k)}\mu(d\omega) It will be. Use this for sampling.

Sampling from Beta process

Following the above, the algorithm looks like this:

sample the number of points many times : n_{k}\sim Poisson(\int_{\Omega}\mu_{k}(d\omega))
for each k, sample n_{k} points from \mu_{ki}:\omega_{ki}\sim\frac{\mu_{k}}{\int_{\Omega}\mu_{k}(d\omega)}
sample p_{ki}\sim beta(1,c(\omega_{ki})+k)
then \cup_{k=0}^{\infty }{(\omega_{ki},p_{ki})} is a realization

For example, if $ \ mu_ {k} $ is uniform, $ \ Omega = [0,1] $, $ \ int_ {\ Omega} \ mu_ {k} (d \ omega) $ is $ \ gamma $ You can draw the following graph.

Also, if you actually sample $ \ gamma = 10.0 $ several times, the average of $ B (\ Omega) $ will be about 10.

Relationship between Beta process and IBP

Details are described in [4]. In this paper, the sampling method of beta process is also derived from the relation with ibp. According to De Finetti's Theorem, the exchangeable observation sequence $ Z_ {1}, ..., Z_ {n} $ P(Z_{1},...,Z_{n} )= \int[\prod_{i=1}^nP(Z_{i}|B)]dP(B) have become. In IBP, P (B) corresponds to the beta process and $ P (Z_ {i} | B) $ corresponds to the poisson process.

Generalization by two parameters

IBP can be further generalized by having two parameters for the beta bernoulli distribution in a one-parameter distribution model. It also supports changing c and $ \ gamma $ in the beta process.

Application example

It is an application example of the latent feature model by matrix factorization rather than an application example of IBP, but it is listed in various ways. It's a light summary of the review [3] by Griffiths.

General thing

Linear factor models such as probabilistic PCA, factor analysis and sparse coding are summarized below. http://qiita.com/masasora/items/431afb71eae60a1c826b

Recommender system

There are already many pages ([10], [11], [12]) that summarize good literature, but I will explain them for the time being. There are generally three ways to recommend a recommender system. The first is to use the basic statistics as they are. For example, recommending a store that sells or a product that sells as it is. The second is collaborative filtering. There are user based and item based. For example, if it is user based, in order to predict the evaluation of a certain product (B) by a certain person (A), it is like taking the evaluation of B of a person who is similar to A and predicting the evaluation of B by A. The third method is matrix factorization. When the most basic method is X = SVD by matrix factorization by SVD, the matrix is decomposed by decomposing X into S and VD as described above. However, in reality, there are missing values and negative evaluations are strange, so there are many ways to further develop this (is it no longer SVD?). Although it depends on the application, the method based on 3 is basically superior when it comes to simple prediction performance. You can also use the decomposed matrix to find out about similar users and similar movies. Returning to the main subject, IBP can be used in 3 methods.

Speech processing

A typical application of voice processing is the problem of sound source separation. Note that the X prior is not gaussian. For example, it has a laplace distribution or a t distribution.

biology

The first application is the decomposition of the matrix (Z) that represents the expression level of genes and cells. Whether or not the gene i is expressed in cell k is $ z_ {ik} $. Another application is to investigate protein interactions. Whether or not the protein $ i $ belongs to complex $ k $ is $ z_ {ik} $. You can find similar proteins using the decomposed matrix. (That is, the larger $ \ sum z_ {ik} z_ {jk} $, the more similar.)

References

Professor Sato's non-parametric book The algorithm and theory are very concisely summarized. http://www.kspub.co.jp/book/detail/1529151.html
Dr. Sato's slide http://www.slideshare.net/issei_sato/ss-37837949
You will definitely see a review when you study IBP https://cocosci.berkeley.edu/tom/papers/indianbuffet.pdf
How to create a hierarchical beta process that describes the relationship between Beta process and IBP http://www.cs.berkeley.edu/~jordan/papers/thibaux-jordan-aistats07.pdf
It is written in various ways as nonpara for machine learning including Beta process. Same author as above. (Jordan's disciple) http://digitalassets.lib.berkeley.edu/techreports/ucb/text/EECS-2008-130.pdf
It describes the sampling method from Beta process and gamma process. http://people.ee.duke.edu/~lcarin/Yingjian_ICML_2012.pdf
Poisson proces is surprisingly easy to understand. (Wikipedia) https://en.wikipedia.org/wiki/Poisson_point_process#Simulation
About poisson process Lecture materials at RIKEN? http://toyoizumilab.brain.riken.jp/hideaki/res/lecture/point_process.html
There is a cohesive chapter on the Linear factor model. (Bengio's book on Deep learning) http://www.deeplearningbook.org/contents/linear_factors.html
The recommended model is organized in an easy-to-understand manner. Http://ameblo.jp/principia-ca/entry-10980281840.html
The recommended model is organized in an easy-to-understand manner. http://smrmkt.hatenablog.jp/entry/2014/08/23/211555
Coursera class textbook There is a part that summarizes the recommended model http://infolab.stanford.edu/~ullman/mmds/book.pdf
If the measure theory of probability theory is Japanese, williams is the standard for Mr. Funaki's foreign books (I have never studied the book of stochastic process for continuous time strictly, but for Brownian motion, the book of hereve is common. I think that "for finance" is not strict, but it is easy to read. What is a good point process book?)
Handbook of markov chain monte calro The point process is also carefully written from the viewpoint of computational statistics http://www.mcmchandbook.net/HandbookTableofContents.html

Introduction to Nonparametric Bayes 2 (Indian Buffet Process and Latent Feature Model)

Indian Buffet Process and latent feature model

motivation

Implementation

Latent feature model to consider this time

Implementation

Theory

Sampling from Beta process

Relationship between Beta process and IBP

Generalization by two parameters

Application example

General thing

Recommender system

Speech processing

biology

References