Introduction to Nonparametric Bayes

Since I recently studied nonparametric Bayes, I also tried to organize my mind. I am not a specialist, so I would appreciate it if you could point out any mistakes.

motivation

Nonparametric Bayes is simply an infinite dimensional Bayesian model. (It's not that there are no parameters.) The theory itself has been around for quite some time, and it seems that it has been available in the context of machine learning since the 2000s.

For example, when clustering, you usually specify a mixture number. However, in nonparametric Bayes, once you think that there are an infinite number of clusters, the number of mixture is automatically determined. Specifically, it looks like a gif like the one below.

If the colors are the same, they are in the same cluster. I haven't specified the number of clusters, but you can see that they are clustering well. This time, we aim to implement this infinite mixed model. The code is https://github.com/Ma-sa-ue/practice/tree/master/machine%20learning(python)/nonparabayes It is listed in. As an application example in machine learning other than nonparametric clustering

Regression (Kernel regression uses Gaussian process. It can be said that the function has priors, which is a kind of nonparametric Bayes.)
Hidden Markov model (not fixed latent number etc.)
Recommended model (Collaborative filtering)
Density estimation

there is. The advantages (although they overlap) are as follows.

More functions can be expressed
Neither Overfit nor Underfit
No need to select a model in the conventional sense

A brief overview

As I said earlier, the goal is to implement an infinite mixed model. For that purpose, we will first explain the Dirichlet process, Stick breaking process, and Chinese restaurant process, which are parts in nonparametric Bayes. It is important that the three are not separate, but also different perspectives. Then rearrange the above in the context of the mixture model. Then implement the infinite mixed model. After that, I organized the theoretical background. I'm not in a position to say something great, but when I studied, I couldn't understand at all in one document, so I recommend that you refer to and implement various documents.

Basic edition

This is a diagram taken from Professor Ueda's tutorial (Reference 1). Understanding this figure is one goal.

Finite mixed model

First, I will write about what happens to the finite mixed model before infinity. It can be represented by a graphical model such as (a). In particular

\theta_{k}\sim G_{0}, \pi =(\pi_{1},...,\pi_{K}) \sim Dirichlet(\frac{\alpha}{K},...,\frac{\alpha }{K}) z_{i}|\pi \sim Multinomial(\pi) x_{i}|z_{i},\theta_{k} \sim f(\theta_{z_{i}})

It will be. $ x = (x_ {1}, ..., x_ {n}) $ is the observed value and $ z = (z_ {1}, ..., z_ {n}) $ is the label. $ \ pi $ is the dice for determining the label. For a mixed Gaussian distribution, f is the Gaussian distribution and $ \ theta = (\ theta_ {1}, ..., \ theta_ {k}) $ is the mean of each of the K classes. $ G_ {0} $ is the parameter prior. In the case of a mixed Gauss distribution, if it is conjugated, this will also be a Gauss distribution.

You can only observe $ x $, but you really want to know the true $ z $ (label). In the case of a finite mixed model, it can be basically obtained by Gibbs sampling for $ z $ and $ \ theta $. You can also create a method for an infinite mixed model by taking $ K $ infinitely. However, when it comes to implementing this "as is", it turns out that there are an infinite number of classes and it is impossible.

Dirichlet distribution

The multivariable version of the Beta distribution is the Dirichlet distribution. The Dirichlet distribution often appears as a prior of the multinomial distribution in Bayes. At this time, it is important that the sum of the samples generated from the K-dimensional Dirichlet distribution is 1. Let's consider $ G = \ sum _ {k = 1} ^ {K} \ pi _ {k} \ delta _ {\ theta _ {k}} $. (Note $ \ pi \ sim Dirichlet (\ alpha / K, ..., \ alpha / K), \ theta \ sim H $) Then you can see that $ p (x) = \ int p (x | \ theta) G (\ theta) d \ theta $ in the mixed model. (Z is erased) Then, the shape of the graphical model of the finite mixed model is as shown in (b). At this time, the Dirichlet process is used with the motivation to sample from G when the number of K becomes infinite.

Dirichlet Process

The original introduction of the Dirichlet process is described in the theory section. The Dirichlet process displays $ DP (\ alpha, H) $. As I said earlier, the Dirichlet process is almost everywhere discrete and is $ G = \ sum _ {k} ^ {\ infty} \ pi _ {k} \ delta _ {\ theta _ {k}} $ (infinite sum). .. Use the Stick-Breaking process to sample from the Dirichlet process.

Stick-Breaking Process To sample from the Dirichlet process

\theta_{k} \sim H,v_{k} \sim Beta(1,\alpha ),\pi_{k}=v_{k}\prod_{j=1}^{k-1} (1-v_{j}),G=\sum _{k=1}^{\infty} \pi _{k} \delta _{\theta _{k}}

Sampling as. It is consistent so that the sum of $ \ pi _ {k} $ is 1. It seems to be called Stick-Breaking Process because it feels like the branches are gradually broken. A sampling example is shown below.

Of course, G, which is the sum of an infinite number, cannot be created originally, so it is an approximation. You can see that the larger $ \ alpha $, the more branches there are. Another important thing to use in the actual implementation is the following properties.

\theta_{i}|\theta_{-i} \sim \frac{1}{i-1+\alpha}\sum_{j=1}^{i-1} \delta (\theta_{j}) + \frac{\alpha }{i-1+\alpha }G_{0}

Chinese Restaurant Process(CRP)

CRP is the distribution of the cluster index partitions. For the time being, I will introduce it in an amakudari manner. For example, write the whole set as $ [n] = [1,2, ... n] $ and write the element of part as $ \ rho $. As an example of $ \ rho $, when the whole set is [4], [[1], [2,3], [4]] or [[1,2], [3,4]] can be considered. The following method of making partion is CRP. First, imagine that each person goes into a Chinese restaurant one after another and sits at a table. How to sit the n + 1th person when n people are sitting

P (probability of sitting at existing table c) = $ \ frac {n_ {c}} {a + n} $, P (probability of sitting at new table) = $ \ frac {\ alpha} {\ alpha + n} $

If so, the final distribution of parts when this is continued will be CRP. ($ \ Alpha $ is a constant, $ n_ {c} $ is the number of people already sitting in c)

For example, if there are already 4 people and one or 3 people are sitting, it will be [[1], [2,3,4]]. So I will think about how to enter the fifth person. Then, if $ \ alpha = 1 $, the probability of sitting at the first table is $ \ frac {1} {1 + 4} $, and the probability of sitting at the second table is $ \ frac {3} {1 + 4}. $, The probability of sitting at the third (new) table is $ \ frac {1} {1 + 4} $.

Keep in mind the nature of a table where many people are already sitting is more crowded. Also note that creating a partion is exactly the same as clustering (labeling). For example, let's label people by table number in the above example. If the fifth person sits at a new table, it will move from the state of [2,1,1,1] to [2,1,1,1,3]. Also note the nature of customers sitting sparsely as $ \ alpha $ grows. I will actually implement it. Basically, you can implement the above as it is, but you may not be sure if it matches. At that time, you can roughly confirm by looking at the relationship between the number of tables and the number of customers. The relationship is E[|\rho |] \approx \alpha \log(1+\frac{n}{\alpha}) It looks like I actually tried to draw a graph.

The x-axis is the number of customers and the y-axis is the number of tables. The red points are the points that have been sampled, and the blue lines are approximate values of the expected value. You can see that it looks like that.

Rethinking mixed model

Let's reorganize the following figure in consideration of the past.

(a) is as I said earlier. As a learning policy, there is Gibbs sampling of $ z $, $ \ theta $.

(b) is basically as I said earlier. Since it is difficult to handle the Dirichlet process directly, it is practically (c). In this case, the label (z) is not explicitly calculated, but $ \ theta $ is calculated. This is calculated by Gibbs samling for $ \ theta $. Note that you need to sample from G at this time, but not in the strict sense. In this case, you need to prepare an infinite number of $ \ theta $ and $ \ pi $.

But if you think about it, you can see that we only need to focus on the clusters at the data points. This is (d). To that end, let us first consider the posterior distribution of z in the finite mixed model (a). Then, by a simple calculation (integral elimination of $ \ pi $)

$ P (z_ {i} = c | z_ {-i}, \ alpha, K) = \ frac {n_ {c} + \ alpha / K} {n-1 + \ alpha} $ ($ n_c $ is c The number of z clustered in, K is the number of clusters, $ z_ {-i} $ is z other than $ z_ {i} $)

It will be. You can see that the numerator is $ n_ {c} $ by setting $ K \ to \ infinty $, the customer is $ z_ {i} $, and the class is CRP. When written in the formula, (d) becomes as follows.

z |\alpha \sim CRP([n],\alpha ),\theta _{c}|H \sim H(c\in \rho)

x_{i}|\theta _{c} \sim F(\theta _{c})

(CRP was called the distribution of parts, but note that creating a part for [n] is the same as labeling $ z_ {i} $. From CRP partion: $ \ Interpret it in the sense that z is generated and labeled based on rho $)

Implementation

Finally, let's move on to the implementation part. Basically, MCMC is used. As a policy, I feel that we will faithfully MCMC based on the graphical model mentioned earlier. It is implemented based on Neal's paper (Reference 5). It is easy to understand what has been done so far and how to implement it, so I recommend you to read it.

Implementation by CRP

First, implement based on (d) in the figure. We use the technique of making one-dimensional transitions to $ z $ and $ \ theta $. The first thing that comes to mind is Gibbs sampling using the formula below.

P(z_{i}|z_{-i},x_{i},\theta )\sim P(z_{i}|z_{-i})P(x_{i}|z_{i},\theta )

If c=z_{j} for some j\neq i then P(z_{i}=c|z_{-i})= \frac{n_{-i,c}}{n+1+\alpha}, P(x_{i}|z_{i},\theta ) = f(x_{i}|\theta_{z_{i}})

If c=z_{i} is new then P(z_{i}|z_{-i}) = \frac{\alpha }{n+1+\alpha}, P(x_{i}|z_{i},\theta ) = \int f(x_{i},\theta )dG_{0}(\theta)

Gibbs sampling $ z $ using the top formula as is is not very good. This is because the integral calculation comes out when calculating the transition probability. Moreover, since it is not normalized, it is necessary to calculate the normalized term. So, recall that Gibbs samplning was a Metropolis-Hastings alogrithm with a 1 acceptance rate in the first place. At the same time, let's see how easy it is to calculate $ P (z_ {i} | z_ {-i}) $. Then, by setting the transition probability to $ P (z_ {i} | z_ {-i}) $, you will notice that Metropolis-Hastings will make the calculation effort for transition easier. There is a theory that the integral remains when calculating the acceptance rate, but this is avoided by one sampling. In other words, the acceptance rate is $ \ alpha (z_ {i} ^ {*}, z_ {i}) $ is

min[1,\frac{f(y_{i},\theta_{z_{i}}^{*})}{f(y_{i},\theta_{z_{i}})}]

It will be. The algorithm can be organized as follows.

repeat the following procedure　many times
- for i=1,...n repeat the following update of z_{i} R times:
  - draw a candidate z_{i}^{*} from P(z_{i}|z_{-i})
  - if z_{i}^{*} is not a new value, calucate the acceptance rate.
  - if z_{i}^{*} is a new value, draw \theta from G_{0} then calucate the acceptance rate
  - update z_{i} according to the acceptance rate
- for all \theta, draw a new value from posterior distibution

You can see that it looks like K-means. It is written that the update of $ \ theta $ is from the posterior distribution, but this time we used the gauss distribution for the prior, so the posterior distribution is also the gauss distribution. Also, R is recommended to be 3 times or more. This is because if R = 1, not many people will gather in the new cluster. The actual implementation is as follows.

It is recommended that $ \ alpha $ is not large enough to prevent too many clusters. (Around 1?) In the first gif, it was properly divided into four, but this time it is more complicated.

Implementation by SBP (explicit-G-sampler)

Next, implement based on (b) and (c) in the figure. The specific motivation is to sample from the posterior distribution of $ \ theta $ using the following formula.

\theta_{i}|\theta_{-i} \sim \frac{1}{n-1+\alpha }\sum_{j=1}^{n-1} \delta (\theta_{j}) + \frac{\alpha }{n-1+\alpha }G_{0}

Of course, G sampling is not possible, so it is a pseudo sampling. Notice that the subscript of $ \ theta_ {i} $ means from cluster to sample. This time as well, the transition probability is set to $ \ theta_ {i} | \ theta_ {-i} $ instead of just Gibbs sampling as in the implementation of CBP. The acceptance rate is $ \ alpha (\ theta_ {i} ^ {*}, \ theta_ {i}) $ is

min[1,\frac{f(y_{i},\theta_{*})}{f(y_{i},\theta_{i})}]

It will be. If you organize it, you can see that $ \ theta_ {i} (i = 1, ... n) $ should be sampled in order with the above acceptance rate. Then it will be as below.

In the gif above, $ \ alpha = 1 $. The left is $ \ theta $ and the right is color-coded by the resulting clusters. It's not completely divided into four, but if you increase the number of steps, it will break up a little better.

Theory

Write theoretical things for "machine learning" people. If you want to know the theory for "statistics" people, you should read Professor Kato's tutorial (Reference 9). (I haven't read it)

De Finetti's Theorem

First, I will write about De Fineti's theorem, which justifies the Bayesian style. Observations were exchangeable with the CRP treated above. In other words, when there were random variables $ X_ {1}, X_ {2}, ..., X_ {n}, ... $, replacing the index did not change the simultaneous probability. In mathematical terms P(X_{1}=x_{1},X_{2} = x_{2},...)=P(X_{1}=x_{\sigma (1)},X_{2}=x_{\sigma_{2}},...) It will be. By the way, the condition of exchange is weaker than the condition of i.i.d. De Finetti's Theorem satisfies $ \ theta $ when there are exchangeable random variables $ X_ {1}, X_ {2}, ... $

P(X_{1},...,X_{n}) = \int P(\theta )\prod_{i=1}^{n}P(X_{i}|\theta )d\theta

Claim to exist. This theorem justifies the Bayesian style. In the infinite mixed model above, $ P (\ theta) $ was a random pdf ($ G = \ sum \ pi_ {k} \ delta _ {\ theta _ {k}}) $.

Definition of Dirichlet process in the first place

Dirichlet proceess has a probability space and when the (random) measure is G, $ (G (A_ {1}) ,. For any division $ A_ {1}, ..., A_ {n} $. .., G (A_ {n})) Based on a Dirichlet distribution where $ is $ (\ alpha H (A_ {1}), ..., \ alpha H (A_ {n})) $ as a parameter The definition is $ G $ when it is. It's hard to get an image with this, but it is known that G can be displayed almost everywhere as discrete as it was done with SBP.

Representation of continuous time stochastic process 1

Generally, given the random variable $ X $ ($ X $ is a measurable function from $ \ Omega $ to $ \ mathbb {R} $), it's usually like a pullback measure of $ \ mathbb {R} . After defining it in Borel space, we think of pdf by Radon–Nikodym. The set of such random variables is the stochastic process { X_ {t} } ( t \ in T $). If there are a finite number of the simplest i.i.d stochastic processes, their existence is guaranteed by considering the Cartesian product measure in the Cartesian space. If T is an infinite set or not i.i.d, use Kolmogorov's extension theorem if you want to say the existence of such a stochastic process. Kolmogorove's extended theorem is that if T is a countable set, then such a stochastic process exists if there is consistency in any joint distribution of any finite subset of T. For example, in the case of Brownian motion or Poisson process, $ T = \ mathbb {R} $, but once using the extension theorem for $ \ mathbb {Q} $, $ \ mathbb {Q} $ is $ \ mathbb {R } We will define it continuously in $ using the property of denseness. However, in the case of the Dirichlet process, this $ T $ is a Borel set of $ \ mathbb {R} $. Of course, it is not countable, so it seems that we will take a set that is countable in the Borel set and extend it continuously. It is no longer the image of continuous time.

Representation of continuous time stochastic process 2

From the above configuration, the Dirichlet process seems incomprehensible, but fortunately it is known that it can be represented as SBP and can be sampled using it. I used that expression in the above implementation as well.

Other

I am writing something that seems to be important because it is not written above.

Pitman-Yor Process Since I did CRP with much effort, I will also touch on Pitman-Yor Process. PYP is a guy who changed CRP a little. The formula is as follows in the same situation as CRP.

P(Probability of sitting at an existing table c)=\frac{n_{c}-d}{a+\sum_{c\in \rho} n_{c}},P(Probability of sitting at a new table)= \frac{\alpha +d|\rho|}{\alpha + \sum_{c\in \rho} n_{c}}

If you draw a graph (log 10 scale) with d = 0.9 (green), d = 0.5 (blue), and d = 0.1 (yellow), it will be as shown below.

Looking up, you can see that the number of tables increases as $ d $ increases. $ d $ is reducing the number of people per table, but at the same time, the number of tables and the number of customers per table are adjusted to follow the power law. (As you may know from reading network books, there is an empirical rule that the number of clusters in the world and the number of people per cluster follow the power law.) The Pitman-yor process is also used in natural language processing and image processing because it conforms to real data.

When can nonparametrics be used?

As mentioned in theory, the Dirichclet process makes observations interchangeable and homogeneous (from the same basis distribution). For example, this homogeneity is implicitly used in gibbs sampling of the infinite mixed model. However, unfortunately, not all phenomena in the world come from such a homogeneous base distribution. Therefore, we use the Hierarchil Dirichlet process, which samples from the Dirichlet process and uses it as the basis distribution to generate the Dirichlet process.

What is the method other than MCMC?

Of course, it seems that it can also be done in variational Bayes. But MCMC is easier (to make an algorithm).

A topic that I haven't touched on but seems to be important

I've mentioned some important things that I haven't touched on this time. Infinite HMM, Indian buffet process,Polya Trees,Hierarchical Dirichlet Processes. Also, I am studying, so if I actually implement it, I may try to write an article again.

References

The tutorial created by Professor Ueda of NTT is very easy to understand. This article has also been significantly influenced. You can read more about IRM. Of course, it is also listed in the continuation pattern. http://www.kecl.ntt.co.jp/as/members/yamada/dpm_ueda_yamada2007.pdf
Slides made by Professor Yoshii of Kyoto University are very easy to understand. http://sap.ist.i.kyoto-u.ac.jp/members/yoshii/slides/mus91-tutorial-npb.pdf
Oxford's Yee Whye Teh homepage There are many useful tutorials. http://www.stats.ox.ac.uk/~teh/npbayes.html
↑ The lecture that the teacher heard I didn't understand when I first heard it, but looking back again, it's wonderful. This article has also been significantly influenced. http://videolectures.net/mlss2011_teh_nonparametrics/
This is a paper written by Neal of the University of Toronto. A must read to implement this article. http://www.stat.columbia.edu/npbayes/papers/neal_sampling.pdf
Tutorial with the actual implementation https://datamicroscopes.github.io/
I don't know even if you look at this, but once you study lightly, you will deepen your understanding http://mlg.eng.cam.ac.uk/zoubin/talks/uai05tutorial-b.pdf
Columbia teacher's homepage with full lecture notes http://stat.columbia.edu/~porbanz/npb-tutorial.html
A tutorial that firmly describes the Dirichlet process written by Professor Kato of the Faculty of Economics at the University of Tokyo. difficult. https://docs.google.com/viewer?a=v&pid=sites&srcid=ZGVmYXVsdGRvbWFpbnxra2F0b3N0YXR8Z3g6MmE0NmFkYmU5MzM4YjllNg
Here is the code for nonparametrics in Pymc. Http://xiangze.hatenablog.com/entry/2016/04/09/225754
Needless to say, BDA http://www.stat.columbia.edu/~gelman/book/

I think that the non-parametric book by Dr. Sato, which will be published soon, will be recommended.