Since I recently studied nonparametric Bayes, I also tried to organize my mind. I am not a specialist, so I would appreciate it if you could point out any mistakes.
Nonparametric Bayes is simply an infinite dimensional Bayesian model. (It's not that there are no parameters.) The theory itself has been around for quite some time, and it seems that it has been available in the context of machine learning since the 2000s.
For example, when clustering, you usually specify a mixture number. However, in nonparametric Bayes, once you think that there are an infinite number of clusters, the number of mixture is automatically determined. Specifically, it looks like a gif like the one below.
If the colors are the same, they are in the same cluster. I haven't specified the number of clusters, but you can see that they are clustering well. This time, we aim to implement this infinite mixed model. The code is https://github.com/Ma-sa-ue/practice/tree/master/machine%20learning(python)/nonparabayes It is listed in. As an application example in machine learning other than nonparametric clustering
there is. The advantages (although they overlap) are as follows.
As I said earlier, the goal is to implement an infinite mixed model. For that purpose, we will first explain the Dirichlet process, Stick breaking process, and Chinese restaurant process, which are parts in nonparametric Bayes. It is important that the three are not separate, but also different perspectives. Then rearrange the above in the context of the mixture model. Then implement the infinite mixed model. After that, I organized the theoretical background. I'm not in a position to say something great, but when I studied, I couldn't understand at all in one document, so I recommend that you refer to and implement various documents.
This is a diagram taken from Professor Ueda's tutorial (Reference 1). Understanding this figure is one goal.
First, I will write about what happens to the finite mixed model before infinity. It can be represented by a graphical model such as (a). In particular
It will be. $ x = (x_ {1}, ..., x_ {n}) $ is the observed value and $ z = (z_ {1}, ..., z_ {n}) $ is the label. $ \ pi $ is the dice for determining the label. For a mixed Gaussian distribution, f is the Gaussian distribution and $ \ theta = (\ theta_ {1}, ..., \ theta_ {k}) $ is the mean of each of the K classes. $ G_ {0} $ is the parameter prior. In the case of a mixed Gauss distribution, if it is conjugated, this will also be a Gauss distribution.
You can only observe $ x $, but you really want to know the true $ z $ (label). In the case of a finite mixed model, it can be basically obtained by Gibbs sampling for $ z $ and $ \ theta $. You can also create a method for an infinite mixed model by taking $ K $ infinitely. However, when it comes to implementing this "as is", it turns out that there are an infinite number of classes and it is impossible.
The multivariable version of the Beta distribution is the Dirichlet distribution. The Dirichlet distribution often appears as a prior of the multinomial distribution in Bayes. At this time, it is important that the sum of the samples generated from the K-dimensional Dirichlet distribution is 1. Let's consider $ G = \ sum _ {k = 1} ^ {K} \ pi _ {k} \ delta _ {\ theta _ {k}} $. (Note $ \ pi \ sim Dirichlet (\ alpha / K, ..., \ alpha / K), \ theta \ sim H $) Then you can see that $ p (x) = \ int p (x | \ theta) G (\ theta) d \ theta $ in the mixed model. (Z is erased) Then, the shape of the graphical model of the finite mixed model is as shown in (b). At this time, the Dirichlet process is used with the motivation to sample from G when the number of K becomes infinite.
Dirichlet Process
The original introduction of the Dirichlet process is described in the theory section. The Dirichlet process displays $ DP (\ alpha, H) $. As I said earlier, the Dirichlet process is almost everywhere discrete and is $ G = \ sum _ {k} ^ {\ infty} \ pi _ {k} \ delta _ {\ theta _ {k}} $ (infinite sum). .. Use the Stick-Breaking process to sample from the Dirichlet process.
Stick-Breaking Process To sample from the Dirichlet process
Sampling as. It is consistent so that the sum of $ \ pi _ {k} $ is 1. It seems to be called Stick-Breaking Process because it feels like the branches are gradually broken. A sampling example is shown below.
Of course, G, which is the sum of an infinite number, cannot be created originally, so it is an approximation. You can see that the larger $ \ alpha $, the more branches there are. Another important thing to use in the actual implementation is the following properties.
Chinese Restaurant Process(CRP)
CRP is the distribution of the cluster index partitions. For the time being, I will introduce it in an amakudari manner. For example, write the whole set as $ [n] = [1,2, ... n] $ and write the element of part as $ \ rho $. As an example of $ \ rho $, when the whole set is [4], [[1], [2,3], [4]] or [[1,2], [3,4]] can be considered. The following method of making partion is CRP. First, imagine that each person goes into a Chinese restaurant one after another and sits at a table. How to sit the n + 1th person when n people are sitting
P (probability of sitting at existing table c) = $ \ frac {n_ {c}} {a + n} $, P (probability of sitting at new table) = $ \ frac {\ alpha} {\ alpha + n} $
If so, the final distribution of parts when this is continued will be CRP. ($ \ Alpha $ is a constant, $ n_ {c} $ is the number of people already sitting in c)
For example, if there are already 4 people and one or 3 people are sitting, it will be [[1], [2,3,4]]. So I will think about how to enter the fifth person. Then, if $ \ alpha = 1 $, the probability of sitting at the first table is $ \ frac {1} {1 + 4} $, and the probability of sitting at the second table is $ \ frac {3} {1 + 4}. $, The probability of sitting at the third (new) table is $ \ frac {1} {1 + 4} $.
Keep in mind the nature of a table where many people are already sitting is more crowded.
Also note that creating a partion is exactly the same as clustering (labeling).
For example, let's label people by table number in the above example. If the fifth person sits at a new table, it will move from the state of [2,1,1,1] to [2,1,1,1,3].
Also note the nature of customers sitting sparsely as $ \ alpha $ grows.
I will actually implement it. Basically, you can implement the above as it is, but you may not be sure if it matches. At that time, you can roughly confirm by looking at the relationship between the number of tables and the number of customers.
The relationship is
The x-axis is the number of customers and the y-axis is the number of tables. The red points are the points that have been sampled, and the blue lines are approximate values of the expected value. You can see that it looks like that.
Let's reorganize the following figure in consideration of the past.
(a) is as I said earlier. As a learning policy, there is Gibbs sampling of $ z $, $ \ theta $.
(b) is basically as I said earlier. Since it is difficult to handle the Dirichlet process directly, it is practically (c). In this case, the label (z) is not explicitly calculated, but $ \ theta $ is calculated. This is calculated by Gibbs samling for $ \ theta $. Note that you need to sample from G at this time, but not in the strict sense. In this case, you need to prepare an infinite number of $ \ theta $ and $ \ pi $.
But if you think about it, you can see that we only need to focus on the clusters at the data points. This is (d). To that end, let us first consider the posterior distribution of z in the finite mixed model (a). Then, by a simple calculation (integral elimination of $ \ pi $)
$ P (z_ {i} = c | z_ {-i}, \ alpha, K) = \ frac {n_ {c} + \ alpha / K} {n-1 + \ alpha} $ ($ n_c $ is c The number of z clustered in, K is the number of clusters, $ z_ {-i} $ is z other than $ z_ {i} $)
It will be. You can see that the numerator is $ n_ {c} $ by setting $ K \ to \ infinty $, the customer is $ z_ {i} $, and the class is CRP. When written in the formula, (d) becomes as follows.
(CRP was called the distribution of parts, but note that creating a part for [n] is the same as labeling $ z_ {i} $. From CRP partion: $ \ Interpret it in the sense that z is generated and labeled based on rho $)
Finally, let's move on to the implementation part. Basically, MCMC is used. As a policy, I feel that we will faithfully MCMC based on the graphical model mentioned earlier. It is implemented based on Neal's paper (Reference 5). It is easy to understand what has been done so far and how to implement it, so I recommend you to read it.
First, implement based on (d) in the figure. We use the technique of making one-dimensional transitions to $ z $ and $ \ theta $. The first thing that comes to mind is Gibbs sampling using the formula below.
If
If
Gibbs sampling $ z $ using the top formula as is is not very good. This is because the integral calculation comes out when calculating the transition probability. Moreover, since it is not normalized, it is necessary to calculate the normalized term. So, recall that Gibbs samplning was a Metropolis-Hastings alogrithm with a 1 acceptance rate in the first place. At the same time, let's see how easy it is to calculate $ P (z_ {i} | z_ {-i}) $. Then, by setting the transition probability to $ P (z_ {i} | z_ {-i}) $, you will notice that Metropolis-Hastings will make the calculation effort for transition easier. There is a theory that the integral remains when calculating the acceptance rate, but this is avoided by one sampling. In other words, the acceptance rate is $ \ alpha (z_ {i} ^ {*}, z_ {i}) $ is
It will be. The algorithm can be organized as follows.
You can see that it looks like K-means. It is written that the update of $ \ theta $ is from the posterior distribution, but this time we used the gauss distribution for the prior, so the posterior distribution is also the gauss distribution. Also, R is recommended to be 3 times or more. This is because if R = 1, not many people will gather in the new cluster. The actual implementation is as follows.
It is recommended that $ \ alpha $ is not large enough to prevent too many clusters. (Around 1?) In the first gif, it was properly divided into four, but this time it is more complicated.
Next, implement based on (b) and (c) in the figure. The specific motivation is to sample from the posterior distribution of $ \ theta $ using the following formula.
Of course, G sampling is not possible, so it is a pseudo sampling. Notice that the subscript of $ \ theta_ {i} $ means from cluster to sample. This time as well, the transition probability is set to $ \ theta_ {i} | \ theta_ {-i} $ instead of just Gibbs sampling as in the implementation of CBP. The acceptance rate is $ \ alpha (\ theta_ {i} ^ {*}, \ theta_ {i}) $ is
It will be. If you organize it, you can see that $ \ theta_ {i} (i = 1, ... n) $ should be sampled in order with the above acceptance rate. Then it will be as below.
In the gif above, $ \ alpha = 1 $. The left is $ \ theta $ and the right is color-coded by the resulting clusters. It's not completely divided into four, but if you increase the number of steps, it will break up a little better.
Write theoretical things for "machine learning" people. If you want to know the theory for "statistics" people, you should read Professor Kato's tutorial (Reference 9). (I haven't read it)
De Finetti's Theorem
First, I will write about De Fineti's theorem, which justifies the Bayesian style.
Observations were exchangeable with the CRP treated above. In other words, when there were random variables $ X_ {1}, X_ {2}, ..., X_ {n}, ... $, replacing the index did not change the simultaneous probability. In mathematical terms
Claim to exist. This theorem justifies the Bayesian style. In the infinite mixed model above, $ P (\ theta) $ was a random pdf ($ G = \ sum \ pi_ {k} \ delta _ {\ theta _ {k}}) $.
Dirichlet proceess has a probability space and when the (random) measure is G, $ (G (A_ {1}) ,. For any division $ A_ {1}, ..., A_ {n} $. .., G (A_ {n})) Based on a Dirichlet distribution where $ is $ (\ alpha H (A_ {1}), ..., \ alpha H (A_ {n})) $ as a parameter The definition is $ G $ when it is. It's hard to get an image with this, but it is known that G can be displayed almost everywhere as discrete as it was done with SBP.
Generally, given the random variable $ X $ ($ X $ is a measurable function from $ \ Omega $ to $ \ mathbb {R} $), it's usually like a pullback measure of $ \ mathbb {R}
From the above configuration, the Dirichlet process seems incomprehensible, but fortunately it is known that it can be represented as SBP and can be sampled using it. I used that expression in the above implementation as well.
I am writing something that seems to be important because it is not written above.
Pitman-Yor Process Since I did CRP with much effort, I will also touch on Pitman-Yor Process. PYP is a guy who changed CRP a little. The formula is as follows in the same situation as CRP.
P(Probability of sitting at an existing table c)=
If you draw a graph (log 10 scale) with d = 0.9 (green), d = 0.5 (blue), and d = 0.1 (yellow), it will be as shown below.
Looking up, you can see that the number of tables increases as $ d $ increases. $ d $ is reducing the number of people per table, but at the same time, the number of tables and the number of customers per table are adjusted to follow the power law. (As you may know from reading network books, there is an empirical rule that the number of clusters in the world and the number of people per cluster follow the power law.) The Pitman-yor process is also used in natural language processing and image processing because it conforms to real data.
As mentioned in theory, the Dirichclet process makes observations interchangeable and homogeneous (from the same basis distribution). For example, this homogeneity is implicitly used in gibbs sampling of the infinite mixed model. However, unfortunately, not all phenomena in the world come from such a homogeneous base distribution. Therefore, we use the Hierarchil Dirichlet process, which samples from the Dirichlet process and uses it as the basis distribution to generate the Dirichlet process.
Of course, it seems that it can also be done in variational Bayes. But MCMC is easier (to make an algorithm).
I've mentioned some important things that I haven't touched on this time. Infinite HMM, Indian buffet process,Polya Trees,Hierarchical Dirichlet Processes. Also, I am studying, so if I actually implement it, I may try to write an article again.
Recommended Posts