HMM parameter estimation implementation in python

Introduction

Implemented HMM parameter estimation in python. I used ["Continued / Easy-to-understand pattern recognition"](https://www.amazon.co.jp/ Continued / Easy-to-understand pattern recognition-Introduction to unsupervised learning --- Ishii-Kenichiro / dp / 427421530X) as a textbook.

Structure of this article

Hidden Markov Model

A characteristic in which the current state is stochastically determined depending on the state one time ago is called ** Markov property **. The process of satisfying Markov property is called ** Markov process **. Examples of such events include stock prices, voice signals, and languages. In the Markov model, a model in which only the series of output symbols can be observed and the series of states cannot be observed. It is called ** Hidden Markov Model **.

Dice throw

HMM will be explained using dice throwing as an example. Consider throwing $ c $ seed dice with $ m $ seeds $ v_1, v_2, \ cdots, v_m $ $ \ omega_1, \ omega_2, \ cdots, \ omega_c $ $ n $ times, The type of dice thrown at the $ t $ time is represented by $ s_t $, and the number of dice thrown and observed is represented by $ x_t $. $ s_t $ and $ x_t $ are expressed as follows.

\begin{eqnarray}
\left\{\begin{array}{ll}
s_t \in \bigl\{\omega_1, \omega_2, \cdots, \omega_c\bigr\} \\
x_t \in \bigl\{v_1, v_2, \cdots, v_m\bigr\} \\
\end{array} \right.
\end{eqnarray}

$ t $ Dice to be taken out $ s_t $ is stochastically determined by the dice $ s_ {t-1} $ to be taken out for $ (t -1) $ The probability is represented by $ a (s_ {t -1}, s_t) $. When the $ (t -1) $ dice taken out is $ \ omega_i $, the probability that the dice taken out $ t $ is $ \ omega_j $ is It is expressed as follows.

a(\omega_i, \omega_j) = P(s_t = \omega_j \ | \ s_{t - 1} = \omega_i) \ \ \ \ \ (i, j = 1, 2, \cdots, c) \tag{1}

The probability that $ x_t $ will be observed by throwing the dice $ s_t $ at the $ t $ th time is represented by $ b (s_t, x_t) $. The probability that $ v_k $ will be observed by throwing the dice $ \ omega_j $ is expressed as follows.

b(\omega_j, v_k) = P(x_t = v_k \ | \ s_t = \omega_j) \ \ \ \ \ (j = 1, 2, \cdots, c) \ \ (k = 1, 2, \cdots, m) \tag{2}

For simplicity, $ a (\ omega_i, \ omega_j) $ is abbreviated as $ a_ {ij} $, $ b (\ omega_j, v_k) $ is abbreviated as $ b_ {jk} $, and $ a_ {ij} $ is abbreviated as *. * Transition probability ** and $ b_ {jk} $ are called ** output probability **. With HMM, you can get information on the ** output eyes ** (= observation results) by throwing the dice, Information on the ** type of dice ** (= state) that output the roll cannot be obtained.

Arrangement of symbols

The symbols for expressing the Markov model are summarized below.

Evaluation

When the parameters $ \ boldsymbol A, \ boldsymbol B, \ boldsymbol \ rho $ are known Finding the probability $ P (\ boldsymbol x) $ to obtain the observation result $ \ boldsymbol x $ is called ** evaluation **.

\begin{align}
P(\boldsymbol x) &= \sum_{\boldsymbol s} P(\boldsymbol x, \boldsymbol s) \tag{3} \\
&= \sum_{s_1} \sum_{s_2} \cdots \sum_{s_n} P(\boldsymbol x, s_1s_2 \cdots s_n) \\
P(\boldsymbol x, \boldsymbol s) &= \prod_{t = 1}^n a(s_{t - 1}, s_t)b(s_t, x_t) \tag{4}
\end{align}

If both $ \ boldsymbol x $ and $ \ boldsymbol s $ can be observed at the same time, $ P (\ boldsymbol x, \ boldsymbol s) $ can be easily calculated. On the other hand, if only $ \ boldsymbol x $ can be observed, the complexity of marginalization for all possible states $ \ boldsymbol s $ will be enormous. We will introduce ** forward algorithm ** and ** backward algorithm ** as algorithms for efficient calculation.

forward algorithm

The probability that the observation result $ \ boldsymbol x $ is obtained and the dice $ \ omega_i $ is taken out at the $ t $ th time is as follows.

\begin{align}
& P(\boldsymbol x, s_t = \omega_i) \\
&= P(x_1x_2 \cdots x_t, s_t = \omega_i) \cdot P(x_{t + 1}x_{t + 2} \cdots x_n \ | \ s_t = \omega_i) \\
&= \alpha_t(i) \cdot \beta_t(i) \tag{5}
\end{align}

$ \ Alpha_t (i) $ in the formula $ (5) $ is the probability that $ x_1x_2 \ cdots x_t $ is obtained and the dice $ \ omega_i $ is taken out at the $ t $ th time. $ \ Alpha_t (i) $ can be calculated recursively as follows.

\begin{align}
& \alpha_t(j) = \Biggl[\sum_{i = 1}^c \alpha_{t - 1}(i) a_{ij} \Biggr]b(\omega_j, x_t) \tag{6} \\
& (t = 2, 3, \cdots, n) \ \ (j = 1, 2, \cdots, c)
\end{align}

However, $ \ alpha_1 (i) $ is as follows.

\begin{align}
\alpha_1(i) &= P(x_1, s_1 = \omega_i) \\
&= \rho_i b(\omega_i, x_1) \ \ \ \ \ (i = 1, 2, \cdots, c) \tag{7}
\end{align}

The following equation is obtained by setting $ t = n $ in the equation $ (6) $.

\alpha_n(i) = P(x_1x_2 \cdots x_n, s_n = \omega_i) \tag{8}

The following equation is obtained by marginalizing the equation $ (8) $ with respect to $ s_n $.

\begin{align}
P(\boldsymbol x) &= P(x_1x_2 \cdots x_n) \\
&= \sum_{i = 1}^c P(x_1x_2 \cdots x_n, s_n = \omega_i) \\
&= \sum_{i = 1}^c \alpha_n(i) \tag{9}
\end{align}

Based on the above, the processing procedure of the forward algorithm is shown below. < forward algorithm >

  1. Initialization \ \ \ \ \alpha_1(i) = \rho_i b(\omega_i, x_1) \ \ \ \ \ (i = 1, 2, \cdots, c)
  2. Recursive calculation \ \ \ \ \alpha_t(j) = \Biggl[\sum_{i = 1}^c \alpha_{t - 1}(i) a_{ij} \Biggr]b(\omega_j, x_t)
  3. Probability calculation \ \ \ \ P(\boldsymbol x) = \sum_{i = 1}^c \alpha_n(i)

backward algorithm

$ \ Beta_t (i) $ in the formula $ (5) $ is subject to the condition that the dice $ \ omega_i $ is taken out at the $ t $ th time. It is the probability that the observation result after that will be $ x_ {t + 1} x_ {t + 2} \ cdots x_n $. $ \ Beta_t (i) $ can be calculated recursively as follows.

\begin{align}
& \beta_t(i) = \sum_{j = 1}^c a_{ij}b(\omega_j, x_{t + 1})\beta_{t + 1}(j) \tag{10} \\
& (t = 1, 2, \cdots, (n - 1)) \ \ (i = 1, 2, \cdots, c)
\end{align}

However, $ \ beta_n (i) $ is as follows.

\beta_n(i) = 1 \ \ \ \ \ (i = 1, 2, \cdots, c) \tag{11}

The following equation is obtained by setting $ t = 1 $ in the equation $ (10) $.

\beta_1(i) = P(x_2x_3 \cdots x_n \ | \ s_1 = \omega_i) \tag{12}

From the formula $ (12) $, $ P (x_1x_2 \ cdots x_n, s_1 = \ omega_i) $ can be expressed as follows.

\begin{align}
& P(x_1x_2 \cdots x_n, s_1 = \omega_i) \\
&= P(s_1 = \omega_i)P(x_1 \ | \ s_1 = \omega_i)P(x_2x_3 \cdots x_n \ | \ s_1 = \omega_i) \\
&= \rho_i b(\omega_i, x_1) \beta_1(i) \tag{13}
\end{align}

The following equation is obtained by marginalizing the equation $ (13) $ with respect to $ s_n $.

\begin{align}
P(\boldsymbol x) &= P(x_1x_2 \cdots x_n) \\
&= \sum_{i = 1}^c P(x_1x_2 \cdots x_n, s_1 = \omega_i) \\
&= \sum_{i = 1}^c \rho_i b(\omega_i, x_1) \beta_1(i) \tag{14}
\end{align}

Based on the above, the processing procedure of the backward algorithm is shown below. < backward algorithm >

  1. Initialization \ \ \ \ \beta_n(i) = 1 \ \ \ \ \ (i = 1, 2, \cdots, c)
  2. Recursive calculation \ \ \ \ \beta_t(i) = \sum_{j = 1}^c a_{ij}b(\omega_j, x_{t + 1})\beta_{t + 1}(j) \ \ \ \ \ (t = 1, 2, \cdots, (n - 1)) \ \ (i = 1, 2, \cdots, c)
  3. Probability calculation \ \ \ \ P(\boldsymbol x) = \sum_{i = 1}^c \rho_i b(\omega_i, x_1) \beta_1(i)

Estimate

When the parameters $ \ boldsymbol A, \ boldsymbol B, \ boldsymbol \ rho $ are unknown Obtaining an unknown parameter from the observation result $ \ boldsymbol x $ is called ** estimation **. In preparation for deriving the solution of the estimation, we define $ \ gamma_t (i) $ and $ \ xi_t (i, j) $ as follows.

\begin{align}
\gamma_t(i) &\overset{\rm{def}}{=} P(s_t = \omega_i \ | \ \boldsymbol x) \tag{15} \\
\xi_t(i) &\overset{\rm{def}}{=} P(s_t = \omega_i, s_{t + 1} = \omega_j \ | \ \boldsymbol x) \tag{16}
\end{align}

In addition, the following relationship holds for $ \ gamma_t (i) $ and $ \ xi_t (i, j) $.

\begin{align}
\sum_{j = 1}^c \xi_t(i, j) &= \sum_{j = 1}^c P(s_t = \omega_i, s_{t + 1} = \omega_j \ | \ \boldsymbol x) \\
&= P(s_t = \omega_i \ | \ \boldsymbol x) \\
&= \gamma_t(i) \tag{17}
\end{align}

Here, $ P (\ boldsymbol x, s_t = \ omega_i, s_ {t + 1} = \ omega_j) $ is expressed as follows.

\begin{align}
&P(\boldsymbol x, s_t = \omega_i, s_{t + 1} = \omega_j) \\
&= P(x_1 \cdots x_t, s_t = \omega_i) \cdot P(s_{t + 1} = \omega_j \ | \ s_t = \omega_i) \cdot P(x_{t + 1} \ | \ s_{t + 1} = \omega_j) \cdot P(x_{t + 2} \cdots x_n \ | \ s_{t + 1} = \omega_j) \\
&= \alpha_t(i)a_{ij}b(\omega_j, x_{t + 1})\beta_{t + 1}(j) \tag{18}
\end{align}

From the above, the following equation can be derived.

\begin{align}
\gamma_t(i) &= P(s_t = \omega_i \ | \ \boldsymbol x) \\
&= P(\boldsymbol x, s_t = \omega_i) \bigl/ P(\boldsymbol x) \\
&= \alpha_t(i)\beta_t(i) \Bigl/ \sum_{i = 1}^c \alpha_n(i) \tag{19} \\
\xi_t(i, j) &= P(s_t = \omega_i, s_{t + 1} = \omega_j \ | \ \boldsymbol x) \\
&= P(\boldsymbol x, s_t = \omega_i, s_{t + 1} = \omega_j) \bigl/ P(\boldsymbol x) \\
&= \alpha_t(i)a_{ij}b(\omega_j, x_{t + 1})\beta_{t + 1}(j) \Bigl/ \sum_{i = 1}^c \alpha_n(i) \tag{20}
\end{align}

Baum-Welch algorithm

Based on the above results, the HMM parameters are estimated using maximum likelihood estimation. The estimation method explained this time is called ** Baum-Welch algorithm **. The parameters $ \ boldsymbol A, \ boldsymbol B, \ boldsymbol \ rho $ are summarized as follows.

\boldsymbol \theta = (\boldsymbol A, \boldsymbol B, \boldsymbol \rho) \tag{21}

Apply the EM algorithm and maximize the following $ Q $ function.

\begin{align}
Q(\boldsymbol \theta^0, \boldsymbol \theta) &= \sum_{\boldsymbol s} P(\boldsymbol s \ | \ \boldsymbol x; \boldsymbol \theta^0) \log P(\boldsymbol x, \boldsymbol s; \boldsymbol \theta) \\
&= \sum_{\boldsymbol s} P(\boldsymbol s \ | \ \boldsymbol x; \boldsymbol \theta^0) \bigl\{\log P(\boldsymbol s; \boldsymbol \theta) + \log P(\boldsymbol x \ | \ \boldsymbol s; \boldsymbol \theta)\bigr\} \\
&= \sum_{\boldsymbol s} P(\boldsymbol s \ | \ \boldsymbol x; \boldsymbol \theta^0) \bigl\{\log P(s_1) + \sum_{t = 1}^{n - 1} \log a(s_t, s_{t + 1}) + \sum_{t = 1}^n \log b(s_t, x_t) \bigr\} \tag{22}
\end{align}

Here, it is defined as follows.

\begin{align}
Q(\boldsymbol \theta^0, \boldsymbol \rho) &\overset{\rm{def}}{=} \sum_{\boldsymbol s} P(\boldsymbol s \ | \ \boldsymbol x; \boldsymbol \theta^0) \log P(s_1) \tag{23} \\
Q(\boldsymbol \theta^0, \boldsymbol A) &\overset{\rm{def}}{=} \sum_{\boldsymbol s} P(\boldsymbol s \ | \ \boldsymbol x; \boldsymbol \theta^0) \sum_{t = 1}^{n - 1} \log a(s_t, s_{t + 1}) \tag{24} \\
Q(\boldsymbol \theta^0, \boldsymbol B) &\overset{\rm{def}}{=} \sum_{\boldsymbol s} P(\boldsymbol s \ | \ \boldsymbol x; \boldsymbol \theta^0) \sum_{t = 1}^n \log b(s_t, x_t) \tag{25}
\end{align}

Maximize the expressions $ (23), (24), (25) $ for $ \ boldsymbol \ rho, \ boldsymbol A, \ boldsymbol B $, respectively. The estimated parameters are as follows. (Derivation is omitted)

\begin{align}
\hat{a}_{ij} &= \sum_{t = 1}^{n - 1} \xi_t(i, j) \Bigl/ \sum_{t = 1}^{n - 1} \gamma_t(i) \tag{26} \\
\hat{b}_{jk} &= \sum_{t = 1}^n \delta(x_t, v_k) \gamma_t(i) \Bigl/ \sum_{t = 1}^n \gamma_t(i) \tag{27} \\
\hat{\rho}_i &= \gamma_1(i) \tag{28}
\end{align}

The processing procedure of the Baum-Welch algorithm is shown below. < Baum-Welch algorithm >

  1. Initialization Set appropriate initial values for the $ \ \ \ $ parameters $ a_ {ij}, b_ {jk}, \ rho_i $
  2. Recursive calculation Calculate $ \ hat a_ {ij}, \ hat b_ {jk}, \ hat \ rho_i $ from the $ \ \ \ $ formula $ (26), (27), (28) $ $ \ \ \ $ However, $ \ alpha_t (i), \ beta_t (i) $ for calculating $ \ gamma_t (i), \ xi_t (i, j) $ are calculated by forward alogorithm and backward alogorithm.
  3. Parameter update Update parameters with $ \ \ \ \ a_ {ij} \ leftarrow \ hat a_ {ij}, b_ {jk} \ leftarrow \ hat b_ {jk}, \ rho_i \ leftarrow \ hat \ rho_i $
  4. End judgment $ \ \ \ $ If the log-likelihood increment is less than or equal to the preset threshold, it is considered to have converged and the update ends.

scaling

The Baum-Welch algorithm uses the forward algorithm and backward algorithm to calculate $ \ alpha_t (i) and \ beta_t (i) $. If the series becomes long, underflow will occur and it will not be possible to calculate $ \ alpha_t (i) and \ beta_t (i) $. We will introduce ** scaling ** as a countermeasure against underflow. Introduce new $ \ hat \ alpha_t (i), c_t $.

\begin{align}
& \hat \alpha_t(i) \overset{\rm{def}}{=} P(s_t = \omega_i \ | \ x_1 \cdots x_t) = \frac{P(x_1 \cdots x_t, s_t = \omega_i)}{P(x1 \cdots x_t)} \tag{29} \\
& c_t = P(x_t \ | \ x_1 \cdots x_{t - 1}) \tag{30}
\end{align}

Here, $ P (x_1 \ cdots x_t) $ is expressed as follows.

\begin{align}
P(x_1 \cdots x_t) &= P(x_1) \cdot P(x_2 \ | \ x_1) \cdot \cdots \cdot P(x_{t - 1} \ | \ x_1 \cdots x_{t - 2}) \cdot P(x_t \ | \ x_1 \cdots x_{t - 1}) \\
&= c_1 \cdot c_2 \cdot \cdots \cdot c_{t - 1} \cdot c_t \\
&= \prod_{m = 1}^t c_m \tag{31}
\end{align}

From this, the following relationship holds.

\begin{align}
\hat \alpha_t(i) &= \frac{\alpha_t(i)}{P(x_1 \cdots x_t)} \\
&= \frac{\alpha_t(i)}{\prod_{m = 1}^t c_m} \\
&= \frac{\alpha_t(i)}{c_t \cdot \prod_{m = 1}^{t - 1} c_m} \\
&= \frac{\alpha_t(i)}{c_t \cdot P(x_1 \cdots x_{t - 1})} \tag{32}
\end{align}

The expression $ (6) $ can be rewritten as follows.

\begin{align}
\alpha_t(j) &= \Biggl[\sum_{i = 1}^c \alpha_{t - 1}(i) a_{ij} \Biggr]b(\omega_j, x_t) \\
&= \Biggl[\sum_{i = 1}^c \hat\alpha_{t - 1}(i) P(x_1 \cdots x_{t - 1}) a_{ij} \Biggr]b(\omega_j, x_t) \\
\frac{\alpha_t(j)}{P(x_1 \cdots x_{t - 1})} &= \Biggl[\sum_{i = 1}^c \hat\alpha_{t - 1}(i) a_{ij} \Biggr]b(\omega_j, x_t) \\
c_t \hat\alpha_t(j)&= \Biggl[\sum_{i = 1}^c \hat\alpha_{t - 1}(i) a_{ij} \Biggr]b(\omega_j, x_t) \tag{33}
\end{align}

From the formula $ (33) $, $ c_t $ can be calculated as follows.

\begin{align}
\sum_{j = 1}^c c_t \hat\alpha_t(j) &= \sum_{j = 1}^c \Biggl[\sum_{i = 1}^c \hat\alpha_{t - 1}(i) a_{ij} \Biggr]b(\omega_j, x_t) \\
c_t &= \sum_{j = 1}^c \Biggl[\sum_{i = 1}^c \hat\alpha_{t - 1}(i) a_{ij} \Biggr]b(\omega_j, x_t) \tag{34}
\end{align}

Also, $ \ hat \ beta_t (i) $ is expressed as follows using $ c_t $ obtained by the forward algorithm.

c_{t + 1} \hat\beta_t(i) = \sum_{j = 1}^c a_{ij}b(\omega_j, x_{t + 1})\hat\beta_{t + 1}(j) \tag{35}

With $ \ hat \ alpha_t (i), \ hat \ beta_t (i) $, $ \ gamma_t (i), \ xi_t (i, j) $ can be rewritten as follows.

\begin{align}
\gamma_t(i) &= \hat\alpha_t(i)\hat\beta_t(i) \tag{36} \\
\xi_t(i) &= \frac{\hat\alpha_t(i)a_{ij}b(\omega_j, x_{t + 1})\hat\beta_{t + 1}(j)}{c_{t + 1}} \tag{37}
\end{align}

The parameters are updated by substituting the equations $ (36), (37) $ into the equations $ (26), (27), (28) $.

Implementation

Implemented parameter estimation using Baum-Welch algorithm in python. The implementation code can be found here. Based on dice throwing, the number of states is $ 3 $ species ($ c = 3 $) of $ \ omega_1, \ omega_2, \ omega_3 $, The output symbol is the $ 2 $ species ($ m = 2 $) of $ v_1 and v_2 $. This is equivalent to selecting and throwing dice of different $ 3 $ types and observing odd ($ v_1 ) and even ( v_2 $) rolls.

result

The true parameters are shown below.

\boldsymbol A = \begin{pmatrix}
0.1 & 0.7 & 0.2 \\
0.2 & 0.1 & 0.7 \\
0.7 & 0.2 & 0.1
\end{pmatrix}, \ \ \  
\boldsymbol B = \begin{pmatrix}
0.9 & 0.1 \\
0.6 & 0.4 \\
0.1 & 0.9
\end{pmatrix},  \ \ \ 
\boldsymbol \rho = \begin{pmatrix}
1.0 & 0.0 & 0.0
\end{pmatrix}

The initial values for estimation were set as follows.

\boldsymbol A = \begin{pmatrix}
0.15 & 0.60 & 0.25 \\
0.25 & 0.15 & 0.60 \\
0.60 & 0.25 & 0.15
\end{pmatrix}, \ \ \  
\boldsymbol B = \begin{pmatrix}
0.5 & 0.5 \\
0.5 & 0.5 \\
0.5 & 0.5
\end{pmatrix},  \ \ \ 
\boldsymbol \rho = \begin{pmatrix}
1.0 & 0.0 & 0.0
\end{pmatrix}

The estimation result is as follows.

\boldsymbol A = \begin{pmatrix}
0.11918790 & 0.63863200 & 0.24218010 \\
0.13359797 & 0.07275209 & 0.79364994 \\
0.72917405 & 0.18715812 & 0.08366783
\end{pmatrix}, \ \ \  
\boldsymbol B = \begin{pmatrix}
0.92503797 & 0.07496203 \\
0.56298958 & 0.43701042 \\
0.14059588 & 0.85940412
\end{pmatrix}

The process of increasing the log-likelihood is as follows.

logP.png

Finally, the convergence process of $ b_ {11}, b_ {21}, b_ {31} $ is shown below. The thin black line represents the true value.

B1.png

in conclusion

I was able to implement HMM parameter estimation in python. If you have any mistakes, please point them out.

Recommended Posts

HMM parameter estimation implementation in python
RNN implementation in python
ValueObject implementation in Python
SVM implementation in python
Kernel density estimation in Python
Neural network implementation in python
Implementation of quicksort in Python
Transfer parameter values in Python
Maximum likelihood estimation implementation of topic model in python
FX Systre Parameter Optimization in Python
Mixed normal distribution implementation in python
Implementation of life game in Python
Implementation of original sorting in Python
Implementation module "deque" in queue and Python
Quadtree in Python --2
Python in optimization
CURL in python
Metaprogramming in Python
Python 3.3 in Anaconda
Geocoding in python
SendKeys in Python
Meta-analysis in Python
Unittest in python
Epoch in Python
Discord in Python
Sudoku in Python
DCI in Python
quicksort in python
nCr in python
N-Gram in Python
Programming in python
Plink in Python
Constant in python
Lifegame in Python.
FizzBuzz in Python
Sqlite in python
StepAIC in Python
N-gram in python
LINE-Bot [0] in Python
Csv in python
Disassemble in Python
Reflection in Python
Constant in python
nCr in Python.
format in python
Scons in Python3
Puyo Puyo in python
python in virtualenv
PPAP in Python
Quad-tree in Python
Reflection in Python
Chemistry in Python
Hashable in python
DirectLiNGAM in Python
LiNGAM in Python
Flatten in python
flatten in python
Minimal implementation to do Union Find in Python
[Implementation for learning] Implement Stratified Sampling in Python (1)
[Modint] Decoding the AtCoder Library ~ Implementation in Python ~
Explanation of edit distance and implementation in Python