Introduction

Implemented HMM parameter estimation in python. I used ["Continued / Easy-to-understand pattern recognition"](https://www.amazon.co.jp/ Continued / Easy-to-understand pattern recognition-Introduction to unsupervised learning --- Ishii-Kenichiro / dp / 427421530X) as a textbook.

Structure of this article

Hidden Markov Model
Dice throw
Arrangement of symbols
Rating
- forward algorithm
- backward algorithm
Estimate
- Baum-Welch algorithm
Scaling
Implementation
Result
in conclusion

Hidden Markov Model

A characteristic in which the current state is stochastically determined depending on the state one time ago is called ** Markov property **. The process of satisfying Markov property is called ** Markov process **. Examples of such events include stock prices, voice signals, and languages. In the Markov model, a model in which only the series of output symbols can be observed and the series of states cannot be observed. It is called ** Hidden Markov Model **.

Dice throw

HMM will be explained using dice throwing as an example. Consider throwing $ c $ seed dice with $ m $ seeds $ v_1, v_2, \ cdots, v_m $ $ \ omega_1, \ omega_2, \ cdots, \ omega_c $ $ n $ times, The type of dice thrown at the $ t $ time is represented by $ s_t $, and the number of dice thrown and observed is represented by $ x_t $. $ s_t $ and $ x_t $ are expressed as follows.

\begin{eqnarray}
\left\{\begin{array}{ll}
s_t \in \bigl\{\omega_1, \omega_2, \cdots, \omega_c\bigr\} \\
x_t \in \bigl\{v_1, v_2, \cdots, v_m\bigr\} \\
\end{array} \right.
\end{eqnarray}

$ t $ Dice to be taken out $ s_t $ is stochastically determined by the dice $ s_ {t-1} $ to be taken out for $ (t -1) $ The probability is represented by $ a (s_ {t -1}, s_t) $. When the $ (t -1) $ dice taken out is $ \ omega_i $, the probability that the dice taken out $ t $ is $ \ omega_j $ is It is expressed as follows.

a(\omega_i, \omega_j) = P(s_t = \omega_j \ | \ s_{t - 1} = \omega_i) \ \ \ \ \ (i, j = 1, 2, \cdots, c) \tag{1}

The probability that $ x_t $ will be observed by throwing the dice $ s_t $ at the $ t $ th time is represented by $ b (s_t, x_t) $. The probability that $ v_k $ will be observed by throwing the dice $ \ omega_j $ is expressed as follows.

b(\omega_j, v_k) = P(x_t = v_k \ | \ s_t = \omega_j) \ \ \ \ \ (j = 1, 2, \cdots, c) \ \ (k = 1, 2, \cdots, m) \tag{2}

For simplicity, $ a (\ omega_i, \ omega_j) $ is abbreviated as $ a_ {ij} $, $ b (\ omega_j, v_k) $ is abbreviated as $ b_ {jk} $, and $ a_ {ij} $ is abbreviated as *. * Transition probability ** and $ b_ {jk} $ are called ** output probability **. With HMM, you can get information on the ** output eyes ** (= observation results) by throwing the dice, Information on the ** type of dice ** (= state) that output the roll cannot be obtained.

Arrangement of symbols

The symbols for expressing the Markov model are summarized below.

$ n $: Number of observations
$ c $: Number of states
$ m $: Number of output symbols
$ \ omega_i $: $ i $ th state
$ v_k $: $ k $ th output symbol
$ s_t \ in \ {\ omega_1, \ omega_2, \ cdots, \ omega_c \} $: State at time $ t $
$ x_t \ in \ {v_1, v_2, \ cdots, v_m \} $: Observation result at time point $ t $
$ \ boldsymbol s $: State series
$ \ boldsymbol x $: Observation symbol series
$ a_ {ij} $: Transition probability from state $ \ omega_i $ to state $ \ omega_j $
$ b_ {jk} $: Probability of outputting $ v_k $ in state $ \ omega_j $
$ \ rho_i $: Probability that the initial state $ (t = 1) $ is $ \ omega_i $
$ \ boldsymbol A $: A matrix of $ c \ times c $ with $ a_ {ij} $ as the $ (i, j) $ component
$ \ boldsymbol B $: A matrix of $ c \ times m $ with $ b_ {jk} $ as the $ (j, k) $ component
$ \ boldsymbol \ rho = (\ rho_1, \ rho_2, \ cdots, \ rho_c) $: $ c $ dimensional vector with $ \ rho_i $ as a component

Evaluation

When the parameters $ \ boldsymbol A, \ boldsymbol B, \ boldsymbol \ rho $ are known Finding the probability $ P (\ boldsymbol x) $ to obtain the observation result $ \ boldsymbol x $ is called ** evaluation **.

\begin{align}
P(\boldsymbol x) &= \sum_{\boldsymbol s} P(\boldsymbol x, \boldsymbol s) \tag{3} \\
&= \sum_{s_1} \sum_{s_2} \cdots \sum_{s_n} P(\boldsymbol x, s_1s_2 \cdots s_n) \\
P(\boldsymbol x, \boldsymbol s) &= \prod_{t = 1}^n a(s_{t - 1}, s_t)b(s_t, x_t) \tag{4}
\end{align}

If both $ \ boldsymbol x $ and $ \ boldsymbol s $ can be observed at the same time, $ P (\ boldsymbol x, \ boldsymbol s) $ can be easily calculated. On the other hand, if only $ \ boldsymbol x $ can be observed, the complexity of marginalization for all possible states $ \ boldsymbol s $ will be enormous. We will introduce ** forward algorithm ** and ** backward algorithm ** as algorithms for efficient calculation.

forward algorithm

The probability that the observation result $ \ boldsymbol x $ is obtained and the dice $ \ omega_i $ is taken out at the $ t $ th time is as follows.

\begin{align}
& P(\boldsymbol x, s_t = \omega_i) \\
&= P(x_1x_2 \cdots x_t, s_t = \omega_i) \cdot P(x_{t + 1}x_{t + 2} \cdots x_n \ | \ s_t = \omega_i) \\
&= \alpha_t(i) \cdot \beta_t(i) \tag{5}
\end{align}

$ \ Alpha_t (i) $ in the formula $ (5) $ is the probability that $ x_1x_2 \ cdots x_t $ is obtained and the dice $ \ omega_i $ is taken out at the $ t $ th time. $ \ Alpha_t (i) $ can be calculated recursively as follows.

\begin{align}
& \alpha_t(j) = \Biggl[\sum_{i = 1}^c \alpha_{t - 1}(i) a_{ij} \Biggr]b(\omega_j, x_t) \tag{6} \\
& (t = 2, 3, \cdots, n) \ \ (j = 1, 2, \cdots, c)
\end{align}

However, $ \ alpha_1 (i) $ is as follows.

\begin{align}
\alpha_1(i) &= P(x_1, s_1 = \omega_i) \\
&= \rho_i b(\omega_i, x_1) \ \ \ \ \ (i = 1, 2, \cdots, c) \tag{7}
\end{align}

The following equation is obtained by setting $ t = n $ in the equation $ (6) $.

\alpha_n(i) = P(x_1x_2 \cdots x_n, s_n = \omega_i) \tag{8}

The following equation is obtained by marginalizing the equation $ (8) $ with respect to $ s_n $.

\begin{align}
P(\boldsymbol x) &= P(x_1x_2 \cdots x_n) \\
&= \sum_{i = 1}^c P(x_1x_2 \cdots x_n, s_n = \omega_i) \\
&= \sum_{i = 1}^c \alpha_n(i) \tag{9}
\end{align}

Based on the above, the processing procedure of the forward algorithm is shown below. < forward algorithm >

Initialization \ \ \ \ \alpha_1(i) = \rho_i b(\omega_i, x_1) \ \ \ \ \ (i = 1, 2, \cdots, c)
Recursive calculation \ \ \ \ \alpha_t(j) = \Biggl[\sum_{i = 1}^c \alpha_{t - 1}(i) a_{ij} \Biggr]b(\omega_j, x_t)
Probability calculation \ \ \ \ P(\boldsymbol x) = \sum_{i = 1}^c \alpha_n(i)

backward algorithm

$ \ Beta_t (i) $ in the formula $ (5) $ is subject to the condition that the dice $ \ omega_i $ is taken out at the $ t $ th time. It is the probability that the observation result after that will be $ x_ {t + 1} x_ {t + 2} \ cdots x_n $. $ \ Beta_t (i) $ can be calculated recursively as follows.

\begin{align}
& \beta_t(i) = \sum_{j = 1}^c a_{ij}b(\omega_j, x_{t + 1})\beta_{t + 1}(j) \tag{10} \\
& (t = 1, 2, \cdots, (n - 1)) \ \ (i = 1, 2, \cdots, c)
\end{align}

However, $ \ beta_n (i) $ is as follows.

\beta_n(i) = 1 \ \ \ \ \ (i = 1, 2, \cdots, c) \tag{11}

The following equation is obtained by setting $ t = 1 $ in the equation $ (10) $.

\beta_1(i) = P(x_2x_3 \cdots x_n \ | \ s_1 = \omega_i) \tag{12}

From the formula $ (12) $, $ P (x_1x_2 \ cdots x_n, s_1 = \ omega_i) $ can be expressed as follows.

\begin{align}
& P(x_1x_2 \cdots x_n, s_1 = \omega_i) \\
&= P(s_1 = \omega_i)P(x_1 \ | \ s_1 = \omega_i)P(x_2x_3 \cdots x_n \ | \ s_1 = \omega_i) \\
&= \rho_i b(\omega_i, x_1) \beta_1(i) \tag{13}
\end{align}

The following equation is obtained by marginalizing the equation $ (13) $ with respect to $ s_n $.

\begin{align}
P(\boldsymbol x) &= P(x_1x_2 \cdots x_n) \\
&= \sum_{i = 1}^c P(x_1x_2 \cdots x_n, s_1 = \omega_i) \\
&= \sum_{i = 1}^c \rho_i b(\omega_i, x_1) \beta_1(i) \tag{14}
\end{align}

Based on the above, the processing procedure of the backward algorithm is shown below. < backward algorithm >

Initialization \ \ \ \ \beta_n(i) = 1 \ \ \ \ \ (i = 1, 2, \cdots, c)
Recursive calculation \ \ \ \ \beta_t(i) = \sum_{j = 1}^c a_{ij}b(\omega_j, x_{t + 1})\beta_{t + 1}(j) \ \ \ \ \ (t = 1, 2, \cdots, (n - 1)) \ \ (i = 1, 2, \cdots, c)
Probability calculation \ \ \ \ P(\boldsymbol x) = \sum_{i = 1}^c \rho_i b(\omega_i, x_1) \beta_1(i)

Estimate

When the parameters $ \ boldsymbol A, \ boldsymbol B, \ boldsymbol \ rho $ are unknown Obtaining an unknown parameter from the observation result $ \ boldsymbol x $ is called ** estimation **. In preparation for deriving the solution of the estimation, we define $ \ gamma_t (i) $ and $ \ xi_t (i, j) $ as follows.

\begin{align}
\gamma_t(i) &\overset{\rm{def}}{=} P(s_t = \omega_i \ | \ \boldsymbol x) \tag{15} \\
\xi_t(i) &\overset{\rm{def}}{=} P(s_t = \omega_i, s_{t + 1} = \omega_j \ | \ \boldsymbol x) \tag{16}
\end{align}

In addition, the following relationship holds for $ \ gamma_t (i) $ and $ \ xi_t (i, j) $.

\begin{align}
\sum_{j = 1}^c \xi_t(i, j) &= \sum_{j = 1}^c P(s_t = \omega_i, s_{t + 1} = \omega_j \ | \ \boldsymbol x) \\
&= P(s_t = \omega_i \ | \ \boldsymbol x) \\
&= \gamma_t(i) \tag{17}
\end{align}

Here, $ P (\ boldsymbol x, s_t = \ omega_i, s_ {t + 1} = \ omega_j) $ is expressed as follows.

\begin{align}
&P(\boldsymbol x, s_t = \omega_i, s_{t + 1} = \omega_j) \\
&= P(x_1 \cdots x_t, s_t = \omega_i) \cdot P(s_{t + 1} = \omega_j \ | \ s_t = \omega_i) \cdot P(x_{t + 1} \ | \ s_{t + 1} = \omega_j) \cdot P(x_{t + 2} \cdots x_n \ | \ s_{t + 1} = \omega_j) \\
&= \alpha_t(i)a_{ij}b(\omega_j, x_{t + 1})\beta_{t + 1}(j) \tag{18}
\end{align}

From the above, the following equation can be derived.

\begin{align}
\gamma_t(i) &= P(s_t = \omega_i \ | \ \boldsymbol x) \\
&= P(\boldsymbol x, s_t = \omega_i) \bigl/ P(\boldsymbol x) \\
&= \alpha_t(i)\beta_t(i) \Bigl/ \sum_{i = 1}^c \alpha_n(i) \tag{19} \\
\xi_t(i, j) &= P(s_t = \omega_i, s_{t + 1} = \omega_j \ | \ \boldsymbol x) \\
&= P(\boldsymbol x, s_t = \omega_i, s_{t + 1} = \omega_j) \bigl/ P(\boldsymbol x) \\
&= \alpha_t(i)a_{ij}b(\omega_j, x_{t + 1})\beta_{t + 1}(j) \Bigl/ \sum_{i = 1}^c \alpha_n(i) \tag{20}
\end{align}

Baum-Welch algorithm

Based on the above results, the HMM parameters are estimated using maximum likelihood estimation. The estimation method explained this time is called ** Baum-Welch algorithm **. The parameters $ \ boldsymbol A, \ boldsymbol B, \ boldsymbol \ rho $ are summarized as follows.

\boldsymbol \theta = (\boldsymbol A, \boldsymbol B, \boldsymbol \rho) \tag{21}

Apply the EM algorithm and maximize the following $ Q $ function.

\begin{align}
Q(\boldsymbol \theta^0, \boldsymbol \theta) &= \sum_{\boldsymbol s} P(\boldsymbol s \ | \ \boldsymbol x; \boldsymbol \theta^0) \log P(\boldsymbol x, \boldsymbol s; \boldsymbol \theta) \\
&= \sum_{\boldsymbol s} P(\boldsymbol s \ | \ \boldsymbol x; \boldsymbol \theta^0) \bigl\{\log P(\boldsymbol s; \boldsymbol \theta) + \log P(\boldsymbol x \ | \ \boldsymbol s; \boldsymbol \theta)\bigr\} \\
&= \sum_{\boldsymbol s} P(\boldsymbol s \ | \ \boldsymbol x; \boldsymbol \theta^0) \bigl\{\log P(s_1) + \sum_{t = 1}^{n - 1} \log a(s_t, s_{t + 1}) + \sum_{t = 1}^n \log b(s_t, x_t) \bigr\} \tag{22}
\end{align}

Here, it is defined as follows.

\begin{align}
Q(\boldsymbol \theta^0, \boldsymbol \rho) &\overset{\rm{def}}{=} \sum_{\boldsymbol s} P(\boldsymbol s \ | \ \boldsymbol x; \boldsymbol \theta^0) \log P(s_1) \tag{23} \\
Q(\boldsymbol \theta^0, \boldsymbol A) &\overset{\rm{def}}{=} \sum_{\boldsymbol s} P(\boldsymbol s \ | \ \boldsymbol x; \boldsymbol \theta^0) \sum_{t = 1}^{n - 1} \log a(s_t, s_{t + 1}) \tag{24} \\
Q(\boldsymbol \theta^0, \boldsymbol B) &\overset{\rm{def}}{=} \sum_{\boldsymbol s} P(\boldsymbol s \ | \ \boldsymbol x; \boldsymbol \theta^0) \sum_{t = 1}^n \log b(s_t, x_t) \tag{25}
\end{align}

Maximize the expressions $ (23), (24), (25) $ for $ \ boldsymbol \ rho, \ boldsymbol A, \ boldsymbol B $, respectively. The estimated parameters are as follows. (Derivation is omitted)

\begin{align}
\hat{a}_{ij} &= \sum_{t = 1}^{n - 1} \xi_t(i, j) \Bigl/ \sum_{t = 1}^{n - 1} \gamma_t(i) \tag{26} \\
\hat{b}_{jk} &= \sum_{t = 1}^n \delta(x_t, v_k) \gamma_t(i) \Bigl/ \sum_{t = 1}^n \gamma_t(i) \tag{27} \\
\hat{\rho}_i &= \gamma_1(i) \tag{28}
\end{align}

The processing procedure of the Baum-Welch algorithm is shown below. < Baum-Welch algorithm >

Initialization Set appropriate initial values for the $ \ \ \ $ parameters $ a_ {ij}, b_ {jk}, \ rho_i $
Recursive calculation Calculate $ \ hat a_ {ij}, \ hat b_ {jk}, \ hat \ rho_i $ from the $ \ \ \ $ formula $ (26), (27), (28) $ $ \ \ \ $ However, $ \ alpha_t (i), \ beta_t (i) $ for calculating $ \ gamma_t (i), \ xi_t (i, j) $ are calculated by forward alogorithm and backward alogorithm.
Parameter update Update parameters with $ \ \ \ \ a_ {ij} \ leftarrow \ hat a_ {ij}, b_ {jk} \ leftarrow \ hat b_ {jk}, \ rho_i \ leftarrow \ hat \ rho_i $
End judgment $ \ \ \ $ If the log-likelihood increment is less than or equal to the preset threshold, it is considered to have converged and the update ends.

scaling

The Baum-Welch algorithm uses the forward algorithm and backward algorithm to calculate $ \ alpha_t (i) and \ beta_t (i) $. If the series becomes long, underflow will occur and it will not be possible to calculate $ \ alpha_t (i) and \ beta_t (i) $. We will introduce ** scaling ** as a countermeasure against underflow. Introduce new $ \ hat \ alpha_t (i), c_t $.

\begin{align}
& \hat \alpha_t(i) \overset{\rm{def}}{=} P(s_t = \omega_i \ | \ x_1 \cdots x_t) = \frac{P(x_1 \cdots x_t, s_t = \omega_i)}{P(x1 \cdots x_t)} \tag{29} \\
& c_t = P(x_t \ | \ x_1 \cdots x_{t - 1}) \tag{30}
\end{align}

Here, $ P (x_1 \ cdots x_t) $ is expressed as follows.

\begin{align}
P(x_1 \cdots x_t) &= P(x_1) \cdot P(x_2 \ | \ x_1) \cdot \cdots \cdot P(x_{t - 1} \ | \ x_1 \cdots x_{t - 2}) \cdot P(x_t \ | \ x_1 \cdots x_{t - 1}) \\
&= c_1 \cdot c_2 \cdot \cdots \cdot c_{t - 1} \cdot c_t \\
&= \prod_{m = 1}^t c_m \tag{31}
\end{align}

From this, the following relationship holds.

\begin{align}
\hat \alpha_t(i) &= \frac{\alpha_t(i)}{P(x_1 \cdots x_t)} \\
&= \frac{\alpha_t(i)}{\prod_{m = 1}^t c_m} \\
&= \frac{\alpha_t(i)}{c_t \cdot \prod_{m = 1}^{t - 1} c_m} \\
&= \frac{\alpha_t(i)}{c_t \cdot P(x_1 \cdots x_{t - 1})} \tag{32}
\end{align}

The expression $ (6) $ can be rewritten as follows.

\begin{align}
\alpha_t(j) &= \Biggl[\sum_{i = 1}^c \alpha_{t - 1}(i) a_{ij} \Biggr]b(\omega_j, x_t) \\
&= \Biggl[\sum_{i = 1}^c \hat\alpha_{t - 1}(i) P(x_1 \cdots x_{t - 1}) a_{ij} \Biggr]b(\omega_j, x_t) \\
\frac{\alpha_t(j)}{P(x_1 \cdots x_{t - 1})} &= \Biggl[\sum_{i = 1}^c \hat\alpha_{t - 1}(i) a_{ij} \Biggr]b(\omega_j, x_t) \\
c_t \hat\alpha_t(j)&= \Biggl[\sum_{i = 1}^c \hat\alpha_{t - 1}(i) a_{ij} \Biggr]b(\omega_j, x_t) \tag{33}
\end{align}

From the formula $ (33) $, $ c_t $ can be calculated as follows.

\begin{align}
\sum_{j = 1}^c c_t \hat\alpha_t(j) &= \sum_{j = 1}^c \Biggl[\sum_{i = 1}^c \hat\alpha_{t - 1}(i) a_{ij} \Biggr]b(\omega_j, x_t) \\
c_t &= \sum_{j = 1}^c \Biggl[\sum_{i = 1}^c \hat\alpha_{t - 1}(i) a_{ij} \Biggr]b(\omega_j, x_t) \tag{34}
\end{align}

Also, $ \ hat \ beta_t (i) $ is expressed as follows using $ c_t $ obtained by the forward algorithm.

c_{t + 1} \hat\beta_t(i) = \sum_{j = 1}^c a_{ij}b(\omega_j, x_{t + 1})\hat\beta_{t + 1}(j) \tag{35}

With $ \ hat \ alpha_t (i), \ hat \ beta_t (i) $, $ \ gamma_t (i), \ xi_t (i, j) $ can be rewritten as follows.

\begin{align}
\gamma_t(i) &= \hat\alpha_t(i)\hat\beta_t(i) \tag{36} \\
\xi_t(i) &= \frac{\hat\alpha_t(i)a_{ij}b(\omega_j, x_{t + 1})\hat\beta_{t + 1}(j)}{c_{t + 1}} \tag{37}
\end{align}

The parameters are updated by substituting the equations $ (36), (37) $ into the equations $ (26), (27), (28) $.

Implementation

Implemented parameter estimation using Baum-Welch algorithm in python. The implementation code can be found here. Based on dice throwing, the number of states is $ 3 $ species ($ c = 3 $) of $ \ omega_1, \ omega_2, \ omega_3 $, The output symbol is the $ 2 $ species ($ m = 2 $) of $ v_1 and v_2 $. This is equivalent to selecting and throwing dice of different $ 3 $ types and observing odd ($ v_1 ) and even ( v_2 $) rolls.

result

The true parameters are shown below.

\boldsymbol A = \begin{pmatrix}
0.1 & 0.7 & 0.2 \\
0.2 & 0.1 & 0.7 \\
0.7 & 0.2 & 0.1
\end{pmatrix}, \ \ \  
\boldsymbol B = \begin{pmatrix}
0.9 & 0.1 \\
0.6 & 0.4 \\
0.1 & 0.9
\end{pmatrix},  \ \ \ 
\boldsymbol \rho = \begin{pmatrix}
1.0 & 0.0 & 0.0
\end{pmatrix}

The initial values for estimation were set as follows.

\boldsymbol A = \begin{pmatrix}
0.15 & 0.60 & 0.25 \\
0.25 & 0.15 & 0.60 \\
0.60 & 0.25 & 0.15
\end{pmatrix}, \ \ \  
\boldsymbol B = \begin{pmatrix}
0.5 & 0.5 \\
0.5 & 0.5 \\
0.5 & 0.5
\end{pmatrix},  \ \ \ 
\boldsymbol \rho = \begin{pmatrix}
1.0 & 0.0 & 0.0
\end{pmatrix}

The estimation result is as follows.

\boldsymbol A = \begin{pmatrix}
0.11918790 & 0.63863200 & 0.24218010 \\
0.13359797 & 0.07275209 & 0.79364994 \\
0.72917405 & 0.18715812 & 0.08366783
\end{pmatrix}, \ \ \  
\boldsymbol B = \begin{pmatrix}
0.92503797 & 0.07496203 \\
0.56298958 & 0.43701042 \\
0.14059588 & 0.85940412
\end{pmatrix}

The process of increasing the log-likelihood is as follows.

Finally, the convergence process of $ b_ {11}, b_ {21}, b_ {31} $ is shown below. The thin black line represents the true value.

in conclusion

I was able to implement HMM parameter estimation in python. If you have any mistakes, please point them out.

HMM parameter estimation implementation in python