select_max_ucb_child() A method that selects the child node that maximizes the value of the following formula that appears in Chapter 4, the so-called UCB value. St is the current state of the node, that is, the phase. a is a candidate move. Q (St, a) is the expected value term. Represents the action value of action a in state St. In the book, the total winning percentage of child node a is divided by the number of visits of child node a. U (St, a) is a bonus item. The hand with the smaller number of searches is preferentially selected. Furthermore, the probability P (s, a) of the move obtained in the policy network is also used so that the promising move is preferentially searched.
Cpuct: A constant that adjusts the weight of the bonus term. P (s, a): Predicted start probability of the policy network. N (s, a): The number of visits of action a in state s. In the book, it is +1. Is it to avoid the denominator becoming 0 when the number of visits is 0? √ΣN (s, b): Number of visits for all actions in state s.
Image of what you are actually doing with select_max_ucb_child ()
Recommended Posts