Introduction

Write a review of Q-learning, which is the basis of reinforcement learning.

theory

Q-learning is one of the reinforcement learning methods TD learning, and is a method of updating the Q value (state behavior value) every time the agent acts. Let $ s_t $ be the state at time $ t $, $ a_t $ be the action, and $ r_t $ be the reward obtained by causing action $ a_t $ under the state $ s_t $. The Q value $ Q (s_t, a_t) $ is the value when a certain action $ a_t $ is taken in a certain state $ s_t . The value update is done as follows. $ Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha(r_{t+1}+\gamma \max_{a_{t+1}}Q(s_{t+1},a_{t+1})-Q(s_t,a_t)) $$ $ \ Alpha $ is a parameter that controls the magnitude of learning and takes a value from 0 to 1. $ \ Gamma $ is a parameter of how much future value is considered and takes 0 to 1. $ \ max_ {a_ {t + 1}} Q (s_ {t + 1}, a_ {t + 1}) $ is the maximum value available in the state $ s_ {t + 1} $. This formula is $ Q (s_t, a_t) $, $ \ alpha (r_ {t + 1} -Q (s_t, a_t)) $, $ \ alpha \ gamma \ max_ {a_ {t + 1}} Q (s_ {s_ {t + 1}} {a_ {t + 1}}) It is easier to understand if you divide it into $. The first formula is the previous value and is considered as a standard. In other words, the degree of update is determined by the second and third equations. The second formula is updated to eliminate the difference between the previous value and reward. Simply adding the difference is overly dependent on the reward that was accidentally obtained at that time, so it is controlled by $ \ alpha $. Equation 3 estimates future value. We consider the next maximum value to be obtained as the future value and are controlled by $ \ alpha \ gamma $. $ \ Alpha $ is the same control as in the second equation, and $ \ gamma $ is the discount rate due to the uncertainty of estimation.

Actual battle (concrete example)

As a concrete example, consider the two-arm bandit problem. Suppose there is a mountain A in which numbers from 0 to 2000 appear randomly and a mountain B in which numbers from 0 to 1000 appear randomly. When you get a number from one of the mountains, create a model that allows you to select the mountain that is likely to get a larger number (reward). Let the values of mountain A and mountain B at the update count $ t $ (note that the update count $ t $ is different from the theoretical time $ t $) be $ Q_t (A), Q_t (B) $, and that The initial value is 0 for both. Each time it is updated, a mountain is selected with the probability of being generated based on the value, and the value for the selected mountain is updated by the number that appears. The number from the selected mountain is $ r_t $. The probabilities are $ P_t (A), P_t (B) , and $ P_ {t + 1} (A) = \ frac {\ exp (\ beta Q_t (A))} {\ exp (\ beta Q_t (A)) )) + \ exp (\ beta Q_t (B))} Calculate with $$. The value is updated by $ Q_ {t + 1} (X) = Q_t (X) + \ alpha (r_t-Q_t (X)) $. (There is no third term dealt with in theory because we are thinking of a system without time evolution) The above settings were updated 100 times with $ \ alpha = 0.05 and \ beta = 0.004 $. The value has been updated as follows: The probability of choosing each mountain has been updated as follows.

code

import numpy as np
import matplotlib.pyplot as plt

#Fixed seed value
np.random.seed(71)

#Total number subtracted
N = 100

#Mountain A:0~2000
pileA = np.array([i for i in range(2001)])
#Mountain B:0~1000
pileB = np.array([i for i in range(1001)])

#parameter
alpha = 0.05
beta = 0.004

"""
variable
Q_A: (Every hour)Value of mountain A
Q_A: (Every hour)Value of mountain B
P_A: (Every hour)Probability of drawing mountain A
P_B: (Every hour)Probability of drawing mountain B
select_pile: (Every hour)Selected mountain
"""
Q_A = [0]
Q_B = [0]
P_A = []
P_B = []
select_pile = []

for i in range(N):
    P_A.append(np.exp(beta * Q_A[i]) / (np.exp(beta * Q_A[i]) + np.exp(beta * Q_B[i])))
    P_B.append(np.exp(beta * Q_B[i]) / (np.exp(beta * Q_A[i]) + np.exp(beta * Q_B[i])))
    if P_A[i] >= np.random.rand():
        select_pile.append(1)
        Q_A.append(Q_A[i] + alpha * (np.random.choice(pileA, 1)[0] - Q_A[i]))
        Q_B.append(Q_B[i])
    else:
        select_pile.append(0)
        Q_A.append(Q_A[i])
        Q_B.append(Q_B[i] + alpha * (np.random.choice(pileB, 1)[0] - Q_B[i]))

#time
t = np.array([i for i in range(N)])

plt.scatter(t, Q_A[1:], c="blue", marker="o", label="A")
plt.scatter(t, 1000*np.ones(N), c="black", marker=".", s=3)
plt.scatter(t, Q_B[1:], c="red", marker="o", label="B")
plt.scatter(t, 500*np.ones(N), c="black", marker=".", s=3)
plt.title("Q")
plt.legend()
plt.savefig("value.png ")
plt.show()

plt.scatter(t, select_pile, c="green", marker="s", s=10)
plt.plot(t, P_A, color=(1,0,0), marker="o", label="A")
plt.plot(t, P_B, color=(0,0,1), marker="o", label="B")
plt.title("prob")
plt.legend()
plt.savefig("prob.png ")
plt.show()

I didn't introduce it here because the code is large in the system with time evolution. If you search for "cart pole q learning" etc., many articles will appear, so if you are interested, please have a look there.

reference

Theory: Sutton, Richard S. (1998). Reinforcement Learning: An Introduction. Problem citation: Mathematical model for data analysis

Q-learning practice and theory

Introduction

theory

Actual battle (concrete example)

code

reference