Write a review of Q-learning, which is the basis of reinforcement learning.
Q-learning is one of the reinforcement learning methods TD learning, and is a method of updating the Q value (state behavior value) every time the agent acts. Let $ s_t $ be the state at time $ t $, $ a_t $ be the action, and $ r_t $ be the reward obtained by causing action $ a_t $ under the state $ s_t $. The Q value $ Q (s_t, a_t) $ is the value when a certain action $ a_t $ is taken in a certain state $ s_t
As a concrete example, consider the two-arm bandit problem. Suppose there is a mountain A in which numbers from 0 to 2000 appear randomly and a mountain B in which numbers from 0 to 1000 appear randomly. When you get a number from one of the mountains, create a model that allows you to select the mountain that is likely to get a larger number (reward). Let the values of mountain A and mountain B at the update count $ t $ (note that the update count $ t $ is different from the theoretical time $ t $) be $ Q_t (A), Q_t (B) $, and that The initial value is 0 for both. Each time it is updated, a mountain is selected with the probability of being generated based on the value, and the value for the selected mountain is updated by the number that appears. The number from the selected mountain is $ r_t $. The probabilities are $ P_t (A), P_t (B)
import numpy as np
import matplotlib.pyplot as plt
#Fixed seed value
np.random.seed(71)
#Total number subtracted
N = 100
#Mountain A:0~2000
pileA = np.array([i for i in range(2001)])
#Mountain B:0~1000
pileB = np.array([i for i in range(1001)])
#parameter
alpha = 0.05
beta = 0.004
"""
variable
Q_A: (Every hour)Value of mountain A
Q_A: (Every hour)Value of mountain B
P_A: (Every hour)Probability of drawing mountain A
P_B: (Every hour)Probability of drawing mountain B
select_pile: (Every hour)Selected mountain
"""
Q_A = [0]
Q_B = [0]
P_A = []
P_B = []
select_pile = []
for i in range(N):
P_A.append(np.exp(beta * Q_A[i]) / (np.exp(beta * Q_A[i]) + np.exp(beta * Q_B[i])))
P_B.append(np.exp(beta * Q_B[i]) / (np.exp(beta * Q_A[i]) + np.exp(beta * Q_B[i])))
if P_A[i] >= np.random.rand():
select_pile.append(1)
Q_A.append(Q_A[i] + alpha * (np.random.choice(pileA, 1)[0] - Q_A[i]))
Q_B.append(Q_B[i])
else:
select_pile.append(0)
Q_A.append(Q_A[i])
Q_B.append(Q_B[i] + alpha * (np.random.choice(pileB, 1)[0] - Q_B[i]))
#time
t = np.array([i for i in range(N)])
plt.scatter(t, Q_A[1:], c="blue", marker="o", label="A")
plt.scatter(t, 1000*np.ones(N), c="black", marker=".", s=3)
plt.scatter(t, Q_B[1:], c="red", marker="o", label="B")
plt.scatter(t, 500*np.ones(N), c="black", marker=".", s=3)
plt.title("Q")
plt.legend()
plt.savefig("value.png ")
plt.show()
plt.scatter(t, select_pile, c="green", marker="s", s=10)
plt.plot(t, P_A, color=(1,0,0), marker="o", label="A")
plt.plot(t, P_B, color=(0,0,1), marker="o", label="B")
plt.title("prob")
plt.legend()
plt.savefig("prob.png ")
plt.show()
I didn't introduce it here because the code is large in the system with time evolution. If you search for "cart pole q learning" etc., many articles will appear, so if you are interested, please have a look there.
Theory: Sutton, Richard S. (1998). Reinforcement Learning: An Introduction. Problem citation: Mathematical model for data analysis
Recommended Posts