I usually study reinforcement learning and reverse reinforcement learning, but recently I learned that it was interesting and tried to implement it, so I will use it as a material for the Advent calendar.
We will outline the prerequisite Markov decision process and the value iteration method.
The Markov decision process (MDP) is the process of adding actions that are decision-making to the Markov process (the next state is a process with Markov property that is influenced only by the current state). I think there are many detailed explanations on the net, so I will omit them.
Value Iteration is a type of algorithm called dynamic programming. Find the value of $ s $ in a certain state by the following procedure.
-Initialize $ V $, $ Q $
--Repeat until convergence for $ s $, $ a
numpy.einsum makes it easy to implement matrix calculations using Einstein's notation. In this case, the $ \ sum_ {s ^ {\ prime}} P (s ^ {\ prime} | s, a) V (s ^ {\ prime}) $ term on the right side of the above equation is as follows: Can be written in.
np.einsum("ijk,k->ij", P, V)
I experimented with a basic problem in the basics of reinforcement learning called Gridworld. We compared how the calculation time of the implementation by einsum and the implementation by a common loop changes when the length of one side of the Grid is increased.
vi.py
# -*- coding: utf-8 -*-
import numpy as np
from makeP import makeProb
import time
for statenum in range(2, 101):
list_einsum = []
list_loop = []
for ite in range(100):
print(".", end="", flush=True)
x_size = statenum
y_size = statenum
n_states = int(x_size * y_size)
P = makeProb(x_size, y_size)
R = np.ones(n_states)*-0.01
R[int(n_states-1)] = 1
Q = np.zeros((n_states, 4))
V = np.zeros(n_states)
delta = np.inf
eps = 0.0
gamma = 0.95
t1 = time.time()
for _ in range(10000):
Q = R[:, None] + gamma * np.einsum("ijk,k->ij", P, V)
V = np.max(Q, axis=1)
t2 = time.time()
list_einsum.append(t2-t1)
Q = np.zeros((n_states, 4))
V = np.zeros(n_states)
t1 = time.time()
for _ in range(10000):
for s in range(n_states):
for a in range(4):
Q[s][a] = R[s, None] + gamma * np.dot(P[s,a,:], V)
V[s] = np.max(Q[s, :])
t2 = time.time()
list_loop.append(t2-t1)
print("")
ar_einsum = np.array(list_einsum)
ar_loop = np.array(list_loop)
print("EINSUM: ", "size: ", statenum, "mean: ", np.mean(ar_einsum), "median: ", np.median(ar_einsum), "stdev: ", np.std(ar_einsum))
print("LOOP : ", "size: ", statenum, "mean: ", np.mean(ar_loop), "median: ", np.median(ar_loop), "stdev: ", np.std(ar_loop))
makeP.py
# -*- coding: utf-8 -*-
import numpy as np
def makeProb(x_size, y_size, n_action=4):
# make transition prob
n_states = x_size*y_size
# S, A, S'
P = np.zeros(( n_states, n_action, n_states ))
for s in range(n_states):
#left is wall
if s%x_size == 0:
if s == 0:
P[0][1][5] = 1
P[0][3][1] = 1
elif s > (x_size*(y_size-1)-1):
P[s][0][s-x_size] = 1
P[s][3][s+1] = 1
else:
P[s][0][s-x_size] = 1
P[s][1][s+x_size] = 1
P[s][3][s+1] = 1
#right is wall
elif (s+1)%x_size == 0:
if s == (x_size*y_size-1):
P[s][0][s-x_size] = 1
P[s][2][s-1] = 1
elif s < x_size:
P[s][1][s+x_size] = 1
P[s][2][s-1]=1
else:
P[s][0][s-x_size] = 1
P[s][1][s+x_size] = 1
P[s][2][s-1] = 1
#upper is wall
elif s < x_size and s%x_size!=0 and (s+1)%x_size !=0 :
P[s][1][s+x_size] = 1
P[s][2][s-1] = 1
P[s][3][s+1] = 1
#lower is wall
elif s > (x_size*(y_size-1)-1) and s%x_size!=0 and (s+1)%x_size !=0:
P[s][0][s-x_size] = 1
P[s][2][s-1] = 1
P[s][3][s+1] = 1
else:
P[s][0][s-x_size] = 1
P[s][1][s+x_size] = 1
P[s][2][s-1] = 1
P[s][3][s+1] = 1
return P
Overwhelming victory of einsum.
What makes me happy is that, for example, the inner-loop of reverse reinforcement learning can be solved quickly.
Recommended Posts