I usually study reinforcement learning and reverse reinforcement learning, but recently I learned that it was interesting and tried to implement it, so I will use it as a material for the Advent calendar.

Prior knowledge

We will outline the prerequisite Markov decision process and the value iteration method.

Markov decision process

The Markov decision process (MDP) is the process of adding actions that are decision-making to the Markov process (the next state is a process with Markov property that is influenced only by the current state). I think there are many detailed explanations on the net, so I will omit them.

Value iterative method

Value Iteration is a type of algorithm called dynamic programming. Find the value of $ s $ in a certain state by the following procedure.

-Initialize $ V $, $ Q $ --Repeat until convergence for $ s $, $ a : - Q (s, a) = R (s) + \ gamma \ sum_ {s ^ {\ prime}} P (s ^ {\ prime} | s, a) V (s ^ {\ prime}) $ Calculation

V(s)=\max_a Q(s,a)

Implementation by einsum

numpy.einsum makes it easy to implement matrix calculations using Einstein's notation. In this case, the $ \ sum_ {s ^ {\ prime}} P (s ^ {\ prime} | s, a) V (s ^ {\ prime}) $ term on the right side of the above equation is as follows: Can be written in.


np.einsum("ijk,k->ij", P, V)

code

I experimented with a basic problem in the basics of reinforcement learning called Gridworld. We compared how the calculation time of the implementation by einsum and the implementation by a common loop changes when the length of one side of the Grid is increased.

`vi.py`



# -*- coding: utf-8 -*-

import numpy as np
from makeP import makeProb
import time

for statenum in range(2, 101):

    list_einsum = []
    list_loop = []

    for ite in range(100):
        print(".", end="", flush=True)
        x_size = statenum
        y_size = statenum

        n_states = int(x_size * y_size)

        P = makeProb(x_size, y_size)

        R = np.ones(n_states)*-0.01
        R[int(n_states-1)] = 1

        Q = np.zeros((n_states, 4))
        V = np.zeros(n_states)

        delta = np.inf
        eps = 0.0

        gamma = 0.95

        t1 = time.time()
        for _ in range(10000):
            Q = R[:, None] + gamma * np.einsum("ijk,k->ij", P, V)
            V = np.max(Q, axis=1)
        t2 = time.time()
        list_einsum.append(t2-t1)

        Q = np.zeros((n_states, 4))
        V = np.zeros(n_states)

        t1 = time.time()
        for _ in range(10000):
            for s in range(n_states):
                for a in range(4):
                    Q[s][a] = R[s, None] + gamma * np.dot(P[s,a,:], V)
                V[s] = np.max(Q[s, :])
        t2 = time.time()
        list_loop.append(t2-t1)
    print("")
    ar_einsum = np.array(list_einsum)
    ar_loop = np.array(list_loop)

    print("EINSUM: ", "size: ", statenum, "mean: ", np.mean(ar_einsum), "median: ", np.median(ar_einsum), "stdev: ", np.std(ar_einsum))
    print("LOOP  : ", "size: ", statenum, "mean: ", np.mean(ar_loop),   "median: ", np.median(ar_loop),   "stdev: ", np.std(ar_loop))

`makeP.py`


# -*- coding: utf-8 -*-

import numpy as np

def makeProb(x_size, y_size, n_action=4):
    # make transition prob

    n_states = x_size*y_size
    # S, A, S'
    P = np.zeros(( n_states, n_action, n_states ))



    for s in range(n_states):

        #left is wall
        if s%x_size == 0:
            if s == 0:
                P[0][1][5] = 1
                P[0][3][1] = 1
            elif s > (x_size*(y_size-1)-1):
                P[s][0][s-x_size] = 1
                P[s][3][s+1] = 1
            else:
                P[s][0][s-x_size] = 1
                P[s][1][s+x_size] = 1
                P[s][3][s+1] = 1

        #right is wall
        elif (s+1)%x_size == 0:
            if s == (x_size*y_size-1):
                P[s][0][s-x_size] = 1
                P[s][2][s-1] = 1
            elif s < x_size:
                P[s][1][s+x_size] = 1
                P[s][2][s-1]=1
            else:
                P[s][0][s-x_size] = 1
                P[s][1][s+x_size] = 1
                P[s][2][s-1] = 1

        #upper is wall
        elif s < x_size and s%x_size!=0 and (s+1)%x_size !=0 :
            P[s][1][s+x_size] = 1
            P[s][2][s-1] = 1
            P[s][3][s+1] = 1

        #lower is wall
        elif s > (x_size*(y_size-1)-1) and s%x_size!=0 and (s+1)%x_size !=0:
            P[s][0][s-x_size] = 1
            P[s][2][s-1] = 1
            P[s][3][s+1] = 1

        else:
            P[s][0][s-x_size] = 1
            P[s][1][s+x_size] = 1
            P[s][2][s-1] = 1
            P[s][3][s+1] = 1

    return P

result

Overwhelming victory of einsum.

computing time.png

What makes me happy is that, for example, the inner-loop of reverse reinforcement learning can be solved quickly.

Einsum implementation of value iterative method