Einsum implementation of value iterative method

I usually study reinforcement learning and reverse reinforcement learning, but recently I learned that it was interesting and tried to implement it, so I will use it as a material for the Advent calendar.

Prior knowledge

We will outline the prerequisite Markov decision process and the value iteration method.

Markov decision process

The Markov decision process (MDP) is the process of adding actions that are decision-making to the Markov process (the next state is a process with Markov property that is influenced only by the current state). I think there are many detailed explanations on the net, so I will omit them.

Value iterative method

Value Iteration is a type of algorithm called dynamic programming. Find the value of $ s $ in a certain state by the following procedure.

-Initialize $ V $, $ Q $ --Repeat until convergence for $ s $, $ a : - Q (s, a) = R (s) + \ gamma \ sum_ {s ^ {\ prime}} P (s ^ {\ prime} | s, a) V (s ^ {\ prime}) $ Calculation

Implementation by einsum

numpy.einsum makes it easy to implement matrix calculations using Einstein's notation. In this case, the $ \ sum_ {s ^ {\ prime}} P (s ^ {\ prime} | s, a) V (s ^ {\ prime}) $ term on the right side of the above equation is as follows: Can be written in.


np.einsum("ijk,k->ij", P, V)

code

I experimented with a basic problem in the basics of reinforcement learning called Gridworld. We compared how the calculation time of the implementation by einsum and the implementation by a common loop changes when the length of one side of the Grid is increased.

vi.py



# -*- coding: utf-8 -*-

import numpy as np
from makeP import makeProb
import time

for statenum in range(2, 101):

    list_einsum = []
    list_loop = []

    for ite in range(100):
        print(".", end="", flush=True)
        x_size = statenum
        y_size = statenum

        n_states = int(x_size * y_size)

        P = makeProb(x_size, y_size)

        R = np.ones(n_states)*-0.01
        R[int(n_states-1)] = 1

        Q = np.zeros((n_states, 4))
        V = np.zeros(n_states)

        delta = np.inf
        eps = 0.0

        gamma = 0.95

        t1 = time.time()
        for _ in range(10000):
            Q = R[:, None] + gamma * np.einsum("ijk,k->ij", P, V)
            V = np.max(Q, axis=1)
        t2 = time.time()
        list_einsum.append(t2-t1)

        Q = np.zeros((n_states, 4))
        V = np.zeros(n_states)

        t1 = time.time()
        for _ in range(10000):
            for s in range(n_states):
                for a in range(4):
                    Q[s][a] = R[s, None] + gamma * np.dot(P[s,a,:], V)
                V[s] = np.max(Q[s, :])
        t2 = time.time()
        list_loop.append(t2-t1)
    print("")
    ar_einsum = np.array(list_einsum)
    ar_loop = np.array(list_loop)

    print("EINSUM: ", "size: ", statenum, "mean: ", np.mean(ar_einsum), "median: ", np.median(ar_einsum), "stdev: ", np.std(ar_einsum))
    print("LOOP  : ", "size: ", statenum, "mean: ", np.mean(ar_loop),   "median: ", np.median(ar_loop),   "stdev: ", np.std(ar_loop))

makeP.py


# -*- coding: utf-8 -*-

import numpy as np

def makeProb(x_size, y_size, n_action=4):
    # make transition prob

    n_states = x_size*y_size
    # S, A, S'
    P = np.zeros(( n_states, n_action, n_states ))



    for s in range(n_states):

        #left is wall
        if s%x_size == 0:
            if s == 0:
                P[0][1][5] = 1
                P[0][3][1] = 1
            elif s > (x_size*(y_size-1)-1):
                P[s][0][s-x_size] = 1
                P[s][3][s+1] = 1
            else:
                P[s][0][s-x_size] = 1
                P[s][1][s+x_size] = 1
                P[s][3][s+1] = 1

        #right is wall
        elif (s+1)%x_size == 0:
            if s == (x_size*y_size-1):
                P[s][0][s-x_size] = 1
                P[s][2][s-1] = 1
            elif s < x_size:
                P[s][1][s+x_size] = 1
                P[s][2][s-1]=1
            else:
                P[s][0][s-x_size] = 1
                P[s][1][s+x_size] = 1
                P[s][2][s-1] = 1

        #upper is wall
        elif s < x_size and s%x_size!=0 and (s+1)%x_size !=0 :
            P[s][1][s+x_size] = 1
            P[s][2][s-1] = 1
            P[s][3][s+1] = 1

        #lower is wall
        elif s > (x_size*(y_size-1)-1) and s%x_size!=0 and (s+1)%x_size !=0:
            P[s][0][s-x_size] = 1
            P[s][2][s-1] = 1
            P[s][3][s+1] = 1

        else:
            P[s][0][s-x_size] = 1
            P[s][1][s+x_size] = 1
            P[s][2][s-1] = 1
            P[s][3][s+1] = 1

    return P

result

Overwhelming victory of einsum.

computing time.png

What makes me happy is that, for example, the inner-loop of reverse reinforcement learning can be solved quickly.

Recommended Posts

Einsum implementation of value iterative method
Implementation and experiment of convex clustering method
Clustering of clustering method
Gradient method implementation 1
parallelization of class method
Implementation of object authenticity judgment condition using __bool__ method
Implementation of Fibonacci sequence
Summary of test method
[Deep Learning from scratch] Implementation of Momentum method and AdaGrad method
Increase the speed of the Monte Carlo method of Cython cut-out implementation.
A simple Python implementation of the k-nearest neighbor method (k-NN)
Verification and implementation of video reconstruction method using GRU and Autoencoder
Implementation of ML-EM method, cross-section reconstruction algorithm for CT scan
Quantum computer implementation of quantum walk 2
Implementation of TF-IDF using gensim
[python] Value of function object (?)
Explanation and implementation of SocialFoceModel
Implementation of game theory-Prisoner's dilemma-
Implementation of independent component analysis
Unity IAP implementation method summary
Quantum computer implementation of quantum walk 3
Python implementation of particle filters
Behavior of pandas rolling () method
Implementation of quicksort in Python
Quantum computer implementation of quantum walk 1
Deep reinforcement learning 2 Implementation of reinforcement learning
Implementation of Scale-space for SIFT
Mathematical explanation of binary search and ternary search and implementation method without bugs
[Python] Implementation of Nelder–Mead method and saving of GIF images by matplotlib
I tried to summarize the frequently used implementation method of pytest-mock
[Recommendation] Summary of advantages and disadvantages of content-based and collaborative filtering / implementation method