0. Introduction

The other day, "[Reinforcement learning] Easy high-speed implementation of Ape-X!" was released, but this time, an article for beginners about "Experience Replay" which is the basis of it. I will write.

(If you search for "Experience Replay" on the Internet, you will find many articles implemented in full scratch in Python (1, 2, 3, 4, etc.), so I'll write that it can be used more easily)

cpprb

My work cpprb is a library developed for Reinforcement Learning Experience Replay.

1.1 Installation

1.1.1 Linux/Windows You can install the binary as is from PyPI.

pip install cpprb

1.1.2 macOS Unfortunately, clang, which is used by default, cannot be compiled, so it is necessary to have gcc prepared by Homebrew or MacPorts and compiled by hand at the time of installation.

Replace / path/to/g ++ with the path of the installed g ++.

CC=/path/to/g++ CXX=/path/to/g++ pip install cpprb

Reference: Installation procedure on the official website

Experience Replay

2.1 Overview

Experience Replay is a method to train a neural network with a sample that is temporarily stored and randomly taken, instead of passing the transition obtained by searching the environment by an agent to the neural network as it is for learning. is.

It is known to reduce learning instability due to the effects of autocorrelation inherent in continuous transitions, and is widely used in off-policy reinforcement learning.

2.2 Sample code

Below is a sample code of Experience Replay using cpprb. Model implementation by neural network, visualization, model saving, etc. are not included in this sample. (Modify the implementation of the MockModel part.)

import numpy as np
import gym

from cpprb import ReplayBuffer

n_training_step = int(1e+4)
buffer_size = int(1e+6)
batch_size = 32

env = gym.make("CartPole-v1")

class MockModel:
    #Implement a model such as DQN here
    def __init__(self):
        pass

    def get_action(self,obs):
        return env.action_space.sample()

    def train(self,sample):
        pass

model = MockModel()

obs_shape = 4
act_dim = 1
rb = ReplayBuffer(buffer_size,
                  env_dict ={"obs": {"shape": obs_shape},
                             "act": {"shape": act_dim},
                             "rew": {},
                             "next_obs": {"shape": obs_shape},
                             "done": {}})
#Specify what to save in dict format."shape"When"dtype"Can be specified. The default is{"shape": 1, "dtype": np.single}


obs = env.reset()

for i in range(n_training_step):
    act = model.get_action(obs)
    next_obs, rew, done, _ = env.step(act)

    #Pass as keyword argument
    rb.add(obs=obs,act=act,rew=rew,next_obs=next_obs,done=done)

    if done:
        rb.on_episode_end()
        obs = env.reset()
    else:
        obs = next_obs

    sample = rb.sample(batch_size)
    # dict[str,np.ndarray]Randomly sampled in format

    model.train(sample)

Prioritized Experience Replay

3.1 Overview

Prioritized Experience Replay is an advanced version of Experience Replay, which is a method of sampling transitions with a large TD error with higher priority.

Detailed explanations are omitted in this article, but the following articles and sites have explanations.

3.2 Sample code

Like the Experience Replay sample code, this sample does not include neural network model implementation, visualization, or storage.

It has been suggested to use the Segment Tree to run Prioritized Experience Replay at high speed, but it is often buggy if implemented independently and slow if implemented in Python. (In cpprb, Segment Tree is implemented in C ++ and is fast.)

import numpy as np
import gym

from cpprb import PrioritizedReplayBuffer

n_training_step = int(1e+4)
buffer_size = int(1e+6)
batch_size = 32

env = gym.make("CartPole-v1")

class MockModel:
    #Implement a model such as DQN here
    def __init__(self):
        pass

    def get_action(self,obs):
        return env.action_space.sample()

    def train(self,sample):
        pass

    def compute_abs_TD(self,sample):
        return 0

model = MockModel()

obs_shape = 4
act_dim = 1
rb = PrioritizedReplayBuffer(buffer_size,
                             env_dict ={"obs": {"shape": obs_shape},
                                        "act": {"shape": act_dim},
                                        "rew": {},
                                        "next_obs": {"shape": obs_shape},
                                        "done": {}},
                             alpha = 0.4)


obs = env.reset()

for i in range(n_training_step):
    act = model.get_action(obs)
    next_obs, rew, done, _ = env.step(act)

    #You can also specify priority directly when adding to the buffer. If not specified, the highest priority is used.
    rb.add(obs=obs,act=act,rew=rew,next_obs=next_obs,done=done)

    if done:
        rb.on_episode_end()
        obs = env.reset()
    else:
        obs = next_obs

    sample = rb.sample(batch_size, beta = 0.4)
    #In addition to the transitions specified in the constructor"indexes", "weights"Is np.Included in dict as ndarray

    model.train(sample)

    abs_TD = model.compute_abs_TD(sample)
    rb.update_priorities(sample["indexes"],abs_TD)

4. In case of trouble

We have opened GitHub Discussions as a user forum, so if you have any questions about cpprb, please click here.

[Reinforcement learning] Experience Replay is easy with cpprb!