The other day, "[Reinforcement learning] Easy high-speed implementation of Ape-X!" was released, but this time, an article for beginners about "Experience Replay" which is the basis of it. I will write.
(If you search for "Experience Replay" on the Internet, you will find many articles implemented in full scratch in Python (1, 2, 3, 4, etc.), so I'll write that it can be used more easily)
My work cpprb is a library developed for Reinforcement Learning Experience Replay.
1.1.1 Linux/Windows You can install the binary as is from PyPI.
pip install cpprb
1.1.2 macOS Unfortunately, clang, which is used by default, cannot be compiled, so it is necessary to have gcc prepared by Homebrew or MacPorts and compiled by hand at the time of installation.
Replace / path/to/g ++
with the path of the installed g ++.
CC=/path/to/g++ CXX=/path/to/g++ pip install cpprb
Reference: Installation procedure on the official website
Experience Replay is a method to train a neural network with a sample that is temporarily stored and randomly taken, instead of passing the transition obtained by searching the environment by an agent to the neural network as it is for learning. is.
It is known to reduce learning instability due to the effects of autocorrelation inherent in continuous transitions, and is widely used in off-policy reinforcement learning.
Below is a sample code of Experience Replay using cpprb. Model implementation by neural network, visualization, model saving, etc. are not included in this sample. (Modify the implementation of the MockModel
part.)
import numpy as np
import gym
from cpprb import ReplayBuffer
n_training_step = int(1e+4)
buffer_size = int(1e+6)
batch_size = 32
env = gym.make("CartPole-v1")
class MockModel:
#Implement a model such as DQN here
def __init__(self):
pass
def get_action(self,obs):
return env.action_space.sample()
def train(self,sample):
pass
model = MockModel()
obs_shape = 4
act_dim = 1
rb = ReplayBuffer(buffer_size,
env_dict ={"obs": {"shape": obs_shape},
"act": {"shape": act_dim},
"rew": {},
"next_obs": {"shape": obs_shape},
"done": {}})
#Specify what to save in dict format."shape"When"dtype"Can be specified. The default is{"shape": 1, "dtype": np.single}
obs = env.reset()
for i in range(n_training_step):
act = model.get_action(obs)
next_obs, rew, done, _ = env.step(act)
#Pass as keyword argument
rb.add(obs=obs,act=act,rew=rew,next_obs=next_obs,done=done)
if done:
rb.on_episode_end()
obs = env.reset()
else:
obs = next_obs
sample = rb.sample(batch_size)
# dict[str,np.ndarray]Randomly sampled in format
model.train(sample)
Prioritized Experience Replay is an advanced version of Experience Replay, which is a method of sampling transitions with a large TD error with higher priority.
Detailed explanations are omitted in this article, but the following articles and sites have explanations.
Like the Experience Replay sample code, this sample does not include neural network model implementation, visualization, or storage.
It has been suggested to use the Segment Tree to run Prioritized Experience Replay at high speed, but it is often buggy if implemented independently and slow if implemented in Python. (In cpprb, Segment Tree is implemented in C ++ and is fast.)
import numpy as np
import gym
from cpprb import PrioritizedReplayBuffer
n_training_step = int(1e+4)
buffer_size = int(1e+6)
batch_size = 32
env = gym.make("CartPole-v1")
class MockModel:
#Implement a model such as DQN here
def __init__(self):
pass
def get_action(self,obs):
return env.action_space.sample()
def train(self,sample):
pass
def compute_abs_TD(self,sample):
return 0
model = MockModel()
obs_shape = 4
act_dim = 1
rb = PrioritizedReplayBuffer(buffer_size,
env_dict ={"obs": {"shape": obs_shape},
"act": {"shape": act_dim},
"rew": {},
"next_obs": {"shape": obs_shape},
"done": {}},
alpha = 0.4)
obs = env.reset()
for i in range(n_training_step):
act = model.get_action(obs)
next_obs, rew, done, _ = env.step(act)
#You can also specify priority directly when adding to the buffer. If not specified, the highest priority is used.
rb.add(obs=obs,act=act,rew=rew,next_obs=next_obs,done=done)
if done:
rb.on_episode_end()
obs = env.reset()
else:
obs = next_obs
sample = rb.sample(batch_size, beta = 0.4)
#In addition to the transitions specified in the constructor"indexes", "weights"Is np.Included in dict as ndarray
model.train(sample)
abs_TD = model.compute_abs_TD(sample)
rb.update_priorities(sample["indexes"],abs_TD)
We have opened GitHub Discussions as a user forum, so if you have any questions about cpprb, please click here.
Recommended Posts