I changed the sample of QuickStart a little and moved CNN. The target of learning is Atari's "Pong-v0". ChainerRL Quickstart Guide
I am referencing this article. Try using chainerRL
I can't tell if I'm learning properly due to lack of knowledge about Linux, Python, and reinforcement learning, but I've confirmed that it works. Please give us any mistakes or advice.
OS: ubuntu 16.04 python: 3.6.0 chainer: 1.21.0
There are two main changes below. I used it to grayscale and resize the game screen.
train.py
import chainer
import chainer.functions as F
import chainer.links as L
import chainerrl
import gym
import numpy as np
import datetime
from skimage.color import rgb2gray
from skimage.transform import resize
I haven't changed this area.
train.py
env = gym.make('Pong-v0')
obs = env.reset()
env.render()
I don't know how to set it up, so I'm just using the old model for CNN.
train.py
class QFunction(chainer.Chain):
def __init__(self, n_history=1, n_action=6):
super().__init__(
l1=L.Convolution2D(n_history, 32, ksize=8, stride=4, nobias=False, wscale=np.sqrt(2)),
l2=L.Convolution2D(32, 64, ksize=3, stride=2, nobias=False, wscale=np.sqrt(2)),
l3=L.Convolution2D(64, 64, ksize=3, stride=1, nobias=False, wscale=np.sqrt(2)),
l4=L.Linear(3136, 512, wscale=np.sqrt(2)),
out=L.Linear(512, n_action, initialW=np.zeros((n_action, 512), dtype=np.float32))
)
def __call__(self, x, test=False):
s = chainer.Variable(x)
h1 = F.relu(self.l1(s))
h2 = F.relu(self.l2(h1))
h3 = F.relu(self.l3(h2))
h4 = F.relu(self.l4(h3))
h5 = self.out(h4)
return chainerrl.action_value.DiscreteActionValue(h5)
The same applies to this. I haven't studied enough. n_history is used to mean a channel. This time I made it grayscale, so the channel is 1.
train.py
n_action = env.action_space.n
n_history=1
q_func = QFunction(n_history, n_action)
Changed capacity from 10 ** 6.
optimizer = chainer.optimizers.Adam(eps=1e-2)
optimizer.setup(q_func)
gamma = 0.95
explorer = chainerrl.explorers.ConstantEpsilonGreedy(
epsilon=0.3, random_action_func=env.action_space.sample)
replay_buffer = chainerrl.replay_buffer.ReplayBuffer(capacity=10 ** 4)
phi = lambda x: x.astype(np.float32, copy=False)
train.py
agent = chainerrl.agents.DoubleDQN(
q_func, optimizer, replay_buffer, gamma, explorer,
minibatch_size=4, replay_start_size=500, update_frequency=1,
target_update_frequency=100, phi=phi)
last_time = datetime.datetime.now()
n_episodes = 1000
for i in range(1, n_episodes + 1):
obs = resize(rgb2gray(env.reset()),(80,80))
obs = obs[np.newaxis, :, :]
reward = 0
done = False
R = 0
while not done:
action = agent.act_and_train(obs, reward)
obs, reward, done, _ = env.step(action)
obs = resize(rgb2gray(obs), (80, 80))
obs = obs[np.newaxis, :, :]
if reward != 0:
R += reward
elapsed_time = datetime.datetime.now() - last_time
print('episode:', i, '/', n_episodes,
'reward:', R,
'minutes:', elapsed_time.seconds/60)
last_time = datetime.datetime.now()
if i % 100 == 0:
filename = 'agent_Breakout' + str(i)
agent.save(filename)
agent.stop_episode_and_train(obs, reward, done)
print('Finished.')
The main changes are these two lines. The first line is grayscale and resizing. In the second line, I changed the shape to put it in Convolution2D.
obs = resize(rgb2gray(env.reset()),(80,80))
obs = obs[np.newaxis, :, :]
I used a laptop with 8GB of memory, but if I set the capacity to 10 ** 6 and not grayscale, it will be killed around 300 episodes. I don't know which one works, but these two changes have fixed it.
If you study about 200 episodes, you will get 21 points in a row. I got about 5 points in 1000 episodes. Learning 1000 episodes takes a whole day.
I will post it because I think it may be helpful for beginners. If you have any mistakes or points to be improved, please give us some advice.
Recommended Posts