Introduction

I changed the sample of QuickStart a little and moved CNN. The target of learning is Atari's "Pong-v0". ChainerRL Quickstart Guide

I am referencing this article. Try using chainerRL

I can't tell if I'm learning properly due to lack of knowledge about Linux, Python, and reinforcement learning, but I've confirmed that it works. Please give us any mistakes or advice.

environment

OS: ubuntu 16.04 python: 3.6.0 chainer: 1.21.0

Package import

There are two main changes below. I used it to grayscale and resize the game screen.

`train.py`


import chainer
import chainer.functions as F
import chainer.links as L
import chainerrl
import gym
import numpy as np
import datetime
from skimage.color import rgb2gray
from skimage.transform import resize

Game selection etc.

I haven't changed this area.

`train.py`


env = gym.make('Pong-v0')
obs = env.reset()
env.render()

agent settings, etc.

I don't know how to set it up, so I'm just using the old model for CNN.

`train.py`


class QFunction(chainer.Chain):
    def __init__(self, n_history=1, n_action=6):
        super().__init__(
            l1=L.Convolution2D(n_history, 32, ksize=8, stride=4, nobias=False, wscale=np.sqrt(2)),
            l2=L.Convolution2D(32, 64, ksize=3, stride=2, nobias=False, wscale=np.sqrt(2)),
            l3=L.Convolution2D(64, 64, ksize=3, stride=1, nobias=False, wscale=np.sqrt(2)),
            l4=L.Linear(3136, 512, wscale=np.sqrt(2)),
            out=L.Linear(512, n_action, initialW=np.zeros((n_action, 512), dtype=np.float32))
        )

    def __call__(self, x, test=False):
        s = chainer.Variable(x)
        h1 = F.relu(self.l1(s))
        h2 = F.relu(self.l2(h1))
        h3 = F.relu(self.l3(h2))
        h4 = F.relu(self.l4(h3))
        h5 = self.out(h4)
        return chainerrl.action_value.DiscreteActionValue(h5)

The same applies to this. I haven't studied enough. n_history is used to mean a channel. This time I made it grayscale, so the channel is 1.

`train.py`


n_action = env.action_space.n
n_history=1
q_func = QFunction(n_history, n_action)

optimizer settings, etc.

Changed capacity from 10 ** 6.

optimizer = chainer.optimizers.Adam(eps=1e-2)
optimizer.setup(q_func)

gamma = 0.95

explorer = chainerrl.explorers.ConstantEpsilonGreedy(
    epsilon=0.3, random_action_func=env.action_space.sample)

replay_buffer = chainerrl.replay_buffer.ReplayBuffer(capacity=10 ** 4)

phi = lambda x: x.astype(np.float32, copy=False)

Game progress etc.

`train.py`


agent = chainerrl.agents.DoubleDQN(
    q_func, optimizer, replay_buffer, gamma, explorer,
    minibatch_size=4, replay_start_size=500, update_frequency=1,
    target_update_frequency=100, phi=phi)

last_time = datetime.datetime.now()
n_episodes = 1000
for i in range(1, n_episodes + 1):
    obs = resize(rgb2gray(env.reset()),(80,80))
    obs = obs[np.newaxis, :, :]

    reward = 0
    done = False
    R = 0

    while not done:
        action = agent.act_and_train(obs, reward)
        obs, reward, done, _ = env.step(action)
        obs = resize(rgb2gray(obs), (80, 80))
        obs = obs[np.newaxis, :, :]

        if reward != 0:
            R += reward

    elapsed_time = datetime.datetime.now() - last_time
    print('episode:', i, '/', n_episodes,
          'reward:', R,
          'minutes:', elapsed_time.seconds/60)
    last_time = datetime.datetime.now()

    if i % 100 == 0:
        filename = 'agent_Breakout' + str(i)
        agent.save(filename)

    agent.stop_episode_and_train(obs, reward, done)
print('Finished.')

The main changes are these two lines. The first line is grayscale and resizing. In the second line, I changed the shape to put it in Convolution2D.

obs = resize(rgb2gray(env.reset()),(80,80))
obs = obs[np.newaxis, :, :]

I used a laptop with 8GB of memory, but if I set the capacity to 10 ** 6 and not grayscale, it will be killed around 300 episodes. I don't know which one works, but these two changes have fixed it.

If you study about 200 episodes, you will get 21 points in a row. I got about 5 points in 1000 episodes. Learning 1000 episodes takes a whole day.

I will post it because I think it may be helpful for beginners. If you have any mistakes or points to be improved, please give us some advice.

Try running CNN with ChainerRL

Introduction

environment

Package import

train.py

Game selection etc.

train.py

agent settings, etc.

train.py

train.py

optimizer settings, etc.

Game progress etc.

train.py

`train.py`

`train.py`

`train.py`

`train.py`

`train.py`