Try running CNN with ChainerRL


I changed the sample of QuickStart a little and moved CNN. The target of learning is Atari's "Pong-v0". ChainerRL Quickstart Guide

I am referencing this article. Try using chainerRL

I can't tell if I'm learning properly due to lack of knowledge about Linux, Python, and reinforcement learning, but I've confirmed that it works. Please give us any mistakes or advice.


OS: ubuntu 16.04 python: 3.6.0 chainer: 1.21.0

Package import

There are two main changes below. I used it to grayscale and resize the game screen.

import chainer
import chainer.functions as F
import chainer.links as L
import chainerrl
import gym
import numpy as np
import datetime
from skimage.color import rgb2gray
from skimage.transform import resize

Game selection etc.

I haven't changed this area.

env = gym.make('Pong-v0')
obs = env.reset()

agent settings, etc.

I don't know how to set it up, so I'm just using the old model for CNN.

class QFunction(chainer.Chain):
    def __init__(self, n_history=1, n_action=6):
            l1=L.Convolution2D(n_history, 32, ksize=8, stride=4, nobias=False, wscale=np.sqrt(2)),
            l2=L.Convolution2D(32, 64, ksize=3, stride=2, nobias=False, wscale=np.sqrt(2)),
            l3=L.Convolution2D(64, 64, ksize=3, stride=1, nobias=False, wscale=np.sqrt(2)),
            l4=L.Linear(3136, 512, wscale=np.sqrt(2)),
            out=L.Linear(512, n_action, initialW=np.zeros((n_action, 512), dtype=np.float32))

    def __call__(self, x, test=False):
        s = chainer.Variable(x)
        h1 = F.relu(self.l1(s))
        h2 = F.relu(self.l2(h1))
        h3 = F.relu(self.l3(h2))
        h4 = F.relu(self.l4(h3))
        h5 = self.out(h4)
        return chainerrl.action_value.DiscreteActionValue(h5)

The same applies to this. I haven't studied enough. n_history is used to mean a channel. This time I made it grayscale, so the channel is 1.

n_action = env.action_space.n
q_func = QFunction(n_history, n_action)

optimizer settings, etc.

Changed capacity from 10 ** 6.

optimizer = chainer.optimizers.Adam(eps=1e-2)

gamma = 0.95

explorer = chainerrl.explorers.ConstantEpsilonGreedy(
    epsilon=0.3, random_action_func=env.action_space.sample)

replay_buffer = chainerrl.replay_buffer.ReplayBuffer(capacity=10 ** 4)

phi = lambda x: x.astype(np.float32, copy=False)

Game progress etc.

agent = chainerrl.agents.DoubleDQN(
    q_func, optimizer, replay_buffer, gamma, explorer,
    minibatch_size=4, replay_start_size=500, update_frequency=1,
    target_update_frequency=100, phi=phi)

last_time =
n_episodes = 1000
for i in range(1, n_episodes + 1):
    obs = resize(rgb2gray(env.reset()),(80,80))
    obs = obs[np.newaxis, :, :]

    reward = 0
    done = False
    R = 0

    while not done:
        action = agent.act_and_train(obs, reward)
        obs, reward, done, _ = env.step(action)
        obs = resize(rgb2gray(obs), (80, 80))
        obs = obs[np.newaxis, :, :]

        if reward != 0:
            R += reward

    elapsed_time = - last_time
    print('episode:', i, '/', n_episodes,
          'reward:', R,
          'minutes:', elapsed_time.seconds/60)
    last_time =

    if i % 100 == 0:
        filename = 'agent_Breakout' + str(i)

    agent.stop_episode_and_train(obs, reward, done)

The main changes are these two lines. The first line is grayscale and resizing. In the second line, I changed the shape to put it in Convolution2D.

obs = resize(rgb2gray(env.reset()),(80,80))
obs = obs[np.newaxis, :, :]

I used a laptop with 8GB of memory, but if I set the capacity to 10 ** 6 and not grayscale, it will be killed around 300 episodes. I don't know which one works, but these two changes have fixed it.

If you study about 200 episodes, you will get 21 points in a row. I got about 5 points in 1000 episodes. Learning 1000 episodes takes a whole day.

I will post it because I think it may be helpful for beginners. If you have any mistakes or points to be improved, please give us some advice.

