Deep Q-Network (DQN)

I found Deep Q-Network, which uses Deep Learning + reinforcement learning to learn behavior patterns, interesting, so I implemented it. I will publish it because I got a little result.

The source code is available below. https://github.com/dsanno/chainer-dqn

The following is detailed about DQN. History of DQN + Deep Q-Network written in Chainer

Learning target

The game that DQN will learn this time is Winnie the Pooh's Homelander Bee! . (Note that a sound will be heard at the link destination) The goal is to make Aniki, also known as Puniki, hit many home runs.

The reasons for choosing this game are as follows

Simple rules Hit the ball thrown by the pitcher with a log and hit the specified number of home runs to clear it.
Easy to judge reward The results are shown in images such as "home run" and "strike", so we will give rewards corresponding to the images.

There was also a reason that it was very difficult for humans, but I couldn't reach the point where it became difficult. (Reference: [Nico Nico Pedia](http://dic.nicovideo.jp/a/%E3%81%8F%E3%81%BE%E3%81%AE%E3%83%97%E3%83% BC% E3% 81% 95% E3% 82% 93% E3% 81% AE% E3% 83% 9B% E3% 83% BC% E3% 83% A0% E3% 83% A9% E3% 83% B3% E3% 83% 80% E3% 83% BC% E3% 83% 93% E3% 83% BC!))

Development environment

Windows 10
Chainer 1.5.1
Use PyAutoGUI for screen capture and operation
GeForce GTX970
If it is a CPU, it takes too much time to learn and it does not work well

Neural network configuration

Input is 150 x 112 x 3ch pixel data. The size of the game screen is 600 x 450px, but the captured image is reduced by 1/4 vertically and horizontally and input.
The output is a vector of behavioral evaluation values. The length of the vector matches the number of behavior patterns. This time, the Y coordinate of the pointer is fixed and the X coordinate is changed in 33 steps. The button has two states, ON and OFF, and has a total of 66 action patterns.
The middle layer is 3 layers of Convolutional Neural Network, 1 layer of LSTM, and 1 layer of Fully Connected Layer.

About play

On the title screen, pitcher selection screen, etc., click the specified position.
During the match, the screen is captured every 100ms and used as an input image. (Hereinafter, the unit of 100ms interval is called "frame") The action pattern with the highest evaluation value is the action of the next frame.
A reward was given when the following conditions were judged. The reward for other frames is 0. According to the rules, fouls and hits are treated as failures like strikes, but since it is better to hit the ball than to miss a ball, the reward is slightly higher than the strike.
Home run: 100
Strike: -100
Foul: -90
Hit: -80
The following 3 patterns are used to take random actions. In this game, it doesn't make much sense to take a random action for only one frame, so I made it take a random action for consecutive frames in the range of 10 to 30 frames.
Only the pointer position is random
Only the button state is random
Random pointer position and button state

About learning

Learning was done in parallel with play in a separate thread.
In order to learn LSTM, the parameters are updated by giving the input values of consecutive frames as shown below. m was gradually increased in the range of 4 to 32.
Randomly select n
Enter the input value of frame n
Enter the input value of frame n + 1 to find the maximum evaluation value of frame n + 1. Use it to update the parameters of frame n
- ...
Enter the input value of frame n + m to find the maximum evaluation value of frame n + m. Parameter update for frame n + m -1 using it

Setting

Number of mini batches: 64
DQN gamma: 0.98
optimizer: AdaDelta
I used AdaDelta in reference [2], so I adopted it. The parameters are default rho = 0.95, eps = 1e-06.
Limit L2 norm to 0.1 with chainer.optimizer.GradientClipping () If the gradient was not limited, the Q value would be too large and the learning would not be stable.

Learning results

After continuing to study in Stage 1 for about 10 hours, I was able to almost clear Stage 1. I uploaded the play video below. I try not to take random actions when shooting videos. https://youtu.be/J4V6ZveYFUM

After learning including other stages, I was able to confirm that stage 3 was cleared by a fluke.

References

[1] V. Mnih et al., "Playing atari with deep reinforcement learning"
http://arxiv.org/abs/1312.5602
[2] M. Hausknecht, P. Stone, "Deep Recurrent Q-Learning for Partially Observable MDPs"
http://arxiv.org/abs/1507.06527
[3] History of DQN + Deep Q-Network written in Chainer

I want DQN Puniki to hit a home run