DQN's background + Deep Q-Network was written in Chainer

Introduction

I didn't want to make DQN in the first place, but I wanted to use Chainer for other purposes, so I wrote DQN for that practice, but I thought I'd make it public ~~ ** I made it public **. Also, if I publish it anyway, I will take this opportunity to organize the knowledge I have in relation to this (Q learning + function approximation).

If you read only news articles, Nature, and NIPS papers, DQN made Deepmind / Google is amazing! !! !! However, considering the historical background of reinforcement learning, it can be seen that it is a technology that was born in a rather natural way, with reinforcement learning + deep learning. (It was visually easy to understand that playing ATARI games with better performance than humans. $ \ Leftrightarrow $ The problem setting was also good.)

In this article, the following two papers by NIPS and Nature

・ V. Mnih * et al. *, "Playing atari with deep reinforcement learning"   http://arxiv.org/pdf/1312.5602.pdf ・ V. Mnih * et al. *, "Human-level control through deep reinforcement learning"   http://www.nature.com/nature/journal/v518/n7540/abs/nature14236.html

I will explain about the process of reinforcement learning up to that point, focusing only on the relevant parts as much as possible.

It's geeky chasing information, but the lead author, Mnih, was a person who studied under Dr. Szepesvári, who is famous for stochastic planning including UCT, and Dr. Hinton, a familiar neural network godfather, and DQN was born naturally. It can be said that it is a flow.

I mean, what is reinforcement learning?

Some people may not be familiar with reinforcement learning itself. Bishop's "PRML" and, more recently, Murphy's "Machine Learning" are treated like "This book deals with many areas of machine learning, except for reinforcement learning."

I think this is due to the special condition setting for reinforcement learning. This is because the theory of reinforcement learning is a mixture of control theory (optimal control theory / dynamic programming) and machine learning in a shaker. Here, without going into such a theory, I will explain (I will) as simply as possible "what is reinforcement learning?"

Reinforcement learning: environment and agents

Reinforcement learning is a machine learning method in which an "agent" placed in an "environment" obtains optimal measures (rules for determining behavior) through interaction with the environment.

According to Wikipedia, "Reinforcement learning is a type of machine learning that deals with the problem of agents in an environment observing their current state and deciding what action to take." [^ 1 ] That's right.

A picture is worth a thousand words, and it looks like the following figure in a schematic diagram. RL.png

Reinforcement learning progresses by repeating the following simple steps.

  1. The agent receives the observation $ o $ (or directly the environment state $ s $) received from the environment and returns the action $ a $ to the environment based on the strategy $ \ pi $.
  2. The environment changes to the next state $ s'$ based on the action $ a $ received from the agent and the current state $ s $, and based on that transition the next observation $ o'$ and the reward. Returns to the agent a single number (scalar amount) called $ r $ that indicates the quality of the previous action.
  3. Time progress: $ t \ leftarrow t + 1 $

Where $ \ leftarrow $ represents an assignment operation. The reward is usually given as a function of state and action, and the next state $ r = r (s, a, s') $. We cannot directly manipulate the environment due to the conditions of reinforcement learning. Only agents are free to operate. This interaction between the environment and the agent, and the constraints, represent the peculiarities of reinforcement learning.

For simplicity, the observations here only describe situations where the state of the environment is passed directly to the agent ($ o = s $) [^ 2]. In the framework of reinforcement learning, the optimal policy for agents is the sum of rewards that can be obtained from the present time to the infinite future by deciding actions according to that policy.

R_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k+1} = r_{t+1} + \gamma\ r_{t+2} + \gamma^2\ r_{t+3} + \cdots

I think that it is a measure that maximizes. That is, this is the objective function in reinforcement learning. $ t $ represents the current time and $ r_i $ represents the reward received by the agent at time $ i $. $ \ Gamma $ is a parameter of reinforcement learning called the discount rate. This value is usually given as a continuous value between $ 0 $ and less than $ 1 $, and is a constant introduced so that $ R_t $ is a finite value for any reward setting that does not include infinity or infinitesimal. is. Usually set to a value close to $ 1 $, such as $ \ gamma = 0.99 $.

Whenever the state of the environment is passed directly as an agent's observation ($ o = s $) as it is now ***, there is always a deterministic function that outputs the action *** $ \ pi ^ * (s) It is theoretically known that there is at least one optimal strategy that can be represented by) $ *** [^ 3]. I won't go into details, but all of the learning methods introduced in the following sections are the same in that they look for deterministic measures that maximize the above objective function.

When DQN learns the ATARI game, it is assumed that the state is configured with the game screen (observation) between 4 steps.

Deep Q-Network Before Birth

Here, we will introduce the background behind the birth of Deep Q-Network. Learn about Q-learning, function approximation, Experience Replay, and Neural Fitted Q Iteration that make up the Deep Q-Network.

Q learning

Q-learning is a classic reinforcement learning algorithm and may be the most widely known reinforcement learning algorithm. All Q-learning-based methods learn optimal strategies by approximating a function called the ** optimal behavioral value function **.

There is one optimal action value for each state and action pair $ (s, a) $, and the reward you get if you take the action $ a $ in the state $ s $ and follow the optimal policy all the time. The sum of (that is, the objective function) $ R $ is the expected value. This is called the optimal action value function, which is obtained for each state and action set $ (s, a) $, and is expressed by $ Q ^ * (s, a) $.

It's a confusing definition, but roughly speaking, this value represents the ** merit ** of taking action $ a $ with the agent in $ s $. The important point about this ** optimal behavioral value function is that there is a ** between this function and the optimal deterministic strategy (one of them) $ \ pi ^ * (s) $.

\pi^*(s) = \rm{argmax}_a Q^* (s,a)  

It means that there is a ** relationship. ** In other words, in Q-learning, obtaining the [^ 4] optimal behavioral value function and obtaining the optimal policy have almost the same meaning.

In actual Q-learning, we create a table function $ Q (s, a) $ for every state $ s $ and action pair $ a , initialize all elements to arbitrary values, and then every time. Update the data ( s $, $ a $, $ r $, $ s'$) with the following formula:

Q(s,a) \leftarrow (1-\alpha) Q(s,a) + \alpha \Bigl(r + \gamma \ \max_{a'} Q(s',a')\Bigr)

Where $ r $ is the reward received from the environment after selecting the action $ a $ in the state $ s $, $ s'$ is also the action $ a $ in the state $ s $, and then at the next time Represents the received status. $ \ Alpha $ represents the learning rate. Usually use smaller values such as $ 0.1 $ or $ 0.01 $ that are less than $ 1 $. This algorithm takes $ Q (s, a) $ a little closer each time to $ r + \ gamma \ \ max_ {a'} Q (s', a') $, which contains the function itself. It is [^ 5].

The above update notation may not be very popular in reinforcement learning (especially for those who are familiar with reinforcement learning in the Sutton & Barto book), but this notation is a value function $ Q (s, a) $ It is in a format that explicitly expresses what you are trying to apply to. Naturally, this update is a more popular notation.

Q(s,a) \leftarrow Q(s,a) + \alpha \Bigl(r + \gamma\ \max_{a'} Q(s',a') - Q(s,a)\Bigr)

Is equivalent to

The biggest feature of Q-learning is that samples ($ s $, $ a $, $ r $, $ s') can be obtained infinitely from all pairs of ** ( s $, $ a $). Then, no matter what order they are given, the above algorithm always gives the optimum value function $ Q ^ * (s, a) $ ** [^ 6]. This feature will be fully utilized in Experience Replay, which will be introduced later.

I said that I would create a table function $ Q (s, a) $ for all states and actions, but if I give an insanely high-dimensional thing such as an image string as a state like DQN, it goes without saying that the table function So it doesn't fit in memory and I can't take all the samples. Even a binary image of $ 10 \ times 10 $ will have a very uncontrollable number of states such as $ 2 ^ {100} \ approx 10 ^ {30} $. Therefore, when dealing with these very high-dimensional states, we use function approximation for $ Q (s, a) $.

Not Deep Q-network: Q-learning + function approximation

So let's consider using a function approximation for the function $ Q (s, a) $. So $ Q (s, a) $ is represented by some parameter $ \ theta $, and the approximated function is represented by $ Q_ \ theta (s, a) $.

To approximate a function equivalent to Q-learning using this function, the following gradient algorithm is classically used.

\theta \leftarrow \theta - \alpha \nabla_\theta L_\theta

Where $ L_ \ theta $ is represented by the following error function. This is the same as the general gradient method update.

L_\theta = \frac{1}{2}\Bigl( \rm{target} - Q_\theta(s,a)\Bigr)^2

By the way, if you know the true action value $ Q ^ * (s, a) $, you can set $ \ rm {target} = Q ^ * (s, a) $ as pure supervised learning. , In reinforcement learning, the teacher signal $ Q ^ * (s, a) $ is not given. So, like Q-learning using table functions

 \rm{target} = r + \gamma \max_{a'} Q_\theta (s',a')

Is used as the teacher signal at that time. And the differentiation of the error function does not differentiate against the teacher signal, just to make $ Q_ \ theta (s, a) $ closer to the teacher signal. In other words

\begin{align}
\nabla_\theta L_\theta &=-\Bigl(\rm{target} - Q_\theta(s,a)\Bigr) \nabla_\theta Q_\theta(s,a)\\
&=-\Bigl(r + \gamma \max_{a'} Q_\theta (s',a') - Q_\theta(s,a)\Bigr) \nabla_\theta Q_\theta(s,a)
\end{align}

It will be.

As a result, Q-learning when using the approximation function is as follows.

\theta \leftarrow \theta + \alpha \Bigl(r + \gamma \max_{a'} Q_\theta (s',a') - Q_\theta(s,a)\Bigr) \nabla_\theta Q_\theta(s,a)

An expression very similar to the second update expression for Q-learning using table functions has appeared.

One thing to be aware of in Q-learning using function approximation is that ** $ \ max_ {a'} Q_ \ theta (s', a') $ included in the teacher signal in the calculation of the derivative of the error function. It means that it must not be differentiated **. This kind of mistake is unlikely to occur when you differentiate yourself and drop it in the code, but if you use the automatic differentiation that is widely used these days, it means that you have differentiated even the teacher signal without knowing it. (And it's hard to notice because they both move in the same way at the beginning of learning). (I did it once too)

If the training is differentiated including the teacher signal, the algorithm will have different properties from the Q-learning using the function approximator described here.

Use neural networks for function approximation

Q-learning using nonlinear function approximation methods such as multi-layer neural networks is Long-Ji in the era of secondary neural networks where error backpropagation was developed, when reinforcement learning began to theoretically mature. Mr. Lin did it energetically. Experience Replay is the method published by this person in the paper "Self-Improving Reactive Agents Based On Reinforcement Learning, Planning and Teaching".

As an aside, this paper is a very interesting paper that trains agents who are trying to survive in a complex and dynamic simulation environment.

Learning is inevitably slow in a multi-layer (although it is about 3 or 4 layers of fully connected) neural network, and learning diverges even if the learning rate is increased. As a way to speed this up, we will use the Q-learning characteristics mentioned above.

What is Experience Replay? All the samples ($ s $, $ a $, $ r $, $ s'$) experienced by the agent are recorded (or for a finite number of hours), and the sample used once is used as the above function. It is a method that is used many times in Q-learning using approximation. The idea is that in Q-learning using a table function, learning proceeds no matter how the order of the samples is rearranged, so the same property is expected in Q-learning using a function approximator.

An algorithm that thinks about the limits of Experience Replay is Dr. Riedmiller's Neural Fitted Q Iteration [^ 7]. This is different from the online reinforcement learning such as "collect and learn data by yourself" so far, and first a sufficient number of samples ($ s $, $ a $, $ r $, $ s'$) It is an algorithm that considers "learning the optimal policy from the given data without adding any more samples", that is, performing batch reinforcement learning, assuming that there is data of.

The pseudo code for the Neural Fitted Q Iteration algorithm is as follows.

input: Data $ D $ consisting of many samples ($ s $, $ a $, $ r $, $ s'$) output: Q function after training $ Q_ {\ theta_N} (s, a) $

  1. k=0
  2. Initialize the neural network parameter $ \ theta_0 $
  3. Repeat: Repeat $ N-1 $ times
  4. Extract $ M $ of samples from the data $ D $
  5. Based on $ M $ samples Teacher signal $ {\ rm {target}} ^ t = r_t + \ gamma \ max_ {a'} Q_ {\ theta_k} (s_t', a') $ and Create input $ (s_t, a_t) $ ($ t = 1,2, ..., M $)
  6. Based on the created teacher signal $ {\ rm {target}} ^ t $ and input $ (s_t, a_t) $ Minimize the error $ L_ \ theta $ between $ Q_ \ theta (s, a) $ and the teacher signal by some gradient method To do. When the learning converges and certain conditions are met, the result is $ \ theta_ {k + 1} $
  7. $ k \ leftarrow k + 1 $
  8. Returns $ Q_ {\ theta_N} (s, a) $

The important point in Neural Fitted Q Iteration is that the supervised signal is always based on the same parameter $ \ theta_k $ in the optimization using the gradient method performed each time during the iteration. $ R + \ max_ {a'} Q_ {\ theta_k } (s_t', a') Generated from $, so unlike the online method, ** is completely supervised learning during each optimization **. Therefore, it is expected that learning will be very stable in terms of settings.

Neural Fitted Q Iteration is an algorithm that extracts information from a constant number of data, but it is a more advanced method, and after learning to some extent, the agent is moved in the environment and batch data is added later. The idea of Growing Batch has been proposed [^ 8]. At this point, the method is almost the same as that of the Nature paper.

Riedmiller's team uses this batch reinforcement learning as the main method and applies it to robot behavior learning. It seems that the teacher is also enrolled in Deepmind, and it can be seen that a surprising number of reinforcement learning researchers are accumulated in Deepmind.

And Deep Q-Network: Q-learning + deep learning

The question "What is the new technology of DQN?" Is simply "I applied the deep learning technology to the above function approximation of reinforcement learning."

Specifically, the convolutional neural network was used in both NIPS and Nature papers, and this structure was used. In addition, we applied optimization methods that do not fit into local solutions, such as the recently developed RMS Drop. These are essentially new points in DQN. (Addition: I think that the structure of DQN that inputs only the state to the network and expresses the estimated value of all action values in that state with one forward propagation is new, but I'm sorry ~~ I'm sure. I can't have it. ~~) (Addition: The structure of DQN introduced in the addition of $ \ uparrow $, but there was a precedent. As of 2009, Shibata \ & Kono used a 5-layer fully connected neural network with a similar structure at the latest. Q-learning is configured and reinforcement learning using image input is applied to AIBO's behavioral learning task [^ 9].)

The NIPS paper "Playing atari with deep reinforcement learning", where DQN first appeared, uses Experience Replay on a large scale in combination with the mini-batch method. However, the online reinforcement learning algorithm is still unstable.

On the other hand, in the paper "Human-level control through deep reinforcement learning" in Nature, by applying Neural Fitted Q Iteration + Growing Batch, stabilization is achieved by online "almost batch" reinforcement learning.

This is the origin of DQN that I know.

I wrote DQN in Python + Chainer (code released)

That's why I implemented DQN in Chainer practice. It will be released on github ~~. ~~ $ \ rightarrow $ ** Published. ** ** https://github.com/ugo-nama-kun/DQN-chainer

** Addendum 1 **: Both nature / nips versions of the published code have been released. The test requires RLglue and RLglue Python-codec, which are often used for benchmarking in reinforcement learning research, the Arcade Learning Environment that connects the ATARI emulator and RLglue, and the Python package Numpy, Scipy, and Mochiron Chainer. Since it is supposed to be calculated on GPU, NVIDIA GPU is required. We have confirmed the operation only on Ubuntu 14.04 LTS.

** Addendum 2 **: The number of histories saved for DQN's Experience Replay is set to 1/10 of the original ($ 10 ^ 6 \ rightarrow 10 ^ 5 $). This was simply because I wanted to scale it down to work on my small machine. I think that the implementation of history this time is inefficient because it is a straightforward implementation of the algorithm described in Nature magazine. At least in Pong, we have confirmed that learning can be done without problems.

** Addition 3 **: All operation checks are done with Pong, which is one of the ATARI games. See readme.txt for how to train in other games. Due to the scale down, the expected performance may not be learned.

I created both the NIPS version and the Nature version, but the NIPS version has many unknown parameters in the paper, and the atmosphere of convergence is better for the Nature version. Think of the NIPS version as a bonus.

The learning results of the Nature version are posted on youtube $ \ downarrow $ https://www.youtube.com/watch?v=N813o-Xb6S8

Postscript: I posted it because the learning of the NIPS version was also successful (although the behavior of the result is almost the same) $ \ downarrow $ https://www.youtube.com/watch?v=ez_JEHMUpng

It is an impression using Chainer [^ 10], but from the experience of implementing DQN once in Theano and publishing the video on youtube, where is the error occurring in Chainer compared to the Theano based one? It's really easy to understand. There is a lot of freedom, so I definitely recommend it if you want to use a combination of standard neural network components (such as using a convolutional neural network or stacking layers).

On the other hand, Theano is still attractive because when you want to try a new algorithm, you can speed it up on the GPU with almost the same operation regardless of what it is. For example, "I'm going to develop another, better one, like an LSTM cell! I'll optimize it!" Sometimes I feel like I need to use a general one like Theano. I will. In this case, it seems that it will be used in most advanced research and academic (and more special) situations. That was my impression.

Tips: Why did the name DQN come about?

This story is completely ridiculous.

I think it's a rare phenomenon to call a Yankee-like person DQN, but anyway, it's quite miraculous that this algorithm matched with Deep Q-Network = DQN. I feel it.

When you dig up the components of each D-Q-N, D becomes Deep learning, Q becomes Q learning, and N becomes a neural network.

First of all, Deep is said to be a trend these days (it is believed that the name Deep Nantoka has probably increased explosively since the paper "A fast learning algorithm for deep belief nets" by Dr. Hinton et al.). Q What is the Q of learning? According to a comment by Dr. Barto, one of the creators of reinforcement learning, this "Q" is the Q of "Quality" (I think) [^ 11].

Next, the name Q-Network seems to be uncommon in reinforcement learning. Even in the paper, I feel that the phrase "I used a neural network as a function approximator in Q-learning" is generally used. A close expression is that Lin mentioned earlier in his 1993 paper called Q-net, and I think this is probably the origin of what has come to be called Q-Network this time.

note:

[^ 1]: Reinforcement learning in Wikipedia
https://ja.wikipedia.org/wiki/%E5%BC%B7%E5%8C%96%E5%AD%A6%E7%BF%92

[^ 2]: Such a situation (although there are actually a few more assumptions) is called "assuming that the environment can be described by the Markov Decision Process (MDP)". When the state is not passed directly to the agent, it is called the Partially Observable Markov Decision Process (POMDP), and reinforcement learning in such an environment is called Partially Observable Reinforcement Learning (PORL) in recent years. I will. The setting for learning ATARI games is naturally included in PORL because only the game screen is input by the agent, but learning is performed assuming that the state is configured with the game screen between 4 steps.

[^ 3]: See M. Puterman's "Markov decision processes" and O. Sigaud, O. Buffet's "Markov decision processes in artificial intelligence" for more information on this point.

[^ 4]: More precisely, "in an environment that can be described in the Markov decision process"

[^ 5]: See the following site for the full form of the algorithm:
http://webdocs.cs.ualberta.ca/~sutton/book/ebook/node65.html

[^ 6]: Theoretically, the learning rate $ \ alpha $ should gradually decay with each update. However, it is not very realistic when you want to run the reinforcement learning algorithm for a long time or when you always want to adapt it to a dynamic environment to some extent. For more information on convergence conditions, see Dr. Melo's "Convergence of Q-learning: a simple proof":
http://users.isr.ist.utl.pt/~mtjspaan/readingGroup/ProofQlearning.pdf

[^ 10]: Chainer has been updated frequently since its release and new features have been added, so this is just the current impression (July 2015).

[^ 11]: The source of information is Mayutsuba:
http://www.quora.com/How-was-the-term-Q-learning-coined

Recommended Posts

DQN's background + Deep Q-Network was written in Chainer
CPickle.UnpicklingError in Chainer