We will try to solve the inverted pendulum swing problem using Deep Q Network, commonly known as DQN, which combines Q-learning, which is a method of reinforcement learning, and deep neural networks.
The "inverted pendulum swing-up problem" is such a problem setting this time.
First, there is a stationary motor in the air, and one end of the rod is connected to the motor shaft. The rod is a common rod with mass concentrated in the center, rigidity $ \ infinity $, and thickness 0. In the initial state, the rod hangs downward due to gravity. The problem is that you should swing up the pendulum from this state and let it stand still in an inverted state. In good old control engineering, you will have to deal with it by using a controller that includes non-linear elements, such as preparing and switching between two controllers designed separately for swinging up and stationary. No, I've never done it, but it seems.
This time, the motor can only rotate to the right or left with a constant torque. Also, although it's a little messy, the torque of the motor is not so large, and even if you keep turning it in one direction from the initial state, you can not overcome the gravity and you can not swing up. Below is an animation when I'm addicted to the trap. The torque is applied all the way to the right, but as it goes horizontally, the contribution of gravitational acceleration in the angular direction increases, so it is pushed back and vibrates.
As for DQN itself, the wonderful article here is detailed, so I will mainly explain the results and ideas around implementation in this article.
The agent (in this case, the controller of the motor) takes action (instructs the direction of rotation of the motor) on the environment (motor and rod), and lets the agent learn the optimal policy under the condition of getting a reward and some observation result. ..
For the reward, we used the following function $ r (h) $, which states that the higher the height of the tip of the rod as seen from the motor, the better it is.
r(h)= \Biggl\{\quad
\begin{eqnarray}
5h & \mathrm{if} h\ge 0\\
h & \mathrm{if} h< 0
\end{eqnarray}
I biased the positive side, but it may have been extra care. For observation, the screen image is directly input in the example of ATARI, but this time I tried to input the pendulum angle itself. It is assumed that the angle sequence for 4 steps of the simulation can be obtained as a sequence.
Below is a plot of growth. The horizontal axis is the number of trials, and the vertical axis is the total score obtained in the trials. The blue dots are the results of each generation, and the red lines are the high scores.
The result is very oscillating, probably because of the non-linear and multimodal system, and it oscillates with positive and negative results even after convergence, but the high score result is steadily growing. Let's take a look at the growth process below.
You're stuck in a trap for the first time.
The 120th time, I notice that it is possible to swing up by going back and forth, but I can not stop after that.
It seems that he has grasped the knack of resting after swinging up for the 6950th time. A little more! fight!
At the 7640th time, the purpose was almost achieved.
This is the best result in 30000 iterations. Chi It's a little overkill, but it seems that it's better to do the first swing in the shortest amount of time.
I'm a little surprised that it works better than I expected. For the last example, here is a chronological plot of the height and the profile of the control inputs to the motor. The way of thinking is completely different between swinging up and holding, but you can see that you are learning that.
DQN article introduced earlier also shows the implementation, but this time I reimplemented the wheels for understanding. Saw. I put the implementation according to this paper in here.
What I couldn't fully understand just by reading the paper was how to structure the essential deep net and how to update it. I will explain about that.
Deep neural network $ Q $ is a net that outputs the action value of each action when a sequence of state observation results is input. In this case, the action value is a vector that shows "how happy it is to turn the motor to the right and left in the situation indicated by the input angle sequence".
Of course, in the early stages of learning, this net is random, so it returns random results. If you update this with the procedure explained from now on, it will grow into a good net that you can get a lot of total rewards in the future.
Suppose you perform an action $ a \ _t $ on the state $ s \ _t $ to get the reward $ r \ _t $ and change it to $ s \ _ {t + 1} $. At this time, out of the action value vector $ Q (s_t) $, which is the output of $ Q $ when $ s \ _t $ is input, only the action value corresponding to $ a \ _t $ is rewritten according to the following formula. Create $ y \ _t $ (the symbols have been changed from the original paper).
y_t = Q(s_t) \\
y_t[a_t] \leftarrow r_t + γ max Q(s_{t+1})
Update the net weights one step so that $ Q (s \ _t) $ is closer to this $ y \ _t $. This reward is added to the maximum action value that can be obtained by the next move multiplied by a certain discount rate of $ \ gamma $. Ideally, the value of this action should be determined from the sum of the action values up to the end of the episode, but since the calculation time is too short, we will only take the next step. If you repeat this update procedure infinitely, you will get action value based on the total reward from the state. I wonder if it is true. At least I understand that, I'll just say it.
The composition of the deep net itself has not been tried and errored so much, but I think there is no mistake if it is deepened. Is it really better to add Dropout, Batch Normalization, etc. to improve generalization performance? I think it depends on the problem.
Best Experience Replay
This implementation is basically the same as the paper, but I will explain it because there is one point that I devised.
It seems that it is better to use a set of "state / action / reward" used for learning that is not correlated with each other. For this reason, a method called ER (Experience Replay) is important. It seems to be one of the biggest points of DQN. It is a method of remembering past experiences and learning from a random set taken from it.
One trial is called an episode, but in the original paper, all episodes are memorized in their entirety. To get a new experience, each episode will be experimented with in a way called $ \ epsilon $ -greedy. $ \ Epsilon $ -greedy selects random action and net-based action (greedy action) with a certain probability of $ \ epsilon $. At the beginning of learning, $ \ epsilon $ is large, and you will learn from almost random movements.
Once you've learned enough, you can get results with just the greedy action. For this reason, I sometimes try a complete greedy operation to see how it works.
I've always been skeptical about DQN's dissertation, but when I try it, it's true, especially in the early days, with more and more episodes that are clearly not worth remembering, and the few good experiences disappear from memory. After all, it's completely random. Of course, you can learn from the mistakes, but I think it's better to use a good experience as a model. So, this time, I decided to give priority to the episodes that got a good score (lifetime reward) and memorize them without distinguishing between $ \ epsilon $ -greedy episodes and greedy episodes. I decided to keep the best 100 episodes of all time, and only the episodes with points that fall within the ranking will be memorized. However, even if you do not put it in the ranking, you can remember it with a 1% chance.
Let's call this method Best ER. From the state initialized with the same seed value, I plotted the difference in convergence status between Best ER and simple ER.
It looks pretty effective. I haven't tried it in various cases due to computer resources. ..
I tried a method that does not incorporate any new experience for a while, but I found that it declined after the golden age for a while after convergence. It became as follows.
The same thing happened when I changed the seed of the random number. I haven't investigated the cause well, but I imagine it's happening after the high-scoring achievements of the grown-up greedy episodes have filled the rankings, not because the data are too correlated or lost in diversity. I think. However, it seems that the improved version has not reached the golden age, so I think there is still room for improvement.
Also, I think Best ER can be overwhelmed by small successes. After that, I can't cope with changes in the environment. Because I am trapped in the glory of the past. It's kind of like the golden age, and it makes me want to add it to my life theory, but it depends on the task.
It may be a tendency of DQN, but it is difficult to tell whether learning is progressing, so it is essential to save candidates that look good. Also, as is often the case with numerical optimization of multimodal functions, the impression is that the time it takes to produce results depends very much on the initial values. However, if you try this case several times, you can get almost the highest score around 10,000 iterations. It took about an hour on my PC (MacBook Pro Core i5 2.5GHz x2 Core).
I was able to try DQN, which I always wanted to try. I also proposed a method called Best ER as a way to speed up convergence. If you implement it yourself, you can deepen your understanding.
I am surprised to see such a result without tuning or switching on the controller side. There are various things I want to do, such as making the motor control amount continuous, double pendulum, observing with images instead of angles, doing with actual machines and webcams, etc., but I will stop here due to time constraints.
This time I did it with CPU, but after all it is the limit with CPU for trial and error in this field. I'm looking forward to the recently announced Cloud Machine Learning. I hear people telling me to buy a GPU quietly. .. ..
Recommended Posts