I have been studying machine learning and statistics. Among them, I didn't study reinforcement learning very well, so I deepened my understanding with implementation. The reference materials are as follows.
[Deep Learning Textbook Deep Learning G Test (Generalist) Official Text](https://www.amazon.co.jp/%E6%B7%B1%E5%B1%A4%E5%AD%A6%E7%BF % 92% E6% 95% 99% E7% A7% 91% E6% 9B% B8-% E3% 83% 87% E3% 82% A3% E3% 83% BC% E3% 83% 97% E3% 83% A9% E3% 83% BC% E3% 83% 8B% E3% 83% B3% E3% 82% B0-G% E6% A4% 9C% E5% AE% 9A-% E3% 82% B8% E3% 82 % A7% E3% 83% 8D% E3% 83% A9% E3% 83% AA% E3% 82% B9% E3% 83% 88-% E5% 85% AC% E5% BC% 8F% E3% 83% 86% E3% 82% AD% E3% 82% B9% E3% 83% 88 / dp / 4798157554)
Reinforcement learning (RL) is one of the machine learning methods that realizes the optimum system control through trial and error by the system itself.
Alpha Go is famous.
In terms of classification, it is positioned as ** different from supervised / non-supervised learning **. It's unsupervised learning, but it seems different.
The figure below is a schematic diagram. Under a certain environment, we will learn the optimization of actions to be taken to maximize the target reward (score) as a model.
This behavior
It depends on the game you want to learn.
Although research itself was actively conducted in the 1990s, it was difficult to express how to express "state" and how to connect it to "behavior" based on that "state", and the momentum of research was declining in the 2000s. It seems.
In response to this issue, in 2013, DeepMind introduced Breakout by Reinforcement Learning ** in combination with a convolutional neural network (CNN). It was released and received a great response.
It is called DQN (Deep Q-Network) because it combines ** Q learning ** and deep learning with the reinforcement learning method used here. ** However, in this implementation, we will do it with simple Q-learning. ** I hope I can make an implementation with DQN from the next time. ..
This time, we will deal with ** Cart Pole **, which is a MNIST-like existence of reinforcement learning. It has a stick on a pulley that makes one-dimensional movements (left and right). It is a content to learn the movement of the pulley so that this stick does not fall.
It's very simple. ** The status parameters are the following 4 **.
On the other hand, ** behavior parameters ** are the following two.
You will get 1 reward as long as the ball does not fall. Also, one episode is the period until the end of one game. And
will do.
Randomly move 0
or 1
as np.random.choice ([0,1])
without worrying about anything. What happens then?
import gym
import numpy as np
env = gym.make('CartPole-v0')
goal_average_steps = 195
max_number_of_steps = 200
num_consecutive_iterations = 100
num_episodes = 200
last_time_steps = np.zeros(num_consecutive_iterations)
for episode in range(num_episodes):
#Environment initialization
observation = env.reset()
episode_reward = 0
for t in range(max_number_of_steps):
#Drawing CartPole
env.render()
if env.viewer == None:
env.render()
#Random choice of action
action = np.random.choice([0, 1])
#Take action and get feedback
observation, reward, done, info = env.step(action)
episode_reward += reward
if done:
print('%d Episode finished after %d time steps / mean %f' % (episode, t + 1,
last_time_steps.mean()))
last_time_steps = np.hstack((last_time_steps[1:], [episode_reward]))
break
if (last_time_steps.mean() >= goal_average_steps): #Success if the last 100 episodes are 195 or higher
print('Episode %d train agent successfuly!' % episode)
break
Then it will be displayed as follows.
185 Episode finished after 21 time steps / mean 21.350000
186 Episode finished after 23 time steps / mean 21.390000
187 Episode finished after 22 time steps / mean 21.510000
188 Episode finished after 39 time steps / mean 21.420000
189 Episode finished after 13 time steps / mean 21.320000
190 Episode finished after 9 time steps / mean 21.160000
191 Episode finished after 26 time steps / mean 20.980000
192 Episode finished after 17 time steps / mean 21.100000
193 Episode finished after 94 time steps / mean 21.120000
194 Episode finished after 15 time steps / mean 21.870000
195 Episode finished after 26 time steps / mean 21.880000
196 Episode finished after 13 time steps / mean 21.970000
197 Episode finished after 13 time steps / mean 21.940000
198 Episode finished after 31 time steps / mean 21.760000
199 Episode finished after 23 time steps / mean 21.950000
You can see that it breaks in about dozens of time steps. I understand that the number does not increase because I have not learned.
Now, let's learn with Q Learning.
The point here is to set an index to select the next action to be taken in a certain situation. This value is called the Q value. Programmatically, it is defined as Q_table and the matrix is as follows.
In this case, since there are four states in which the following cart and pole states are each divided into four, $ 4 ^ 4 = 256 $ and the left and right actions $ 2 $ form a matrix of $ 256 x 2 $.
We will improve this Q value for each action. The larger the Q value, the easier it is to select, so increasing this Q value in a way that increases the reward will improve behavior.
By the way, the update of this Q value is as follows.
Q(s, a) = (1 - \alpha)Q(s, a) + \alpha(R(s, a) + \gamma max_{a'} E[Q(s', a')])
here
It will be.
When this is implemented, it will be as follows.
def get_action(state, action, observation, reward):
next_state = digitize_state(observation)
next_action = np.argmax(q_table[next_state])
#Update Q table
alpha = 0.2
gamma = 0.99
q_table[state, action] = (1 - alpha) * q_table[state, action] +\
alpha * (reward + gamma * q_table[next_state, next_action])
return next_action, next_state
for episode in range(num_episodes):
#Environment initialization
observation = env.reset()
state = digitize_state(observation)
action = np.argmax(q_table[state])
episode_reward = 0
for t in range(max_number_of_steps):
#Drawing CartPole
#env.render()
#Take action and get feedback
observation, reward, done, info = env.step(action)
#Action choice
action, state = get_action(state, action, observation, reward)
episode_reward += reward
With this, the learning actually converges in about 100 trials and succeeds.
I played with the outline of reinforcement learning and Cart Pole. I would also like to challenge DQN.
The full program is below. https://github.com/Fumio-eisan/RL20200527
Recommended Posts