Introduction

I have been studying machine learning and statistics. Among them, I didn't study reinforcement learning very well, so I deepened my understanding with implementation. 　 The reference materials are as follows.

Practical machine learning with scikit-learn and TensorFlow
[Deep Learning Textbook Deep Learning G Test (Generalist) Official Text](https://www.amazon.co.jp/%E6%B7%B1%E5%B1%A4%E5%AD%A6%E7%BF % 92% E6% 95% 99% E7% A7% 91% E6% 9B% B8-% E3% 83% 87% E3% 82% A3% E3% 83% BC% E3% 83% 97% E3% 83% A9% E3% 83% BC% E3% 83% 8B% E3% 83% B3% E3% 82% B0-G% E6% A4% 9C% E5% AE% 9A-% E3% 82% B8% E3% 82 % A7% E3% 83% 8D% E3% 83% A9% E3% 83% AA% E3% 82% B9% E3% 83% 88-% E5% 85% AC% E5% BC% 8F% E3% 83% 86% E3% 82% AD% E3% 82% B9% E3% 83% 88 / dp / 4798157554)
Implementation and explanation of Q-learning with CartPole [Reinforcement learning with Phython: Part 1]

Reinforcement learning overview

Reinforcement learning (RL) is one of the machine learning methods that realizes the optimum system control through trial and error by the system itself.

Alpha Go is famous.

In terms of classification, it is positioned as ** different from supervised / non-supervised learning **. It's unsupervised learning, but it seems different.

The figure below is a schematic diagram. Under a certain environment, we will learn the optimization of actions to be taken to maximize the target reward (score) as a model.

This behavior

Breakout → Move left and right
Go → Position to hit the stone
Mario (2D) → Move left, right, up and down

It depends on the game you want to learn.

Although research itself was actively conducted in the 1990s, it was difficult to express how to express "state" and how to connect it to "behavior" based on that "state", and the momentum of research was declining in the 2000s. It seems.

Deep reinforcement learning

In response to this issue, in 2013, DeepMind introduced Breakout by Reinforcement Learning ** in combination with a convolutional neural network (CNN). It was released and received a great response.

It is called DQN (Deep Q-Network) because it combines ** Q learning ** and deep learning with the reinforcement learning method used here. 　 ** However, in this implementation, we will do it with simple Q-learning. ** I hope I can make an implementation with DQN from the next time. ..

Implement Cart Pole

This time, we will deal with ** Cart Pole **, which is a MNIST-like existence of reinforcement learning. It has a stick on a pulley that makes one-dimensional movements (left and right). It is a content to learn the movement of the pulley so that this stick does not fall.

It's very simple. ** The status parameters are the following 4 **.

Cart position (-2.4 to +2.4)
Cart speed (-inf to + inf): infinity
Pole angle (-41.8 ° to + 41.8 °): Collapse when the absolute value is 40 or more
Pole speed (-inf ~ + inf)

On the other hand, ** behavior parameters ** are the following two.

The cart moves to the left (0)
The cart moves to the right (1)

You will get 1 reward as long as the ball does not fall. Also, one episode is the period until the end of one game. And

Success if you get 195 or more rewards for 100 consecutive episodes
Up to 200 episode time steps

will do.

If you simply move it

Randomly move 0 or 1 as np.random.choice ([0,1]) without worrying about anything. What happens then?

import gym
import numpy as np

env = gym.make('CartPole-v0')

goal_average_steps = 195
max_number_of_steps = 200
num_consecutive_iterations = 100
num_episodes = 200
last_time_steps = np.zeros(num_consecutive_iterations)

for episode in range(num_episodes):
    #Environment initialization
    observation = env.reset()

    episode_reward = 0
    for t in range(max_number_of_steps):
        #Drawing CartPole
        env.render()
        if env.viewer == None:
            env.render()
        #Random choice of action
        action = np.random.choice([0, 1])

        #Take action and get feedback
        observation, reward, done, info = env.step(action)
        episode_reward += reward

        if done:
            print('%d Episode finished after %d time steps / mean %f' % (episode, t + 1,
                last_time_steps.mean()))
            last_time_steps = np.hstack((last_time_steps[1:], [episode_reward]))
            break

    if (last_time_steps.mean() >= goal_average_steps): #Success if the last 100 episodes are 195 or higher
        print('Episode %d train agent successfuly!' % episode)
        break

Then it will be displayed as follows.

185 Episode finished after 21 time steps / mean 21.350000
186 Episode finished after 23 time steps / mean 21.390000
187 Episode finished after 22 time steps / mean 21.510000
188 Episode finished after 39 time steps / mean 21.420000
189 Episode finished after 13 time steps / mean 21.320000
190 Episode finished after 9 time steps / mean 21.160000
191 Episode finished after 26 time steps / mean 20.980000
192 Episode finished after 17 time steps / mean 21.100000
193 Episode finished after 94 time steps / mean 21.120000
194 Episode finished after 15 time steps / mean 21.870000
195 Episode finished after 26 time steps / mean 21.880000
196 Episode finished after 13 time steps / mean 21.970000
197 Episode finished after 13 time steps / mean 21.940000
198 Episode finished after 31 time steps / mean 21.760000
199 Episode finished after 23 time steps / mean 21.950000

You can see that it breaks in about dozens of time steps. I understand that the number does not increase because I have not learned.

Learn with Q Learning

Now, let's learn with Q Learning.

The point here is to set an index to select the next action to be taken in a certain situation. This value is called the Q value. Programmatically, it is defined as Q_table and the matrix is as follows.

In this case, since there are four states in which the following cart and pole states are each divided into four, $ 4 ^ 4 = 256 $ and the left and right actions $ 2 $ form a matrix of $ 256 x 2 $.

Cart position (-2.4 to +2.4)
Cart speed (-inf to + inf): infinity
Pole angle (-41.8 ° to + 41.8 °): Collapse when the absolute value is 40 or more
Pole speed (-inf ~ + inf)

We will improve this Q value for each action. The larger the Q value, the easier it is to select, so increasing this Q value in a way that increases the reward will improve behavior.

By the way, the update of this Q value is as follows.

Q(s, a) = (1 - \alpha)Q(s, a) + \alpha(R(s, a) + \gamma max_{a'} E[Q(s', a')])

here

$ Q (s, a) $ State s, value to take action a
$ α $ Learning rate
$ γ $ Discount rate
$ max_ {a'} E [Q (s', a')] $ Max value of the action Q value that can be selected at the next time

It will be.

When this is implemented, it will be as follows.


def get_action(state, action, observation, reward):
    next_state = digitize_state(observation)
    next_action = np.argmax(q_table[next_state])

    #Update Q table
    alpha = 0.2
    gamma = 0.99
    q_table[state, action] = (1 - alpha) * q_table[state, action] +\
            alpha * (reward + gamma * q_table[next_state, next_action])

    return next_action, next_state

for episode in range(num_episodes):
    #Environment initialization
    observation = env.reset()

    state = digitize_state(observation)
    action = np.argmax(q_table[state])

    episode_reward = 0
    for t in range(max_number_of_steps):
        #Drawing CartPole
        #env.render()

        #Take action and get feedback
        observation, reward, done, info = env.step(action)

        #Action choice
        action, state = get_action(state, action, observation, reward)
        episode_reward += reward

With this, the learning actually converges in about 100 trials and succeeds.

the end

I played with the outline of reinforcement learning and Cart Pole. I would also like to challenge DQN.

The full program is below. https://github.com/Fumio-eisan/RL20200527

I learned the basics of reinforcement learning and played with Cart Pole (implementing simple Q Learning)