I learned the basics of reinforcement learning and played with Cart Pole (implementing simple Q Learning)

Introduction

I have been studying machine learning and statistics. Among them, I didn't study reinforcement learning very well, so I deepened my understanding with implementation.   The reference materials are as follows.

Reinforcement learning overview

Reinforcement learning (RL) is one of the machine learning methods that realizes the optimum system control through trial and error by the system itself.

Alpha Go is famous.

In terms of classification, it is positioned as ** different from supervised / non-supervised learning **. It's unsupervised learning, but it seems different.

The figure below is a schematic diagram. Under a certain environment, we will learn the optimization of actions to be taken to maximize the target reward (score) as a model.

image.png

This behavior

It depends on the game you want to learn.

Although research itself was actively conducted in the 1990s, it was difficult to express how to express "state" and how to connect it to "behavior" based on that "state", and the momentum of research was declining in the 2000s. It seems.

Deep reinforcement learning

In response to this issue, in 2013, DeepMind introduced Breakout by Reinforcement Learning ** in combination with a convolutional neural network (CNN). It was released and received a great response.

It is called DQN (Deep Q-Network) because it combines ** Q learning ** and deep learning with the reinforcement learning method used here.   ** However, in this implementation, we will do it with simple Q-learning. ** I hope I can make an implementation with DQN from the next time. ..

Implement Cart Pole

This time, we will deal with ** Cart Pole **, which is a MNIST-like existence of reinforcement learning. It has a stick on a pulley that makes one-dimensional movements (left and right). It is a content to learn the movement of the pulley so that this stick does not fall.

001.png

It's very simple. ** The status parameters are the following 4 **.

On the other hand, ** behavior parameters ** are the following two.

You will get 1 reward as long as the ball does not fall. Also, one episode is the period until the end of one game. And

will do.

If you simply move it

Randomly move 0 or 1 as np.random.choice ([0,1]) without worrying about anything. What happens then?

import gym
import numpy as np

env = gym.make('CartPole-v0')

goal_average_steps = 195
max_number_of_steps = 200
num_consecutive_iterations = 100
num_episodes = 200
last_time_steps = np.zeros(num_consecutive_iterations)

for episode in range(num_episodes):
    #Environment initialization
    observation = env.reset()

    episode_reward = 0
    for t in range(max_number_of_steps):
        #Drawing CartPole
        env.render()
        if env.viewer == None:
            env.render()
        #Random choice of action
        action = np.random.choice([0, 1])

        #Take action and get feedback
        observation, reward, done, info = env.step(action)
        episode_reward += reward

        if done:
            print('%d Episode finished after %d time steps / mean %f' % (episode, t + 1,
                last_time_steps.mean()))
            last_time_steps = np.hstack((last_time_steps[1:], [episode_reward]))
            break

    if (last_time_steps.mean() >= goal_average_steps): #Success if the last 100 episodes are 195 or higher
        print('Episode %d train agent successfuly!' % episode)
        break

Then it will be displayed as follows.

185 Episode finished after 21 time steps / mean 21.350000
186 Episode finished after 23 time steps / mean 21.390000
187 Episode finished after 22 time steps / mean 21.510000
188 Episode finished after 39 time steps / mean 21.420000
189 Episode finished after 13 time steps / mean 21.320000
190 Episode finished after 9 time steps / mean 21.160000
191 Episode finished after 26 time steps / mean 20.980000
192 Episode finished after 17 time steps / mean 21.100000
193 Episode finished after 94 time steps / mean 21.120000
194 Episode finished after 15 time steps / mean 21.870000
195 Episode finished after 26 time steps / mean 21.880000
196 Episode finished after 13 time steps / mean 21.970000
197 Episode finished after 13 time steps / mean 21.940000
198 Episode finished after 31 time steps / mean 21.760000
199 Episode finished after 23 time steps / mean 21.950000

You can see that it breaks in about dozens of time steps. I understand that the number does not increase because I have not learned.

Learn with Q Learning

Now, let's learn with Q Learning.

The point here is to set an index to select the next action to be taken in a certain situation. This value is called the Q value. Programmatically, it is defined as Q_table and the matrix is as follows.

image.png

In this case, since there are four states in which the following cart and pole states are each divided into four, $ 4 ^ 4 = 256 $ and the left and right actions $ 2 $ form a matrix of $ 256 x 2 $.

We will improve this Q value for each action. The larger the Q value, the easier it is to select, so increasing this Q value in a way that increases the reward will improve behavior.

By the way, the update of this Q value is as follows.

Q(s, a) = (1 - \alpha)Q(s, a) + \alpha(R(s, a) + \gamma max_{a'} E[Q(s', a')])

here

It will be.

When this is implemented, it will be as follows.


def get_action(state, action, observation, reward):
    next_state = digitize_state(observation)
    next_action = np.argmax(q_table[next_state])

    #Update Q table
    alpha = 0.2
    gamma = 0.99
    q_table[state, action] = (1 - alpha) * q_table[state, action] +\
            alpha * (reward + gamma * q_table[next_state, next_action])

    return next_action, next_state

for episode in range(num_episodes):
    #Environment initialization
    observation = env.reset()

    state = digitize_state(observation)
    action = np.argmax(q_table[state])

    episode_reward = 0
    for t in range(max_number_of_steps):
        #Drawing CartPole
        #env.render()

        #Take action and get feedback
        observation, reward, done, info = env.step(action)

        #Action choice
        action, state = get_action(state, action, observation, reward)
        episode_reward += reward

With this, the learning actually converges in about 100 trials and succeeds.

the end

I played with the outline of reinforcement learning and Cart Pole. I would also like to challenge DQN.

The full program is below. https://github.com/Fumio-eisan/RL20200527

Recommended Posts

I learned the basics of reinforcement learning and played with Cart Pole (implementing simple Q Learning)
See the behavior of drunkenness with reinforcement learning
Deep Learning from scratch The theory and implementation of deep learning learned with Python Chapter 3
Learn while implementing with Scipy Logistic regression and the basics of multi-layer perceptron
I compared the speed of Hash with Topaz, Ruby and Python
[Introduction to StyleGAN] I played with "The Life of a Man" ♬
[Super basics of Python] I learned the basics of the basics, so I summarized it briefly.
Explore the maze with reinforcement learning
I played with PyQt5 and Python3
I replaced the numerical calculation of Python with Rust and compared the speed
I vectorized the chord of the song with word2vec and visualized it with t-SNE
I made a GAN with Keras, so I made a video of the learning process.
I measured the speed of list comprehension, for and while with python2.7.
[Scikit-learn] I played with the ROC curve
I didn't know the basics of Python
I tweeted the illuminance of the room with Raspberry Pi, Arduino and optical sensor
I compared the moving average of IIR filter type with pandas and scipy
The story of doing deep learning with TPU
[Python] I introduced Word2Vec and played with it.
Basics of Supervised Learning Part 1-Simple Regression- (Note)
A memorandum of studying and implementing deep learning
I played with Floydhub for the time being
I read and implemented the Variants of UKR
[Mac] I tried reinforcement learning with OpenAI Baselines
I implemented "Basics of Time Series Analysis and State Space Model" (Hayamoto) with pystan
[Example of Python improvement] I learned the basics of Python on a free site in 2 weeks.
I tried to automate the article update of Livedoor blog with Python and selenium.
I just wanted to extract the data of the desired date and time with Django
I tried to compare the processing speed with dplyr of R and pandas of Python