Reinforcement learning is cool, isn't it? This time, I've played with "Mountain Car" in the python environment "OpenAI Gym", so I will introduce it. By the way, I use Google Colab </ font>.
I referred to this article quite a bit. Introduction to OpenAI Gym
Looking back on Q-learning as a learning method. No matter what! If so, please skip it.
In Q-learning, $ Q \ left (s_ {t}, a_ {t} \ right) $ is called state action value </ font>, and in a certain state $ st $, action $ Represents the value of taking a_ {t} $. The notation $ t $ used here does not mean time, but a single state of a state. The value here is not the reward that you will receive temporarily when you change states, but the cumulative reward that you will receive when you complete the episode to the end.
Therefore, as a measure, you should select $ a $ that becomes $ \ max_ {a \ in At} (Q (s_ {t}, a)) $ in a certain state $ s_ {t} $. Will be.
Generally, the method of updating the state behavior value is expressed as follows.
It is this $ Q $ learning that updates this state value function by repeating episodes (games) and seeks the optimal strategy.
In this environment, the game ends when the car's position reaches the position of the flag on the right. Unless you reach it, you will get a reward of -1 for each action you take. If you do not reach the goal after 200 actions, the game is over. In that case, you've got a -200 reward. In other words, reinforcement learning is done to get rewards greater than -200.
Actions are limited to three: move to the left: 0, do not move: 1, move to the right: 2.
I put various things in Google Colab to describe the state, but in the end, all I have to do is put in the gym.
bash
#gym installation
$ pip install gym
#Not required(Recommended when using colab)
$apt update
$apt install xvfb
$apt-get -qq -y install libcusparse8.0 libnvrtc8.0 $ibnvtoolsext1 > /dev/null
$ln -snf /usr/lib/x86_64-linux-gnu/libnvrtc-builtins.so.8.0 /usr/lib/x86_64-linux-gnu/libnvrtc-builtins.so
$apt-get -qq -y install xvfb freeglut3-dev ffmpeg> /dev/null
$pip install pyglet
$pip install pyopengl
$pip install pyvirtualdisplay
Some of them are not needed, so you can choose them as appropriate.
import gym
from gym import logger as gymlogger
from gym.wrappers import Monitor
gymlogger.set_level(40) #error only
import tensorflow as tf
import numpy as np
import random
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import math
import glob
import io
import base64
from IPython.display import HTML
I refer to github for the types of actions and how to use the environment. GitHub MountainCar
class Q:
def __init__(self, env):
self.env = env
self.env_low = self.env.observation_space.low #Minimum position and velocity
self.env_high = self.env.observation_space.high #Maximum position and speed
self.env_dx = (self.env_high - self.env_low) / 40 #Divide into 50 equal parts
self.q_table = np.zeros((40,40,3))
def get_status(self, _observation):
position = int((_observation[0] - self.env_low[0])/self.env_dx[0])
velocity = int((_observation[1] - self.env_low[1])/self.env_dx[1])
return position, velocity
def policy(self, s, epsilon = 0.1):
if np.random.random() <= epsilon:
return np.random.randint(3)
else:
p, v = self.get_status(s)
if self.q_table[p][v][0] == 0 and self.q_table[p][v][1] == 0 and self.q_table[p][v][2] == 0:
return np.random.randint(3)
else:
return np.argmax(self.q_table[p][v])
def learn(self, time = 5000, alpha = 0.4, gamma = 0.99):
log = []
for j in range(time):
total = 0
s = self.env.reset()
done = False
while not done:
a = self.policy(s)
next_s, reward, done, _ = self.env.step(a)
total += reward
p, v = self.get_status(next_s)
G = reward + gamma * max(self.q_table[p][v])
p,v = self.get_status(s)
self.q_table[p][v][a] += alpha*(G - self.q_table[p][v][a])
s = next_s
log.append(total)
if j %100 == 0:
print(str(j) + " ===total reward=== : " + str(total))
return plt.plot(log)
def show(self):
s = self.env.reset()
img = plt.imshow(env.render('rgb_array'))
done = False
while not done:
p, v = self.get_status(s)
s, _, done, _ = self.env.step(self.policy(s))
display.clear_output(wait=True)
img.set_data(env.render('rgb_array'))
plt.axis('off')
display.display(plt.gcf())
self.env.close()
The handling state $ s $ includes the current position of the car (position) and the current speed (velocity). Since both take continuous values, they are set to discrete values (40, 40) with the get_status function. q_table is the storage location of the state value function, which is the product of each of the states (40, 40) by the type of action, (3).
env = gym.make('MountainCar-v0')
agent = Q(env)
agent.learn()
With about 5,000 lessons, the number of times you can reach your goal has increased.
By the way, if you want to check the animation with google colab,
from IPython import display
from pyvirtualdisplay import Display
import matplotlib.pyplot as plt
d = Display()
d.start()
agent.show()
You can see it in.
This time, the explanation is quite complicated, so I will update it gradually. We will continue to disseminate information while playing with Gym more. see you!
Recommended Posts