Trigger

For some reason, I wanted to solve the maze with reinforcement learning.

Before that, I bought this book "Introduction to Strengthening Learning with Python" — Takahiro Kubo to study the basics.

However, although it is a fairly easy-to-understand book, I do not understand it very much at the first time ...

I remember hearing a long time ago that I should output it, and wrote what I understood the contents of this book in my own way. (There are a lot of quotations ...)

There may be some mistakes, but I would appreciate it if you could point out.

What is reinforcement learning?

Reinforcement Learning, AlphaGo and AlphaZero are famous, aren't they?

As a simple example There is a pattern to train reinforcement learning in a breakout game. There is a video released by Deepmind, so please take a look. https://www.youtube.com/watch?v=TmPfTpjtdgg

At first, you won't hit at all, but when you happen to hit the board, you will get a (point) reward.
How to get (points) rewards While trying and error, learn that the board should hit → → break the blocks.
Learn how to get the largest (point) reward through trial and error, and learn that the (point) reward is higher if it is reflected on the wall or ceiling.
By repeating this trial and error thousands of times, you will become a master of breakout that surpasses humans!

Roughly, did you understand? It's okay as if you're learning for a reward.

Characteristics of reinforcement learning

--There is a reward (= correct answer) for the action It's a bit like supervised learning.

—— Behavior is evaluated from the perspective of maximizing the “sum of rewards”.

In other words, continuous actions are required, not just actions.
It is not a reward for bouncing the ball once, but a reward for one game.

--The period from the start to the end of the environment is one episode (Episode)

If you break a block, one game is one episode

-** Purpose of reinforcement learning: Maximize the total rewards obtained in one episode **

-** Reinforcement learning model learns **

1, behavior evaluation method

How to choose actions (based on evaluation) = strategy

Markov property

In reinforcement learning Assume that a given environment follows certain rules The rule (property) is called Markov property. "The state of the transition destination depends only on the previous state and the action there. The reward depends on the previous state and the transition destination. "

Simply put The state of the transition destination depends only on the previous state and the action there → $ T (s, a) $ The reward depends on the previous state and the transition destination → $ R (s, s') $ Read the text below for details
The "transition destination" may be understood as the "next state".

An environment with Markov properties is called a Markov Decision Process (MDP). 4 components of MDP $ s $: State $ a $: Action $ T $: Probability of state transition (Transition function) A function that outputs the transition destination (next state) and transition probability $ P_ {a} $ with the state and action as arguments.

※ Input: Action $ a $, State $ s $ Transition function: $ T \ left (s, a \ right) $ Output: Transition destination $ s'$, Transition probability $ P_ {a} \ left (s, s'\ right) $

$ R $: Immediate reward (Reward function) A function that outputs a reward with the state and transition destination as arguments (even when taking an action as an argument)

※ Input: State $ s $, Transition destination (next state) $ s'$ Reward function: $ R \ left (s, s'\ right) $ Output: Immediate reward $ r $

$ \ pi $: A function that receives a Strategy state and outputs an action It is called an agent that moves according to the strategy.

Some call $ \ pi $ a strategy, but here we will unify it with the strategy.

-Learning in reinforcement learning is to adjust the parameters of the strategy so that appropriate actions can be output according to the state. ――Strategy is a model in reinforcement learning.

What kind of parameter is "adjusting strategic parameters"?

Reward summation formula

Reinforcement learning seeks to maximize the sum of rewards, so we will explain the formula for sums of rewards.

The sum of the rewards in MDP is the sum of the immediate rewards.

If the episode ends at time $ T $, the sum of rewards $ G_ {t} $ at time $ t $ is defined as: G_{t}:=r_{t+1}+r_{t+2}+r_{t+3}+\ldots +r_{T}

However, there is a problem with this formula and it cannot be used, so I will change it a little. Problem: $ G_ {t} $ cannot be calculated until the episode ends Solution: Make a quote But the estimate is uncertain and needs to be discounted → If you use the discount factor $ \ gamma $

G_{t}:=r_{t+1}+\gamma r_{t+2}+\gamma ^{2}r_{t+3}+\ldots + \gamma ^{T-t-1}r_{T} =\gamma^{k}r_{t+k+1}\sum ^{T-t-1}_{k=0}

Why is it buggy if I connect it, so I put out $ \ gamma $$ r $ ...

--Discount rate is 0 ~ 1 ――Because the index of the discount rate becomes larger at the future time, that is, the discount is made at the future. ――The value obtained in the future is discounted by the discount rate and is called the present value.

To explain it sensuously, if you get 100,000 tomorrow and 100,000 yen a year later, you'd be happy to get tomorrow. We are introducing the idea of people that the value of what they receive now is different from that of the future.

What is the expected reward (value)?

When the above formula is expressed as a recursive formula

-A recursive expression is to use $ G_ {t} $ in the expression that defines $ G_ {t} $.

G_{t}:=r_{t+1}+\gamma r_{t+2}+\gamma ^{2}r_{t+3}+\ldots + \gamma ^{T-t-1}r_{T} =r_{t+1}+\gamma\left(r_{t+2}+\gamma r_{t+3}+\ldots + \gamma ^{T-t-2}r_{T}\right) =r_{t+1}+\gamma G_{t+1}

--What is the expected reward (value) $ G_ {t} $?

-This $ G_ {t} $ "sum of rewards" is called the "estimated value" ** Expected reward, ** or ** Value **. * Value is used in the following explanation --Calculating value is called ** Value approximation **. ――This value evaluation is one of the two things learned by reinforcement learning ** "Behavior evaluation method" **.

MDP components in maze movement

--State $ s $: Current position of coordinates, squares, cells, etc. --Action $ a $: Move up / down / left / right (because it is a maze, there are only four directions) --Transition function $ T $: A function that receives the state $ s $ and the action $ a $ and returns the movable cell and the probability of moving to it (transition probability $ P_ {a} $).

In other words, from the original position (state $ s $) and in which direction (action $ a $)
→ You can get the position when you move (transition destination $ s'$) and the probability of whether you really move (transition probability $ P_ {a} $).
Probability of moving to the right when trying to move to the right
- It may not actually move or may move in a different direction. For example, the floor is slipping, or a strong wind is blowing, causing it to move in a different direction.

--Immediate reward $ R $: A function that receives the state $ s $ and the transition destination $ s'$ and returns the reward $ r $.

That is, the reward $ r $ is returned from the original position (state $ s $) and the position (transition destination $ s'$) when moving.
A square with no original position
→ If the transition destination is a square with a bonus, $ r $ = + 1
→ If the transition destination is a square with a penalty, $ r $ = -1
→ $ r $ = 0 if there is no transition destination

Summary

--The period from the start to the end of the environment is one episode (Episode)

-** Purpose of reinforcement learning: Maximize the total rewards obtained in one episode ** --What the reinforcement learning model learns

Behavioral evaluation method
How to choose actions (based on evaluation) = strategy

--Four components of MDP $ s $: State $ a $: Action $ T $: Probability of state transition (Transition function) A function that outputs the transition destination (next state) and transition probability $ P_ {a} $ with the state and action as arguments. $ R $: Immediate reward (Reward function) A function that outputs a reward with the state and transition destination as arguments (even when taking an action as an argument)

-$ \ pi $: Function that receives the Strategy state and outputs the action It is called an agent that moves according to the strategy.

-$ G_ {t} $ The "sum of rewards" is called the "estimated value" ** Expected reward, ** or ** Value **. --Calculating value is called ** Value approximation **. ――This value evaluation is one of the two things learned by reinforcement learning ** "Behavior evaluation method" **.

Reference material

-[Machine Learning Startup Series-Strengthening Learning with Python-From Introduction to Practice](https://www.amazon.co.jp/%E6%A9%9F%E6%A2%B0%E5%AD%A6%] E7% BF% 92% E3% 82% B9% E3% 82% BF% E3% 83% BC% E3% 83% 88% E3% 82% A2% E3% 83% 83% E3% 83% 97% E3% 82% B7% E3% 83% AA% E3% 83% BC% E3% 82% BA-Python% E3% 81% A7% E5% AD% A6% E3% 81% B6% E5% BC% B7% E5% 8C% 96% E5% AD% A6% E7% BF% 92-% E5% 85% A5% E9% 96% 80% E3% 81% 8B% E3% 82% 89% E5% AE% 9F% E8% B7 % B5% E3% 81% BE% E3% 81% A7-KS% E6% 83% 85% E5% A0% B1% E7% A7% 91% E5% AD% A6% E5% B0% 82% E9% 96 % 80% E6% 9B% B8-% E4% B9% 85% E4% BF% 9D / dp / 4065142989)

https://github.com/icoxfog417/baby-steps-of-rl-ja

This is the official account of the above book. You can see the code and overview.

-Math Webmemo LaTex conversion of handwritten formulas! Seriously recommended!

I read "Reinforcement Learning with Python: From Introduction to Practice" Chapter 1