For some reason, I wanted to solve the maze with reinforcement learning.
Before that, I bought this book "Introduction to Strengthening Learning with Python" — Takahiro Kubo to study the basics.
However, although it is a fairly easy-to-understand book, I do not understand it very much at the first time ...
I remember hearing a long time ago that I should output it, and wrote what I understood the contents of this book in my own way. (There are a lot of quotations ...)
There may be some mistakes, but I would appreciate it if you could point out.
Reinforcement Learning, AlphaGo and AlphaZero are famous, aren't they?
As a simple example There is a pattern to train reinforcement learning in a breakout game. There is a video released by Deepmind, so please take a look. https://www.youtube.com/watch?v=TmPfTpjtdgg
Roughly, did you understand? It's okay as if you're learning for a reward.
--There is a reward (= correct answer) for the action It's a bit like supervised learning.
—— Behavior is evaluated from the perspective of maximizing the “sum of rewards”.
--The period from the start to the end of the environment is one episode (Episode)
-** Purpose of reinforcement learning: Maximize the total rewards obtained in one episode **
-** Reinforcement learning model learns **
1, behavior evaluation method
In reinforcement learning Assume that a given environment follows certain rules The rule (property) is called Markov property. "The state of the transition destination depends only on the previous state and the action there. The reward depends on the previous state and the transition destination. "
Simply put The state of the transition destination depends only on the previous state and the action there → $ T (s, a) $ The reward depends on the previous state and the transition destination → $ R (s, s') $ Read the text below for details
The "transition destination" may be understood as the "next state".
An environment with Markov properties is called a Markov Decision Process (MDP). 4 components of MDP $ s $: State $ a $: Action $ T $: Probability of state transition (Transition function) A function that outputs the transition destination (next state) and transition probability $ P_ {a} $ with the state and action as arguments.
※ Input: Action $ a $, State $ s $ Transition function: $ T \ left (s, a \ right) $ Output: Transition destination $ s'$, Transition probability $ P_ {a} \ left (s, s'\ right) $
$ R $: Immediate reward (Reward function) A function that outputs a reward with the state and transition destination as arguments (even when taking an action as an argument)
※ Input: State $ s $, Transition destination (next state) $ s'$ Reward function: $ R \ left (s, s'\ right) $ Output: Immediate reward $ r $
$ \ pi $: A function that receives a Strategy state and outputs an action It is called an agent that moves according to the strategy.
-Learning in reinforcement learning is to adjust the parameters of the strategy so that appropriate actions can be output according to the state. ――Strategy is a model in reinforcement learning.
Reinforcement learning seeks to maximize the sum of rewards, so we will explain the formula for sums of rewards.
The sum of the rewards in MDP is the sum of the immediate rewards.
If the episode ends at time $ T $, the sum of rewards $ G_ {t} $ at time $ t $ is defined as:
G_{t}:=r_{t+1}+r_{t+2}+r_{t+3}+\ldots +r_{T}
G_{t}:=r_{t+1}+\gamma r_{t+2}+\gamma ^{2}r_{t+3}+\ldots + \gamma ^{T-t-1}r_{T} =\gamma^{k} r_{t+k+1} \sum ^{T-t-1}_{k=0}
--Discount rate is 0 ~ 1 ――Because the index of the discount rate becomes larger at the future time, that is, the discount is made at the future. ――The value obtained in the future is discounted by the discount rate and is called the present value.
When the above formula is expressed as a recursive formula
-A recursive expression is to use $ G_ {t} $ in the expression that defines $ G_ {t} $.
G_{t}:=r_{t+1}+\gamma r_{t+2}+\gamma ^{2}r_{t+3}+\ldots + \gamma ^{T-t-1}r_{T} =r_{t+1}+\gamma\left(r_{t+2}+\gamma r_{t+3}+\ldots + \gamma ^{T-t-2}r_{T}\right) =r_{t+1}+\gamma G_{t+1}
--What is the expected reward (value) $ G_ {t} $?
-This $ G_ {t} $ "sum of rewards" is called the "estimated value" ** Expected reward, ** or ** Value **. * Value is used in the following explanation --Calculating value is called ** Value approximation **. ――This value evaluation is one of the two things learned by reinforcement learning ** "Behavior evaluation method" **.
--State $ s $: Current position of coordinates, squares, cells, etc. --Action $ a $: Move up / down / left / right (because it is a maze, there are only four directions) --Transition function $ T $: A function that receives the state $ s $ and the action $ a $ and returns the movable cell and the probability of moving to it (transition probability $ P_ {a} $).
--Immediate reward $ R $: A function that receives the state $ s $ and the transition destination $ s'$ and returns the reward $ r $.
--The period from the start to the end of the environment is one episode (Episode)
-** Purpose of reinforcement learning: Maximize the total rewards obtained in one episode ** --What the reinforcement learning model learns
--Four components of MDP $ s $: State $ a $: Action $ T $: Probability of state transition (Transition function) A function that outputs the transition destination (next state) and transition probability $ P_ {a} $ with the state and action as arguments. $ R $: Immediate reward (Reward function) A function that outputs a reward with the state and transition destination as arguments (even when taking an action as an argument)
-$ \ pi $: Function that receives the Strategy state and outputs the action It is called an agent that moves according to the strategy.
-$ G_ {t} $ The "sum of rewards" is called the "estimated value" ** Expected reward, ** or ** Value **. --Calculating value is called ** Value approximation **. ――This value evaluation is one of the two things learned by reinforcement learning ** "Behavior evaluation method" **.
-[Machine Learning Startup Series-Strengthening Learning with Python-From Introduction to Practice](https://www.amazon.co.jp/%E6%A9%9F%E6%A2%B0%E5%AD%A6%] E7% BF% 92% E3% 82% B9% E3% 82% BF% E3% 83% BC% E3% 83% 88% E3% 82% A2% E3% 83% 83% E3% 83% 97% E3% 82% B7% E3% 83% AA% E3% 83% BC% E3% 82% BA-Python% E3% 81% A7% E5% AD% A6% E3% 81% B6% E5% BC% B7% E5% 8C% 96% E5% AD% A6% E7% BF% 92-% E5% 85% A5% E9% 96% 80% E3% 81% 8B% E3% 82% 89% E5% AE% 9F% E8% B7 % B5% E3% 81% BE% E3% 81% A7-KS% E6% 83% 85% E5% A0% B1% E7% A7% 91% E5% AD% A6% E5% B0% 82% E9% 96 % 80% E6% 9B% B8-% E4% B9% 85% E4% BF% 9D / dp / 4065142989)
-Math Webmemo LaTex conversion of handwritten formulas! Seriously recommended!
Recommended Posts