Aidemy　2020/11/21

Introduction

Hello, it is Yope! I'm a crunchy literary school, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This is the first post of deep reinforcement learning. Nice to meet you.

This article is a summary of what you learned in "Aidemy" "in your own words". It may contain mistakes and misunderstandings. Please note.

What to learn this time ・ (Review) Reinforcement learning ・ Reinforcement learning method ・ DQN

(Review) About reinforcement learning

・ Reinforcement learning is a method of machine learning. ・ The components of reinforcement learning are as follows. Agent, which is the subject of action, Environment, which is the target of action, Action, which acts on the environment, and each element of the environment that changes accordingly is state. In addition, reward indicates the evaluation that is immediately obtained by the action, and revenue indicates how much the total reward is finally obtained. ・ The purpose of reinforcement learning is to maximize the sum of this revenue. -As a model of reinforcement learning, the __measure __ of agent action selection is expressed as __ "input the current environment state" __ and __ "output action" __. And this action chooses something that will give you a higher reward than __. -For this "higher reward", if all the rewards are known, the one with the highest reward can be selected, but in reality, this is rarely given in advance. In such a case, it is necessary to collect information by performing actions that have never been selected by performing __ "search" __. After gathering information in this way, it is advisable to select the action that is presumed to be the most rewarding. This is called __ "use" __.

Reinforcement learning strategies

(Review) greedy method

-It is important to take measures in line with the problem of measures about how to carry out the search and use shown above. -For example, when all the expected values of rewards are known, it is best to select __ "greedy method" __ that only the action with the highest expected value is selected. -However, as mentioned above, in general, there are few cases where all rewards are known, so in such cases, it is necessary to select another action even if the rewards obtained are known to be small. One of these measures is __ "ε-greedy method" __. This is __, which searches with the probability ε and uses it with ___1-ε. By reducing the value of ε based on the number of trials __, the usage rate will increase, and it will be possible to search efficiently.

Boltzmann selection

・ The ε-greedy method was a __method of selecting actions with some probability. Similar to this, there is a policy called __ "Boltzmann selection" __. -Boltzmann selection is called this way because the selection probability follows the following __Boltzmann distribution __.

スクリーンショット 2020-11-18 11.58.34.png

-In this formula, __ "T" __ is called __ temperature function __, and it is __ "function that converges to 0 with the passage of time" __. At this time, __T → infinite limit __ selects all actions with the same probability __, and __T → 0 limit __ makes it easier to select the one with the maximum expected reward value __ It is a thing. -In other words, since T is large at the beginning, the action selection is random __, but when __T approaches 0 with the passage of time, it becomes __ to select like the greedy method.

DQN -DQN is the __Q function of Q-learning expressed by deep learning __. The __Q function __ is the __ "action value function" __, and the __Q learning __ is a reinforcement learning algorithm that estimates this. -The __action value function __ is a function that calculates the expected value __ of the reward when __optimal measures are taken by inputting __ "state s and action a" __. What is done is the sum of the action value __ obtained by performing a certain action and the action value __ obtained by performing a possible action in the next state __ Then, the function is updated a little (adjusting the learning rate) by taking the __difference __ from the current action value. -Actually, the state s and the action a are represented by __table function __ for all combinations, but depending on the problem, there is a risk that the amount of this __combination will be enormous __. ・ In such a case, DQN can solve this Q function by function approximation by deep learning.

-Characteristics of __DQN __ are as follows. See the next Chapter for details. · __Experience Replay : __ Shuffle data time series __ to deal with time series correlation - Target Network __: Calculates the error from the correct answer and adjusts the model so that it is close to the correct answer. Randomly create a batch from the data and perform __batch learning __. -Filter and convert images by __CNN __: __ Convolution . -Clipping: Regarding the reward, if it is negative, it is -1, if it is positive, it is __ + 1, and if it is none, it is 0.

Experience Replay -For example, the input obtained by the agent playing the game has the property of __time series __. Since the time series input has a strong correlation __, if the time series input is used as it is for learning, the learning result will be biased and the convergence will be poor. The solution to this is called Experience Replay. This is a __method in which states, actions, and rewards are input, all or a certain number are recorded, and then __randomly called and learned.

スクリーンショット 2020-11-18 12.44.01.png

Summary

-In reinforcement learning, __search __ and use are performed in order to maximize the sum of __ profits. How to do this is policy. -For this measure, the __ "greedy method" __ is effective when the expected value of the reward is known. This is __ to select only the action with the highest expected value __. -The __ "ε-greedy method" __ corresponds to the case where all the expected reward values are not known. This policy is to search with probability ε and use with 1-ε. -As a similar measure, there is __Boltzmann selection __. Since the values are selected according to the Boltzmann distribution using the temperature function T whose value converges to 0 over time, the action is randomly selected at first, but the action with the highest expected value is selected over time. Will be. -DQN is the __Q function (behavioral value function) __ expressed by __ deep learning __. The action value function calculates the expected value of the reward by inputting the state s and the action a, but this method is used because the amount of s and a becomes enormous if all combinations are expressed by the table function. .. -One of the features of DQN is __ "Experience Replay" __. This does randomly retrieve states, actions, and rewards in order to remove the __time-series nature of the input data.

This time it is abnormal. Thank you for reading this far.