Aidemy 2020/11/15
Hello, it is Yope! I am a liberal arts student, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This is the first post of reinforcement learning. Nice to meet you.
What to learn this time ・ What is reinforcement learning? ・ Agent, environment, reward ・ Reinforcement learning measures
・ Machine learning can be broadly divided into __ "supervised learning," "unsupervised learning," and "reinforcement learning." ・ Of these, the __reinforcement learning __ that we will learn this time is a method aimed at __ "discovering the optimal behavior under given conditions" __. For example, in the case of a game, you can discover how to win by reinforcement learning.
-Reinforcement learning is carried out on the premise that __ "player" __ and __ "environment (stage)" __ interact with each other. -If a agent __ is in the state s and takes action a for environment, then the environment will receive reward R as an evaluation of that behavior. In return, the agent proceeds with the task by repeating the transition to the next state s'. -For example, in the case of Super Mario, the agent "Mario" takes the action of "avoiding the enemy (by jumping)" in an environment where "go ahead and aim for the goal, but end when it hits the enemy". The environment returns the reward "the right to go further (the game doesn't end there)". By repeating this, Mario advances toward the task of goal.
・ In reinforcement learning, not only the on-the-spot reward __ "immediate reward" __ is maximized, but also the reward __ "delayed reward" __ included after that __ "revenue" __ It needs to be maximized. ・ In the previous Mario example, taking coins is a worthless action on the spot, but if you find value in collecting 100 coins and "increasing the remaining machines", you can maximize the reward. It can be said that it is better to act as if to take coins as much as possible.
・ __N arm bandit problem __ is an example of reinforcement learning, __ "Suppose that there are N slot machines, and if it is out of 1, the reward is 0. The probability of hit is different for each machine, but from the user "I can't see" __ "," If the user can draw any one slot in one trial, __ How can I maximize the average reward per trial __ " It is a problem to think about. ・ The idea is that __ "pull on the platform with the highest probability" __, but it is necessary to try to find (guess) that platform __. That is the miso of this problem. ・ In the following, we will learn about reinforcement learning using this example.
-Agent means more specifically __ "things that determine actions in the environment and influence the environment" __. -In the N-arm bandit problem, the agent is user who decides which machine to use, receives a reward, and makes the next decision. -The index __ on how to make the next decision from the reward obtained by the agent is called __ "policy" __. For example, when the policy is set to "always 1" in this problem, the agent will always aim for a platform where the reward is 1. ・ The purpose of this time is to determine the optimal policy for this agent.
-The following code is a function __ "randomselect ()" __ that randomly determines the method __ "which machine to choose" __. This will return __ "slot_num" __ which unit to choose.
-Environment is the target __ to which the __agent takes action. Its role is to take action, observe the situation, send rewards to agents, and advance one time. -The environment in this problem is __ "When an agent pulls a certain table, the process of hitting or losing depending on the probability of that table" __.
-The following code is the function __ "environment ()" __ that defines the environment. __ "coins_p" __ stores the probability of hitting each unit in a NumPy array, and "results" is __ "np.random.binomial (n, p)" __, the number of trials n, and the probability p binomial Use the distribution function to store the result of subtracting slots. This time, the number of trials is one and the probability is "coins_p", so you can specify __ (1, coins_p) __. As the environment function, the result of the passed "band_number" should be returned, so this is stored in "result" and returned.
-Reward is an index __ that evaluates the desirability of a series of actions of the __agent. ・ Speaking of this problem, the __return value (0 or 1) __ obtained by subtracting the slot is the reward as it is. This is __immediate reward __.
-The following code is the function __ "reward ()" __ that defines the reward. The result obtained by the environment function earlier is stored in "result", and this is stored in the "time" th, which indicates the current number of trials of the array "record" passed as an argument. -Also, in the list __ "results" __ which is __ [[(number of units), (number of selections), (total reward), (total reward ÷ number of selections)], ...] __ , Store each result.
-Furthermore, use these functions or pass the value by yourself to decide __ "table to use", "whether it is a hit or miss", "how many times to try" __, etc., and the number of trials of each machine The result is output and the transition of __average reward is illustrated __ below.
·result
・ Looking at this result, "Which car to choose" is completely random (1/5), so the number of times __ cars are selected is equal __, and the probability of hitting is almost the same as the set value. It has become. Also, as you can see from the graph, the average reward is maintained around 0.5.
-Since the agent, environment, and reward type have been defined so far, the next step is to think about __measures __ "How can I raise the average reward?" ・ As a basic policy, there is no information at first, so there is no choice but to randomly select a table __. After some trials and when you can guess the probability of hitting a table, you will find the table with the highest probability and choose to keep choosing that table. -At this time, collecting information by repeating trials to some extent is called __ "search" __, and finally continuing to select the most suitable platform is called __ "use" __.
-The simplest method to carry out this policy is __ "greedy method" __. This is a simple and straightforward __ "select the one with the highest expected value from the results so far" . ・ First, perform search. At this time, it is decided in advance that __ will be tried n times per unit, and that will be done for all units. When this is done, calculate the expected value of __ reward $ u {i} $ at that time __ and select the largest one. -The expected value $ u {i} $ is calculated by __ "(sum of rewards of machine i) ÷ (number of trials of machine i)" __.
-The following code is an example of actually defining this method and using the function that defines the environment created up to the previous section.
-The definition of the greedy method is __ "Leave slot_num on that table until the maximum number of trials is reached" __ and __ "When all the machines reach the maximum number of trials, slot_num is set. Make it the one with the highest expected value at that point. "__ You can define something like that. -For the latter, the reason why the code says __ "np.array ()" __ is to use the method __ "argmax ()" __ to get the maximum value.
・ Practicing the greedy method by giving a value
・ Result![Screenshot 2020-11-11 12.20.26.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/d2b53fb9-583e-7d14- ce60-f265a667b2be.png)
・ As can be seen from this result, trials are performed on all units up to the upper limit of the number of trials for a given search __ "n = 100" __, and from that point on, the highest expected value __ All trials (9500 remaining) are used for "Table 4" __.
-In the greedy method in the previous section, the smaller n, that is, the fewer trials used for the search, the higher the possibility of selecting the wrong platform. On the other hand, if n is increased, the possibility of selecting the optimum platform increases, but the search requires a large number of trials, which increases waste. -This __ "ε-greedy method" __ can solve this problem. By interweaving search and use, this method can reduce the cost of search while reducing the risk of continuing to choose the wrong platform. ・ The flow of the ε-greedy method is as follows. ① If there is a stand that you have not selected yet, select it ②__Randomly select from all units with probability ε__ (search) ③ __ Select the machine with the highest average reward (expected value) so far with probability 1-ε __ (use)
-In other words, this method is to search with __ probability ε and use with probability 1-ε for each trial.
-Definition part of ε-greedy method![Screenshot 2020-11-12 10.08.14.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/ f62db320-5545-1e0d-e810-d73fc3a99221.png)
-Execution part![Screenshot 2020-11-12 10.10.43.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/d09711b3-1335-9165 -36ec-9cd56a863bd8.png)
・ Result![Screenshot 2020-11-12 10.10.29.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/df18123b-4a82-53a6- 7eab-852e42fe26d5.png)
・ In the greedy method, for example, if there are "tables A and B that hit with a probability of 0.4 and 0.6" and the expected value of "table of A" is predicted to be large like "0.8", the table of A is initially used. However, by trial, it is possible to recognize that the expected value of the A platform seems to be low (converge to 0.4) and move to the B platform. ・ On the other hand, if the expected value of "B stand" is predicted to be less than "0.2", then __, "B stand will not be tried", that is, "A stand only will be selected". There is a problem that __ I can't notice the mistake __. -As a method to reduce such risk, __ "optimistic initial value method" __ "estimate the expected value higher (optimistically) when uncertain" is used. ・ As a concrete idea, before learning, we will calculate the expected value after assuming that the maximum reward value is obtained K times from each machine __. -Therefore, when using this method, it is necessary to specify __ "maximum reward value (rsup)" __ and __ "number of times assuming maximum value (K)" _. ・ The expected value of each unit using these two is calculated by $ \ frac {R (N) + Kr {sup}} {N + K} $. __R (N) __ is the number of times actually measured (stored in the third column of results), and N is the number of times actually measured (stored in the second column of results). So the code looks like this:
-Execution part![Screenshot 2020-11-12 11.02.52.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/f47baedd-9afe-5c32 -1a76-f0236ffaef22.png)
・ Result![Screenshot 2020-11-12 11.02.38.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/6992afc7-4fc9-beaa- 767b-bfa559a8a530.png)
-The disadvantage of the ε-greedy method is that when searching, even if the platform is clearly considered to be the worst, __ will always try a fixed number of times. For example, if the probability of hitting is 20% and there is a table that is "1" if it is hit and "-100" if it is lost, it is predicted that pulling this table will be extremely bad behavior, but ε-greedy The law cannot avoid this and search. -It is __ "soft-max method" __ that can solve this. This is __ "It is easy to choose actions that seem to be high value, and it is difficult to choose actions that seem to be low value" . That is, __ "weight each action" __ is performed. -The weight is calculated using the formula $ \ frac {\ exp {Q {i} / \ tau}} {\ sum ^ i \ exp {Q {i} / \ tau}} $. $ Q_i $ is the expected value of the reward, $ \ tau $ (tau) is the parameter, and the smaller the $ \ tau $, the stronger the tendency of the above weighting. ・ As the flow of the soft-max method, if there is no __data, all rewards are assumed to be "1" __, the selection probability of each unit is calculated by the above formula, and selection is made based on this. , Update the reward function by getting a reward, and so on. The specific code is as follows.
-For the above code, first calculate the expected expected value of reward __ "q" __. See the 4th column of results for this. Next, if there is no data, the expected reward value "q" is set to "1", that is, an array in which all values are 1 is created with __ "np.ones ()" __. -Calculate the selection probability __ "probability" __ with the above formula using "q" and "tau" passed. Then, assuming that the probability is probability and the target to be selected is the number of the unit, __ "np.random.choice (array, p = probability)" __ defines what is selected.
-Execution part![Screenshot 2020-11-12 12.01.40.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/90b343bf-f6b6-9786 -e3c9-2528b0771d39.png)
・ Looking at the above results, it is often selected according to the probability of hitting the table. However, in this trial, the No. 1 machine did not hit at all, it was regarded as "the machine with a probability of 0", and it was not selected after that.
-The __UCB1 algorithm __ is an improved version of the __optimistic initial value method . Specifically, __ "How much the machine hit (success rate)" __ and __ "How much do you know about the machine (the amount of data variation due to chance)" __ By making a judgment, it is possible to actively search for a table that has not been searched so much, and to select the table with the highest winning probability when data is collected.
-As a flow, first calculate the difference "R" __ between the maximum value and the minimum value of __ reward. Then, if there is a unit that has not been selected yet, select it and calculate the expected value (success rate) __ "$ u {i} $" of the reward for each unit from the obtained results.
-Also, with $ R \ sqrt {\ frac {2logT} {N}} $ using these, the magnitude of data variation "$ x {i}
-For the above code, "a machine that has not been selected yet" can be rephrased as having 0 in the first column (number of trials) of results. In addition, the "total number of trials so far (times)" may be obtained by adding all of these parts with "sum ()". -The magnitude of data variation due to chance of each unit "xi" can be described according to the above formula, but the root is __ "math.sqrt ()" __, __ "math.log ()" __ Each log can be expressed with.
-Execution part![Screenshot 2020-11-12 13.35.58.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/ba96a21e-3d0e-e34b -480f-3dc74fba84fa.png)
-For this result, all the machines have been tried the minimum number of times, but the machines that seem to have a low probability have not been tried so much, and __ waste is reduced __.
・ So far, we have seen the five __ "greedy method", "ε-greedy method", "soft-max method", "optimistic initial value method", and "UCB1 algorithm" __ The important thing is to use __ depending on the problem __. ・ As a premise of reinforcement learning, it can be said that __ "search and use are in a trade-off relationship" __. That is, if the number of searches is increased, the number of trials other than the optimum one is increased, which increases waste, and if the number of uses is increased, the risk of not selecting the best one is increased. -For example, when the total number of trials is small, the search rate is large __ "ε-greedy method" __, etc., because it is easy to find the optimum platform, the reward rate tends to be high. On the other hand, when the number of trials is large, the search waste becomes large. The search can be performed smarter and the number of uses increases __ "UCB1 algorithm" __ tends to have a higher rate of return. ・ When all the probabilities are published from the beginning, the agent only has to try the most suitable one from them, that is, no search is necessary, but the reward at this time and the above five methods were used. The difference in rewards in the case is called __ "regret" __. -Of the five methods, the one that can minimize the regret is __ "UCB1 algorithm" __.
・ A diagram showing the results of each method (in the case of the N-arm banded problem)
・ Reinforcement learning is a method aimed at discovering optimal behavior under given conditions. -In reinforcement learning, there is an actor __ "agent" __. In the __N arm banded problem , the action of "choosing a platform" is taken. -There is also a __ "environment" __ for which the agent takes action. The environment in this case is "the process of returning whether the platform is a hit or a miss". - "Reward" __ is an index that evaluates the desirability of the agent's behavior. The purpose of reinforcement learning is to maximize the magnitude of this reward. -As a method for maximizing this reward, __ "greedy method", "ε-greedy method", "soft-max method", "optimistic initial value method", and "UCB1 algorithm" __ can be mentioned. In each method, first, __ "search" __ is performed to estimate the probability of each platform, and then __ "use" __ is performed to select the optimum platform from the result.
This time is over. Thank you for reading until the end.
Recommended Posts