I would like to write an article to digest machine learning in my own way. We plan to make a series, starting from the basics, and gradually dealing with evolving content. The theory behind it ⇨ Code. Questions are welcome.
Reinforcement learning, unlike so-called supervised learning, is an approach that seeks the optimal solution while no "answer" is given to a pre-given data set. For example, let's assume that there are two blackjack stands in front of you (here, the winning percentage will change depending on the dealer). Of course, if you continue to play with the higher winning percentage, you can maximize your profits. However, at the beginning, we do not know how much money each platform will return, so we will "learn" which one is better by repeating the number of times. This is the basic idea of reinforcement learning.
Here, two important "antagonizing" factors, called Explore-Exploit dillemma, are
Exploit (~Greedy) The power to maximize profits Explore (~Epsilon) The power to "seek" for greater profits at the expense of maximizing (most recent) profits
Is. Next, let's take a look at one of the concrete models, the Epsilon-Greedy Algorithm.
Epsilon-Greedy Algorithm To put it simply, it is a tactic that "basically chooses the one with the higher return (Greedy), but occasionally (with a probability as small as Epsilon) changes the mood and randomly chooses". In the example of blackjack above, suppose that you play blackjack a total of N times, and you first win the male dealer (M). Then, the return, that is, the winning percentage against M (MW) is 1, while the winning percentage against the female dealer (W) is unknown, so if the winning percentage (WW) is 0, the next table to play will be M again. By the way, in this second play, the winning percentage is MW> WW regardless of whether you win or lose. In this way, if you just play with the one with the highest winning percentage at that time, that is, with M (Complete Greedy), you will only play with one dealer, and you will continue playing forever without knowing the winning percentage against W. Therefore, what was brought out was the place where the probability of Epsilon in the latter half (usually 5 to 10%) was randomly selected from all options. With this, if you get a sufficiently large N, there will be no dealers whose winning percentage will not be updated, and you can estimate all winning percentages.
Now, what is important here is Epsilon, the "probability of playing randomly". If this is a large value, 50%, what happens is that you play appropriately with half the probability, so you will play about the same number as two dealers, and you can know the winning percentage against two people at an early stage. At the same time, Greedy's strategy is relatively weakened, resulting in a smaller total return. What if I take Epsilon too small this time? The answer is, on the contrary, you decide on a dealer with a high winning percentage early on (not always really right) and you only play with that dealer. In other words, balance is important (as is life). By the way, my life is small Epsilon (aside).
Let's implement these things next.
Written in python3.x. First, substitute the argument to be used after library.
Bandit_Epsilon_simulation.py
import numpy as np
import matplotlib.pyplot as plt
Num_try = 1000
Eps =0.05
Bandit_prob = [0.1,0.45,0.75]
Here, the number in Bandit_prob is the "real win rate" for each dealer, but the goal of this algorithm is to find this by learning (hence, the information that the player does not know originally). Note).
Next, implement the core algorithm.
Bandit_Epsilon_simulation.py
class Bandit():
def __init__(self,p): # p:winning rate- pretend not to know what these are
self.p = p
self.p_estimate = 0
self.N = 0
def pull(self): # return a 1 with p
return np.random.random() < self.p # random takes [0,1] randomly
def update(self, x):
self.N +=1
self.p_estimate = (1/self.N)*(x + self.p_estimate*(self.N-1))
# Using the formula below
# <p>_n*N = p_n(=x) + <p>_(n-1)*(N-1)
def experiment():
bandits = [Bandit(p) for p in Bandit_prob] # we don't know what are in Bandit_prob
rewards = np.zeros(Num_try)
num_explored = 0
num_exploited = 0
num_opt = 0
opt_j = np.argmax([b.p for b in bandits]) # # in bandit_prob
for i in range(Num_try):
# follow epsilon-greedy algorythm
if np.random.random() < Eps:# explore
num_explored += 1
# j is index
j = np.random.choice(list(range(len(bandits))))
else: # greed
num_exploited += 1
j = np.argmax([k.p_estimate for k in bandits])
if j == opt_j:
num_opt += 1
x = bandits[j].pull() #the chosen one can get reward or not?
rewards[i] = x
bandits[j].update(x)
return bandits, rewards, num_explored, num_exploited, opt_j
The match result (win: 1, Lose: 0) is returned by the method pull of the first Bandit. Depending on the result, the winning percentage against each dealer will be updated by updating each time. It should be noted that the winning percentage here is calculated and obtained while playing against the player side, and is different from what the dealer originally has (learning to get closer to it).
Now let's plot this under multiple Epsilon. The horizontal axis is the number of plays, the vertical axis is the winning percentage, and the black horizontal line is the maximum winning percentage of 75%.
This plot is packed with a lot of insights. The speed to the plateau, its height, why it does not reach 75%, next time, let's proceed while looking around.
--Reinforcement learning is a model that corrects the probability distribution of choices based on medium returns (rewards) for which there is no answer. --The Epsilon-Greedy model is one of the representatives. ――The appearance differs greatly depending on the size of Epsilon.
References Udemy- Artificial Intelligence: Reinforcement Learning in Python
Recommended Posts