Implemented $ n $ skill bandit task in python. As a textbook ["Reinforcement Learning"](http://www.amazon.co.jp/ Reinforcement Learning-Richard-S-Sutton/dp/4627826613/ref=sr_1_1?ie=UTF8&qid=1463965873&sr=8-1&keywords=Reinforcement Learning % E3% 80% 80 Mikami) was used.
Structure of this article
Reinforcement learning learns which actions to choose to maximize rewards. In supervised learning, the correct action to be selected is given, but in reinforcement learning, the action is selected based on a "certain guideline", and the action is evaluated and updated using the reward obtained from it. You can maximize your reward by acting according to the learned "certain guidelines".
I will explain the components of reinforcement learning and their brief explanations.
The agent senses the environment and selects actions based on the sensed state and value function. The environment feeds back the reward for the action, and this reward is used to update the value of the action.
A slot machine with $ n $ levers is called a $ n $ arm bandit. Select a $ 1 $ book from the $ n $ levers and pull the lever to win a prize. First, from the Gaussian distribution with mean $ 0 $ variance $ 1 $, we generate reward $ Q ^ {\ ast} (a) $ for some action $ a $. Next, we generate the reward obtained from the bandit machine from the Gaussian distribution with mean $ Q ^ {\ ast} (a) $ variance $ 1 $. Creating multiple bandit machines and pulling all bandit machines $ 1 $ each is called $ 1 $ play. Maximize rewards by learning through multiple plays,
By averaging the rewards obtained when an action is selected, the value of that action can be estimated. This is called the sample averaging method. Let the true value of the action $ a $ be $ Q ^ {\ ast} (a) $ and the estimator in the $ t $ th play be $ Q_t (a) $. If the action $ a $ is selected $ k_a $ times in the $ t $ th play and the rewards for each selection are $ r_ {1}, r_ {2}, ..., r_ {k_a} $ The value of the action $ a $ up to the $ t $ time is defined as follows.
Q_{t}(a) = \frac{r_{1} + r_{2} + \cdots + r_{k_a}}{k_a}
In $ k_a \ rightarrow \ infinty $, it converges to $ Q_t (a) \ rightarrow Q ^ {\ ast} (a) $ according to the law of large numbers.
The Greedy method and the $ \ varepsilon $ Greedy method are explained as action selection rules. In the Greedy method, the action with the highest estimated action value is selected. In other words, in the $ t $ first play, $ 1 $ of greedy behavior such that $ Q_t (a ^ {\ ast}) = \ max_ {a} Q_t (a) $ $ a ^ {\ ast} $ Choose. In the $ \ varepsilon $ greedy method, a random action is selected with a probability of $ \ varepsilon $, and a greedy action is selected with a probability of $ 1-\ varepsilon $. With the Greedy method, we always choose the action with the best value and never try any other action. This means that you can't find any behavior that you haven't tried, even if it's better than what you're currently doing best. On the other hand, in the $ \ varepsilon $ Greedy method, the action is basically selected in a greedy manner, but sometimes the action is randomly performed with a probability of $ \ varepsilon $. This may allow you to find even better behavior than what you are currently doing best.
I implemented it as follows. I prepared a $ 10 $ bandit machine with $ 2000 $ and learned through $ 2000 $ play. Here, the $ 1 $ play is to draw $ 1 $ for all machines in the $ 2000 $ range. (The results are slightly different due to some changes in the textbook rules.)
bandit.py
import numpy
from matplotlib import pyplot
import random
import sys
class Bandit:
# constuctor
def __init__(self, n_machine, n_action):
for i in range(n_action):
_qstar = numpy.random.normal(0.0, 1.0)
_machine = numpy.random.normal(_qstar, 1.0, n_machine).reshape((-1, 1))
if i == 0:
self.machine = _machine
else:
self.machine = numpy.hstack((self.machine, _machine))
# public method
def play(self, n_play, epsilon):
self.q = numpy.zeros(self.machine.shape)
self.q_count = numpy.zeros(self.machine.shape)
average_reward = numpy.zeros(n_play)
n_machine = self.machine.shape[0]
for _p in range(n_play):
total = 0.0
for mac_index in range(n_machine):
act_index = self.__select_action(mac_index, epsilon)
reward = self.machine[mac_index, act_index]
total += reward
self.__update_qtable(reward, mac_index, act_index)
average_reward[_p] = total / n_machine
self.__display(_p, average_reward[_p])
return average_reward
# private method
def __select_action(self, mac_index, epsilon):
if numpy.random.rand() > epsilon:
act_index = self.__select_greedy_action(mac_index)
else:
act_index = self.__select_random_action()
return act_index
def __select_greedy_action(self, mac_index):
_max = self.q[mac_index, :].max()
indexes = numpy.argwhere(self.q[mac_index, :] == _max)
random.shuffle(indexes)
return indexes[0]
def __select_random_action(self):
return numpy.random.randint(10)
def __update_qtable(self, reward, mac_index, act_index):
_q = self.q[mac_index, act_index]
self.q_count[mac_index, act_index] += 1
self.q[mac_index, act_index] = _q + (reward - _q) / self.q_count[mac_index, act_index]
def __display(self, play, average_reward):
if (play + 1) % 100 == 0:
print u'play: %d, average reward: %f' % (play + 1, average_reward)
main.py
from bandit import *
if __name__ == '__main__':
# param
param = sys.argv
# init
n_machine = 2000
n_action = 10
n_play = 2000
epsilon = [0.0, 0.01, 0.1, 1.0]
# draw init
mergin = 5
color = ['b', 'g', 'r', 'k']
pyplot.figure(figsize = (8, 6))
pyplot.xlim(-mergin, n_play + mergin)
# bandit machine
bandit = Bandit(n_machine, n_action)
# play
for i in range(len(epsilon)):
print u'play count: %d, epsilon: %.2f' % (n_play, epsilon[i])
average_reward = bandit.play(n_play, epsilon[i])
_label = 'e = %.2f' % epsilon[i]
pyplot.plot(numpy.arange(n_play), average_reward, color = color[i], label = _label)
print '!!!finish!!!\n'
# save and show
if '-d' in param or '-s' in param:
pyplot.legend(loc = 'center right')
if '-s' in param:
pyplot.savefig('bandit2.png')
if '-d' in param:
pyplot.show()
The following results were obtained. At $ \ varepsilon = 0.0 $, you always choose a greedy action, so when you find an action worth $ 0 $ or more, you will always continue to choose that action. Therefore, the average value of the rewards obtained is always constant. At $ \ varepsilon = 1.0 $, you always act randomly, so even if you play more times, the average reward you get does not exceed a certain range. With $ \ varepsilon = 0.1 $, there is a $ 10 % $ chance of randomly acting and a $ 90 % $ chance of choosing a greedy action. We find the best behavior the fastest, but the average reward is only around $ 1.8 $ because we act randomly with a probability of $ 10 % $ even after learning. With $ \ varepsilon = 0.01 $, there is a probability of $ 1 % $ to act randomly, and a probability of $ 99 % $ to select a greedy action. The convergence speed is slower than $ \ varepsilon = 0.1 $, but the performance is the best from around $ 1000 $ play.
We implemented the $ n $ skill bandit task in python and confirmed the difference in the learning process due to the change in $ \ varepsilon $.
Recommended Posts