Introduction

Implemented $ n $ skill bandit task in python. As a textbook ["Reinforcement Learning"](http://www.amazon.co.jp/ Reinforcement Learning-Richard-S-Sutton/dp/4627826613/ref=sr_1_1?ie=UTF8&qid=1463965873&sr=8-1&keywords=Reinforcement Learning % E3% 80% 80 Mikami) was used.

Structure of this article

Introduction
Reinforcement learning
Overview
Component
Bandit task
Rules
Specimen averaging method
Action selection rules
Implementation
Result
in conclusion

Reinforcement learning

Overview

Reinforcement learning learns which actions to choose to maximize rewards. In supervised learning, the correct action to be selected is given, but in reinforcement learning, the action is selected based on a "certain guideline", and the action is evaluated and updated using the reward obtained from it. You can maximize your reward by acting according to the learned "certain guidelines".

Component

I will explain the components of reinforcement learning and their brief explanations.

Agent: An individual that senses the surrounding environment and interacts directly with the environment
Environment: Surroundings independent of the agent
Reward: Value obtained from the environment by the action of the agent
Value function: The total amount of rewards that can be accumulated in the future based on the current status of the agent

The agent senses the environment and selects actions based on the sensed state and value function. The environment feeds back the reward for the action, and this reward is used to update the value of the action.

Bandit task

rule

A slot machine with $ n $ levers is called a $ n $ arm bandit. Select a $ 1 $ book from the $ n $ levers and pull the lever to win a prize. First, from the Gaussian distribution with mean $ 0 $ variance $ 1 $, we generate reward $ Q ^ {\ ast} (a) $ for some action $ a $. Next, we generate the reward obtained from the bandit machine from the Gaussian distribution with mean $ Q ^ {\ ast} (a) $ variance $ 1 $. Creating multiple bandit machines and pulling all bandit machines $ 1 $ each is called $ 1 $ play. Maximize rewards by learning through multiple plays,

Specimen averaging method

By averaging the rewards obtained when an action is selected, the value of that action can be estimated. This is called the sample averaging method. Let the true value of the action $ a $ be $ Q ^ {\ ast} (a) $ and the estimator in the $ t $ th play be $ Q_t (a) $. If the action $ a $ is selected $ k_a $ times in the $ t $ th play and the rewards for each selection are $ r_ {1}, r_ {2}, ..., r_ {k_a} $ The value of the action $ a $ up to the $ t $ time is defined as follows.

Q_{t}(a) = \frac{r_{1} + r_{2} + \cdots + r_{k_a}}{k_a}

In $ k_a \ rightarrow \ infinty $, it converges to $ Q_t (a) \ rightarrow Q ^ {\ ast} (a) $ according to the law of large numbers.

Action selection rules

The Greedy method and the $ \ varepsilon $ Greedy method are explained as action selection rules. In the Greedy method, the action with the highest estimated action value is selected. In other words, in the $ t $ first play, $ 1 $ of greedy behavior such that $ Q_t (a ^ {\ ast}) = \ max_ {a} Q_t (a) $ $ a ^ {\ ast} $ Choose. In the $ \ varepsilon $ greedy method, a random action is selected with a probability of $ \ varepsilon $, and a greedy action is selected with a probability of $ 1-\ varepsilon $. With the Greedy method, we always choose the action with the best value and never try any other action. This means that you can't find any behavior that you haven't tried, even if it's better than what you're currently doing best. On the other hand, in the $ \ varepsilon $ Greedy method, the action is basically selected in a greedy manner, but sometimes the action is randomly performed with a probability of $ \ varepsilon $. This may allow you to find even better behavior than what you are currently doing best.

Implementation

I implemented it as follows. I prepared a $ 10 $ bandit machine with $ 2000 $ and learned through $ 2000 $ play. Here, the $ 1 $ play is to draw $ 1 $ for all machines in the $ 2000 $ range. (The results are slightly different due to some changes in the textbook rules.)

`bandit.py`


import numpy
from matplotlib import pyplot
import random
import sys

class Bandit:

	# constuctor
	def __init__(self, n_machine, n_action):
		for i in range(n_action):
			_qstar = numpy.random.normal(0.0, 1.0)
			_machine = numpy.random.normal(_qstar, 1.0, n_machine).reshape((-1, 1))
			if i == 0:
				self.machine = _machine
			else:
				self.machine = numpy.hstack((self.machine, _machine))


# public method
	def play(self, n_play, epsilon):
		
		self.q = numpy.zeros(self.machine.shape)
		self.q_count = numpy.zeros(self.machine.shape)
		average_reward = numpy.zeros(n_play)
		n_machine = self.machine.shape[0]

		for _p in range(n_play):
			total = 0.0
			for mac_index in range(n_machine):
				act_index = self.__select_action(mac_index, epsilon)
				reward = self.machine[mac_index, act_index]
				total += reward
				self.__update_qtable(reward, mac_index, act_index)

			average_reward[_p] = total / n_machine
			self.__display(_p, average_reward[_p])

		return average_reward


# private method
	def __select_action(self, mac_index, epsilon):
		if numpy.random.rand() > epsilon:
			act_index = self.__select_greedy_action(mac_index)
		else:
			act_index = self.__select_random_action()

		return act_index


	def __select_greedy_action(self, mac_index):
		_max = self.q[mac_index, :].max()
		indexes = numpy.argwhere(self.q[mac_index, :] == _max)
		random.shuffle(indexes)
		return indexes[0]


	def __select_random_action(self):
		return numpy.random.randint(10)


	def __update_qtable(self, reward, mac_index, act_index):
		_q = self.q[mac_index, act_index]
		self.q_count[mac_index, act_index] += 1
		self.q[mac_index, act_index] = _q + (reward - _q) / self.q_count[mac_index, act_index]


	def __display(self, play, average_reward):
		if (play + 1) % 100 == 0:
			print u'play: %d, average reward: %f' % (play + 1, average_reward)

`main.py`


from bandit import *

if __name__ == '__main__':

	# param
	param = sys.argv

	# init
	n_machine = 2000
	n_action = 10
	n_play = 2000
	epsilon = [0.0, 0.01, 0.1, 1.0]

	# draw init
	mergin = 5
	color = ['b', 'g', 'r', 'k']
	pyplot.figure(figsize = (8, 6))
	pyplot.xlim(-mergin, n_play + mergin)

	# bandit machine
	bandit = Bandit(n_machine, n_action)

	# play
	for i in range(len(epsilon)):
		print u'play count: %d, epsilon: %.2f' % (n_play, epsilon[i])
		average_reward = bandit.play(n_play, epsilon[i])
		_label = 'e = %.2f' % epsilon[i]
		pyplot.plot(numpy.arange(n_play), average_reward, color = color[i], label = _label)
		print '!!!finish!!!\n'

	# save and show
	if '-d' in param or '-s' in param:
		pyplot.legend(loc = 'center right')
		if '-s' in param:
			pyplot.savefig('bandit2.png')
		if '-d' in param:
			pyplot.show()

result

The following results were obtained. At $ \ varepsilon = 0.0 $, you always choose a greedy action, so when you find an action worth $ 0 $ or more, you will always continue to choose that action. Therefore, the average value of the rewards obtained is always constant. At $ \ varepsilon = 1.0 $, you always act randomly, so even if you play more times, the average reward you get does not exceed a certain range. With $ \ varepsilon = 0.1 $, there is a $ 10 % $ chance of randomly acting and a $ 90 % $ chance of choosing a greedy action. We find the best behavior the fastest, but the average reward is only around $ 1.8 $ because we act randomly with a probability of $ 10 % $ even after learning. With $ \ varepsilon = 0.01 $, there is a probability of $ 1 % $ to act randomly, and a probability of $ 99 % $ to select a greedy action. The convergence speed is slower than $ \ varepsilon = 0.1 $, but the performance is the best from around $ 1000 $ play.

in conclusion

We implemented the $ n $ skill bandit task in python and confirmed the difference in the learning process due to the change in $ \ varepsilon $.

[Reinforcement learning] Bandit task