This time, I did Reinforcement Learning, which is on the tutorial, so I will describe it with the meaning of the memorandum.
For example
The new salesman is the agent and the environment is the customer.
Action the action that the new salesman sells again State observation of customer's reaction to sales The reward will be "whether the customer's purchasing motivation has increased".
Since novice salesmen have no sales experience, they cannot know whether the reward, that is, "whether the customer's purchasing motivation has increased" is accurate.
In addition, new salesmen cannot accurately grasp the reaction of customers to sales.
Reinforcement learning used in situations where there is high uncertainty and no teacher data is called POMDP.
Please refer to the following for a detailed explanation (Source: NTT Communication Science Laboratories, Yasuhiro Minami) http://www.lai.kyutech.ac.jp/sig-slud/SLUD63-minami-POMDP-tutorial.pdf
The tutorial below uses MDP, which assumes that the observed state is correct.
MDP (Markov decision process) http://www.orsj.or.jp/~wiki/wiki/index.php/%E3%83%9E%E3%83%AB%E3%82%B3%E3%83%95%E6%B1%BA%E5%AE%9A%E9%81%8E%E7%A8%8B
Please refer to the following for how to start PyBrain.
https://github.com/pybrain/pybrain/blob/master/docs/documentation.pdf
Install the required libraries.
from scipy import *
import sys, time
from pybrain.rl.environments.mazes import Maze, MDPMazeTask
from pybrain.rl.learners.valuebased import ActionValueTable
from pybrain.rl.agents import LearningAgent
from pybrain.rl.learners import Q, SARSA
from pybrain.rl.experiments import Experiment
from pybrain.rl.environments import Task
Get ready for visualization.
import pylab
pylab.gray()
pylab.ion()
Since the goal of the tutorial is to clear the maze game, we will define the following maze structure.
structure = array([[1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 0, 0, 1, 0, 0, 0, 0, 1],
[1, 0, 0, 1, 0, 0, 1, 0, 1],
[1, 0, 0, 1, 0, 0, 1, 0, 1],
[1, 0, 0, 1, 0, 1, 1, 0, 1],
[1, 0, 0, 0, 0, 0, 1, 0, 1],
[1, 1, 1, 1, 1, 1, 1, 0, 1],
[1, 0, 0, 0, 0, 0, 0, 0, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1]])
Define the maze structure as an environment. I will pass you the maze structure and the final goal you defined earlier.
environment = Maze(structure, (7, 7))
Next, define the action of the agent. The agent action is now an agent with a table of values, with 81 states and 4 actions. And initialize the state of the agent.
81 state: Because the maze structure is 9x9 structure 4 actions: Because up, down, right, down actions are possible
Interfaces for action definition include ActionValueTable and ActionValueNetwork.
ActionValueTable: Used for discrete actions ActionValueNetwork: Used for continuous actions
controller = ActionValueTable(81, 4)
controller.initialize(1.)
Defines how the agent learns. Define the agent's actions to be optimized for rewards using Q-learning.
learner = Q()
agent = LearningAgent(controller, learner)
Define the tasks that connect the agent to the environment.
task = MDPMazeTask(environment)
I actually practice reinforcement learning 100 times with the code below, and plot the position of the agent again for each practice.
experiment = Experiment(task, agent)
while True:
experiment.doInteractions(100)
agent.learn()
agent.reset()
pylab.pcolor(controller.params.reshape(81,4).max(1).reshape(9,9))
pylab.draw()
pylab.show()
That's it.
Recommended Posts