I usually touch AI/machine learning in my work. However, until now, I have been studying mainly with supervised learning, so I thought that I hadn't touched much on unsupervised learning and reinforcement learning. So yesterday, I saw this article, which is the best for reinforcement learning hands-on, and actually tried it myself.
When I was looking at the article thinking that I wanted a theme even if I did it, I felt that the movement of the agent was "__ I feel like I'm drunk." In this article, the theme was reinforcement learning to find the optimal route from the start point to the end point. The agent has an 80% chance of going in the desired direction, a 10% chance of going left, and a 10% chance of going right.
I thought, "Isn't it possible to reproduce the behavior of drunkenness ...? __" by using this, so I experimented with hands-on. Please take a look while remembering __the memory that became sloppy at the tavern in front of Corona __. This article is also very convenient, so please have a look.
Almost as in this article, some modifications have been made to introduce the drunk variable berobero. Please refer to Github in the article above for the contents of the defined method. Take out only the modified part.
#Abstract class
class MDP:
#MDP:Markov decision process(Markov decision processes
#Define drunkenness berobero as an argument
def __init__(self, init, actlist, terminals, gamma=.9,berobero=0.1):
#init:initial state
#actlist:Action
#erminals:End state
#gamma:Discount function
self.init = init
self.actlist = actlist
self.terminals = terminals
#Concrete class
class GridMDP(MDP):
#Define drunkenness berobero as an argument
def __init__(self, grid, terminals, init=(0, 0), gamma=.9,berobero=0.1):
#grid is a matrix that defines the field
grid.reverse() # because we want row 0 on bottom, not on top
MDP.__init__(self, init, actlist=orientations,
terminals=terminals, gamma=gamma,berobero=berobero)
self.grid = grid
self.berobero=berobero
self.rows = len(grid)
self.cols = len(grid[0])
#List of transition probabilities and next actions
#berobero=0 is shirafu, 0.1 is good, 0.3 is a sticky image
#berobero=0.It's okay to go around because it will definitely be a crab walk at 5
def T(self, state, action):
if action is None:
return [(0.0, state)]
else:
return [(1-2*self.berobero, self.go(state, action)),
(self.berobero, self.go(state, turn_right(action))),
(self.berobero, self.go(state, turn_left(action)))]
As the agent progresses from the start point to the end point, look for a way to maximize the reward. Agents have a certain probability of moving in a direction different from what they want. __berobero has a probability of left, berobero has a right, 1-2 * berobero has a chance of moving in the desired direction. __ So, __berobero = 0 is sloppy, 0.1 is nice, 0.3 is a sloppy image __. __berobero = 0.5 will surely make you walk with crabs, so please imagine "a person who can go around". __
When you actually get drunk, why don't you touch the wall and move along the wall? __ If it's a tatami room, I'm afraid to say "I'm sorry", so I have an image of going to the wall and moving along the wall at the shortest. I would like to simulate this for a moment.
First, let's verify using the example of this article. Consider a pattern in which the store has one pillar and there is one person near the exit. We will guide you to the optimal route to the exit without hitting people.
If you're drunk, you don't want to risk hitting someone by moving to the right as much as possible. So __intuitively, it seems good to go up and hit it and go to the right as it is __.
Loss is the reward you get when you go there, and if you set it to -0.5, it will be a room like "-0.5 every time you move". The larger the minus of the reward, the shorter the movement will be. It seems that you can use loss to design something like "If you don't go to the exit early, you'll vomit." (I won't do it this time)
#Original pattern
loss = 0
grid=[
[loss, loss, loss, +1],
[loss, None, loss, -1],
[loss, loss,loss,loss]
]
#Count from the bottom left
sequential_decision_environment = GridMDP(grid,terminals=[(3, 2), (3, 1)],berobero=0.1)
pi = best_policy(sequential_decision_environment, value_iteration(sequential_decision_environment, .01))
print_table(sequential_decision_environment.to_arrows(pi))
Click here for berobero = 0.1. As a point of view, "Which direction is the most appropriate to move from that square?" Is output as an arrow. In this case, the result is "the best route is to go up from the starting point and go to the right". It seems intuitively valid.
Click here for berobero = 0.3. The interesting thing is that you try to turn left first. I feel the will to not go to the right even if I am willing. It's interesting to look up just before entering the end point to make it look like "rightward, worst left."
Here is the case of berobero = 0.5. I already feel a strong will to say, "__ I'm too drunk ... I'm steadily moving sideways ... __". Go to the right and walk sideways, then face up and walk sideways without difficulty. __ I feel the state of enlightenment that I took the wrong side of not being able to move forward __.
Well, it's finally the practical edition. Consider escaping from the back seats of the four table drinking parties. Let's think of the four tables as if they were four people, and see if they go to the exit while "excuse me" or go along the wall even in a detour.
Click here for berobero = 0.1. In this case, it's a pattern of "excuse me". I'm not drunk, so I feel like it's okay to get out of the way.
Click here for berobero = 0.3. __This! !! I feel a strong will to go along the wall! !! __ __ When you get drunk, just go along the wall and go safely! !! __
I wondered why this is not the case with berobero = 0.1. Since loss = 0, there should be no demerit of detouring, and this is certain. If you know it, I would be grateful if you could teach me!
Here is the case of berobero = 0.5. It's kind of funny to walk with crabs because you can't move forward. __ It makes me think that there is a figure that is too sticky and feels a sense of crisis and is calm. __
There was no risk in the detour earlier, but this time I will add a little risk. Whether you go on a short cut or a detour, you will be "excuse me", and if you take a detour, the road will be thicker and you can safely proceed to the right. __ Is there such an izakaya? __
Click here for berobero = 0.1. In this case, it's a pattern of "I'm sorry" to the right.
Why is the phenomenon of pointing downwards at the starting point? I feel a misanthropic grief like "__ No, I don't want to go to the exit."
Click here for berobero = 0.3. It's a pattern that takes safety choices. None By turning to the left one place above, I feel a strong will to "never go to the right."
Here is the case of berobero = 0.5. If you learn crab walking, it's a crowded mystery.
This time, the purpose was reinforcement learning hands-on, so the condition examination may still be sweet. Especially since I set loss = 0, I would like to see the behavior change due to that change. Also, this time I used the value iteration method as a model, but I would like to implement Q-learning as well.
And let's enjoy drinking alcohol moderately! !! !! !! Thank you for reading until the end! I would be grateful if you could do LGTM.
Recommended Posts