I want to study reinforcement learning

I usually touch AI/machine learning in my work. However, until now, I have been studying mainly with supervised learning, so I thought that I hadn't touched much on unsupervised learning and reinforcement learning. So yesterday, I saw this article, which is the best for reinforcement learning hands-on, and actually tried it myself.

When I was looking at the article thinking that I wanted a theme even if I did it, I felt that the movement of the agent was "__ I feel like I'm drunk." In this article, the theme was reinforcement learning to find the optimal route from the start point to the end point. The agent has an 80% chance of going in the desired direction, a 10% chance of going left, and a 10% chance of going right.

I thought, "Isn't it possible to reproduce the behavior of drunkenness ...? __" by using this, so I experimented with hands-on. Please take a look while remembering __the memory that became sloppy at the tavern in front of Corona __. This article is also very convenient, so please have a look.

Sample code

Almost as in this article, some modifications have been made to introduce the drunk variable berobero. Please refer to Github in the article above for the contents of the defined method. Take out only the modified part.

#Abstract class
class MDP:
    #MDP:Markov decision process(Markov decision processes
    #Define drunkenness berobero as an argument
    def __init__(self, init, actlist, terminals, gamma=.9,berobero=0.1):
        #init:initial state
        #actlist:Action
        #erminals:End state
        #gamma:Discount function
        self.init = init
        self.actlist = actlist
        self.terminals = terminals

#Concrete class
class GridMDP(MDP):
    #Define drunkenness berobero as an argument
    def __init__(self, grid, terminals, init=(0, 0), gamma=.9,berobero=0.1):
        #grid is a matrix that defines the field
        grid.reverse()  # because we want row 0 on bottom, not on top                                                                                                  
        MDP.__init__(self, init, actlist=orientations,
                     terminals=terminals, gamma=gamma,berobero=berobero)
        self.grid = grid
        self.berobero=berobero
        self.rows = len(grid)
        self.cols = len(grid[0])

    #List of transition probabilities and next actions
    #berobero=0 is shirafu, 0.1 is good, 0.3 is a sticky image
    #berobero=0.It's okay to go around because it will definitely be a crab walk at 5
    def T(self, state, action):
        if action is None:
            return [(0.0, state)]
        else:
            return [(1-2*self.berobero, self.go(state, action)),
                    (self.berobero, self.go(state, turn_right(action))),
                    (self.berobero, self.go(state, turn_left(action)))]

Problem setting

As the agent progresses from the start point to the end point, look for a way to maximize the reward. Agents have a certain probability of moving in a direction different from what they want. __berobero has a probability of left, berobero has a right, 1-2 * berobero has a chance of moving in the desired direction. __ So, __berobero = 0 is sloppy, 0.1 is nice, 0.3 is a sloppy image __. __berobero = 0.5 will surely make you walk with crabs, so please imagine "a person who can go around". __

Walk along the wall when you get drunk

When you actually get drunk, why don't you touch the wall and move along the wall? __ If it's a tatami room, I'm afraid to say "I'm sorry", so I have an image of going to the wall and moving along the wall at the shortest. I would like to simulate this for a moment.

When there are people near the exit

First, let's verify using the example of this article. Consider a pattern in which the store has one pillar and there is one person near the exit. We will guide you to the optimal route to the exit without hitting people.

If you're drunk, you don't want to risk hitting someone by moving to the right as much as possible. So __intuitively, it seems good to go up and hit it and go to the right as it is __.

Loss is the reward you get when you go there, and if you set it to -0.5, it will be a room like "-0.5 every time you move". The larger the minus of the reward, the shorter the movement will be. It seems that you can use loss to design something like "If you don't go to the exit early, you'll vomit." (I won't do it this time)

#Original pattern
loss = 0
grid=[
    [loss, loss, loss, +1],
    [loss, None,  loss, -1],
    [loss, loss,loss,loss]
]

#Count from the bottom left
sequential_decision_environment = GridMDP(grid,terminals=[(3, 2), (3, 1)],berobero=0.1)

pi = best_policy(sequential_decision_environment, value_iteration(sequential_decision_environment, .01))

print_table(sequential_decision_environment.to_arrows(pi))

Bittersweet pattern

Click here for berobero = 0.1. As a point of view, "Which direction is the most appropriate to move from that square?" Is output as an arrow. In this case, the result is "the best route is to go up from the starting point and go to the right". It seems intuitively valid.

Sticky pattern

Click here for berobero = 0.3. The interesting thing is that you try to turn left first. I feel the will to not go to the right even if I am willing. It's interesting to look up just before entering the end point to make it look like "rightward, worst left."

Crab walking pattern

Here is the case of berobero = 0.5. I already feel a strong will to say, "__ I'm too drunk ... I'm steadily moving sideways ... __". Go to the right and walk sideways, then face up and walk sideways without difficulty. __ I feel the state of enlightenment that I took the wrong side of not being able to move forward __.

When escaping from a drinking party of four tatami mats

Well, it's finally the practical edition. Consider escaping from the back seats of the four table drinking parties. Let's think of the four tables as if they were four people, and see if they go to the exit while "excuse me" or go along the wall even in a detour.

Bittersweet pattern

Click here for berobero = 0.1. In this case, it's a pattern of "excuse me". I'm not drunk, so I feel like it's okay to get out of the way.

Sticky pattern

Click here for berobero = 0.3. __This! !! I feel a strong will to go along the wall! !! __ __ When you get drunk, just go along the wall and go safely! !! __

I wondered why this is not the case with berobero = 0.1. Since loss = 0, there should be no demerit of detouring, and this is certain. If you know it, I would be grateful if you could teach me!

Crab walking pattern

Here is the case of berobero = 0.5. It's kind of funny to walk with crabs because you can't move forward. __ It makes me think that there is a figure that is too sticky and feels a sense of crisis and is calm. __

When there is a risk in a detour

There was no risk in the detour earlier, but this time I will add a little risk. Whether you go on a short cut or a detour, you will be "excuse me", and if you take a detour, the road will be thicker and you can safely proceed to the right. __ Is there such an izakaya? __

Bittersweet pattern

Click here for berobero = 0.1. In this case, it's a pattern of "I'm sorry" to the right.

Why is the phenomenon of pointing downwards at the starting point? I feel a misanthropic grief like "__ No, I don't want to go to the exit."

Sticky pattern

Click here for berobero = 0.3. It's a pattern that takes safety choices. None By turning to the left one place above, I feel a strong will to "never go to the right."

Crab walking pattern

Here is the case of berobero = 0.5. If you learn crab walking, it's a crowded mystery.

at the end

This time, the purpose was reinforcement learning hands-on, so the condition examination may still be sweet. Especially since I set loss = 0, I would like to see the behavior change due to that change. Also, this time I used the value iteration method as a model, but I would like to implement Q-learning as well.

And let's enjoy drinking alcohol moderately! !! !! !! Thank you for reading until the end! I would be grateful if you could do LGTM.

See the behavior of drunkenness with reinforcement learning

I want to study reinforcement learning

Sample code

Problem setting

Walk along the wall when you get drunk

When there are people near the exit

Bittersweet pattern

Sticky pattern

Crab walking pattern

When escaping from a drinking party of four tatami mats

Bittersweet pattern

Sticky pattern

Crab walking pattern

When there is a risk in a detour

Bittersweet pattern

Sticky pattern

Crab walking pattern

at the end