Preface

During the spring break, I was conducting research through reinforcement learning on safety engineering, but it seems that it will not be possible to sublimate it into a concrete form, so I will show here the part up to this time along with the background.

Reinforcement learning and research background

Maze problem

The maze problem is an example of a problem that is often dealt with in reinforcement learning. As an example of implementation, I referred to here.

Description: The purpose is to learn the route from S (start) to G (Goal) in the maze as shown in the figure. This time, we will use the ε-greedy method in Q-learning.

In Q-learning, movement options (measures) that can be taken for each square are defined, and the action value is determined for each measure. When a measure in a certain square is taken, the difference from the maximum value of the action value of the destination is increased by a certain percentage, and the action value of the measure in that square is updated. Another thing is that Goal has a certain action value, and the policy to arrive at Goal is to update the action value with the reward and finish one set to solve the maze.

The uninflected form of the action value Q update formula is as follows. $Q(s,a)\leftarrow Q(s,a) + \alpha[r+\gamma*max_{a'}Q(s',a')-Q(s,a)]$

$ Q $: Action value $ s $: Status (agent position, square) $ a $: Action $ \ Alpha $: Learning rate $ \ Gamma $: Time discount rate $ r $: Reward Is given as. $ s', a'$ represent the state one step ahead (that is, the action value of the determined action destination).

As you can see from the above formula, the update in Q-learning uses the difference from the action value Q of the next square as the driving force for the update, so if the value of Q in each square given first is uniform. If it is 0, the agent will move around each square randomly from start, and the action value of that square will be updated only if a strategy to reach Goal by chance is acquired. This is actually the case, and at first we perform a random search and get a reward, and gradually the action value propagates from the Goal and the surrounding squares to Start.

On the other hand, the above formula has a defect that once the action value is determined from Start to Goal, it will definitely go through only that route (since the other squares are 0, the policy to move there is not selected). Therefore, a calculation method is adopted in which the value of ε is gradually reduced by taking actions randomly with a certain probability ε when deciding a policy. This is a brief explanation of the ε-greedy method.

Research background

In safety engineering, the disaster of fire is one of the major research subjects. Regarding evacuation in the event of a fire, various reports have been made from experiments and calculations in previous studies. An example is a simulation of fire evacuation behavior using a maze. In the calculation

A model that interprets human behavior as particle motion and solves the equation of motion
Cellular automaton model

Etc. are frequently discussed, but in both cases, information on the environment and prominent rules of environment dependence are used for calculation. As a result, many studies have been discussed along with verification of actual fire cases, raising questions about predicting evacuation behavior.

Therefore, what we are thinking about here is a study based on the idea of applying the maze problem by Q-learning to calculate the optimum route from fire occurrence to evacuation and to consider it quantitatively. One of the advantages of this method is that the parameters set in advance are less dependent on the environment. In addition, it is expected to have great extensibility without increasing the rule setting of the algorithm depending on the setting of rewards and measures.

Maze problem by Q-learning

It will be quick to paste the code.

Q-learning solution

`plain_Q_model.py`


import numpy as np
import matplotlib.pyplot as plt


#decide the direction to move by Q table and random probability epsilon
#and convert it as movement on the map.
def decide_action(status, maze_row, q_table, maze_map, epsilon):
    
    direction=["up", "right", "down", "left"]
    if(np.random.rand() < epsilon):
        #at random choice
        direction_to=np.random.choice(direction, p=maze_map[status])
    else:
        #direction is selected at max Q value direction
        direction_to=direction[np.nanargmax(q_table[status])]
        
        
    #convert direction to map information at matrix
    if(direction_to=="up"):
        status=status-maze_row
        action=0
    elif(direction_to=="right"):
        status=status+1
        action=1
    elif(direction_to=="down"):
        status=status+maze_row
        action=2
    elif(direction_to=="left"):
        status=status-1
        action=3
        
    return status, action


#q_learning function to update the Q_value
def q_learning(status, action, next_status, goal, q_table, alfa, gamma, reward):
    
    #setting of reward
    if(next_status==goal):
        r=reward
    else:
        r=0
        
    #update fomula of Q_value
    q_table[status][action]+=alfa*(r+gamma*(np.nanmax(q_table[next_status])-q_table[status][action]))
    
    return q_table


#solve and update the maze at once
def goal_once(start, goal, maze_row, maze_map, q_table, alfa, gamma, reward, epsilon):
    
    flag=True
    history_move=[]
    
    #initialize
    status=start
    
    #solve maze until it gets the goal
    while(flag):
        
        next_status, action=decide_action(status, maze_row, q_table, maze_map, epsilon)
        q_table=q_learning(status, action, next_status, goal, q_table, alfa, gamma, reward)
        
        #record the history of action
        history_move.append([status, action])
        
        #flag of goal
        if(next_status==goal):flag=False
        
        #status update
        status=next_status
    
    return q_table, history_move



move_0=np.array([0, 0, 0, 0])
move_w=np.array([1, 0, 0, 0])
move_d=np.array([0, 1, 0, 0])
move_s=np.array([0, 0, 1, 0])
move_a=np.array([1, 0, 0, 1])
move_wd=np.array([1, 1, 0, 0])/2
move_ws=np.array([1, 0, 1, 0])/2
move_wa=np.array([1, 0, 0, 1])/2
move_ds=np.array([0, 1, 1, 0])/2
move_da=np.array([0, 1, 0, 1])/2
move_sa=np.array([0, 0, 1, 1])/2
move_wds=np.array([1, 1, 1, 0])/3
move_wda=np.array([1, 1, 0, 1])/3
move_wsa=np.array([1, 0, 1, 1])/3
move_dsa=np.array([0, 1, 1, 1])/3
move_wdsa=np.array([1, 1, 1, 1])/4


###input form###

maze_map=np.array([move_ds, move_dsa, move_dsa, move_dsa, move_sa, \
                   move_wds, move_wdsa, move_wdsa, move_wdsa, move_wsa,\
                   move_wds, move_wdsa, move_wdsa, move_wdsa, move_wsa,\
                   move_wd, move_wda, move_wda, move_wda, move_wa])

q_table=np.where(maze_map==0.0, np.nan, 0)

maze_row=5
maze_columns=4

start=0
goal=19
reward=1
time_of_episode=100
alfa = 0.10  #Learning rate
gamma = 0.92 #Time discount rate
epsilon = 0.99  # ε-Initial value of greedy method
probability_reduce_rate=1.04

###input form end###



history_move=[]
size_of_epi=[]
episode = 1



flag=True
while(flag):
    q_table, history_move=goal_once(start, goal, maze_row, maze_map, q_table, alfa, gamma, reward, epsilon)
    print("this is episode: {0:5d}, steps to goal are {1:5d}".format(episode, len(history_move)))
    size_of_epi.append(len(history_move))
    if(time_of_episode==episode):
        flag=False
        episode=episode-1
    episode+=1
    epsilon=epsilon/probability_reduce_rate
    q_table=q_table/np.nanmax(q_table)
    
direcion=["↑", "→", "↓", "←"]
for i in range(len(history_move)):
    print(direcion[history_move[i][1]])
    
plt.plot(size_of_epi)
plt.show()



q_table[goal]=1.0

np.set_printoptions(suppress=True, precision=4, floatmode='maxprec')
#print(maze_map)
print(q_table)


q_table_max=np.zeros(goal+1)
for i in range(len(q_table)):
    q_table_max[i]=np.nanmax(q_table[i])
        
print(q_table_max.reshape(maze_columns, maze_row))

q_table_max=q_table_max.reshape(maze_columns, maze_row)

plt.figure(figsize=(10, 5))
plt.imshow(q_table_max, interpolation='nearest', vmin=0 ,cmap='YlOrRd')
plt.colorbar()
plt.show()

The calculation results are shown below, and it can be seen that the action value is transmitted very clearly in the maze that finishes from the upper left to the lower right, and the route to Goal is made.

Simulated fire

Let's extend this to simulate learning evacuation behavior in the event of a fire. Here, the model only puts a reward on Goal, but as a simple extension, we will prepare a square that looks like a fire and give a negative reward when we get there. Also, if you reach a square with a negative reward, the game will end one set. In other words, the agent starts from Start with no knowledge and aims for Goal while returning to death many times.

result

The above is the calculation result, but a very interesting result was obtained. Let's look at the Q-learning update formula again.

Q(s,a)\leftarrow Q(s,a) + \alpha[r+\gamma*max_{a'}Q(s',a')-Q(s,a)]

The action value of each measure is updated based on the above formula, but in the above formula, all updates reflect the maximum value of the action destination, so even if a negative reward is given, the surroundings Except for one cell, it is ignored by the Max function. Since the update formula above Q-learning is an optimistic behavioral value calculation formula, simply put, it takes the best route even if there is a large negative reward next to it. Even if the flame is burning next to it, there is a problem that if it is the shortest route, it will be calculated to pass there.

Pessimistic update

Now consider a slightly pessimistic agent. Given a parameter called Pessimism,

Q(s,a)\leftarrow Q(s,a) + \alpha[r+\gamma*max_{a'}Q(s',a')-Q(s,a)]

Q(s,a)\leftarrow Q(s,a) + \alpha[r+\gamma*(max_{a'}Q(s',a')/(1+pessimism)+min_{a'}Q(s',a')*pessimism/(1+pessimism))-Q(s,a)]

Modify the above equation as the following equation. In other words, it is modified so that the minimum value of the action value of the action destination is estimated at a certain rate. If pessimism = 0, it will be the same as the original update formula, and if pessimism → ∞, it will be a program that evaluates and learns only the lowest value of the action value of each square. In particular,

`q_learning`


#q_learning function to update the Q_value
def q_learning(status, action, next_status, goal, q_table, alfa, gamma, reward, pesimism):
    
    r=0
    
    #setting of reward
    for i in range(len(goal)):
        if(next_status==goal[i]):
            r=reward[i]
        
    #update fomula of Q_value
    q_table[status][action]+=alfa*(r+gamma*(np.nanmax(q_table[next_status])-q_table[status][action]))
    
    
    return q_table

The function of the corresponding part

`q_learning`


#q_learning function to update the Q_value
def q_learning(status, action, next_status, goal, q_table, alfa, gamma, reward, pesimism):
    
    r=0
    
    #setting of reward
    for i in range(len(goal)):
        if(next_status==goal[i]):
            r=reward[i]
        
    #update fomula of Q_value
    q_table[status][action]+=alfa*(r+gamma*(np.nanmax(q_table[next_status])/(1+pesimism)\
           +np.nanmin(q_table[next_status])*pesimism/(1+pesimism)-q_table[status][action]))
    
    
    return q_table

Rewrite as. When I try to calculate with this.

It can be seen that a route that properly avoids the location of the fire is selected.

Action value map of medium-sized maze (potential)

Based on the above findings, with the scale of the maze further expanded, the route from each square of the maze to Goal is learned, and the action value obtained from it is averaged to heat the entire maze to Goal. I made a map. Although the heat map itself has no essential meaning, it is considered to be useful for clearly showing how much risk occurs in the event of a fire and what kind of change in route selection occurs.

The goal is the center of the upper part.

No fire:

There is a fire:

It can be seen that if a fire breaks out on the right side of the floor, the action value of the space on the right side will drop significantly. On the other hand, it can be seen that the passage on the left side of the two routes in the lower center is more highly valued and preferred. Even when the fire was regarded as a single negative reward cell, the results were quite interesting.

problem

Although it was declared as an independent research at the beginning, there are the following problems that this has not yet become a concrete form.

Multi-agent

It is necessary to solve multi-agent tasks, not to mention the flow of people in the event of a fire. Also, as a feature of actual evacuation behavior

Evacuees follow leaders (knowledgeable staff). (This can be regarded as the sharing of the action value function)
Information boards and evacuation guidance have a certain effect. (It is necessary to set rewards and measures for specific squares)
Congestion occurs near the exit and the mobility of people drops. (Of course, it will be delayed to escape, so it is necessary to set the reward to be gradually reduced for those who escape later)

It is necessary to study after taking into consideration the characteristics of such actual evacuation behavior. In comparison, this program ends with an improvement in Q-learning.

Incorporation of fire features

In evacuation behavior during a fire, the fire itself changes significantly over time. The biggest factor in evacuation behavior is smoke. The lateral movement speed of smoke is said to be about the walking speed of humans, but the characteristics of diffusion, movement speed and direction, etc. need to be further examined. It is considered as an effective algorithm to evolve the negative reward area for each turn, but in the actual fire evacuation behavior experiment, the smoke of the fire obstructs the human view, slows down the walking speed, etc. It has been pointed out, and it is unclear whether the effect can be expressed only by negative rewards.

The dilemma of human evacuation behavior

As mentioned earlier, human evacuation behavior has characteristic behavior observed from experiments. It is also known to follow influential leaders and crowd behavior to keep up with people for the time being. Such behavior can occur even if the exit is narrow and more efficient towards other exits. The problem is that the optimal path learned by reinforcement learning is not always optimal in view of human irrationality and instinct. While some people panic, the biggest problem is how much irrational behavior can be included in the reward, and whether it is optimal for human behavior.

Outlook

After all, I think it would be interesting if the calculation could be processed at high speed on the server side and the optimal evacuation route could be presented to each evacuee. The fact that there are fewer environment-dependent algorithms than the conventional cellular automaton method seems to be effective for extensibility around here. Has it already been put to practical use in areas around this area, such as urban transportation and station transfers?

I would appreciate it if you could let me know if you have any useful literature.

Conclusion

Since there are many references, I will omit them. Please let me know if you are interested in references

Possibility of application to evacuation route design of maze problem in reinforcement learning

Preface

Reinforcement learning and research background

Maze problem

Research background

Maze problem by Q-learning

plain_Q_model.py

Simulated fire

result

Pessimistic update

q_learning

q_learning

Action value map of medium-sized maze (potential)

problem

Multi-agent

Incorporation of fire features

The dilemma of human evacuation behavior

Outlook

Conclusion

`plain_Q_model.py`

`q_learning`

`q_learning`