During the spring break, I was conducting research through reinforcement learning on safety engineering, but it seems that it will not be possible to sublimate it into a concrete form, so I will show here the part up to this time along with the background.
The maze problem is an example of a problem that is often dealt with in reinforcement learning. As an example of implementation, I referred to here.
Description: The purpose is to learn the route from S (start) to G (Goal) in the maze as shown in the figure. This time, we will use the ε-greedy method in Q-learning.
In Q-learning, movement options (measures) that can be taken for each square are defined, and the action value is determined for each measure. When a measure in a certain square is taken, the difference from the maximum value of the action value of the destination is increased by a certain percentage, and the action value of the measure in that square is updated. Another thing is that Goal has a certain action value, and the policy to arrive at Goal is to update the action value with the reward and finish one set to solve the maze.
The uninflected form of the action value Q update formula is as follows.
$ Q $: Action value $ s $: Status (agent position, square) $ a $: Action $ \ Alpha $: Learning rate $ \ Gamma $: Time discount rate $ r $: Reward Is given as. $ s', a'$ represent the state one step ahead (that is, the action value of the determined action destination).
As you can see from the above formula, the update in Q-learning uses the difference from the action value Q of the next square as the driving force for the update, so if the value of Q in each square given first is uniform. If it is 0, the agent will move around each square randomly from start, and the action value of that square will be updated only if a strategy to reach Goal by chance is acquired. This is actually the case, and at first we perform a random search and get a reward, and gradually the action value propagates from the Goal and the surrounding squares to Start.
On the other hand, the above formula has a defect that once the action value is determined from Start to Goal, it will definitely go through only that route (since the other squares are 0, the policy to move there is not selected). Therefore, a calculation method is adopted in which the value of ε is gradually reduced by taking actions randomly with a certain probability ε when deciding a policy. This is a brief explanation of the ε-greedy method.
In safety engineering, the disaster of fire is one of the major research subjects. Regarding evacuation in the event of a fire, various reports have been made from experiments and calculations in previous studies. An example is a simulation of fire evacuation behavior using a maze. In the calculation
Etc. are frequently discussed, but in both cases, information on the environment and prominent rules of environment dependence are used for calculation. As a result, many studies have been discussed along with verification of actual fire cases, raising questions about predicting evacuation behavior.
Therefore, what we are thinking about here is a study based on the idea of applying the maze problem by Q-learning to calculate the optimum route from fire occurrence to evacuation and to consider it quantitatively. One of the advantages of this method is that the parameters set in advance are less dependent on the environment. In addition, it is expected to have great extensibility without increasing the rule setting of the algorithm depending on the setting of rewards and measures.
It will be quick to paste the code.
plain_Q_model.py
import numpy as np
import matplotlib.pyplot as plt
#decide the direction to move by Q table and random probability epsilon
#and convert it as movement on the map.
def decide_action(status, maze_row, q_table, maze_map, epsilon):
direction=["up", "right", "down", "left"]
if(np.random.rand() < epsilon):
#at random choice
direction_to=np.random.choice(direction, p=maze_map[status])
else:
#direction is selected at max Q value direction
direction_to=direction[np.nanargmax(q_table[status])]
#convert direction to map information at matrix
if(direction_to=="up"):
status=status-maze_row
action=0
elif(direction_to=="right"):
status=status+1
action=1
elif(direction_to=="down"):
status=status+maze_row
action=2
elif(direction_to=="left"):
status=status-1
action=3
return status, action
#q_learning function to update the Q_value
def q_learning(status, action, next_status, goal, q_table, alfa, gamma, reward):
#setting of reward
if(next_status==goal):
r=reward
else:
r=0
#update fomula of Q_value
q_table[status][action]+=alfa*(r+gamma*(np.nanmax(q_table[next_status])-q_table[status][action]))
return q_table
#solve and update the maze at once
def goal_once(start, goal, maze_row, maze_map, q_table, alfa, gamma, reward, epsilon):
flag=True
history_move=[]
#initialize
status=start
#solve maze until it gets the goal
while(flag):
next_status, action=decide_action(status, maze_row, q_table, maze_map, epsilon)
q_table=q_learning(status, action, next_status, goal, q_table, alfa, gamma, reward)
#record the history of action
history_move.append([status, action])
#flag of goal
if(next_status==goal):flag=False
#status update
status=next_status
return q_table, history_move
move_0=np.array([0, 0, 0, 0])
move_w=np.array([1, 0, 0, 0])
move_d=np.array([0, 1, 0, 0])
move_s=np.array([0, 0, 1, 0])
move_a=np.array([1, 0, 0, 1])
move_wd=np.array([1, 1, 0, 0])/2
move_ws=np.array([1, 0, 1, 0])/2
move_wa=np.array([1, 0, 0, 1])/2
move_ds=np.array([0, 1, 1, 0])/2
move_da=np.array([0, 1, 0, 1])/2
move_sa=np.array([0, 0, 1, 1])/2
move_wds=np.array([1, 1, 1, 0])/3
move_wda=np.array([1, 1, 0, 1])/3
move_wsa=np.array([1, 0, 1, 1])/3
move_dsa=np.array([0, 1, 1, 1])/3
move_wdsa=np.array([1, 1, 1, 1])/4
###input form###
maze_map=np.array([move_ds, move_dsa, move_dsa, move_dsa, move_sa, \
move_wds, move_wdsa, move_wdsa, move_wdsa, move_wsa,\
move_wds, move_wdsa, move_wdsa, move_wdsa, move_wsa,\
move_wd, move_wda, move_wda, move_wda, move_wa])
q_table=np.where(maze_map==0.0, np.nan, 0)
maze_row=5
maze_columns=4
start=0
goal=19
reward=1
time_of_episode=100
alfa = 0.10 #Learning rate
gamma = 0.92 #Time discount rate
epsilon = 0.99 # ε-Initial value of greedy method
probability_reduce_rate=1.04
###input form end###
history_move=[]
size_of_epi=[]
episode = 1
flag=True
while(flag):
q_table, history_move=goal_once(start, goal, maze_row, maze_map, q_table, alfa, gamma, reward, epsilon)
print("this is episode: {0:5d}, steps to goal are {1:5d}".format(episode, len(history_move)))
size_of_epi.append(len(history_move))
if(time_of_episode==episode):
flag=False
episode=episode-1
episode+=1
epsilon=epsilon/probability_reduce_rate
q_table=q_table/np.nanmax(q_table)
direcion=["↑", "→", "↓", "←"]
for i in range(len(history_move)):
print(direcion[history_move[i][1]])
plt.plot(size_of_epi)
plt.show()
q_table[goal]=1.0
np.set_printoptions(suppress=True, precision=4, floatmode='maxprec')
#print(maze_map)
print(q_table)
q_table_max=np.zeros(goal+1)
for i in range(len(q_table)):
q_table_max[i]=np.nanmax(q_table[i])
print(q_table_max.reshape(maze_columns, maze_row))
q_table_max=q_table_max.reshape(maze_columns, maze_row)
plt.figure(figsize=(10, 5))
plt.imshow(q_table_max, interpolation='nearest', vmin=0 ,cmap='YlOrRd')
plt.colorbar()
plt.show()
The calculation results are shown below, and it can be seen that the action value is transmitted very clearly in the maze that finishes from the upper left to the lower right, and the route to Goal is made.
Let's extend this to simulate learning evacuation behavior in the event of a fire. Here, the model only puts a reward on Goal, but as a simple extension, we will prepare a square that looks like a fire and give a negative reward when we get there. Also, if you reach a square with a negative reward, the game will end one set. In other words, the agent starts from Start with no knowledge and aims for Goal while returning to death many times.
The above is the calculation result, but a very interesting result was obtained. Let's look at the Q-learning update formula again.
The action value of each measure is updated based on the above formula, but in the above formula, all updates reflect the maximum value of the action destination, so even if a negative reward is given, the surroundings Except for one cell, it is ignored by the Max function. Since the update formula above Q-learning is an optimistic behavioral value calculation formula, simply put, it takes the best route even if there is a large negative reward next to it. Even if the flame is burning next to it, there is a problem that if it is the shortest route, it will be calculated to pass there.
Now consider a slightly pessimistic agent. Given a parameter called Pessimism,
Modify the above equation as the following equation. In other words, it is modified so that the minimum value of the action value of the action destination is estimated at a certain rate. If pessimism = 0, it will be the same as the original update formula, and if pessimism → ∞, it will be a program that evaluates and learns only the lowest value of the action value of each square. In particular,
q_learning
#q_learning function to update the Q_value
def q_learning(status, action, next_status, goal, q_table, alfa, gamma, reward, pesimism):
r=0
#setting of reward
for i in range(len(goal)):
if(next_status==goal[i]):
r=reward[i]
#update fomula of Q_value
q_table[status][action]+=alfa*(r+gamma*(np.nanmax(q_table[next_status])-q_table[status][action]))
return q_table
The function of the corresponding part
q_learning
#q_learning function to update the Q_value
def q_learning(status, action, next_status, goal, q_table, alfa, gamma, reward, pesimism):
r=0
#setting of reward
for i in range(len(goal)):
if(next_status==goal[i]):
r=reward[i]
#update fomula of Q_value
q_table[status][action]+=alfa*(r+gamma*(np.nanmax(q_table[next_status])/(1+pesimism)\
+np.nanmin(q_table[next_status])*pesimism/(1+pesimism)-q_table[status][action]))
return q_table
Rewrite as. When I try to calculate with this.
It can be seen that a route that properly avoids the location of the fire is selected.
Based on the above findings, with the scale of the maze further expanded, the route from each square of the maze to Goal is learned, and the action value obtained from it is averaged to heat the entire maze to Goal. I made a map. Although the heat map itself has no essential meaning, it is considered to be useful for clearly showing how much risk occurs in the event of a fire and what kind of change in route selection occurs.
The goal is the center of the upper part.
No fire:
There is a fire:
It can be seen that if a fire breaks out on the right side of the floor, the action value of the space on the right side will drop significantly. On the other hand, it can be seen that the passage on the left side of the two routes in the lower center is more highly valued and preferred. Even when the fire was regarded as a single negative reward cell, the results were quite interesting.
Although it was declared as an independent research at the beginning, there are the following problems that this has not yet become a concrete form.
It is necessary to solve multi-agent tasks, not to mention the flow of people in the event of a fire. Also, as a feature of actual evacuation behavior
It is necessary to study after taking into consideration the characteristics of such actual evacuation behavior. In comparison, this program ends with an improvement in Q-learning.
In evacuation behavior during a fire, the fire itself changes significantly over time. The biggest factor in evacuation behavior is smoke. The lateral movement speed of smoke is said to be about the walking speed of humans, but the characteristics of diffusion, movement speed and direction, etc. need to be further examined. It is considered as an effective algorithm to evolve the negative reward area for each turn, but in the actual fire evacuation behavior experiment, the smoke of the fire obstructs the human view, slows down the walking speed, etc. It has been pointed out, and it is unclear whether the effect can be expressed only by negative rewards.
As mentioned earlier, human evacuation behavior has characteristic behavior observed from experiments. It is also known to follow influential leaders and crowd behavior to keep up with people for the time being. Such behavior can occur even if the exit is narrow and more efficient towards other exits. The problem is that the optimal path learned by reinforcement learning is not always optimal in view of human irrationality and instinct. While some people panic, the biggest problem is how much irrational behavior can be included in the reward, and whether it is optimal for human behavior.
After all, I think it would be interesting if the calculation could be processed at high speed on the server side and the optimal evacuation route could be presented to each evacuee. The fact that there are fewer environment-dependent algorithms than the conventional cellular automaton method seems to be effective for extensibility around here. Has it already been put to practical use in areas around this area, such as urban transportation and station transfers?
I would appreciate it if you could let me know if you have any useful literature.
Recommended Posts