Solve your own maze with DQN

I tried to solve my own maze with Deep Q Network (so-called DQN).

The program is here. https://github.com/shibuiwilliam/maze_solver

Overview

DQN is a type of reinforcement learning that uses neural networks for optimal strategy selection. The following is a reference for the explanation of reinforcement learning and neural networks.

--Reinforcement learning Reinforcement learning from zero to deep --Qiita

neural network Learn about the principles of artificial intelligence through the TensorFlow tutorial-Qiita

Reinforcement learning is a technology used in games and robot control, but when a player (also an agent) takes an action against a situation (State), the situation changes and the reward for that action (reward). ) Is a model to get. Repeating actions for situations Players aim to maximize rewards.

This time I will use my own maze for the situation. The maze is automatically generated by the program, but it looks like this.

At 10x10, S is the starting point and 50 is the goal. It is surrounded by a wall (#), and only the tiles (-1, 0) with the numbers inside can pass through. Even inside the wall, tiles marked with # cannot pass through. The numbers on the tiles are points (rewards in terms of reinforcement learning). -1 is 1 point minus, 0 point is unchanged, and the goal is 50 points. Players aim to maximize points and reach the goal.

This time, we have implemented the following three logics for the player to reach the goal.

A-star algorithm
Q-Learning
Deep Q-network（DQN）

The outline of each is as follows.

A-star algorithm

The explanation here for the A-star algorithm is very easy to understand. Search the maze with python --Slightly mysterious. A* - Wikipedia

To put it simply, it measures the distance from the starting point to the current location, and from the current location to the goal, and selects the route with the shortest distance. In the maze, each distance is calculated as the Euclidean distance.

Distance ^ 2 = (goal [x] --state [x]) ^ 2 + (goal [y] --state [y]) ^ 2

It's an equation that appears in the Pythagorean theorem (three squares theorem).

Using the A-star algorithm, the shortest distance is prioritized to solve the maze. There is no concept of points or rewards. Therefore, in the maze of this rule, the reward tends to be low while solving the maze quickly.

Here is the path from the start to the goal when solved with A-star. The route I took is blue.

I will go the shortest distance in the maze I used this time, but the reward is low at 44 points. -1 or whatever, aim for the shortest distance.

Q-learning Next is Q-learning. Q-learning-Wikipedia Reinforcement learning starting with Python-Qiita

Q-learning is a general-purpose method that can be used for purposes other than maze search. It is a kind of machine learning, and a model is created by using the situation and one's behavior as learning data and using the reward as a target variable. Create a predictive model of the rewards for actions taken according to the situation. The reward forecast is calculated using the following formula.

$ Q (s, a) $ is the expected value of the reward for performing the action $ a $ in the situation $ s $. $ α $ is the learning rate and $ γ $ is the reward discount rate. $ Q (s', a') $ is the expected value of the reward for performing the action $ a'$ in the next situation $ s'$ of the situation $ s $. Take the maximum expected value of $ Q (s', a') $ with $ max $ and multiply by the discount rate of $ γ $. $ r $ is the reward for doing the action $ a $ in the situation $ s $.

The feature of Q-learning is that by repeatedly calculating this formula, the expected reward value $ Q (s, a) $ and the expected value $ r + γmaxQ (s', a') $ are brought closer (as a result, $ Q). (s, a) ← Q (s, a) + 0 $ will converge). In Q-learning, first, you will acquire a combination of state, action, and reward by repeating it a sufficiently large number of times. With this as an input, the above formula is calculated repeatedly to derive the expected value of reward $ Q (s, a) $ that matches the actual situation.

Q-learning aims to maximize rewards. In this maze, the goal is the maximum reward, so if you solve the maze with Q-learning, you will quickly go to the goal while avoiding the -1 tile. Here is how it looks.

Go through the 0 tiles as much as possible and reach the goal in the shortest distance. The reward is 48.0, which is higher than A-star.

DQN DQN is a combination of Q-learning and deep learning. A neural network is constructed with situation $ s $ and action $ a $ as input variables and $ r + γmaxQ (s', a') $ that appears in Q learning as target variables. The action $ a that maximizes the reward $ r $ for the state $ s $ by approximating $ r + γmaxQ (s', a') $ from $ s $ and $ a $ to minimize the error. It will be the one that derives $.

DQN differs from ordinary neural networks in that the program generates supervised learning teacher data. Teacher data is required for learning neural networks, but in DQN, the action $ a $ for the state $ s $ and its reward $ r $ are accumulated by actually acting. For input variables and target variables, the combination of $ s $, $ a $, and $ r $ is recorded by repeating the action, and learning is performed based on the memory accumulated above a certain level. It's called Experience Replay, but it learns from the combination of $ s $, $ a $, $ r $ recorded by itself, creates a prediction model, and repeats the action to $ s $, $ a $, $ The flow is to obtain the combination of r $ and modify the prediction model. This is a paper advocating Experience Replay. https://pdfs.semanticscholar.org/9cd8/193a66cf53143cbba6ccb0c7b9c2ebf2452b.pdf This explanation is easy to understand in Japanese. History of DQN + Deep Q-Network written in Chainer --Qiita I will quote a part:

input: Data $ D $ consisting of many samples ($ s $, $ a $, $ r $, $ s'$) output: Q function after training $ Q_ {\ theta_N} (s, a) $

k=0
Initialize the neural network parameter $ \ theta_0 $
Repeat: Repeat $ N-1 $ times
Extract $ M $ of samples from the data $ D $
Based on $ M $ samples Teacher signal $ {\ rm {target}} ^ t = r_t + \ gamma \ max_ {a'} Q_ {\ theta_k} (s_t', a') $ and Create input $ (s_t, a_t) $ ($ t = 1,2, ..., M $)
Based on the created teacher signal $ {\ rm {target}} ^ t $ and input $ (s_t, a_t) $ Minimize the error $ L_ \ theta $ between $ Q_ \ theta (s, a) $ and the teacher signal by some gradient method To do. When the learning converges and certain conditions are met, the result is $ \ theta_ {k + 1} $
$ k \ leftarrow k + 1 $
Returns $ Q_ {\ theta_N} (s, a) $

The neural network is written in Keras. The model looks like this.

Only DQN will put the program. The full program is here. https://github.com/shibuiwilliam/maze_solver

Below are the agents. It consists of a neural network model, memory for storing states / actions / rewards, action selection, and Experience Replay. The class variable ʻepsilon is the probability of taking random actions. This value decays with ʻe_decay as training progresses. ʻE_min is the minimum value for ʻepsilon. Eventually, the optimal behavior will be taken according to the prediction model, but at the beginning, it will behave randomly and sampled.

class DQN_Solver:
    def __init__(self, state_size, action_size):
        self.state_size = state_size # list size of state
        self.action_size = action_size # list size of action
        self.memory = deque(maxlen=100000) # memory space
        self.gamma = 0.9 # discount rate
        self.epsilon = 1.0 # randomness of choosing random action or the best one
        self.e_decay = 0.9999 # epsilon decay rate
        self.e_min = 0.01 # minimum rate of epsilon
        self.learning_rate = 0.0001 # learning rate of neural network
        self.model = self.build_model() # model
        self.model.summary() # model summary

    # model for neural network
    def build_model(self):
        model = Sequential()
        model.add(Dense(128, input_shape=(2,2), activation='tanh'))
        model.add(Flatten())
        model.add(Dense(128, activation='tanh'))
        model.add(Dense(128, activation='tanh'))
        model.add(Dense(1, activation='linear'))
        model.compile(loss="mse", optimizer=RMSprop(lr=self.learning_rate))
        return model

    # remember state, action, its reward, next state and next possible action. done means boolean for goal
    def remember_memory(self, state, action, reward, next_state, next_movables, done):
        self.memory.append((state, action, reward, next_state, next_movables, done))

    # choosing action depending on epsilon
    def choose_action(self, state, movables):
        if self.epsilon >= random.random():
            # randomly choosing action
            return random.choice(movables)
        else:
            # choosing the best action from model.predict()
            return self.choose_best_action(state, movables)

	# choose the best action to maximize reward expectation
    def choose_best_action(self, state, movables):
        best_actions = []
        max_act_value = -100
        for a in movables:
            np_action = np.array([[state, a]])
            act_value = self.model.predict(np_action)
            if act_value > max_act_value:
                best_actions = [a,]
                max_act_value = act_value
            elif act_value == max_act_value:
                best_actions.append(a)
        return random.choice(best_actions)

    # this experience replay is going to train the model from memorized states, actions and rewards
    def replay_experience(self, batch_size):
        batch_size = min(batch_size, len(self.memory))
        minibatch = random.sample(self.memory, batch_size)
        X = []
        Y = []
        for i in range(batch_size):
            state, action, reward, next_state, next_movables, done = minibatch[i]
            input_action = [state, action]
            if done:
                target_f = reward
            else:
                next_rewards = []
                for i in next_movables:
                    np_next_s_a = np.array([[next_state, i]])
                    next_rewards.append(self.model.predict(np_next_s_a))
                np_n_r_max = np.amax(np.array(next_rewards))
                target_f = reward + self.gamma * np_n_r_max
            X.append(input_action)
            Y.append(target_f)
        np_X = np.array(X)
        np_Y = np.array([Y]).T
        self.model.fit(np_X, np_Y, epochs=1, verbose=0)
        if self.epsilon > self.e_min:
            self.epsilon *= self.e_decay

In Experience Replay, the current state, behavior, reward and the next state, optimal behavior, and reward are passed to the neural network for learning. Approximates the expected value for the current behavior.

This is model training. ʻEpisodesis the number of trainings, andtimes` is the number of samplings (searches) in one episode. Sampling → training is repeated 1000 x 20000 times. However, if you reach the goal in the middle of sampling, that sampling will be interrupted.

Adjust the number of samplings and trainings to match the values of ʻepsilon and ʻe_decay. If the sampling (times) is low, there is not enough sample data that can be Experience Replayed, and if the number of trainings (ʻepisodes) is low, the neural network does not train the model. The ʻepsilon value is attenuated every ʻepisodes, so if this value is not small enough according to ʻe_decay, no matter how good the model is, it will continue to behave randomly during sampling and the model. Cannot be evaluated.

The value of this side is adjusted according to the size of the maze. If it is 10x10, the model could be generated with this numerical value, but if the size is large, the training will end before the number of samplings is sufficient, and a biased model will be created.

# train the model
state_size = 2
action_size = 2
dql_solver = DQN_Solver(state_size, action_size)

# number of episodes to run training
episodes = 20000

# number of times to sample the combination of state, action and reward
times = 1000

for e in range(episodes):
    state = start_point
    score = 0
    for time in range(times):
        movables = maze_field.get_actions(state)
        action = dql_solver.choose_action(state, movables)
        reward, done = maze_field.get_val(action)
        score = score + reward
        next_state = action
        next_movables = maze_field.get_actions(next_state)
        dql_solver.remember_memory(state, action, reward, next_state, next_movables, done)
        if done or time == (times - 1):
            if e % 500 == 0:
                print("episode: {}/{}, score: {}, e: {:.2} \t @ {}"
                        .format(e, episodes, score, dql_solver.epsilon, time))
            break
        state = next_state
    # run experience replay after sampling the state, action and reward for defined times
    dql_solver.replay_experience(32)

Below are the results of the training. It is output every 500 episodes. score is the reward for the episode, e is epsilon, and @ is the number of samplings. A small number of @ means that you are scoring in the middle of sampling.


episode: 0/20000, score: 5.0, e: 1.0 	 @ 100
episode: 500/20000, score: 19.0, e: 0.95 	 @ 86
episode: 1000/20000, score: 14.0, e: 0.9 	 @ 82
episode: 1500/20000, score: 24.0, e: 0.86 	 @ 56
episode: 2000/20000, score: -9.0, e: 0.82 	 @ 138
episode: 2500/20000, score: 26.0, e: 0.78 	 @ 76
episode: 3000/20000, score: 20.0, e: 0.74 	 @ 84
episode: 3500/20000, score: 42.0, e: 0.7 	 @ 24
episode: 4000/20000, score: 44.0, e: 0.67 	 @ 22
episode: 4500/20000, score: 45.0, e: 0.64 	 @ 24
episode: 5000/20000, score: 46.0, e: 0.61 	 @ 36
episode: 5500/20000, score: 33.0, e: 0.58 	 @ 32
episode: 6000/20000, score: 43.0, e: 0.55 	 @ 32
episode: 6500/20000, score: 48.0, e: 0.52 	 @ 24
episode: 7000/20000, score: 38.0, e: 0.5 	 @ 34
episode: 7500/20000, score: 45.0, e: 0.47 	 @ 28
episode: 8000/20000, score: 43.0, e: 0.45 	 @ 42
episode: 8500/20000, score: 43.0, e: 0.43 	 @ 24
episode: 9000/20000, score: 45.0, e: 0.41 	 @ 28
episode: 9500/20000, score: 45.0, e: 0.39 	 @ 28
episode: 10000/20000, score: 48.0, e: 0.37 	 @ 20
episode: 10500/20000, score: 44.0, e: 0.35 	 @ 22
episode: 11000/20000, score: 46.0, e: 0.33 	 @ 24
episode: 11500/20000, score: 45.0, e: 0.32 	 @ 22
episode: 12000/20000, score: 48.0, e: 0.3 	 @ 26
episode: 12500/20000, score: 45.0, e: 0.29 	 @ 18
episode: 13000/20000, score: 47.0, e: 0.27 	 @ 18
episode: 13500/20000, score: 41.0, e: 0.26 	 @ 24
episode: 14000/20000, score: 47.0, e: 0.25 	 @ 16
episode: 14500/20000, score: 47.0, e: 0.23 	 @ 14
episode: 15000/20000, score: 47.0, e: 0.22 	 @ 14
episode: 15500/20000, score: 48.0, e: 0.21 	 @ 14
episode: 16000/20000, score: 46.0, e: 0.2 	 @ 18
episode: 16500/20000, score: 44.0, e: 0.19 	 @ 20
episode: 17000/20000, score: 48.0, e: 0.18 	 @ 14
episode: 17500/20000, score: 45.0, e: 0.17 	 @ 20
episode: 18000/20000, score: 48.0, e: 0.17 	 @ 18
episode: 18500/20000, score: 48.0, e: 0.16 	 @ 20
episode: 19000/20000, score: 46.0, e: 0.15 	 @ 18
episode: 19500/20000, score: 48.0, e: 0.14 	 @ 16

As the episode repeats, the score rises, epsilon decays, and the number of samplings decreases. You can see that it is being aggregated into a good model.

I will try to solve the maze with the model I made.

# try the model
state = start_point
score = 0
steps = 0
while True:
    steps += 1
    movables = maze_field.get_actions(state)
    action = dql_solver.choose_best_action(state, movables)
    print("current state: {0} -> action: {1} ".format(state, action))
    reward, done = maze_field.get_val(action)
    maze_field.display(state)
    score = score + reward
    state = action
    print("current step: {0} \t score: {1}\n".format(steps, score))
    if done:
        maze_field.display(action)
        print("goal!")
        break

The result is the same as Q-learning. What they are looking for is the same, and I wonder if there is a maze of this size.

in conclusion

I wanted to touch DQN, and then I compared the maze search algorithms. It was good that DQN worked and it was fun. The hardest part was writing a maze program that can be used with either algorithm, but I still sometimes have mazes that I can't solve (goals are blocked by walls), but I'm tired so I'll postpone the correction. did. Also, adjusting DQN parameters takes time. I'm verifying it in a 20x20 maze right now, so I'll upload it again when this is done.