After reading the following articles, I was impressed that DQN (Deep Q-Network) seems to be interesting. Alpha-Go, which is a hot topic recently, is also an extension of DQN ... Is it? (I don't understand) History of DQN + Deep Q-Network written in Chainer DQN (Deep Q Network) learning with an inverted pendulum Playing with Machine Learning with Chainer: Can Addition Games be Reinforced Learning with Chainer?

That's why I tried to implement it with TensorFlow ... (-_-;)? ?? ?? I'm not sure. No, I'm trying to do it without knowing the theory and mathematical formulas well. I think there are too few examples of TensorFlow. For the time being, I tried to imitate it, so I would appreciate it if you could comment if there are any misunderstandings or corrections. "This area is correct" and "This area is ok" are also very helpful.

Other referenced sites: Deep-Q learning Pong with Tensorflow and PyGame I probably referred to the source code in the top half of this page.

Implementation details

Consider the following game. --Consider a number line from 0 to 100. --The program starts from 0 to 100. --There are two choices for the program, and you can go +1 or +2. ――If the place you stopped is a multiple of 2, you will receive a reward of +1 and if it is a multiple of 8, you will receive a reward of -1 (or a penalty). Try training a few times to see how the program works.

environment

TensorFlow 0.7 Ubuntu 14.04 GCE vCPU x8 instance

Implementation

I will put the source code at the bottom, but I will explain the part.

Graph creation

def inference(x_ph):

    with tf.name_scope('hidden1'):
        weights = tf.Variable(tf.truncated_normal([NUM_IMPUT, NUM_HIDDEN1], stddev=stddev), name='weights')
        biases = tf.Variable(tf.zeros([NUM_HIDDEN1], dtype=tf.float32), name='biases')
        hidden1 = tf.matmul(x_ph, weights) + biases

    with tf.name_scope('hidden2'):
        weights = tf.Variable(tf.truncated_normal([NUM_HIDDEN1, NUM_HIDDEN2], stddev=stddev), name='weights')
        biases = tf.Variable(tf.zeros([NUM_HIDDEN2], dtype=tf.float32), name='biases')
        hidden2 = tf.matmul(hidden1, weights) + biases

    with tf.name_scope('output'):
        weights = tf.Variable(tf.truncated_normal([NUM_HIDDEN2, NUM_OUTPUT], stddev=stddev), name='weights')
        biases = tf.Variable(tf.zeros([NUM_OUTPUT], dtype=tf.float32), name='biases')
        y = tf.matmul(hidden2, weights) + biases

    return y

There are two hidden layers, and the number of units is 100 and 100, respectively. The number of this side is appropriate. (I'm trying while playing with it) Enter only one number indicating the current position. There are two outputs, the expected rewards for +1 and +2 (I think). I saw many variable initializations being zero-initialized, but then it didn't work, so random initialization. (Is there a problem?) The activation function does not work well if I use relu, and if I connect it without the activation function, it works a little to Matomo, so it remains none. (Is there a problem?)

Loss calculation

def loss(y, y_ph):
    return tf.reduce_mean(tf.nn.l2_loss((y - y_ph)))

It seems that the loss calculation is squared and halved, so implement it with the equivalent API.

The part to actually train

def getNextPositionReward(choice_position):

    if choice_position % 8 == 0:
        next_position_reward = -1.
    elif choice_position % 2 == 0:
        next_position_reward = 1.
    else:
        next_position_reward = 0.

    return next_position_reward

A function that returns a penalty if the next place is a multiple of 8 and a reward if it is a multiple of 2.

def getNextPosition(position, action_reward1, action_reward2):

    if random.random() < RANDOM_FACTOR:
        if random.randint(0, 1) == 0:
            next_position = position + 1
        else:
            next_position = position + 2
    else:
        if action_reward1 > action_reward2:
            next_position = position + 1
        else:
            next_position = position + 2

    return next_position

The part that compares the two rewards and considers whether to advance +1 or +2. At the time of training, I try to put in a certain random element and proceed.

    for i in range(REPEAT_TIMES):
        position = 0.
        position_history = []
        reward_history = []

        while(True):
            if position >= GOAL:
                break

            choice1_position = position + 1.
            choice2_position = position + 2.

            next_position1_reward = getNextPositionReward(choice1_position)
            next_position2_reward = getNextPositionReward(choice2_position)

            reward1 = sess.run(y, feed_dict={x_ph: [[choice1_position]]})[0]
            reward2 = sess.run(y, feed_dict={x_ph: [[choice2_position]]})[0]

            action_reward1 = next_position1_reward + GAMMA * np.max(reward1)
            action_reward2 = next_position2_reward + GAMMA * np.max(reward2)

            position_history.append([position])
            reward_history.append([action_reward1, action_reward2])

            position = getNextPosition(position, action_reward1, action_reward2)

        sess.run(train_step, feed_dict={x_ph: position_history, y_ph: reward_history})

Training part (excerpt). There are two options, compare the rewards, and choose the one with the higher reward. The reward is the sum of the maximum values of "the reward (certainly) obtained in the next position" and "the predicted value of the reward (probably) obtained after that". Also, put together a list of the two rewards and your current position for supervised learning. This is repeated about 1000 times. ⇒I'm worried about this. I think I'm making a big mistake.

result

Let's take a look at the trajectory of how it actually moved after training.

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98]

It's completely returning just an even number. I'm stepping on multiples of 8 perfectly! I feel like I'm going to get a positive reward, but I don't feel like I'm avoiding a negative one. When I tried various things such as increasing the negative value of the reward, it seems that the above numerical value is different, so it seems that the numerical value is not a fixed value, but it did not move ideally ... ・・. By the way, the loss has converged.

Source code (all)

import tensorflow as tf
import numpy as np
import random

# definition
NUM_IMPUT = 1
NUM_HIDDEN1 = 100
NUM_HIDDEN2 = 100
NUM_OUTPUT = 2
LEARNING_RATE = 0.1
REPEAT_TIMES = 100
GOAL = 100
LOG_DIR = "tf_log"
GAMMA = 0.8
stddev = 0.01
RANDOM_FACTOR = 0.1

def inference(x_ph):

    with tf.name_scope('hidden1'):
        weights = tf.Variable(tf.truncated_normal([NUM_IMPUT, NUM_HIDDEN1], stddev=stddev), name='weights')
        biases = tf.Variable(tf.zeros([NUM_HIDDEN1], dtype=tf.float32), name='biases')
        hidden1 = tf.matmul(x_ph, weights) + biases

    with tf.name_scope('hidden2'):
        weights = tf.Variable(tf.truncated_normal([NUM_HIDDEN1, NUM_HIDDEN2], stddev=stddev), name='weights')
        biases = tf.Variable(tf.zeros([NUM_HIDDEN2], dtype=tf.float32), name='biases')
        hidden2 = tf.matmul(hidden1, weights) + biases

    with tf.name_scope('output'):
        weights = tf.Variable(tf.truncated_normal([NUM_HIDDEN2, NUM_OUTPUT], stddev=stddev), name='weights')
        biases = tf.Variable(tf.zeros([NUM_OUTPUT], dtype=tf.float32), name='biases')
        y = tf.matmul(hidden2, weights) + biases

    return y

def loss(y, y_ph):
    return tf.reduce_mean(tf.nn.l2_loss((y - y_ph)))

def optimize(loss):
    optimizer = tf.train.AdamOptimizer(LEARNING_RATE)
    train_step = optimizer.minimize(loss)
    return train_step

def getNextPositionReward(choice_position):

    if choice_position % 8 == 0:
        next_position_reward = -1.
    elif choice_position % 2 == 0:
        next_position_reward = 1.
    else:
        next_position_reward = 0.

    return next_position_reward

def getNextPosition(position, action_reward1, action_reward2):

    if random.random() < RANDOM_FACTOR:
        if random.randint(0, 1) == 0:
            next_position = position + 1
        else:
            next_position = position + 2
    else:
        if action_reward1 > action_reward2:
            next_position = position + 1
        else:
            next_position = position + 2

    return next_position

if __name__ == "__main__":

    x_ph = tf.placeholder(tf.float32, [None, NUM_IMPUT])
    y_ph = tf.placeholder(tf.float32, [None, NUM_OUTPUT])

    y = inference(x_ph)
    loss = loss(y, y_ph)
    tf.scalar_summary("Loss", loss)
    train_step = optimize(loss)

    sess = tf.Session()
    summary_op = tf.merge_all_summaries()
    init = tf.initialize_all_variables()
    sess.run(init)
    summary_writer = tf.train.SummaryWriter(LOG_DIR, graph_def=sess.graph_def)

    for i in range(REPEAT_TIMES):
        position = 0.
        position_history = []
        reward_history = []

        while(True):
            if position >= GOAL:
                break

            choice1_position = position + 1.
            choice2_position = position + 2.

            next_position1_reward = getNextPositionReward(choice1_position)
            next_position2_reward = getNextPositionReward(choice2_position)

            reward1 = sess.run(y, feed_dict={x_ph: [[choice1_position]]})[0]
            reward2 = sess.run(y, feed_dict={x_ph: [[choice2_position]]})[0]

            action_reward1 = next_position1_reward + GAMMA * np.max(reward1)
            action_reward2 = next_position2_reward + GAMMA * np.max(reward2)

            position_history.append([position])
            reward_history.append([action_reward1, action_reward2])

            position = getNextPosition(position, action_reward1, action_reward2)

        sess.run(train_step, feed_dict={x_ph: position_history, y_ph: reward_history})
        summary_str = sess.run(summary_op, feed_dict={x_ph: position_history, y_ph: reward_history})
        summary_writer.add_summary(summary_str, i)
        if i % 10 == 0:
            print "Count: " + str(i)

    # TEST
    position = 0
    position_history = []
    while(True):
        if position >= GOAL:
                break

        position_history.append(position)

        rewards = sess.run(y, feed_dict={x_ph: [[position]]})[0]
        choice = np.argmax(rewards)
        if choice == 0:
            position += 1
        else:
            position += 2

    print position_history

Really

We look forward to your advice and uselessness.

2016/04/27 postscript

dsanno gave us some advice in the comments. Thank you very much. I'll try that.

Part 1

With this problem setting, there is no hidden layer, I think that you can learn with only one layer of embedding_lookup with 100 input values (one-hot vector representing your current location) and 2 output values.

I see, i see··· I still don't understand embedding_lookup, so I'll leave it aside and make the input a one-hot vector and try it without a hidden layer.

def inference(x_ph):

    with tf.name_scope('output'):
        weights = tf.Variable(tf.truncated_normal([NUM_IMPUT, NUM_OUTPUT], stddev=stddev), name='weights')
        biases = tf.Variable(tf.zeros([NUM_OUTPUT], dtype=tf.float32), name='biases')
        y = tf.matmul(x_ph, weights) + biases

Below is a function to create a one-hot vector.

def onehot(idx):
    idx = int(idx)
    array = np.zeros(GOAL)
    array[idx] = 1.
    return array

result

[0, 2, 4, 6, 7, 9, 10, 12, 14, 15, 17, 18, 20, 22, 23, 25, 26, 28, 30, 32, 34, 36, 38, 39, 41, 42, 44, 46, 47, 49, 50, 52, 53, 54, 55, 57, 58, 60, 62, 63, 65, 66, 68, 70, 71, 73, 74, 76, 78, 79, 81, 82, 84, 86, 88, 90, 92, 94, 95, 97, 98, 99]

It's kind of like that. It's not perfect, but I feel like I'm trying to avoid multiples of 8 while stepping on multiples of 2 as much as possible.

Part 2

With ReLu, there is no upper limit on the output and it seems to be incompatible, so use tanh or relu6 with an upper limit as the activation function

I tried with 100,100 hidden layer units while keeping one input. This is not much different from no activation function.

result

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98]

Part 3

If you can assume that the solution is periodic, use tf.sin for the activation function (use such as sin for the first stage and relu for the second stage). I tried with 100,100 hidden layer units while keeping one input.

def inference(x_ph):

    with tf.name_scope('hidden1'):
        weights = tf.Variable(tf.zeros([NUM_IMPUT, NUM_HIDDEN1], dtype=tf.float32), name='weights')
        biases = tf.Variable(tf.zeros([NUM_HIDDEN1], dtype=tf.float32), name='biases')
        hidden1 = tf.sin(tf.matmul(x_ph, weights) + biases)

    with tf.name_scope('hidden2'):
        weights = tf.Variable(tf.truncated_normal([NUM_HIDDEN1, NUM_HIDDEN2], stddev=stddev), name='weights')
        biases = tf.Variable(tf.zeros([NUM_HIDDEN2], dtype=tf.float32), name='biases')
        hidden2 = tf.nn.relu(tf.matmul(hidden1, weights) + biases)

    with tf.name_scope('output'):
        weights = tf.Variable(tf.truncated_normal([NUM_HIDDEN2, NUM_OUTPUT], stddev=stddev), name='weights')
        biases = tf.Variable(tf.zeros([NUM_OUTPUT], dtype=tf.float32), name='biases')
        y = tf.matmul(hidden2, weights) + biases

    return y

result

[0, 2, 4, 6, 8, 9, 10, 12, 14, 15, 17, 18, 20, 22, 23, 25, 26, 28, 29, 30, 31, 33, 34, 36, 38, 39, 41, 43, 44, 46, 47, 49, 50, 51, 53, 55, 57, 58, 60, 62, 63, 64, 66, 68, 69, 71, 73, 74, 76, 78, 79, 81, 82, 83, 84, 85, 87, 89, 90, 92, 94, 95, 97, 98]

I'm stepping on the first 8, but I feel that I'm doing my best here as well.

Corrected a little and adjusted the number of hidden layer units to 500,100.

result

[0, 2, 4, 6, 7, 9, 10, 12, 14, 15, 17, 18, 20, 22, 23, 25, 26, 28, 30, 31, 33, 34, 36, 38, 39, 41, 42, 44, 46, 47, 49, 50, 52, 54, 55, 57, 58, 60, 62, 63, 65, 66, 68, 70, 71, 73, 74, 76, 78, 79, 81, 82, 84, 86, 87, 89, 90, 92, 94, 95, 97, 98]

Is it perfect? I didn't even think of using sin () at all. Thank you again, dsanno.

Impressions

When I heard about artificial intelligence, I had the illusion that if I did it for the time being, I would think about anything myself, but I realized that the characteristics of the input data and the creator had to think about it. ..

Implemented DQN in TensorFlow (I wanted to ...)

Implementation details

environment

Implementation

Graph creation

Loss calculation

The part to actually train

result

Source code (all)

Really

2016/04/27 postscript

Part 1

result

Part 2

result

Part 3

result

result

Impressions