** * Addition ** There was already a post with similar content.

Good evening. This is Regent.

Most of you who read this article will know FizzBuzz. If you are a programmer, you may have output from 1 to 100 according to the rules of FizzBuzz. This time I will try to do that FizzBuzz "let AI do it without writing code". The created program is published on GitHub.

[What is FizzBuzz](What is #FizzBuzz)
[Organize game elements](#Organize game elements)
[Building a game environment](#Building a game environment)
[Building a learning model](#Building a learning model)
[Build learning memory](#Build learning memory)
[Build Learning Agent](#Build Learning Agent)
[Try to learn](# Let's learn)
Result

What is FizzBuzz?

If you know what it is, please skip it.

Take a look at Wikipedia.

Fizz Buzz (also known as Fizz Buzz, Bizz Buzz or Buzz) is a word game that takes place during long-distance drives and drinking parties in the English-speaking world.

Simply put, when listing numbers in order, ** Fizz ** instead of numbers when multiples of 3, ** Buzz ** when multiples of 5, ** FizzBuzz in both cases. It's a game called **. It is Nabeats of the world.

Organize the elements of the game

The time is 2020, and it's obsolete for humans to play FizzBuzz anymore. Let's prepare only the rules of the game by humans and let the machine play the rest.

The method used this time is ** reinforcement learning **. It is a technique often used for AI in games, and is also used for AI in Go and Shogi.

I won't write about reinforcement learning here. If you are interested, please check it out for yourself.

When humans play FizzBuzz, they can actually give an answer if they only know the current number, but since it has no AI taste, this time ** the answer for the past 15 turns ** is now Give to AI as ** status ** of.

Building a game environment

First, build the game environment for FizzBuzz. Specifically, the following functions are implemented.

Current game status, ** score ** calculation
Transition to the next turn

The ** score ** here indicates how good each answer (number, FIzz, Buzz, FizzBuzz) to the current situation is to AI. AI will grow with the goal of continuing to score this score.

`FizzBuzz.py`


from random import randint
from numpy import array

class FizzBuzz:
    def __init__(self, start, end):
        self.start = start #The beginning
        self.end   = end   #the end

    def reset(self): #Initialization
        self.turn = self.start #Current numbers
        self.state = []        #Current state

        #Set initial status(Last 15 turns)
        for i in range(self.turn - 15, self.turn):
            if i % 15 == 0:
                self.state.append(3)
            elif i % 3 == 0:
                self.state.append(1)
            elif i % 5 == 0:
                self.state.append(2)
            else:
                self.state.append(0)

    #Whether you learned enough
    def is_learned(self):
        return self.turn == self.end

    #Turn transition
    def step(self, action, verbose=False):
        if verbose:
            print(self.turn, [self.turn, 'Fizz', 'Buzz', 'FizzBuzz'][action])

        reward   = 0     #score
        terminal = False #Game end judgment

        self.state = self.state[1:] + [action]

        if action == 1:   # Fizz
            if self.turn % 3 == 0 and self.turn % 5 != 0:
                reward   = 1
                terminal = False
            else:
                reward   = -1
                terminal = True
        elif action == 2: # Buzz
            if self.turn % 5 == 0 and self.turn % 3 != 0:
                reward   = 1
                terminal = False
            else:
                reward   = -1
                terminal = True
        elif action == 3: # FizzBuzz
            if self.turn % 15 == 0:
                reward   = 1
                terminal = False
            else:
                reward   = -1
                terminal = True
        else:             # Number
            if self.turn % 3 != 0 and self.turn % 5 != 0:
                reward   = 1
                terminal = False
            else:
                reward   = -1
                terminal = True

        if self.turn == self.end:
            terminal = True

        self.turn += 1

        return array(self.state), reward, terminal

    def random_step(self): #For initialization
        return array(self.state), 0, False

After initializing with reset (), learning is repeated while transitioning with step (). Also, I wrote is_learned () to judge the end so that learning can be automatically rounded up when the game reaches the last number.

Building a learning model

Next, create a ** model ** that is actually used for learning. This time, we will use a neural network with a simpler intermediate layer.

`model.py`


from keras.models import Sequential
from keras.layers import Dense, Reshape
from keras.optimizers import Adam
import numpy as np

class Model:
    def __init__(self):
        learning_rate = 0.01 #Learning rate
        state_size    = 15   #Input size(Current state)
        action_size   = 4    #Output size(0, 1, 2, 3)
        hidden_size   = 16   #The size of the hidden layer

        self.model = Sequential()
        self.model.add(Dense(hidden_size, activation='relu', input_dim=state_size))
        self.model.add(Dense(action_size, activation='softmax'))
        self.optimizer = Adam(lr=learning_rate)
        self.model.compile(loss='mse', optimizer=self.optimizer)
        self.model.summary()

    #Functions to actually learn
    def replay(self, memory, batch_size, gamma, target_model):
        inputs     = np.zeros((batch_size, 15))
        outputs    = np.zeros((batch_size, 4))
        mini_batch = memory.sample(batch_size)

        for i, (state, action, reward, next_state) in enumerate(mini_batch):
            inputs[i:i + 1] = state
            target          = reward

            if not (next_state == np.zeros(state.shape)).all():
                q = self.model.predict(next_state.reshape(1, 15))[0].argmax()
                next_action = np.argmax(q) #Select the action with the highest Q value as the next action
                target = reward + gamma * target_model.model.predict(
                    next_state.reshape(1, 15)
                )[0][next_action] #Actual reward

            #Correct the current expected value and let it learn
            outputs[i] = self.model.predict(state.reshape(1, 15))
            outputs[i][action.argmax()] = target

        self.model.fit(inputs, outputs, epochs=1, verbose=0)

Use replay () when actually learning. If training is performed every turn, the model will be affected by the time series of data, so a certain amount of data will be stored in the ** memory ** to be implemented later, and randomly extracted from it for training. I will.

Building learning memory

Implement a memory to store the data used for learning.

`memory.py`


from collections import deque
import numpy as np

class Memory:
    def __init__(self):
        self.buffer = deque()

    #Stores the current situation, how it worked, what happened as a result, and the reward for that action
    def add(self, exp):
        self.buffer.append(exp)

    #Extract data stored at random
    def sample(self, batch_size):
        indice = np.random.choice(np.arange(len(self.buffer)), size=batch_size, replace=False)
        return [self.buffer[i] for i in indice]

Build a learning agent

A person (?) Who actually decides an action during learning and learns based on the result is called a ** agent **.

`agent.py`


import numpy as np
from keras.utils.np_utils import to_categorical

class Agent:
    #Choose an action
    def get_action(self, state, epoch, main_model):
        epsilon = 0.001 + 0.9 / (1.0 + epoch)

        if epsilon < np.random.uniform(0, 1):
            action = main_model.model.predict(state.reshape(1, 15))[0].argmax()
        else: #Random operation with a certain probability
            action = np.random.choice([0, 1, 2, 3])

        return to_categorical(action, 4)

Select the action with get_action (). Basically, the behavior that the model expects is selected, but by performing random movements with a certain probability, it is possible to make a so-called ** adventure ** and discover a new good move.

Let's actually learn

Let's use the people implemented above to learn.

`train.py`


from fizzbuzz import FizzBuzz
from model import Model
from memory import Memory
from agent import Agent

def evaluate(env):
    env.reset()
    state, _, finished = env.random_step()
    while not finished:
        action = agent.get_action(state, N_EPOCHS, main_model)
        next_state, _, finished = env.step(action.argmax(), verbose=True)
        state = next_state

if __name__ == '__main__':
    N_EPOCHS = 5000 #Maximum number of trainings
    S_BATCH  = 4    #Batch size
    GAMMA    = 0.99 #Reward reduction rate over time

    env = FizzBuzz(1, 1000) #Learning environment

    main_model   = Model()
    target_model = Model()

    memory = Memory() #memory
    agent  = Agent()  #Agent

    learned_flag = False #Whether learning is complete

    for epoch in range(N_EPOCHS):
        if learned_flag:
            break

        print('Epoch: {}'.format(epoch + 1))

        #Initial state setting
        env.reset()
        state, reward, finished = env.random_step() 
        target_model.model.set_weights(main_model.model.get_weights())

        while not finished:
            action = agent.get_action(state, epoch, main_model)
            learned_flag = env.is_learned()
            next_state, reward, finished = env.step(action.argmax())

            memory.add((state, action, reward, next_state))

            state = next_state

            if len(memory.buffer) > S_BATCH:
                main_model.replay(memory, S_BATCH, GAMMA, target_model)

            target_model.model.set_weights(main_model.model.get_weights())

    env = FizzBuzz(1, 100)
    evaluate(env)

We also implemented evaluate () to evaluate at the end of learning. You can check it while actually outputting FizzBuzz.

result

It is a learning result.

Epoch: 70
Epoch: 71
Epoch: 72
1 1, 2 2, 3 Fizz, 4 4, 5 Buzz, 6 Fizz, 7 7, 8 8, 9 Fizz, 10 Buzz, 11 11, 12 Fizz, 13 13, 14 14,
15 FizzBuzz, 16 16, 17 17, 18 Fizz, 19 19, 20 Buzz, 21 Fizz, 22 22, 23 23, 24 Fizz, 25 Buzz, 26 26,
27 Fizz, 28 28, 29 29, 30 FizzBuzz, 31 31, 32 32, 33 Fizz, 34 34, 35 Buzz, 36 Fizz, 37 37, 38 38,
39 Fizz, 40 Buzz, 41 41, 42 Fizz, 43 43, 44 44, 45 FizzBuzz, 46 46, 47 47, 48 Fizz, 49 49, 50 Buzz,
51 Fizz, 52 52, 53 53, 54 Fizz, 55 Buzz, 56 56, 57 Fizz, 58 58, 59 59, 60 FizzBuzz, 61 61, 62 62,
63 Fizz, 64 64, 65 Buzz, 66 Fizz, 67 67, 68 68, 69 Fizz, 70 Buzz, 71 71, 72 Fizz, 73 73, 74 74,
75 FizzBuzz, 76 76, 77 77, 78 Fizz, 79 79, 80 Buzz, 81 Fizz, 82 82, 83 83, 84 Fizz, 85 Buzz, 86 86,
87 Fizz, 88 88, 89 89, 90 FizzBuzz, 91 91, 92 92, 93 Fizz, 94 94, 95 Buzz, 96 Fizz, 97 FizzBuzz,
98 98, 99 Fizz, 100 Buzz

After 72 laps of learning, I was able to reach 1 to 100.

Let's try the result from 4000.

4000 Buzz, 4001 4001, 4002 Fizz, 4003 4003, 4004 4004, 4005 FizzBuzz, 4006 4006, 4007 4007, 4008 Fizz,
4009 4009, 4010 Buzz, 4011 Fizz, 4012 4012, 4013 4013, 4014 Fizz, 4015 Buzz, 4016 4016, 4017 Fizz,
4018 4018, 4019 4019, 4020 FizzBuzz, 4021 4021, 4022 4022, 4023 Fizz, 4024 4024, 4025 Buzz, 4026 Fizz,
4027 4027, 4028 4028, 4029 Fizz, 4030 Buzz, 4031 4031, 4032 Fizz, 4033 4033, 4034 4034, 4035 FizzBuzz,
4036 4036, 4037 4037, 4038 Fizz, 4039 4039, 4040 Buzz, 4041 Fizz, 4042 4042, 4043 4043, 4044 Fizz,
4045 Buzz, 4046 4046, 4047 Fizz, 4048 4048, 4049 4049, 4050 FizzBuzz, 4051 4051, 4052 4052, 4053 Fizz,
4054 4054, 4055 Buzz, 4056 Fizz, 4057 4057, 4058 4058, 4059 Fizz, 4060 Buzz, 4061 4061, 4062 Fizz,
4063 4063, 4064 4064, 4065 FizzBuzz, 4066 4066, 4067 4067, 4068 Fizz, 4069 4069, 4070 Buzz, 4071 Fizz,
4072 4072, 4073 4073, 4074 Fizz, 4075 Buzz, 4076 4076, 4077 Fizz, 4078 4078, 4079 4079, 4080 FizzBuzz,
4081 4081, 4082 4082, 4083 Fizz, 4084 4084, 4085 Buzz, 4086 Fizz, 4087 4087, 4088 4088, 4089 Fizz,
4090 Buzz, 4091 4091, 4092 Fizz, 4093 4093, 4094 4094, 4095 FizzBuzz, 4096 4096, 4097 4097, 4098 Fizz,
4099 4099, 4100 Buzz, 4101 Fizz, 4102 4102, 4103 4103, 4104 Fizz, 4105 Buzz, 4106 4106, 4107 Fizz,
4108 4108, 4109 4109, 4110 FizzBuzz, 4111 4111, 4112 4112, 4113 Fizz, 4114 4114, 4115 Buzz, 4116 Fizz,
4117 4117, 4118 4118, 4119 Fizz, 4120 Buzz, 4121 4121, 4122 Fizz, 4123 4123, 4124 4124, 4125 FizzBuzz,
4126 4126, 4127 4127, 4128 Fizz, 4129 4129, 4130 Buzz, 4131 Fizz, 4132 4132, 4133 4133, 4134 Fizz,
4135 Buzz, 4136 4136, 4137 Fizz, 4138 4138, 4139 4139, 4140 FizzBuzz, 4141 4141, 4142 4142, 4143 Fizz,
4144 4144, 4145 Buzz, 4146 Fizz, 4147 4147, 4148 4148, 4149 Fizz, 4150 Buzz, 4151 4151, 4152 Fizz,
4153 4153, 4154 4154, 4155 FizzBuzz, 4156 4156, 4157 4157, 4158 Fizz, 4159 4159, 4160 Buzz, 4161 Fizz,
4162 4162, 4163 4163, 4164 Fizz, 4165 Buzz, 4166 4166, 4167 Fizz, 4168 4168, 4169 4169, 4170 FizzBuzz,
4171 4171, 4172 4172, 4173 Fizz, 4174 Fizz

In the 174th 4174, the output of ** 4174 ** was mistaken for ** Fizz **. There should be only about 15 types of states, so I wonder why I made a mistake ...

By the way, if you add the argument verbose = True to step () on the 43rd line of train.py, you can see how it learns while gradually increasing the number of turns. cute.

If you like, please follow us on Twitter. See ya.

Let the machine "learn" the rules of FizzBuzz

table of contents