** * Addition ** There was already a post with similar content.
Good evening. This is Regent.
Most of you who read this article will know FizzBuzz. If you are a programmer, you may have output from 1 to 100 according to the rules of FizzBuzz. This time I will try to do that FizzBuzz "let AI do it without writing code". The created program is published on GitHub.
If you know what it is, please skip it.
Take a look at Wikipedia.
Fizz Buzz (also known as Fizz Buzz, Bizz Buzz or Buzz) is a word game that takes place during long-distance drives and drinking parties in the English-speaking world.
Simply put, when listing numbers in order, ** Fizz ** instead of numbers when multiples of 3, ** Buzz ** when multiples of 5, ** FizzBuzz in both cases. It's a game called **. It is Nabeats of the world.
The time is 2020, and it's obsolete for humans to play FizzBuzz anymore. Let's prepare only the rules of the game by humans and let the machine play the rest.
The method used this time is ** reinforcement learning **. It is a technique often used for AI in games, and is also used for AI in Go and Shogi.
I won't write about reinforcement learning here. If you are interested, please check it out for yourself.
When humans play FizzBuzz, they can actually give an answer if they only know the current number, but since it has no AI taste, this time ** the answer for the past 15 turns ** is now Give to AI as ** status ** of.
First, build the game environment for FizzBuzz. Specifically, the following functions are implemented.
The ** score ** here indicates how good each answer (number, FIzz, Buzz, FizzBuzz) to the current situation is to AI. AI will grow with the goal of continuing to score this score.
FizzBuzz.py
from random import randint
from numpy import array
class FizzBuzz:
def __init__(self, start, end):
self.start = start #The beginning
self.end = end #the end
def reset(self): #Initialization
self.turn = self.start #Current numbers
self.state = [] #Current state
#Set initial status(Last 15 turns)
for i in range(self.turn - 15, self.turn):
if i % 15 == 0:
self.state.append(3)
elif i % 3 == 0:
self.state.append(1)
elif i % 5 == 0:
self.state.append(2)
else:
self.state.append(0)
#Whether you learned enough
def is_learned(self):
return self.turn == self.end
#Turn transition
def step(self, action, verbose=False):
if verbose:
print(self.turn, [self.turn, 'Fizz', 'Buzz', 'FizzBuzz'][action])
reward = 0 #score
terminal = False #Game end judgment
self.state = self.state[1:] + [action]
if action == 1: # Fizz
if self.turn % 3 == 0 and self.turn % 5 != 0:
reward = 1
terminal = False
else:
reward = -1
terminal = True
elif action == 2: # Buzz
if self.turn % 5 == 0 and self.turn % 3 != 0:
reward = 1
terminal = False
else:
reward = -1
terminal = True
elif action == 3: # FizzBuzz
if self.turn % 15 == 0:
reward = 1
terminal = False
else:
reward = -1
terminal = True
else: # Number
if self.turn % 3 != 0 and self.turn % 5 != 0:
reward = 1
terminal = False
else:
reward = -1
terminal = True
if self.turn == self.end:
terminal = True
self.turn += 1
return array(self.state), reward, terminal
def random_step(self): #For initialization
return array(self.state), 0, False
After initializing with reset (), learning is repeated while transitioning with step (). Also, I wrote is_learned () to judge the end so that learning can be automatically rounded up when the game reaches the last number.
Next, create a ** model ** that is actually used for learning. This time, we will use a neural network with a simpler intermediate layer.
model.py
from keras.models import Sequential
from keras.layers import Dense, Reshape
from keras.optimizers import Adam
import numpy as np
class Model:
def __init__(self):
learning_rate = 0.01 #Learning rate
state_size = 15 #Input size(Current state)
action_size = 4 #Output size(0, 1, 2, 3)
hidden_size = 16 #The size of the hidden layer
self.model = Sequential()
self.model.add(Dense(hidden_size, activation='relu', input_dim=state_size))
self.model.add(Dense(action_size, activation='softmax'))
self.optimizer = Adam(lr=learning_rate)
self.model.compile(loss='mse', optimizer=self.optimizer)
self.model.summary()
#Functions to actually learn
def replay(self, memory, batch_size, gamma, target_model):
inputs = np.zeros((batch_size, 15))
outputs = np.zeros((batch_size, 4))
mini_batch = memory.sample(batch_size)
for i, (state, action, reward, next_state) in enumerate(mini_batch):
inputs[i:i + 1] = state
target = reward
if not (next_state == np.zeros(state.shape)).all():
q = self.model.predict(next_state.reshape(1, 15))[0].argmax()
next_action = np.argmax(q) #Select the action with the highest Q value as the next action
target = reward + gamma * target_model.model.predict(
next_state.reshape(1, 15)
)[0][next_action] #Actual reward
#Correct the current expected value and let it learn
outputs[i] = self.model.predict(state.reshape(1, 15))
outputs[i][action.argmax()] = target
self.model.fit(inputs, outputs, epochs=1, verbose=0)
Use replay () when actually learning. If training is performed every turn, the model will be affected by the time series of data, so a certain amount of data will be stored in the ** memory ** to be implemented later, and randomly extracted from it for training. I will.
Implement a memory to store the data used for learning.
memory.py
from collections import deque
import numpy as np
class Memory:
def __init__(self):
self.buffer = deque()
#Stores the current situation, how it worked, what happened as a result, and the reward for that action
def add(self, exp):
self.buffer.append(exp)
#Extract data stored at random
def sample(self, batch_size):
indice = np.random.choice(np.arange(len(self.buffer)), size=batch_size, replace=False)
return [self.buffer[i] for i in indice]
A person (?) Who actually decides an action during learning and learns based on the result is called a ** agent **.
agent.py
import numpy as np
from keras.utils.np_utils import to_categorical
class Agent:
#Choose an action
def get_action(self, state, epoch, main_model):
epsilon = 0.001 + 0.9 / (1.0 + epoch)
if epsilon < np.random.uniform(0, 1):
action = main_model.model.predict(state.reshape(1, 15))[0].argmax()
else: #Random operation with a certain probability
action = np.random.choice([0, 1, 2, 3])
return to_categorical(action, 4)
Select the action with get_action (). Basically, the behavior that the model expects is selected, but by performing random movements with a certain probability, it is possible to make a so-called ** adventure ** and discover a new good move.
Let's use the people implemented above to learn.
train.py
from fizzbuzz import FizzBuzz
from model import Model
from memory import Memory
from agent import Agent
def evaluate(env):
env.reset()
state, _, finished = env.random_step()
while not finished:
action = agent.get_action(state, N_EPOCHS, main_model)
next_state, _, finished = env.step(action.argmax(), verbose=True)
state = next_state
if __name__ == '__main__':
N_EPOCHS = 5000 #Maximum number of trainings
S_BATCH = 4 #Batch size
GAMMA = 0.99 #Reward reduction rate over time
env = FizzBuzz(1, 1000) #Learning environment
main_model = Model()
target_model = Model()
memory = Memory() #memory
agent = Agent() #Agent
learned_flag = False #Whether learning is complete
for epoch in range(N_EPOCHS):
if learned_flag:
break
print('Epoch: {}'.format(epoch + 1))
#Initial state setting
env.reset()
state, reward, finished = env.random_step()
target_model.model.set_weights(main_model.model.get_weights())
while not finished:
action = agent.get_action(state, epoch, main_model)
learned_flag = env.is_learned()
next_state, reward, finished = env.step(action.argmax())
memory.add((state, action, reward, next_state))
state = next_state
if len(memory.buffer) > S_BATCH:
main_model.replay(memory, S_BATCH, GAMMA, target_model)
target_model.model.set_weights(main_model.model.get_weights())
env = FizzBuzz(1, 100)
evaluate(env)
We also implemented evaluate () to evaluate at the end of learning. You can check it while actually outputting FizzBuzz.
It is a learning result.
Epoch: 70
Epoch: 71
Epoch: 72
1 1, 2 2, 3 Fizz, 4 4, 5 Buzz, 6 Fizz, 7 7, 8 8, 9 Fizz, 10 Buzz, 11 11, 12 Fizz, 13 13, 14 14,
15 FizzBuzz, 16 16, 17 17, 18 Fizz, 19 19, 20 Buzz, 21 Fizz, 22 22, 23 23, 24 Fizz, 25 Buzz, 26 26,
27 Fizz, 28 28, 29 29, 30 FizzBuzz, 31 31, 32 32, 33 Fizz, 34 34, 35 Buzz, 36 Fizz, 37 37, 38 38,
39 Fizz, 40 Buzz, 41 41, 42 Fizz, 43 43, 44 44, 45 FizzBuzz, 46 46, 47 47, 48 Fizz, 49 49, 50 Buzz,
51 Fizz, 52 52, 53 53, 54 Fizz, 55 Buzz, 56 56, 57 Fizz, 58 58, 59 59, 60 FizzBuzz, 61 61, 62 62,
63 Fizz, 64 64, 65 Buzz, 66 Fizz, 67 67, 68 68, 69 Fizz, 70 Buzz, 71 71, 72 Fizz, 73 73, 74 74,
75 FizzBuzz, 76 76, 77 77, 78 Fizz, 79 79, 80 Buzz, 81 Fizz, 82 82, 83 83, 84 Fizz, 85 Buzz, 86 86,
87 Fizz, 88 88, 89 89, 90 FizzBuzz, 91 91, 92 92, 93 Fizz, 94 94, 95 Buzz, 96 Fizz, 97 FizzBuzz,
98 98, 99 Fizz, 100 Buzz
After 72 laps of learning, I was able to reach 1 to 100.
Let's try the result from 4000.
4000 Buzz, 4001 4001, 4002 Fizz, 4003 4003, 4004 4004, 4005 FizzBuzz, 4006 4006, 4007 4007, 4008 Fizz,
4009 4009, 4010 Buzz, 4011 Fizz, 4012 4012, 4013 4013, 4014 Fizz, 4015 Buzz, 4016 4016, 4017 Fizz,
4018 4018, 4019 4019, 4020 FizzBuzz, 4021 4021, 4022 4022, 4023 Fizz, 4024 4024, 4025 Buzz, 4026 Fizz,
4027 4027, 4028 4028, 4029 Fizz, 4030 Buzz, 4031 4031, 4032 Fizz, 4033 4033, 4034 4034, 4035 FizzBuzz,
4036 4036, 4037 4037, 4038 Fizz, 4039 4039, 4040 Buzz, 4041 Fizz, 4042 4042, 4043 4043, 4044 Fizz,
4045 Buzz, 4046 4046, 4047 Fizz, 4048 4048, 4049 4049, 4050 FizzBuzz, 4051 4051, 4052 4052, 4053 Fizz,
4054 4054, 4055 Buzz, 4056 Fizz, 4057 4057, 4058 4058, 4059 Fizz, 4060 Buzz, 4061 4061, 4062 Fizz,
4063 4063, 4064 4064, 4065 FizzBuzz, 4066 4066, 4067 4067, 4068 Fizz, 4069 4069, 4070 Buzz, 4071 Fizz,
4072 4072, 4073 4073, 4074 Fizz, 4075 Buzz, 4076 4076, 4077 Fizz, 4078 4078, 4079 4079, 4080 FizzBuzz,
4081 4081, 4082 4082, 4083 Fizz, 4084 4084, 4085 Buzz, 4086 Fizz, 4087 4087, 4088 4088, 4089 Fizz,
4090 Buzz, 4091 4091, 4092 Fizz, 4093 4093, 4094 4094, 4095 FizzBuzz, 4096 4096, 4097 4097, 4098 Fizz,
4099 4099, 4100 Buzz, 4101 Fizz, 4102 4102, 4103 4103, 4104 Fizz, 4105 Buzz, 4106 4106, 4107 Fizz,
4108 4108, 4109 4109, 4110 FizzBuzz, 4111 4111, 4112 4112, 4113 Fizz, 4114 4114, 4115 Buzz, 4116 Fizz,
4117 4117, 4118 4118, 4119 Fizz, 4120 Buzz, 4121 4121, 4122 Fizz, 4123 4123, 4124 4124, 4125 FizzBuzz,
4126 4126, 4127 4127, 4128 Fizz, 4129 4129, 4130 Buzz, 4131 Fizz, 4132 4132, 4133 4133, 4134 Fizz,
4135 Buzz, 4136 4136, 4137 Fizz, 4138 4138, 4139 4139, 4140 FizzBuzz, 4141 4141, 4142 4142, 4143 Fizz,
4144 4144, 4145 Buzz, 4146 Fizz, 4147 4147, 4148 4148, 4149 Fizz, 4150 Buzz, 4151 4151, 4152 Fizz,
4153 4153, 4154 4154, 4155 FizzBuzz, 4156 4156, 4157 4157, 4158 Fizz, 4159 4159, 4160 Buzz, 4161 Fizz,
4162 4162, 4163 4163, 4164 Fizz, 4165 Buzz, 4166 4166, 4167 Fizz, 4168 4168, 4169 4169, 4170 FizzBuzz,
4171 4171, 4172 4172, 4173 Fizz, 4174 Fizz
In the 174th 4174, the output of ** 4174 ** was mistaken for ** Fizz **. There should be only about 15 types of states, so I wonder why I made a mistake ...
By the way, if you add the argument verbose = True to step () on the 43rd line of train.py, you can see how it learns while gradually increasing the number of turns. cute.
If you like, please follow us on Twitter. See ya.
Recommended Posts