After practicing, try using PyTorch to machine-learn your own Minesweeper. There are so many things I don't understand, so I take notes while studying various things. Create a memo once, and if necessary, format it later.
To be able to stably clear the beginner level of Windows standard (was) Minesweeper. For the time being, aim for a winning percentage of about 90%.
I copied the DQN of here. It's not enough, so it's easy.
The network uses a sequential model.
The number of neurons in the input layer (state $ s SIZE_MAG.
I wonder if I should scale with respect to the number of eyes on the board for the time being (appropriate)
ReLU is used as the activation function, and Adam (learning rate 0.001) is used as the optimization method.
Minesweeper is my own work. I thought I'd do my best by capturing the images, but the main subject is not there. The algorithm is omitted.
The rewards are as follows.
| variable | conditions | 
|---|---|
| reward_win | Game clear | 
| reward_failed | Game failure | 
| reward_miss | Trying to open a square that is already open | 
First, try setting the board size to 6x6 and the number of mines to 5 to see if you can learn.
param
GAMMA = 0.99
NUM_EPISODES = 1000
CAPACITY = 10000
BATCH_SIZE = 200
SIZE_MAG = 8
reward_failed = -100
reward_win = 100
reward_miss = -1
I can win very rarely, but I feel like I'm winning by chance. Looking at the error, it was blown away to about 4 digits in about 2000 steps. Yeah ... Even if you look at the simple reward sum, you try to open only the squares that are already open, and is it a reward problem?
So fix the reward.
reward
reward_failed = -100
reward_win = 100
reward_miss = -10
reward_open = 1
reward_open is a reward given when you open a new square.
The error was calmer than before, but it vibrated all the time at around 10.
I played around with it, but the vibration and divergence didn't stop. Even if you look at the behavior, it still tries to open the already open square. Fixed target Q-Network will be introduced ...
I considered the following possibilities.
Even if you select a square that is already open as a trial, if the game is over, the error itself will change to less than 1. When the value of ε was reduced (initial value 0.5 → 0.2), the error became even smaller. (About 0.1-0.01) However, since the problem when making a mini batch has not been solved, we will implement Prioritized Experience Replay. The code is as it is
Even if I try, it doesn't work. Well, I wish I had a coding mistake ( Even target Q-Network was still the default value ...
So, as a result of fixing it, it didn't work. Is the reward too big?
reward
reward_failed = -1
reward_win = 1
reward_miss = -1
reward_open = 1
After all it was useless Baby ...
Until now, I've been learning about different boards every time, Is it possible to learn for one board ...? → I was able to do it. About 200 to 300 episodes before the winning percentage reaches 90%. It is cute that the winning percentage will be 90% as soon as you can clear it once.
Then why not change it once to 150 episodes? → I can't win at all. It seems that it is being dragged by past learning data.
Well then, let's go back to change the board every time!
Recommended Posts