Try to make a blackjack strategy by reinforcement learning (③ Reinforcement learning in your own OpenAI Gym environment)

Introduction

I tried to make a strategy for blackjack while studying Python and reinforcement learning. There is a probability-based strategy called a basic strategy, but I will try to catch up with it.

I will proceed like this

Blackjack implementation
Register in the OpenAI gym environment
Learn blackjack strategy through reinforcement learning ← This time here

Development environment

Windows 10
Python 3.6.9
Anaconda 4.3.0 (64-bit)
gym 0.15.4

Coding Reinforcement Learning

This time, we will use Q-Learning, which is one of the basic reinforcement learning algorithms.

file organization

The file structure is as follows. The reinforcement learning code created this time is "q-learning_blackjack.py". Other files are created as "Register in OpenAI gym environment".

├─ q-learning_blackjack.py
└─ myenv
    ├─ __init__.py  --->Call BlacJackEnv
    └─env
       ├─ __init__.py  --->Indicates where the BlackJack Env is located
       ├─ blackjack.py  --->BlackJack game itself
       └─ blackjack_env.py  --->OpenAI Gym gym.Create a BlackJackEnv class that inherits Env

coding

Agent class

self.Q becomes a table that summarizes the Q values and is updated as the learning progresses. This is called the Q table here. In the Q table, for ** status (Player points, Dealer points, Player owns Ace, Player has been hit) **, Player ** Stand **, ** Hit ** ， ** Double Down ** ， ** Surrender ** Represents the value when the action is taken.

The policy method selects an action according to the ε-greedy method. Randomly select an action with the probability ʻepsilon, and select an action according to the Q table with the probability 1-epsilon`.

`Agent class`


class Agent():
    def __init__(self, epsilon):
        self.Q = {}
        self.epsilon = epsilon
        self.reward_log = []

    def policy(self, state, actions):
        if np.random.random() < self.epsilon:
            return np.random.randint(len(actions))
        else:
            if state in self.Q and sum(self.Q[state]) != 0:
                return np.argmax(self.Q[state])
            else:
                return np.random.randint(len(actions))

    def init_log(self):
        self.reward_log = []

    def log(self, reward):
        self.reward_log.append(reward)

    def show_reward_log(self, interval=100, episode=-1):
        if episode > 0:
            rewards = self.reward_log[-interval:]
            mean = np.round(np.mean(rewards), 3)
            std = np.round(np.std(rewards), 3)
            print("At Episode {} average reward is {} (+/-{}).".format(episode, mean, std))
        else:
            indices = list(range(0, len(self.reward_log), interval))
            means = []
            stds = []
            for i in indices:
                rewards = self.reward_log[i:(i + interval)]
                means.append(np.mean(rewards))
                stds.append(np.std(rewards))
            means = np.array(means)
            stds = np.array(stds)
            plt.figure()
            plt.title("Reward History")
            plt.xlabel("episode")
            plt.ylabel("reward")
            plt.grid()
            plt.fill_between(indices, means - stds, means + stds, alpha=0.2, color="g")
            plt.plot(indices, means, "o-", color="g", label="Rewards for each {} episode".format(interval))
            plt.legend(loc="best")
            plt.savefig("Reward_History.png ")
            plt.show()

QLearningAgent class

It inherits the Agent class created above. The learn method is the main learning method. One episode is equivalent to one blackjack game. ʻA = self.policy (s, actions)selects an action according to the state, andn_state, reward, done, info = env.step (a)` shows the result of actually taking that action. Observe the reward. The step function is as implemented in "Register in OpenAI gym environment".

The following three lines of code are the Q-Learning formula

Q(s_t, a_t)\leftarrow(1-\alpha)Q(s_t, a_t)+\alpha(r_{t+1}+\gamma \max_{a_{t+1}}Q(s_{t+1}, a_{t+1}))

Corresponds to. $ \ Gamma $ (gamma) is a parameter for how much the future value is discounted at the discount rate, and $ \ alpha $ (learning_rate) is the learning rate. It is a parameter to control.

`Q-Learning formula`


gain = reward + gamma * max(self.Q[n_state])
estimated = self.Q[s][a]
self.Q[s][a] += learning_rate * (gain - estimated)

`QLearningAgent class`


class QLearningAgent(Agent):
    def __init__(self, epsilon=0.1):
        super().__init__(epsilon)

    def learn(self, env, episode_count=1000, gamma=0.9,
              learning_rate=0.1, render=False, report_interval=5000):
        self.init_log()
        actions = list(range(env.action_space.n))
        self.Q = defaultdict(lambda: [0] * len(actions))
        for e in range(episode_count):
            s = env.reset()
            done = False
            reward_history = []
            while not done:
                if render:
                    env.render()
                a = self.policy(s, actions)
                n_state, reward, done, info = env.step(a)

                reward_history.append(reward)
                gain = reward + gamma * max(self.Q[n_state])
                estimated = self.Q[s][a]
                self.Q[s][a] += learning_rate * (gain - estimated)
                s = n_state
            else:
                self.log(sum(reward_history))

            if e != 0 and e % report_interval == 0:
                self.show_reward_log(episode=e, interval=50)
        env.close()

train function

Load your own blackjack environment with ʻenv = gym.make ('BlackJack-v0')`.

For the creation method, refer to Blackjack implementation, Register in OpenAI gym environment. please.

I created a save_Q method to save the Q value table and a show_reward_log method to display the reward log history.

`train function and execution part`


def train():
    agent = QLearningAgent()
    env = gym.make('BlackJack-v0')
    agent.learn(env, episode_count=50000, report_interval=1000)
    agent.save_Q()
    agent.show_reward_log(interval=500)

if __name__ == "__main__":
    train()

Learning results

The learning results are as follows. The horizontal axis is the episode and the vertical axis is the reward. The green line is the average reward for 500 episodes, and the green fill is the standard deviation for the reward for 500 episodes. It has been almost flat since the 20000 episode. And I feel sad that the average reward is less than 0 even at the time of learning 50,000 episodes. .. ..

Comparison with basic strategy

Compare the learned Q table with the basic strategy. Extract the action that maximizes the Q value for each state of the Q table, and create a strategy table for each hard hand and soft hand in the same way as the basic strategy.

The left column is the strategy learned in Q-Learning, and the right column is the basic strategy. The upper row is a hard hand (when A is not included in the hand), and the lower row is a soft hand (when A is included in the hand). The rows of each table represent Player points and the columns represent Dealer points. The alphabet in the table indicates the action that the player should take.

H : Hit
S : Stay
D : Double Down
Su : Surrender

The split function is not implemented in this self-made blackjack. Therefore, actions are assigned even when the hard hand player has 4 points (2, 2) and the soft hand player has 12 points (A, A).

Comparing the learned results with the basic strategy, the tendency of Hit when the player's point is low and Stay when the player's point is high are the same. I would like you to learn at least this area. However, if you look at the details, the soft hand player tends to hit at 19 points. Even if you hit it, you won't be Bust, but if you stay silently, it's a strong move. This is where I couldn't learn well. Why. .. .. Also, there is less Double Down and more Surrender. We can see that there is a tendency to take no risk and try to minimize the loss.

I actually played it

I tried to play 100 games x 1000 times based on the learned Q table and the basic strategy. We bet $ 100 on each game and calculated the average chips earned for 100 games 1000 times. The histogram of the average acquired chips is shown below.

The more distributed on the right side of the figure, the better the results, but the basic strategy gives better results. By the way, the average value based on the Q table was \ $ -8.2, and the average value based on the basic strategy was \ $ -3.2. Of course, the case with high average chips was the basic strategy, but the case with low average chips was also the basic strategy. The basic strategy has a wider distribution. The distribution of the Q table is narrower, probably because there are fewer Double Downs and more Surrenders.

in conclusion

Through three steps, I created my own blackjack environment and tried to strengthen and learn the strategy. As a result, I did not get results that exceeded the basic strategy, but I was able to deepen my understanding of reinforcement learning and programming. There is still room for improvement in learning. Since I made my own environment, I can also observe the expected value of the cards that come out of the remaining decks. It's a bit sloppy, but I'd like to experiment. (Of course, it can't be used at casinos ...)

Sites / books that I referred to

-[Machine Learning Startup Series Enhanced Learning with Python](https://www.amazon.co.jp/%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92% E3% 82% B9% E3% 82% BF% E3% 83% BC% E3% 83% 88% E3% 82% A2% E3% 83% 83% E3% 83% 97% E3% 82% B7% E3% 83% AA% E3% 83% BC% E3% 82% BA-Python% E3% 81% A7% E5% AD% A6% E3% 81% B6% E5% BC% B7% E5% 8C% 96% E5% AD% A6% E7% BF% 92-% E6% 94% B9% E8% A8% 82% E7% AC% AC2% E7% 89% 88-% E5% 85% A5% E9% 96% 80% E3% 81% 8B% E3% 82% 89% E5% AE% 9F% E8% B7% B5% E3% 81% BE% E3% 81% A7-% E4% B9% 85% E4% BF% 9D / dp / 4065712519 / ref = asc_df_4065172519 /? tag = jpo-22 & linkCode = df0 & hvadid = 342520995287 & hvpos = & hvnetw = g & hvrand = 8228940498059590938 & hvpone = & hvptwo = & hvqmt = & hvdev = c & hvdvcmdl = & hvlocint = & hvps 69995157558 & hvpone = & hvptwo = & hvadid = 342520995287 & hvpos = & hvnetw = g & hvrand = 8228940498059590938 & hvqmt = & hvdev = c & hvdvcmdl = & hvlocint = & hvlocphy = 1009409 & hvtargid = pla-818473632453) -[Introduction to Deep Reinforcement Learning with Python Reinforcement Learning Beginning with Chainer and OpenAI Gym](https://www.amazon.co.jp/Python%E3%81%AB%E3%82%88%E3%82%8B%E6 % B7% B1% E5% B1% A4% E5% BC% B7% E5% 8C% 96% E5% AD% A6% E7% BF% 92% E5% 85% A5% E9% 96% 80-Chainer% E3 % 81% A8OpenAI-Gym% E3% 81% A7% E3% 81% AF% E3% 81% 98% E3% 82% 81% E3% 82% 8B% E5% BC% B7% E5% 8C% 96% E5 % AD% A6% E7% BF% 92-% E7% 89% A7% E9% 87% 8E-% E6% B5% A9% E4% BA% 8C / dp / 4274222535 / ref = sr_1_1? __mk_ja_JP =% E3% 82% AB% E3% 82% BF% E3% 82% AB% E3% 83% 8A & keywords = Python% E3% 81% AB% E3% 82% 88% E3% 82% 8B% E6% B7% B1% E5% B1% A4% E5% BC% B7% E5% 8C% 96% E5% AD% A6% E7% BF% 92% E5% 85% A5% E9% 96% 80 & qid = 1584978761 & s = books & sr = 1-1)