Reinforcement learning Learn from today

From now on, you will travel from Sapporo in Hokkaido to Naha in Okinawa. With reinforcement learning.

This trip through Japan has the following two goals.

Goal 1: Goal is reached. First of all, it can't be helped not to reach the goal.

Goal 2: Find the shortest path. Once you've found the route to your goal, find the shortest of them.

And I would like to apply the following rules.

A city that has been visited once will never be visited again. If you make a mistake and visit the same city twice, the game is over.

Shouldn't we use the graph search method?

Yes, that's right. It can be easily found by using the graph search method. For more information

Basics of Graph Theory

Basics of graph theory with matplotlib animation

Please see. But here, the purpose is to start learning about reinforcement learning from today.

Japan prefectural office location data

Used in Basics of Graph Theory and Basics of Graph Theory with matplotlib animation The following data is used.

Latitude / longitude of the prefectural capital

Download the latitude / longitude data of the prefectural office location.

import urllib.request url = 'https://raw.githubusercontent.com/maskot1977/ipython_notebook/master/toydata/location.txt' urllib.request.urlretrieve(url, 'location.txt') #Download data

('location.txt', <http.client.HTTPMessage at 0x110e78ba8>)

Load the downloaded data.

import pandas as pd location = pd.read_csv('location.txt').values pd.DataFrame(location)

0 1 2

0 Sapporo 43.0642 141.347

1 Aomori 40.8244 140.74

2 Morioka 39.7036 141.153

3 Sendai 38.2689 140.872

... ... ... ...

44 Miyazaki 31.9111 131.424

45 Kagoshima 31.5603 130.558

46 Naha 26.2125 127.681

47 rows × 3 columns

Let's visualize the read data.

%matplotlib inline import matplotlib.pyplot as plt plt.figure(figsize=(11, 9)) plt.scatter(location[:, 2], location[:, 1]) for city, y, x in location: plt.text(x, y, city, alpha=0.5, size=12) plt.grid()

Walking between prefectural office locations

There is data on Google Maps to find out how many hours it takes to move between prefectural capitals on foot (if you can't move on foot, you'll use a ferry). Download this.

url = 'https://raw.githubusercontent.com/maskot1977/ipython_notebook/master/toydata/walk.txt' urllib.request.urlretrieve(url, 'walk.txt')

('walk.txt', <http.client.HTTPMessage at 0x1220b8a58>)

Load the downloaded data.

walk = pd.read_csv('walk.txt', sep='\s+') walk

Town1 Town2 Hour Ferry

0 Sapporo Aomori 55 True

1 Akita Aomori 36 NaN

2 Akita Sendai 45 NaN

3 Akita Niigata 52 NaN

... ... ... ... ...

96 Nagasaki Kumamoto 18 NaN

97 Nagasaki Saga 20 NaN

98 Kagoshima Naha 26 NaN

99 rows × 4 columns

that? I just noticed that the Kagoshima-Naha section is not a ferry ... well, it doesn't affect this data analysis. Let's visualize the relationship between cities obtained here.

plt.figure(figsize=(11, 9)) for city, y, x in location: plt.text(x, y, city, alpha=0.4, size=8) for w in walk.values: t1 = np.where(location[:, 0] == w[0])[0][0] t2 = np.where(location[:, 0] == w[1])[0][0] n1, y1, x1 = location[t1] n2, y2, x2 = location[t2] plt.plot([x1, x2], [y1, y2]) plt.scatter(location[:, 2], location[:, 1]) plt.grid()

The city where this prefectural office is located is called "vertex" , and the line between cities is called "side" , and the vertices are directly connected by the sides. Will be called adjacent . I have data on walking travel time between cities, but for the sake of simplicity, I will use the information on "sides" in the subsequent analysis, but not the data on walking travel time.

Distance matrix between prefectural office locations

Let's find the distance matrix between the prefectural office locations from the latitude and longitude of the prefectural office location.

import numpy as np from scipy.spatial import distance dist_mat = distance.cdist(location[:, 1:], location[:, 1:], metric='euclidean') #Euclidean distance

Create a function to visualize the matrix with a color map.

import pandas as pd def visualize_matrix(matrix): plt.figure(figsize=(12, 10)) plt.imshow(matrix, interpolation='nearest', cmap=plt.cm.coolwarm) plt.colorbar() tick_marks = np.arange(len(matrix)) plt.xticks(tick_marks, matrix.columns, rotation=90) plt.yticks(tick_marks, matrix.columns) plt.tight_layout()

When the obtained distance matrix is visualized

visualize_matrix(pd.DataFrame(dist_mat, columns=location[:, 0]))

This distance matrix is used for two purposes in this analysis.

One is the calculation of the distance from your current location to your destination, Naha.

The other is to calculate the total distance traveled from the starting point of Sapporo to your current location.

Agent

In reinforcement learning, the "brain" to learn is called the "agent" . The agent understands the action options that can be taken and adjusts the weight of the judgment of that action through learning.

Action options you can take Theta

The variable theta (θ, theta) shall represent the action options that can be taken at that time. In graph theory, it represents an "edge" (the connection between vertices). Let's calculate.

import numpy as np theta_zero = np.full((len(location), len(location)), np.nan) for w in walk.values: i = np.where(location[:, 0] == w[0])[0][0] j = np.where(location[:, 0] == w[1])[0][0] theta_zero[i, j] = 1 theta_zero[j, i] = 1

With the above, the number of the city $ i $ and the adjacent city can be obtained by theta_zero [i].

Probability pie of actions to take

The variable pie (π, pi) shall represent the probability of the action taken at that time. In graph theory, it represents "edge weights". In reinforcement learning, this pi is called a " policy ", and it is optimized.

First, as an initial value, consider the case where there are multiple options and one of them is randomly selected with the same probability.

def normalize_pi(theta): [m, n] = theta.shape pi = np.zeros((m, n)) for i in range(0, m): pi[i, :] = theta[i, :] / np.nansum(theta[i, :]) return np.nan_to_num(pi)

In this way, the initial value of pi`` pi_zero is obtained.

pi_zero = normalize_pi(theta_zero)

Let's visualize pi_zero.

visualize_matrix(pd.DataFrame(pi_zero, columns=location[:, 0]))

Note that pi_zero is not a symmetric matrix. For example, there is a side from Sapporo to Aomori and a side from Aomori to Sapporo, but the weights are not the same. This is because, in this graph, there is only one option from Sapporo to the next city, but there are multiple options from Aomori to the next city.

About "behavior" and "state"

Reinforcement learning treats "behavior" and "state" separately. As a result of selecting a certain action in a certain state, it transitions to a different state.

The examples covered in this article are special cases where it is not necessary to distinguish between "behavior" and "state". In other words, "state" refers to the "city at that time", "action" refers to the "next city", and that is the next "state". The code is simpler because you don't have to distinguish between actions and states.

Random walk

First, let's try a trip through Japan with this policy pi still pi_zero. Once you reach a city, it's a random walk that randomly decides which city to go to next.

Probabilistically choose the next action

Create a function that chooses the next city according to the probability indicated by policy pi.

def get_next(town, pi): return np.random.choice(range(len(pi)), p=pi[town, :])

For example, according to pi_zero, the next city in Tokyo (No. 12) is calculated as follows.

get_next(12, pi_zero)

13

Of course, the result will change with each run because it is decided randomly.

Random walk search

Now, let's search by random walk.

def explore(pi, start=0, goal=46): #Start is 0 (Sapporo), goal is 46 (Naha) route = [start] #List to put travel history town = start while True: next_town = get_next(town, pi) #Determine the next city if next_town in route: #If you visit the same city twice break #Game over route.append(next_town) #Add the next city to the history if next_town == goal: #Arrive at the goal break #Congratulations else: #Not a goal yet town = next_town #Set "next city" as "current position" return route

Start a random walk according to pi_zero

route = explore(pi_zero) route

[0, 1, 2, 3, 4, 5]

It's random, so of course the result will change with each run. If it is difficult to imagine just the city number, it will be easier to imagine if you convert it to the city name as follows.

[location[i, 0] for i in route]

['Sapporo', 'Aomori', 'Morioka', 'Sendai', 'Akita', 'Yamagata']

In this example, the game is over because I walked from Sapporo to Yamagata and mistakenly selected the city I visited at the next destination.

By the way, can we really get from Sapporo to Naha with such a random walk?

Evaluation index of search results

Let's create a way to evaluate the quality of the search results. Calculate the straight-line distance from your current location to your destination (Naha) and the total distance traveled from your departure location (Sapporo) to your current location.

def evaluate(route, goal=46): dist_goal = dist_mat[route[-1], goal] len_route = sum([dist_mat[route[i], route[i + 1]] for i in range(len(route) - 1)]) return [dist_goal, len_route]

dist_goal is the" straight line distance from your current location to your destination (Naha) ", and len_route is the "total distance traveled from your departure point (Sapporo) to your current location".

Then, it is an execution example.

route = explore(pi_zero) [location[i, 0] for i in route], evaluate(route)

(['Sapporo', 'Aomori'], [19.597025248636594, 2.3205099949149073])

In this example, the game is over because I chose the route from Sapporo to Aomori and back from Aomori to Sapporo. The distance from the current location (Aomori) to the destination (Naha) is 19.597025248636594, and the distance from the departure point (Sapporo) to the current location (Aomori) is 2.3205099949149073.

The goal of a trip through Japan is to find a route where this dist_goal is zero, and then find the one with the smallest len_route. Shorter len_route is not a good thing.

From now on, I will try the game of moving from Sapporo to Naha many times, but when I get a certain result, let's create a function to judge whether it is the best one so far.

best_dist_goal = 1000000 #Initialize an impossibly large value best_len_route = 1000000 #Initialize an impossibly large value def is_best_ever(): if best_dist_goal >= dist_goal: #Best dist_If equal to or less than the goal value if best_dist_goal > dist_goal: #Best dist of the past_If less than the goal value return True #Best ever elif best_len_route > len_route: #Best len of the past_If less than the route value return True #Best ever else: return False #Not the best else: return False #Not the best

Search execution

Now that we're ready, we'll repeat the random walk search 50,000 times.

%%time #Measure execution time pi = pi_zero.copy() #Initialization of pi theta = theta_zero.copy() #initialization of theta best_dist_goal = 1000000 #Initialize an impossibly large value best_len_route = 1000000 #Initialize an impossibly large value dist_goal_history0 = [] # dist_history of goal len_route_history0 = [] # len_route history best_route0 = [] #Best route for itera in range(50000): #Repeat 50,000 times route = explore(pi) #search dist_goal, len_route = evaluate(route) #Evaluation dist_goal_history0.append(dist_goal) #Record len_route_history0.append(len_route) #Record if is_best_ever(): #If the best of the past best_dist_goal = dist_goal #Best dist_goal best_len_route = len_route #Best len_route best_route0 = route #Best route

CPU times: user 10 s, sys: 118 ms, total: 10.2 s Wall time: 11.8 s

Search results

Create a function to visualize the result. First, a function that visualizes the distribution of the obtained dist_goal`` len_route

def draw_histgrams(dist_goal_history, len_route_history): plt.figure(figsize=(max(len_route_history) / 4, max(dist_goal_history) / 4)) x_max = max(dist_goal_history + len_route_history) ax = plt.subplot(2,1,1) ax.hist(dist_goal_history, label='dist_goal', bins=20) ax.set_xlim([0, x_max]) ax.grid() ax.legend() ax = plt.subplot(2,1,2) ax.hist(len_route_history, label='len_route', bins=20) ax.set_xlim([0, x_max]) ax.grid() ax.legend()

draw_histgrams(dist_goal_history0, len_route_history0)

Starting from Sapporo, it's hard to reach Naha. If you're lucky, you may get there. You can see that there are many cases where the game is over at the beginning of the search.

Next, let's look at the relationship of dist_goal`` len_route.

def draw_scatter(dist_goal_history, len_route_history): plt.figure(figsize=(max(len_route_history) / 4, max(dist_goal_history) / 4)) plt.scatter(len_route_history, dist_goal_history, alpha=0.5) plt.ylabel('dist_goal') plt.xlabel('len_route') plt.ylim([0, max(dist_goal_history) + 1]) plt.xlim([0, max(len_route_history) + 1]) plt.grid() plt.show()

draw_scatter(dist_goal_history0, len_route_history0)

You can see that we have not reached Naha, and even if the game is over at the same place, the distance traveled to that point varies (various routes are selected).

Next, let's see how dist_goal`` len_route has changed.

def visualize_history(history, interval=50, window=50): plt.plot(range(0, len(history), interval), #Overall maximum [np.array(history)[:i].max() for i in range(len(history)) if (i%interval) == 1], label='max') plt.plot(range(window, len(history)+window, interval), [np.array(history)[i:i + window].mean() for i in range(len(history)) if (i%interval) == 1], label='mean(recent)') #Average of recent interval times plt.plot(range(0, len(history), interval), [np.array(history)[:i].mean() for i in range(len(history)) if (i%interval) == 1], label='mean') #Overall average plt.plot(range(0, len(history), interval), [np.array(history)[:i].min() for i in range(len(history)) if (i%interval) == 1], label='min') #Overall minimum plt.legend() plt.grid() plt.show()

visualize_history(dist_goal_history0)

dist_goal sometimes updates the minimum value, but it does not reach Naha. And no matter how many times the search is repeated, the average value will not improve. It's natural because I haven't learned anything.

visualize_history(len_route_history0)

Similarly, len_route sometimes updates the maximum value, but it does not reach Naha, and the average value does not improve no matter how many times the search is repeated. It's natural because I haven't learned anything.

Finally, let's display the best route found in this random walk.

def draw_route(route): plt.figure(figsize=(11, 9)) for i in route: plt.text(location[i, 2], location[i, 1], location[i, 0], alpha=0.4, size=12) plt.grid() plt.plot([location[i][2] for i in route], [location[i][1] for i in route]) plt.ylabel('latitude') plt.xlabel('longitude') plt.show()

draw_route(best_route0)

I tried very hard, but I could only reach Saga. If you are lucky, you may reach Naha. But you don't learn it.

Policy gradient method

Now, it's time to start reinforcement learning . Reinforcement learning methods can be broadly divided into Policy gradient method and Value iteration method . First, let's do the policy gradient method.

Update policy pi

The obtained route route can be evaluated as good or bad from the viewpoint of" how close the end point of the route is to the goal ". Update policy pi so that the shorter the distance to the goal, the more likely you are to pick an edge on that path for future explorations.

def update_pi(pi, route, goal=46, eta=0.1): #Eta (η,eta) is the learning rate new_pi = pi.copy() #Copy numpy array for i in range(len(route) - 1): #For each side on the route town1 = route[i] #City of origin of side i town2 = route[i + 1] #City at the end of side i new_pi[town1, town2] += eta / (dist_mat[route[-1], goal] + 1) #The closer the end point of the route is to the goal, the higher the score return normalize_pi(new_pi) #Update pi

The reason for using normalize_pi at the end is to adjust the sum of the row values to be 1.0.

Search execution

Now let's start the search.

%%time pi = pi_zero.copy() best_dist_goal = 1000000 best_len_route = 1000000 dist_goal_history1 = [] len_route_history1 = [] best_route1 = [] for itera in range(50000): route = explore(pi) dist_goal, len_route = evaluate(route) dist_goal_history1.append(dist_goal) len_route_history1.append(len_route) pi = update_pi(pi, route) if is_best_ever(): best_dist_goal = dist_goal best_len_route = len_route best_route1 = route

CPU times: user 59.9 s, sys: 340 ms, total: 1min Wall time: 1min 1s

Search results

Below is an example of the results. The results will vary from run to run, and in some cases you may not reach your destination Naha after 50,000 learning sessions.

The resulting dist_goal`` len_route distribution is very different from the random walk distribution. The number of arrivals at the destination Naha has increased overwhelmingly.

draw_histgrams(dist_goal_history1, len_route_history1)

The relationship between dist_goal`` len_route is as follows.

draw_scatter(dist_goal_history1, len_route_history1)

The history of dist_goal. It reached the destination Naha much earlier than the random walk, and after that it became easier to reach the destination.

visualize_history(dist_goal_history1)

This is the first 5000 times in the history of dist_goal.

visualize_history(dist_goal_history1[:5000])

Similarly, the history of len_route.

visualize_history(len_route_history1)

The first 5000 times in the history of len_route.

visualize_history(len_route_history1[:5000])

Here is the output as the best route. You can see that the one closest to the shortest route is selected. Only one is regrettable. I am going to Kumamoto from Fukuoka via Saga. I should have gone directly from Fukuoka to Kumamoto without going through Saga.

draw_route(best_route1)

Save the resulting policy pi as the alias pi_pg.

pi_pg = pi.copy()

Visualize that policy pi_pg

visualize_matrix(pd.DataFrame(pi_pg, columns=location[:, 0]))

The absolutely important route is red. For example, once you arrive in Kagoshima, there can only be Naha next. It is absolutely useless to select other than that (Kumamoto, Miyazaki). It is a good representation of that.

You can check how it has changed compared to the initial value in this way.

visualize_matrix(pd.DataFrame(pi_pg - pi_zero, columns=location[:, 0]))

The places that aren't important haven't changed much. For example, the Shikoku and Kanto regions do not seem to be very important.

Epsilon-Greedy method

In contrast to the above "policy gradient method", the Epsilon-Greedy method introduced here and the Q learning introduced next are called "Value Iteration Method" . ..

In the value iteration method, the policy seems to be called "Q" instead of "π" for some reason. Whereas pi starts with a perfect equal probability, Q starts with a non-equal random probability.

Policy Q

def generate_Q(theta): #Generate unequal random probabilities from the choices you can take [m, n] = theta.shape Q = np.zeros((m, n)) rand = np.random.rand(m, n) for i in range(m): for j in range(n): if theta[i, j] == 1: Q[i, j] = rand[i, j] return normalize_pi(Q)

Generates an initial value Q_zero for a non-uniform policy Q

Q_zero = generate_Q(theta_zero)

Visualize the generated Q_zero.

visualize_matrix(pd.DataFrame(Q_zero, columns=location[:, 0]))

This is the difference between Q_zero and pi.

visualize_matrix(pd.DataFrame(Q_zero - pi_zero, columns=location[:, 0]))

How to choose the next action

In the policy gradient method, the next action get_next was randomly selected according to the probability indicated by policy pi. In the value iterative method, the next action get_next is rewritten as:

def get_next(town, Q, epsilon=0.1): if np.random.rand() < epsilon: return np.random.choice(range(len(Q)), p=Q[town, :]) else: return np.nanargmax(Q[town, :])

In other words, a new parameter ʻepsilon is introduced, and there are cases where the action is randomly selected according to the probability indicated by the policy Q, and cases where the policy Q selects the action with the maximum value. Initially, ʻepsilon sets a large value. In other words, you have more chances to make random choices. After that, gradually reduce ʻepsilon` to reduce the chance of making random choices.

Reward

We also introduced the concept of reward , and if you achieve your goal (when you reach the goal), you get 1 point, and if you fail (when you visit the same city twice and the game is over) -1 Points and progress will be 0 points.

In the following method called sarsa, when updating the Q value for a certain action (side on the route), the Q value for the next action (next side on the route) is also referred to. I will update. At this time, multiply by the time discount rate gamma.

def sarsa(town, Q, prev_t, next_t, reward, eta=0.1, gamma=0.9, goal=46): if reward == 1: #dist_mat[town, goal] == 0: Q[prev_t, town] = Q[prev_t, town] + eta * (reward - Q[prev_t, town]) elif reward >= 0: Q[prev_t, town] = Q[prev_t, town] + eta * (reward - Q[prev_t, town] + gamma * Q[town, next_t]) else: Q[prev_t, town] = Q[prev_t, town] - eta * Q[prev_t, town] return normalize_pi(Q)

By doing this, the rewards earned at the destination will influence the choice of actions towards the destination. The "negative reward" when the game is over also has a negative effect on the choice of past actions to reach it.

def explore_epsilon_greedy(Q, epsilon=0.1, eta=0.1, gamma=0.9, start=0, goal=46, min_epsilon=0.01): epsilon = max(epsilon, min_epsilon) route = [start] town = get_next(start, Q, epsilon) prev_t = start while True: if town in route: reward = -1 Q = sarsa(town, Q, prev_t, next_t, reward, eta, gamma) break elif town == goal: reward = 1 route.append(town) Q = sarsa(town, Q, prev_t, next_t, reward, eta, gamma) break else: reward = 0 route.append(town) next_t = get_next(town, Q, epsilon) Q = sarsa(town, Q, prev_t, next_t, reward, eta, gamma) prev_t = town town = next_t return [route, Q]

Search execution

Now let's start the search.

%%time eta = 0.1 #Learning rate gamma = 0.9 #Time discount rate epsilon = 0.5 Q = Q_zero.copy() best_dist_goal = 1000000 best_len_route = 1000000 best_route2 = [] dist_goal_history2 = [] len_route_history2 = [] for itera in range(50000): epsilon = epsilon * 0.99 route, Q = explore_epsilon_greedy(Q, epsilon, eta, gamma) dist_goal, len_route = evaluate(route) dist_goal_history2.append(dist_goal) len_route_history2.append(len_route) if is_best_ever(): best_dist_goal = dist_goal best_len_route = len_route best_route2 = route

CPU times: user 7min 50s, sys: 948 ms, total: 7min 50s Wall time: 7min 53s

Search results

This is an example of the search results by the Epsilon-Greedy method. In fact, the results aren't very stable and can change quite a bit each time you run them (try it out).

First, the distribution of dist_goal and len_route

draw_histgrams(dist_goal_history2, len_route_history2)

Relationship between dist_goal and len_route

draw_scatter(dist_goal_history2, len_route_history2)

History of dist_goal

Increasing the learning rate ʻeta` will result in faster convergence, but will increase the likelihood of falling into a local solution. The smaller the value, the slower the convergence, but the less likely it is to fall into a local solution.

visualize_history(dist_goal_history2)

The first 5000 times in the history of dist_goal

visualize_history(dist_goal_history2[:5000])

History of len_route

visualize_history(len_route_history2)

The first 5000 times in the history of len_route

visualize_history(len_route_history2[:5000])

The shortest path obtained by the Epsilon-Greedy method

Again, the result was a bit different from the true shortest path. Compare with the results obtained by the policy gradient method.

draw_route(best_route2)

In the policy gradient method, the distance to the destination Naha was used to update the pi value, so learning a route that simultaneously satisfies" Goal 1: Goal "and" Goal 2: Find the shortest route ". was doing. Therefore, no matter how many times you execute it, you will get a result close to the shortest path.

In this Epsilon-Greedy algorithm, the "reward" given only reflects "whether you have reached the goal" and "whether the game is over (have you visited the same city twice)". Therefore, we will learn how to "goal 1: reach the goal", but not "goal 2: find the shortest path". For this reason, it may happen that a result close to the shortest path is obtained, but in reality it often converges to a result far from the shortest path (try it).

Let's think about how to make learning for "Goal 2: Find the shortest path" by the Epsilon-Greedy method (not covered here).

Save the Q value after training with the Epsilon-Greedy method as Q_eg.

Q_eg = Q.copy()

Its value is as follows:

visualize_matrix(pd.DataFrame(Q_eg, columns=location[:, 0]))

Let's compare it with the pi_pg obtained by the policy gradient method.

Q learning

"Q-learning" is famous as another method of value iterative method. It's basically similar to the Epsilon-Greedy method, except that it doesn't include the randomness that occurs when choosing the "next action". It is said that the convergence will be faster by that amount.

However, as long as I execute the following code several times, the convergence may be faster, but it is not always the case, and I often fall into a local solution (I can not reach the destination Naha). I have the impression that it will end up.

def Q_learning(town, Q, prev_t, reward, eta=0.1, gamma=0.9, goal=46): if reward == 1: #dist_mat[town, goal] == 0: Q[prev_t, town] = Q[prev_t, town] + eta * (reward - Q[prev_t, town]) elif reward >= 0: Q[prev_t, town] = Q[prev_t, town] + eta * (reward - Q[prev_t, town] + gamma * np.nanmax(Q[town, :])) else: Q[prev_t, town] = Q[prev_t, town] - eta * Q[prev_t, town] return normalize_pi(Q)

def explore_Q_learning(Q, epsilon=0.1, eta=0.1, gamma=0.9, start=0, goal=46): prev_t = start route = [start] town = get_next(start, Q, epsilon) while True: if town in route: reward = -1 Q = Q_learning(town, Q, prev_t, reward, eta, gamma) break elif town == goal: reward = 1 route.append(town) Q = Q_learning(town, Q, prev_t, reward, eta, gamma) break else: reward = 0 dist_goal, len_route = evaluate(route) if best_dist_goal > dist_goal: reward = 1 route.append(town) next_t = get_next(town, Q, epsilon) Q = Q_learning(town, Q, prev_t, reward, eta, gamma) prev_t = town town = next_t return [route, Q]

Search execution

%%time eta = 0.1 #Learning rate gamma = 0.9 #Time discount rate epsilon = 0.5 Q = Q_zero.copy() best_dist_goal = 1000000 best_len_route = 1000000 best_route3 = [] dist_goal_history3 = [] len_route_history3 = [] for itera in range(50000): epsilon = epsilon * 0.99 route, Q = explore_Q_learning(Q, epsilon, eta, gamma) dist_goal, len_route = evaluate(route) dist_goal_history3.append(dist_goal) len_route_history3.append(len_route) if is_best_ever(): best_dist_goal = dist_goal best_len_route = len_route best_route3 = route

CPU times: user 9min 50s, sys: 1.41 s, total: 9min 52s Wall time: 9min 54s

Search results

This is an example of search results by Q-learning. As with the Epsilon-Greedy method, the results are not very stable and can vary considerably from run to run (try it).

Distribution of dist_goal and len_route

draw_histgrams(dist_goal_history3, len_route_history3)

Relationship between dist_goal and len_route

draw_scatter(dist_goal_history3, len_route_history3)

History of dist_goal

visualize_history(dist_goal_history3)

The first 5000 times in the history of dist_goal

visualize_history(dist_goal_history3[:5000])

History of len_route

visualize_history(len_route_history3)

The first 5000 times in the history of len_route

visualize_history(len_route_history3[:5000])

Shortest path obtained by Q-learning

draw_route(best_route3)

As you can see in this example calculation, this is not the true shortest path. The reason is that, as mentioned in the case of the Epsilon-Greedy method, due to the problem of "reward" design, we are learning to "goal 1: reach the goal", but "goal 2: find the shortest path". This is because we have not learned for.

Also, as mentioned above, within my observation range, Q-learning may converge faster, but in many cases it does not reach the destination Naha even after calculating 50,000 times. I think that the "shortest path obtained by Q-learning" tends to be longer than the "shortest path obtained by the Epsilon-Greedy method". The reason is that the randomness when selecting "next action" is suppressed, so if you happen to reach the goal by the selected route, even if the route is long, there is an opportunity to change it Epsilon- I think it's because it's less than the Greedy method.

Q_qlearn = Q.copy()

visualize_matrix(pd.DataFrame(Q_qlearn, columns=location[:, 0]))

so. That's all for reinforcement learning. Well, Okinawa is far away.

There is deep reinforcement learning that mixes deep learning with this, but that is another opportunity. Chao.

Recommended Posts
Reinforcement learning Learn from today

Reinforcement learning to learn from zero to deep

Learn while making! Deep reinforcement learning_1

[Introduction] Reinforcement learning

Future reinforcement learning_2

Future reinforcement learning_1

Reinforcement learning 1 Python installation

Reinforcement learning 3 OpenAI installation

Deep Learning from scratch

Reinforcement learning for tic-tac-toe

[Reinforcement learning] Bandit task

Somehow learn machine learning

Python + Unity Reinforcement Learning (Learning)

Reinforcement learning 1 introductory edition

Reinforcement learning 18 Colaboratory + Acrobat + ChainerRL

Deep Learning from scratch 1-3 chapters

Reinforcement learning 7 Learning data log output

Play with reinforcement learning with MuZero

Reinforcement learning 17 Colaboratory + CartPole + ChainerRL

Reinforcement learning 28 colaboratory + OpenAI + chainerRL

Reinforcement learning 19 Colaboratory + Mountain_car + ChainerRL

Reinforcement learning 2 Installation of chainerrl

[Reinforcement learning] Tracking by multi-agent

Reinforcement learning 6 First Chainer RL

Reinforcement learning starting with Python

Reinforcement learning 20 Colaboratory + Pendulum + ChainerRL

Reinforcement learning 5 Try programming CartPole?

Reinforcement learning 9 ChainerRL magic remodeling

Learn best practices from cookiecutter-django

Reinforcement learning 4 CartPole first step

Deep Reinforcement Learning 1 Introduction to Reinforcement Learning

Deep reinforcement learning 2 Implementation of reinforcement learning

DeepMind Reinforcement Learning Framework Acme

Reinforcement learning: Accelerate Value Iteration

I read "Reinforcement Learning with Python: From Introduction to Practice" Chapter 1

Evaluate the accuracy of the learning model by cross-validation from scikit learn

I read "Reinforcement Learning with Python: From Introduction to Practice" Chapter 2

Reinforcement learning 21 Colaboratory + Pendulum + ChainerRL + A2C

Deep Learning / Deep Learning from Zero 2 Chapter 4 Memo

TF2RL: Reinforcement learning library for TensorFlow2.x

Reinforcement learning 34 Make continuous Agent videos

Reinforcement learning 13 Try Mountain_car with ChainerRL.

Deep Learning / Deep Learning from Zero Chapter 3 Memo

Deep Learning / Deep Learning from Zero 2 Chapter 5 Memo

[Machine learning] Understanding uncorrelatedness from mathematics

Deep learning from scratch (cost calculation)

Reinforcement learning 22 Colaboratory + CartPole + ChainerRL + A3C

Explore the maze with reinforcement learning

Reinforcement learning 8 Try using Chainer UI

Deep Learning / Deep Learning from Zero 2 Chapter 7 Memo

Programming to learn from books May 7

Deep Learning / Deep Learning from Zero 2 Chapter 8 Memo

Reinforcement learning 24 Colaboratory + CartPole + ChainerRL + ACER

Reinforcement learning 3 Dynamic programming / TD method

Learn from the winning code-Mercari Competition ①-

Deep Learning / Deep Learning from Zero Chapter 5 Memo

Deep Learning / Deep Learning from Zero Chapter 4 Memo

Deep Reinforcement Learning 3 Practical Edition: Breakout

I tried reinforcement learning using PyBrain

Deep Learning / Deep Learning from Zero 2 Chapter 3 Memo

Deep Learning memos made from scratch

	0	1	2
0	Sapporo	43.0642	141.347
1	Aomori	40.8244	140.74
2	Morioka	39.7036	141.153
3	Sendai	38.2689	140.872
...	...	...	...
44	Miyazaki	31.9111	131.424
45	Kagoshima	31.5603	130.558
46	Naha	26.2125	127.681

	Town1	Town2	Hour	Ferry
0	Sapporo	Aomori	55	True
1	Akita	Aomori	36	NaN
2	Akita	Sendai	45	NaN
3	Akita	Niigata	52	NaN
...	...	...	...	...
96	Nagasaki	Kumamoto	18	NaN
97	Nagasaki	Saga	20	NaN
98	Kagoshima	Naha	26	NaN