From now on, you will travel from Sapporo in Hokkaido to Naha in Okinawa. With reinforcement learning.
This trip through Japan has the following two goals.
And I would like to apply the following rules.
Yes, that's right. It can be easily found by using the graph search method. For more information
Please see. But here, the purpose is to start learning about reinforcement learning from today.
Used in Basics of Graph Theory and Basics of Graph Theory with matplotlib animation The following data is used.
Download the latitude / longitude data of the prefectural office location.
import urllib.request
url = 'https://raw.githubusercontent.com/maskot1977/ipython_notebook/master/toydata/location.txt'
urllib.request.urlretrieve(url, 'location.txt') #Download data
('location.txt', <http.client.HTTPMessage at 0x110e78ba8>)
Load the downloaded data.
import pandas as pd
location = pd.read_csv('location.txt').values
pd.DataFrame(location)
0 | 1 | 2 | |
---|---|---|---|
0 | Sapporo | 43.0642 | 141.347 |
1 | Aomori | 40.8244 | 140.74 |
2 | Morioka | 39.7036 | 141.153 |
3 | Sendai | 38.2689 | 140.872 |
... | ... | ... | ... |
44 | Miyazaki | 31.9111 | 131.424 |
45 | Kagoshima | 31.5603 | 130.558 |
46 | Naha | 26.2125 | 127.681 |
47 rows × 3 columns
Let's visualize the read data.
%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(11, 9))
plt.scatter(location[:, 2], location[:, 1])
for city, y, x in location:
plt.text(x, y, city, alpha=0.5, size=12)
plt.grid()
There is data on Google Maps to find out how many hours it takes to move between prefectural capitals on foot (if you can't move on foot, you'll use a ferry). Download this.
url = 'https://raw.githubusercontent.com/maskot1977/ipython_notebook/master/toydata/walk.txt'
urllib.request.urlretrieve(url, 'walk.txt')
('walk.txt', <http.client.HTTPMessage at 0x1220b8a58>)
Load the downloaded data.
walk = pd.read_csv('walk.txt', sep='\s+')
walk
Town1 | Town2 | Hour | Ferry | |
---|---|---|---|---|
0 | Sapporo | Aomori | 55 | True |
1 | Akita | Aomori | 36 | NaN |
2 | Akita | Sendai | 45 | NaN |
3 | Akita | Niigata | 52 | NaN |
... | ... | ... | ... | ... |
96 | Nagasaki | Kumamoto | 18 | NaN |
97 | Nagasaki | Saga | 20 | NaN |
98 | Kagoshima | Naha | 26 | NaN |
99 rows × 4 columns
that? I just noticed that the Kagoshima-Naha section is not a ferry ... well, it doesn't affect this data analysis. Let's visualize the relationship between cities obtained here.
plt.figure(figsize=(11, 9))
for city, y, x in location:
plt.text(x, y, city, alpha=0.4, size=8)
for w in walk.values:
t1 = np.where(location[:, 0] == w[0])[0][0]
t2 = np.where(location[:, 0] == w[1])[0][0]
n1, y1, x1 = location[t1]
n2, y2, x2 = location[t2]
plt.plot([x1, x2], [y1, y2])
plt.scatter(location[:, 2], location[:, 1])
plt.grid()
The city where this prefectural office is located is called "vertex" </ b>, and the line between cities is called "side" </ b>, and the vertices are directly connected by the sides. Will be called adjacent </ b>. I have data on walking travel time between cities, but for the sake of simplicity, I will use the information on "sides" in the subsequent analysis, but not the data on walking travel time.
Let's find the distance matrix between the prefectural office locations from the latitude and longitude of the prefectural office location.
import numpy as np
from scipy.spatial import distance
dist_mat = distance.cdist(location[:, 1:], location[:, 1:], metric='euclidean') #Euclidean distance
Create a function to visualize the matrix with a color map.
import pandas as pd
def visualize_matrix(matrix):
plt.figure(figsize=(12, 10))
plt.imshow(matrix, interpolation='nearest', cmap=plt.cm.coolwarm)
plt.colorbar()
tick_marks = np.arange(len(matrix))
plt.xticks(tick_marks, matrix.columns, rotation=90)
plt.yticks(tick_marks, matrix.columns)
plt.tight_layout()
When the obtained distance matrix is visualized
visualize_matrix(pd.DataFrame(dist_mat, columns=location[:, 0]))
This distance matrix is used for two purposes in this analysis.
In reinforcement learning, the "brain" to learn is called the "agent" </ b>. The agent understands the action options that can be taken and adjusts the weight of the judgment of that action through learning.
The variable theta (θ, theta) shall represent the action options that can be taken at that time. In graph theory, it represents an "edge" (the connection between vertices). Let's calculate.
import numpy as np
theta_zero = np.full((len(location), len(location)), np.nan)
for w in walk.values:
i = np.where(location[:, 0] == w[0])[0][0]
j = np.where(location[:, 0] == w[1])[0][0]
theta_zero[i, j] = 1
theta_zero[j, i] = 1
With the above, the number of the city $ i $ and the adjacent city can be obtained by theta_zero [i]
.
The variable pie (π, pi) shall represent the probability of the action taken at that time. In graph theory, it represents "edge weights". In reinforcement learning, this pi
is called a " policy "</ b>, and it is optimized.
First, as an initial value, consider the case where there are multiple options and one of them is randomly selected with the same probability.
def normalize_pi(theta):
[m, n] = theta.shape
pi = np.zeros((m, n))
for i in range(0, m):
pi[i, :] = theta[i, :] / np.nansum(theta[i, :])
return np.nan_to_num(pi)
In this way, the initial value of pi`` pi_zero
is obtained.
pi_zero = normalize_pi(theta_zero)
Let's visualize pi_zero
.
visualize_matrix(pd.DataFrame(pi_zero, columns=location[:, 0]))
Note that pi_zero
is not a symmetric matrix. For example, there is a side from Sapporo to Aomori and a side from Aomori to Sapporo, but the weights are not the same. This is because, in this graph, there is only one option from Sapporo to the next city, but there are multiple options from Aomori to the next city.
Reinforcement learning treats "behavior" and "state" separately. As a result of selecting a certain action in a certain state, it transitions to a different state.
The examples covered in this article are special cases where it is not necessary to distinguish between "behavior" and "state". In other words, "state" refers to the "city at that time", "action" refers to the "next city", and that is the next "state". The code is simpler because you don't have to distinguish between actions and states.
First, let's try a trip through Japan with this policy pi
still pi_zero
. Once you reach a city, it's a random walk that randomly decides which city to go to next.
Create a function that chooses the next city according to the probability indicated by policy pi
.
def get_next(town, pi):
return np.random.choice(range(len(pi)), p=pi[town, :])
For example, according to pi_zero
, the next city in Tokyo (No. 12) is calculated as follows.
get_next(12, pi_zero)
13
Of course, the result will change with each run because it is decided randomly.
Now, let's search by random walk.
def explore(pi, start=0, goal=46): #Start is 0 (Sapporo), goal is 46 (Naha)
route = [start] #List to put travel history
town = start
while True:
next_town = get_next(town, pi) #Determine the next city
if next_town in route: #If you visit the same city twice
break #Game over
route.append(next_town) #Add the next city to the history
if next_town == goal: #Arrive at the goal
break #Congratulations
else: #Not a goal yet
town = next_town #Set "next city" as "current position"
return route
Start a random walk according to pi_zero
route = explore(pi_zero)
route
[0, 1, 2, 3, 4, 5]
It's random, so of course the result will change with each run. If it is difficult to imagine just the city number, it will be easier to imagine if you convert it to the city name as follows.
[location[i, 0] for i in route]
['Sapporo', 'Aomori', 'Morioka', 'Sendai', 'Akita', 'Yamagata']
In this example, the game is over because I walked from Sapporo to Yamagata and mistakenly selected the city I visited at the next destination.
By the way, can we really get from Sapporo to Naha with such a random walk?
Let's create a way to evaluate the quality of the search results. Calculate the straight-line distance from your current location to your destination (Naha) and the total distance traveled from your departure location (Sapporo) to your current location.
def evaluate(route, goal=46):
dist_goal = dist_mat[route[-1], goal]
len_route = sum([dist_mat[route[i], route[i + 1]] for i in range(len(route) - 1)])
return [dist_goal, len_route]
dist_goal
is the" straight line distance from your current location to your destination (Naha) ", and len_route
is the "total distance traveled from your departure point (Sapporo) to your current location".
Then, it is an execution example.
route = explore(pi_zero)
[location[i, 0] for i in route], evaluate(route)
(['Sapporo', 'Aomori'], [19.597025248636594, 2.3205099949149073])
In this example, the game is over because I chose the route from Sapporo to Aomori and back from Aomori to Sapporo. The distance from the current location (Aomori) to the destination (Naha) is 19.597025248636594
, and the distance from the departure point (Sapporo) to the current location (Aomori) is 2.3205099949149073
.
The goal of a trip through Japan is to find a route where this dist_goal
is zero, and then find the one with the smallest len_route
. Shorter len_route
is not a good thing.
From now on, I will try the game of moving from Sapporo to Naha many times, but when I get a certain result, let's create a function to judge whether it is the best one so far.
best_dist_goal = 1000000 #Initialize an impossibly large value
best_len_route = 1000000 #Initialize an impossibly large value
def is_best_ever():
if best_dist_goal >= dist_goal: #Best dist_If equal to or less than the goal value
if best_dist_goal > dist_goal: #Best dist of the past_If less than the goal value
return True #Best ever
elif best_len_route > len_route: #Best len of the past_If less than the route value
return True #Best ever
else:
return False #Not the best
else:
return False #Not the best
Now that we're ready, we'll repeat the random walk search 50,000 times.
%%time #Measure execution time
pi = pi_zero.copy() #Initialization of pi
theta = theta_zero.copy() #initialization of theta
best_dist_goal = 1000000 #Initialize an impossibly large value
best_len_route = 1000000 #Initialize an impossibly large value
dist_goal_history0 = [] # dist_history of goal
len_route_history0 = [] # len_route history
best_route0 = [] #Best route
for itera in range(50000): #Repeat 50,000 times
route = explore(pi) #search
dist_goal, len_route = evaluate(route) #Evaluation
dist_goal_history0.append(dist_goal) #Record
len_route_history0.append(len_route) #Record
if is_best_ever(): #If the best of the past
best_dist_goal = dist_goal #Best dist_goal
best_len_route = len_route #Best len_route
best_route0 = route #Best route
CPU times: user 10 s, sys: 118 ms, total: 10.2 s
Wall time: 11.8 s
Create a function to visualize the result. First, a function that visualizes the distribution of the obtained dist_goal`` len_route
def draw_histgrams(dist_goal_history, len_route_history):
plt.figure(figsize=(max(len_route_history) / 4, max(dist_goal_history) / 4))
x_max = max(dist_goal_history + len_route_history)
ax = plt.subplot(2,1,1)
ax.hist(dist_goal_history, label='dist_goal', bins=20)
ax.set_xlim([0, x_max])
ax.grid()
ax.legend()
ax = plt.subplot(2,1,2)
ax.hist(len_route_history, label='len_route', bins=20)
ax.set_xlim([0, x_max])
ax.grid()
ax.legend()
draw_histgrams(dist_goal_history0, len_route_history0)
Starting from Sapporo, it's hard to reach Naha. If you're lucky, you may get there. You can see that there are many cases where the game is over at the beginning of the search.
Next, let's look at the relationship of dist_goal`` len_route
.
def draw_scatter(dist_goal_history, len_route_history):
plt.figure(figsize=(max(len_route_history) / 4, max(dist_goal_history) / 4))
plt.scatter(len_route_history, dist_goal_history, alpha=0.5)
plt.ylabel('dist_goal')
plt.xlabel('len_route')
plt.ylim([0, max(dist_goal_history) + 1])
plt.xlim([0, max(len_route_history) + 1])
plt.grid()
plt.show()
draw_scatter(dist_goal_history0, len_route_history0)
You can see that we have not reached Naha, and even if the game is over at the same place, the distance traveled to that point varies (various routes are selected).
Next, let's see how dist_goal`` len_route
has changed.
def visualize_history(history, interval=50, window=50):
plt.plot(range(0, len(history), interval), #Overall maximum
[np.array(history)[:i].max() for i in range(len(history)) if (i%interval) == 1], label='max')
plt.plot(range(window, len(history)+window, interval),
[np.array(history)[i:i + window].mean() for i in range(len(history)) if (i%interval) == 1],
label='mean(recent)') #Average of recent interval times
plt.plot(range(0, len(history), interval),
[np.array(history)[:i].mean() for i in range(len(history)) if (i%interval) == 1], label='mean') #Overall average
plt.plot(range(0, len(history), interval),
[np.array(history)[:i].min() for i in range(len(history)) if (i%interval) == 1], label='min') #Overall minimum
plt.legend()
plt.grid()
plt.show()
visualize_history(dist_goal_history0)
dist_goal
sometimes updates the minimum value, but it does not reach Naha. And no matter how many times the search is repeated, the average value will not improve. It's natural because I haven't learned anything.
visualize_history(len_route_history0)
Similarly, len_route
sometimes updates the maximum value, but it does not reach Naha, and the average value does not improve no matter how many times the search is repeated. It's natural because I haven't learned anything.
Finally, let's display the best route found in this random walk.
def draw_route(route):
plt.figure(figsize=(11, 9))
for i in route:
plt.text(location[i, 2], location[i, 1], location[i, 0], alpha=0.4, size=12)
plt.grid()
plt.plot([location[i][2] for i in route], [location[i][1] for i in route])
plt.ylabel('latitude')
plt.xlabel('longitude')
plt.show()
draw_route(best_route0)
I tried very hard, but I could only reach Saga. If you are lucky, you may reach Naha. But you don't learn it.
Now, it's time to start reinforcement learning </ b>. Reinforcement learning methods can be broadly divided into Policy gradient method </ b> and Value iteration method </ b>. First, let's do the policy gradient method.
The obtained route route
can be evaluated as good or bad from the viewpoint of" how close the end point of the route is to the goal ". Update policy pi
so that the shorter the distance to the goal, the more likely you are to pick an edge on that path for future explorations.
def update_pi(pi, route, goal=46, eta=0.1): #Eta (η,eta) is the learning rate
new_pi = pi.copy() #Copy numpy array
for i in range(len(route) - 1): #For each side on the route
town1 = route[i] #City of origin of side i
town2 = route[i + 1] #City at the end of side i
new_pi[town1, town2] += eta / (dist_mat[route[-1], goal] + 1)
#The closer the end point of the route is to the goal, the higher the score
return normalize_pi(new_pi) #Update pi
The reason for using normalize_pi
at the end is to adjust the sum of the row values to be 1.0.
Now let's start the search.
%%time
pi = pi_zero.copy()
best_dist_goal = 1000000
best_len_route = 1000000
dist_goal_history1 = []
len_route_history1 = []
best_route1 = []
for itera in range(50000):
route = explore(pi)
dist_goal, len_route = evaluate(route)
dist_goal_history1.append(dist_goal)
len_route_history1.append(len_route)
pi = update_pi(pi, route)
if is_best_ever():
best_dist_goal = dist_goal
best_len_route = len_route
best_route1 = route
CPU times: user 59.9 s, sys: 340 ms, total: 1min
Wall time: 1min 1s
Below is an example of the results. The results will vary from run to run, and in some cases you may not reach your destination Naha after 50,000 learning sessions.
The resulting dist_goal`` len_route
distribution is very different from the random walk distribution. The number of arrivals at the destination Naha has increased overwhelmingly.
draw_histgrams(dist_goal_history1, len_route_history1)
The relationship between dist_goal`` len_route
is as follows.
draw_scatter(dist_goal_history1, len_route_history1)
The history of dist_goal
. It reached the destination Naha much earlier than the random walk, and after that it became easier to reach the destination.
visualize_history(dist_goal_history1)
This is the first 5000 times in the history of dist_goal
.
visualize_history(dist_goal_history1[:5000])
Similarly, the history of len_route
.
visualize_history(len_route_history1)
The first 5000 times in the history of len_route
.
visualize_history(len_route_history1[:5000])
Here is the output as the best route. You can see that the one closest to the shortest route is selected. Only one is regrettable. I am going to Kumamoto from Fukuoka via Saga. I should have gone directly from Fukuoka to Kumamoto without going through Saga.
draw_route(best_route1)
Save the resulting policy pi
as the alias pi_pg
.
pi_pg = pi.copy()
Visualize that policy pi_pg
visualize_matrix(pd.DataFrame(pi_pg, columns=location[:, 0]))
The absolutely important route is red. For example, once you arrive in Kagoshima, there can only be Naha next. It is absolutely useless to select other than that (Kumamoto, Miyazaki). It is a good representation of that.
You can check how it has changed compared to the initial value in this way.
visualize_matrix(pd.DataFrame(pi_pg - pi_zero, columns=location[:, 0]))
The places that aren't important haven't changed much. For example, the Shikoku and Kanto regions do not seem to be very important.
In contrast to the above "policy gradient method", the Epsilon-Greedy method introduced here and the Q learning introduced next are called "Value Iteration Method" </ b>. ..
In the value iteration method, the policy seems to be called "Q" instead of "π" for some reason. Whereas pi
starts with a perfect equal probability, Q
starts with a non-equal random probability.
def generate_Q(theta): #Generate unequal random probabilities from the choices you can take
[m, n] = theta.shape
Q = np.zeros((m, n))
rand = np.random.rand(m, n)
for i in range(m):
for j in range(n):
if theta[i, j] == 1:
Q[i, j] = rand[i, j]
return normalize_pi(Q)
Generates an initial value Q_zero
for a non-uniform policy Q
Q_zero = generate_Q(theta_zero)
Visualize the generated Q_zero
.
visualize_matrix(pd.DataFrame(Q_zero, columns=location[:, 0]))
This is the difference between Q_zero
and pi
.
visualize_matrix(pd.DataFrame(Q_zero - pi_zero, columns=location[:, 0]))
In the policy gradient method, the next action get_next
was randomly selected according to the probability indicated by policy pi
. In the value iterative method, the next action get_next
is rewritten as:
def get_next(town, Q, epsilon=0.1):
if np.random.rand() < epsilon:
return np.random.choice(range(len(Q)), p=Q[town, :])
else:
return np.nanargmax(Q[town, :])
In other words, a new parameter ʻepsilon is introduced, and there are cases where the action is randomly selected according to the probability indicated by the policy
Q, and cases where the policy
Q selects the action with the maximum value. Initially, ʻepsilon
sets a large value. In other words, you have more chances to make random choices. After that, gradually reduce ʻepsilon` to reduce the chance of making random choices.
We also introduced the concept of reward </ b>, and if you achieve your goal (when you reach the goal), you get 1 point, and if you fail (when you visit the same city twice and the game is over) -1 Points and progress will be 0 points.
In the following method called sarsa
, when updating the Q
value for a certain action (side on the route), the Q
value for the next action (next side on the route) is also referred to. I will update. At this time, multiply by the time discount rate gamma
.
def sarsa(town, Q, prev_t, next_t, reward, eta=0.1, gamma=0.9, goal=46):
if reward == 1: #dist_mat[town, goal] == 0:
Q[prev_t, town] = Q[prev_t, town] + eta * (reward - Q[prev_t, town])
elif reward >= 0:
Q[prev_t, town] = Q[prev_t, town] + eta * (reward - Q[prev_t, town] + gamma * Q[town, next_t])
else:
Q[prev_t, town] = Q[prev_t, town] - eta * Q[prev_t, town]
return normalize_pi(Q)
By doing this, the rewards earned at the destination will influence the choice of actions towards the destination. The "negative reward" when the game is over also has a negative effect on the choice of past actions to reach it.
def explore_epsilon_greedy(Q, epsilon=0.1, eta=0.1, gamma=0.9, start=0, goal=46, min_epsilon=0.01):
epsilon = max(epsilon, min_epsilon)
route = [start]
town = get_next(start, Q, epsilon)
prev_t = start
while True:
if town in route:
reward = -1
Q = sarsa(town, Q, prev_t, next_t, reward, eta, gamma)
break
elif town == goal:
reward = 1
route.append(town)
Q = sarsa(town, Q, prev_t, next_t, reward, eta, gamma)
break
else:
reward = 0
route.append(town)
next_t = get_next(town, Q, epsilon)
Q = sarsa(town, Q, prev_t, next_t, reward, eta, gamma)
prev_t = town
town = next_t
return [route, Q]
Now let's start the search.
%%time
eta = 0.1 #Learning rate
gamma = 0.9 #Time discount rate
epsilon = 0.5
Q = Q_zero.copy()
best_dist_goal = 1000000
best_len_route = 1000000
best_route2 = []
dist_goal_history2 = []
len_route_history2 = []
for itera in range(50000):
epsilon = epsilon * 0.99
route, Q = explore_epsilon_greedy(Q, epsilon, eta, gamma)
dist_goal, len_route = evaluate(route)
dist_goal_history2.append(dist_goal)
len_route_history2.append(len_route)
if is_best_ever():
best_dist_goal = dist_goal
best_len_route = len_route
best_route2 = route
CPU times: user 7min 50s, sys: 948 ms, total: 7min 50s
Wall time: 7min 53s
This is an example of the search results by the Epsilon-Greedy method. In fact, the results aren't very stable and can change quite a bit each time you run them (try it out).
First, the distribution of dist_goal
and len_route
draw_histgrams(dist_goal_history2, len_route_history2)
Relationship between dist_goal
and len_route
draw_scatter(dist_goal_history2, len_route_history2)
History of dist_goal
Increasing the learning rate ʻeta` will result in faster convergence, but will increase the likelihood of falling into a local solution. The smaller the value, the slower the convergence, but the less likely it is to fall into a local solution.
visualize_history(dist_goal_history2)
The first 5000 times in the history of dist_goal
visualize_history(dist_goal_history2[:5000])
History of len_route
visualize_history(len_route_history2)
The first 5000 times in the history of len_route
visualize_history(len_route_history2[:5000])
The shortest path obtained by the Epsilon-Greedy method
Again, the result was a bit different from the true shortest path. Compare with the results obtained by the policy gradient method.
draw_route(best_route2)
In the policy gradient method, the distance to the destination Naha was used to update the pi
value, so learning a route that simultaneously satisfies" Goal 1: Goal "and" Goal 2: Find the shortest route ". was doing. Therefore, no matter how many times you execute it, you will get a result close to the shortest path.
In this Epsilon-Greedy algorithm, the "reward" given only reflects "whether you have reached the goal" and "whether the game is over (have you visited the same city twice)". Therefore, we will learn how to "goal 1: reach the goal", but not "goal 2: find the shortest path". For this reason, it may happen that a result close to the shortest path is obtained, but in reality it often converges to a result far from the shortest path (try it).
Let's think about how to make learning for "Goal 2: Find the shortest path" by the Epsilon-Greedy method (not covered here).
Save the Q
value after training with the Epsilon-Greedy method as Q_eg
.
Q_eg = Q.copy()
Its value is as follows:
visualize_matrix(pd.DataFrame(Q_eg, columns=location[:, 0]))
Let's compare it with the pi_pg
obtained by the policy gradient method.
"Q-learning" </ b> is famous as another method of value iterative method. It's basically similar to the Epsilon-Greedy method, except that it doesn't include the randomness that occurs when choosing the "next action". It is said that the convergence will be faster by that amount.
However, as long as I execute the following code several times, the convergence may be faster, but it is not always the case, and I often fall into a local solution (I can not reach the destination Naha). I have the impression that it will end up.
def Q_learning(town, Q, prev_t, reward, eta=0.1, gamma=0.9, goal=46):
if reward == 1: #dist_mat[town, goal] == 0:
Q[prev_t, town] = Q[prev_t, town] + eta * (reward - Q[prev_t, town])
elif reward >= 0:
Q[prev_t, town] = Q[prev_t, town] + eta * (reward - Q[prev_t, town] + gamma * np.nanmax(Q[town, :]))
else:
Q[prev_t, town] = Q[prev_t, town] - eta * Q[prev_t, town]
return normalize_pi(Q)
def explore_Q_learning(Q, epsilon=0.1, eta=0.1, gamma=0.9, start=0, goal=46):
prev_t = start
route = [start]
town = get_next(start, Q, epsilon)
while True:
if town in route:
reward = -1
Q = Q_learning(town, Q, prev_t, reward, eta, gamma)
break
elif town == goal:
reward = 1
route.append(town)
Q = Q_learning(town, Q, prev_t, reward, eta, gamma)
break
else:
reward = 0
dist_goal, len_route = evaluate(route)
if best_dist_goal > dist_goal:
reward = 1
route.append(town)
next_t = get_next(town, Q, epsilon)
Q = Q_learning(town, Q, prev_t, reward, eta, gamma)
prev_t = town
town = next_t
return [route, Q]
%%time
eta = 0.1 #Learning rate
gamma = 0.9 #Time discount rate
epsilon = 0.5
Q = Q_zero.copy()
best_dist_goal = 1000000
best_len_route = 1000000
best_route3 = []
dist_goal_history3 = []
len_route_history3 = []
for itera in range(50000):
epsilon = epsilon * 0.99
route, Q = explore_Q_learning(Q, epsilon, eta, gamma)
dist_goal, len_route = evaluate(route)
dist_goal_history3.append(dist_goal)
len_route_history3.append(len_route)
if is_best_ever():
best_dist_goal = dist_goal
best_len_route = len_route
best_route3 = route
CPU times: user 9min 50s, sys: 1.41 s, total: 9min 52s
Wall time: 9min 54s
This is an example of search results by Q-learning. As with the Epsilon-Greedy method, the results are not very stable and can vary considerably from run to run (try it).
Distribution of dist_goal
and len_route
draw_histgrams(dist_goal_history3, len_route_history3)
Relationship between dist_goal
and len_route
draw_scatter(dist_goal_history3, len_route_history3)
History of dist_goal
visualize_history(dist_goal_history3)
The first 5000 times in the history of dist_goal
visualize_history(dist_goal_history3[:5000])
History of len_route
visualize_history(len_route_history3)
The first 5000 times in the history of len_route
visualize_history(len_route_history3[:5000])
Shortest path obtained by Q-learning
draw_route(best_route3)
As you can see in this example calculation, this is not the true shortest path. The reason is that, as mentioned in the case of the Epsilon-Greedy method, due to the problem of "reward" design, we are learning to "goal 1: reach the goal", but "goal 2: find the shortest path". This is because we have not learned for.
Also, as mentioned above, within my observation range, Q-learning may converge faster, but in many cases it does not reach the destination Naha even after calculating 50,000 times. I think that the "shortest path obtained by Q-learning" tends to be longer than the "shortest path obtained by the Epsilon-Greedy method". The reason is that the randomness when selecting "next action" is suppressed, so if you happen to reach the goal by the selected route, even if the route is long, there is an opportunity to change it Epsilon- I think it's because it's less than the Greedy method.
Q_qlearn = Q.copy()
visualize_matrix(pd.DataFrame(Q_qlearn, columns=location[:, 0]))
so. That's all for reinforcement learning. Well, Okinawa is far away.
There is deep reinforcement learning </ b> that mixes deep learning with this, but that is another opportunity. Chao.
Recommended Posts