Saddle point search using the gradient method

1. What is a saddle point?

A saddle point is a point where the gradient is 0 and the minimum is taken from a certain direction and the maximum is taken from a certain direction. It is an important theme in various fields, as represented by GAN in machine learning and transition states in chemical reaction pathway search. GAN consists of a generator that generates data that looks exactly like the real thing, and a discriminator that discriminates between the real and the generated data. The generator learns to take the maximum, and the discriminator learns to take the minimum. In a chemical reaction, various stable raw materials pass through saddle points on the potential energy surface and change into more stable compounds. In this article, we will search for saddle points using the gradient method. This method is effective for simple models, but it behaves somewhat unstable for complex models (e.g. quantum chemistry, etc.), so I will add it as soon as it can be improved. If you are interested in this article, LGTM! please.

2. Definition of saddle point

Describe the saddle point as follows. The objective function is $ y = f (\ boldsymbol {x}) $, $ \ boldsymbol {x} = (x_1, x_2, \ cdots, x_n) ^ \ rm {T} $, and the unit vector corresponding to the local minimum is $. \ boldsymbol {b} = (b_1, b_2, \ cdots, b_n) ^ \ rm {T} $, the unit vector corresponding to the maximum value is $ \ boldsymbol {c} = (c_1, c_2, \ cdots, c_n) ^ Defined as \ rm {T} $. At this time, the gradient of $ f $ becomes $ 0 $, and $ y = f (\ boldsymbol {x} + \ boldsymbol {b} t) $ takes the minimum value at $ t = 0 $ with respect to $ t $, and $ y The saddle point is that = f (\ boldsymbol {x} + \ boldsymbol {c} t) $ has a maximum value at $ t = 0 $ with respect to $ t $. It is expressed by a mathematical formula as follows.

At the saddle point $ \ boldsymbol {x} = \ boldsymbol {x} _0 $

\left.\frac{\partial f(\boldsymbol{x})}{\partial \boldsymbol{x}}\right |_{\boldsymbol{x}=\boldsymbol{x}_0}= 0\\

\left.\frac{\partial^2 f(\boldsymbol{x}+\boldsymbol{b}t)}{\partial t^2}\right |_{\boldsymbol{x}=\boldsymbol{x}_0, t=0} > 0\\

\left.\frac{\partial^2 f(\boldsymbol{x}+\boldsymbol{c}t)}{\partial t^2}\right |_{\boldsymbol{x}=\boldsymbol{x}_0, t=0} < 0

Is established.

Consider $ y = x_1 ^ 2-x_2 ^ 2 $ as an example. At $ x_1 = 0 $, $ x_2 = 0 $, the gradient is $ 0 $, and at $ \ boldsymbol {b} = (1, 0) ^ \ rm {T} $, $ y = (0 + 1 t) ^ 2- (0 + 0 t) ^ 2 = t ^ 2 $ takes the minimum value at $ t = 0 $, and $ y = at $ \ boldsymbol {c} = (0, 1) ^ \ rm {T} $ (0 + 0 t) ^ 2- (0 + 1t) ^ 2 = -t ^ 2 $ takes the maximum value at $ t = 0 $. Therefore, this point can be estimated to be a saddle point.

3. Saddle point search

This method takes two steps: initialization and search. Initialization finds the appropriate initial values for $ \ boldsymbol {b} $ and $ \ boldsymbol {c} $. In the search, the saddle point is searched according to the gradient.

3.1 Initialization

As mentioned earlier, $ \ boldsymbol {b} $ corresponds to a local maximum and $ \ boldsymbol {c} $ corresponds to a local minimum. First, determine $ \ boldsymbol {b} $ that maximizes the second derivative of $ y = f (\ boldsymbol {x} + \ boldsymbol {b} t) $ with respect to $ t $. For simplicity, we use the iterative method of re-descent. It is expressed by a mathematical formula as follows.

\mathrm{grad}\ \boldsymbol{b}_1 \leftarrow \frac{1}{\delta} \left( \nabla f \left(\boldsymbol{x}+\delta \boldsymbol{b} \right) - \nabla f \left(\boldsymbol{x}-\delta \boldsymbol{b} \right) \right)\\

\mathrm{grad}\ \boldsymbol{b}_2 \leftarrow \mathrm{grad}\ \boldsymbol{b}_1 - \left( \boldsymbol{b} \cdot \mathrm{grad}\ \boldsymbol{b}_1 \right)\boldsymbol{b}\\

\boldsymbol{b} \leftarrow \boldsymbol{b} + \epsilon_1 \ \mathrm{grad}\ \boldsymbol{b}_2\\

\boldsymbol{b} \leftarrow \frac{\boldsymbol{b}}{\mathrm{norm} \left(\boldsymbol{b}\right)}\\

The minute amount is $ \ delta $ and the learning rate is $ \ epsilon_1 $. The first equation is the derivative of the $ n $ dimensional Euclidean space with respect to the basis. As a device, the formula was transformed so that the gradient could be used. If this is used as the update amount as it is, it is not appropriate for the unit vector $ \ boldsymbol {b} $, so it is necessary to convert it to the update amount of $ \ boldsymbol {b} $ along the unit sphere in the second equation. there is. The third formula is updated toward the maximum value, and the fourth formula is standardized to make it a unit vector. Similarly, determine $ \ boldsymbol {c} $ such that the second derivative of $ y = f (\ boldsymbol {x} + \ boldsymbol {c} t) $ with respect to $ t $ is maximized. It is expressed by a mathematical formula as follows.

\mathrm{grad}\ \boldsymbol{c}_1 \leftarrow \frac{1}{\delta} \left( \nabla f \left(\boldsymbol{x}+\delta \boldsymbol{c} \right) - \nabla f \left(\boldsymbol{x}-\delta \boldsymbol{c} \right) \right)\\

\mathrm{grad}\ \boldsymbol{c}_2 \leftarrow \mathrm{grad}\ \boldsymbol{c}_1 - \left( \boldsymbol{c} \cdot \mathrm{grad}\ \boldsymbol{c}_1 \right)\boldsymbol{c}\\

\boldsymbol{c} \leftarrow \boldsymbol{c} - \epsilon_1 \ \mathrm{grad}\ \boldsymbol{c}_2\\

\boldsymbol{c} \leftarrow \frac{\boldsymbol{c}}{\mathrm{norm} \left(\boldsymbol{c}\right)}

The difference from $ \ boldsymbol {b} $ is that it is updated in the direction of the minimum value in the third equation. You can use $ \ mathrm {grad} \ \ boldsymbol {b} \ _2 $ and $ \ mathrm {grad} \ \ boldsymbol {c} \ _2 $ as convergence tests.

3.2 Saddle point search

The search uses $ \ mathrm {grad} \ \ boldsymbol {b} \ _2 $ and $ \ mathrm {grad} \ \ boldsymbol {c} \ _2 $. This is because there is a minimum value in the direction of $-\ mathrm {grad} \ \ boldsymbol {b} \ _2 $ and a maximum value in the direction of $ \ mathrm {grad} \ \ boldsymbol {c} \ _2 $. You can reach the saddle point by updating according to. It is expressed by a mathematical formula as follows.

\boldsymbol{x} \leftarrow \boldsymbol{x} + \epsilon_2 \ \left( -\mathrm{grad}\ \boldsymbol{b}_2 +\mathrm{grad}\ \boldsymbol{c}_2 \right)

Initialization also because $ \ mathrm {grad} \ \ boldsymbol {b} _2 $ and $ \ mathrm {grad} \ \ boldsymbol {c} _2 $ have changed as $ \ boldsymbol {x} $ has been updated. You must also do $ \ boldsymbol {b} $ and $ \ boldsymbol {c} $ that you did in.

4. Implementation

The saddle points of various functions were derived according to the above algorithm. The update method used the re-descent method and was executed in Python. The red line is the minimum value vector and the green line is the maximum value vector.

・ Function $ y = x_1 ^ 2-x_2 ^ 2 $ Initial value $ x_1 = -2 $, $ x_2 = -1 $ Small amount $ \ delta = 0.001 $ Learning rate $ \ epsilon_1 = 0.1 $, $ \ epsilon_1 = 0.1 $ result Saddle point $ x_1 = -0.0003 $, $ x_2 = -0.001 $  \boldsymbol{b}=(1,0)^\rm{T}\boldsymbol{c}=(0,1)^\rm{T}

Initialization

Saddle point search

・ Function $ y = x_1 ^ 2 + x_2 ^ 3-x_2 $ Initial value $ x_1 = -2 $, $ x_2 = -0.5 $ Small amount $ \ delta = 0.05 $ Learning rate $ \ epsilon_1 = 0.1 $, $ \ epsilon_1 = 0.1 $ result Saddle point $ x_1 = -0.0011 $, $ x_2 = -0.5774 $  \boldsymbol{b}=(1,0)^\rm{T}\boldsymbol{c}=(0,1)^\rm{T}

Initialization

Saddle point search

5. Challenges

As an improvement, since it is a gradient method, various optimizers (e.g. Momentum, RMSProp, Adam, RAdam) and wegstain methods in conjugate gradient method and machine learning can be used. The problem is that it is not stable. I wondered if it could be used for saddle point search that can be used for chemical reaction prediction, but it converged with a strange structure. Also, depending on the initial value, it may not converge to the saddle point. An example is shown below. This is different from the initial value in the previous example.

・ Function $ y = x_1 ^ 2 + x_2 ^ 3-x_2 $ Initial value $ x_1 = -2 $, $ x_2 = -0.5 $ Small amount $ \ delta = 0.05 $ Learning rate $ \ epsilon_1 = 0.1 $, $ \ epsilon_1 = 0.1 $

Initialization

Saddle point search

In this example, $ \ boldsymbol {b} $ and $ \ boldsymbol {c} $ have been updated so that the direction of $ x_2 $ has a local minimum and the direction of $ x_1 $ has a local maximum. Therefore, while updating toward the minimum value of the cubic function, we are climbing the slope of the quadratic function.

6. Conclusion

In this article, we performed a saddle point search using the gradient method. An example of a two-variable function is shown for visibility, but you can actually use an arbitrary variable function. If you have any questions, we will reply in the comments. Feel free to comment if you have any requests for formula transformations or source code.

7. Source code (quadratic function)

There are settings such as functions, initial values, and learning rate in "Calculation 1 Initialization". You can change the contents of the Class SDG to various optimizers.

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from datetime import datetime
np.set_printoptions(precision=4, floatmode='maxprec')

class SDG:
    def __init__(self, learning_rate):
        self.learning_rate = learning_rate
        
    def solve(self, coord, gradient):
        return coord - self.learning_rate*gradient

class SaddlePoint:
    
    def __init__(self, function, gradient, coordinate, delta=1.0e-3,
                 learning_rate=0.5, max_iteration=1000,
                 judgment1=1.0e-8, judgment2=1.0e-8, log_span=100):
        """
Initialization constructor
        function --Objective function
        differential --First derivative of the function
        coordinates --Initial coordinates
        delta --Small value
        learning_rate --Learning rate
        judgment --Convergence test 1
        judgment --Convergence test 2
        log_span --Log display interval
        """
        self.function = function
        self.gradient = gradient
        self.coordinate = coordinate.copy()
        self.dim = len(coordinate)
        self.delta = delta
        self.learning_rate = learning_rate
        self.max_iteration = max_iteration
        self.judgment1 = judgment1
        self.judgment2 = judgment2
        self.log_span = log_span
        
        #Basis vector
        self.b = np.random.rand(self.dim)
        self.b /= np.linalg.norm(self.b)
        self.c = np.random.rand(self.dim)
        self.c /= np.linalg.norm(self.c)
        
        # SDG
        self.sdg_b_init = SDG(learning_rate)
        self.sdg_c_init = SDG(learning_rate)
        self.sdg_b_solv = SDG(learning_rate)
        self.sdg_c_solv = SDG(learning_rate)
        self.sdg_a_solv = SDG(learning_rate)
        
    def initialize(self):
        """
The function to initialize. Appropriate b,Determine c
Return value--Minimum value direction vector b,Maxima direction vector c
        """
        #Slope
        gradient = self.gradient
        #Coordinate b
        coordinate = self.coordinate
        #Basis vector
        b = self.b.copy()
        c = self.c.copy()
        #Update amount
        diff_b = np.zeros_like(b)
        diff_c = np.zeros_like(c)
        #Learning rate
        learning_rate = self.learning_rate
        #Small value
        delta = self.delta
        #Standardization
        norm = np.linalg.norm
        #Judgment
        judgement1 = self.judgment1
        #Log interval
        log_span = self.log_span
        # SDG
        sdg_b = self.sdg_b_init
        sdg_c = self.sdg_c_init
        
        z, _ = gradient(coordinate)
        
        print("-----Initialization of b has started.-----")
        for i in range(self.max_iteration):
            #First derivative
            z_b1, grad_b1 = gradient(coordinate + delta*b)
            z_b2, grad_b2 = gradient(coordinate - delta*b)
            #Change amount calculation
            nabla_b = (grad_b1 - grad_b2)/delta
            grad_b = nabla_b - (np.dot(b, nabla_b))*b
            #update
            b = sdg_b.solve(b, -grad_b)
            #Standardization
            b /= norm(b)
            #Convergence test
            error = np.linalg.norm(grad_b)
            if i%log_span == 0:
                print("Iteration = {}, Error = {}".format(i, error))
            if error < judgement1:
                print("Converged! Iteration = {}, Error = {}".format(i, error))
                break
        self.b = b.copy()
        
        print()
        print("-----Initialization of c has started.-----")
        for i in range(self.max_iteration):
            #Gradient calculation
            z_c1, grad_c1 = gradient(coordinate + delta*c)
            z_c2, grad_c2 = gradient(coordinate - delta*c)
            #Change amount calculation
            nabla_c = (grad_c1 - grad_c2)/delta
            grad_c = nabla_c - (np.dot(c, nabla_c))*c
            #update
            c = sdg_c.solve(c, grad_c)
            #Standardization
            c /= norm(c)
            #Convergence test
            error = np.linalg.norm(grad_c)
            if i%log_span == 0:
                print("Iteration = {}, Error = {}".format(i, error))
            if error < judgement1:
                print("Converged! Iteration = {}, Error = {}".format(i, error))
                break
        self.c = c.copy()
        
        print()
        print("Result")
        print("b = {}".format(self.b))
        print("c = {}".format(self.c))
        print()
        return self.b, self.c
    
    def solve(self):
        """
Search for saddle points
Return value--Saddle point coordinates coordinate,Minimum value direction vector b,Maxima direction vector c
        """
        #Slope
        gradient = self.gradient
        #Coordinates, a collection of coordinates
        coordinate = self.coordinate.copy()
        coordinate_Array = coordinate.copy()
        #Basis vector
        b = self.b.copy()
        c = self.c.copy()
        #Update amount
        diff_b = np.zeros_like(b)
        diff_c = np.zeros_like(c)
        #Learning rate
        learning_rate = self.learning_rate
        #Small value
        delta = self.delta
        #Standardization
        norm = np.linalg.norm
        #Judgment
        judgement1 = self.judgment1
        judgement2 = self.judgment2
        #Log interval
        log_span = self.log_span
        # SDG
        sdg_a = self.sdg_a_solv
        sdg_b = self.sdg_b_solv
        sdg_c = self.sdg_c_solv
        
        print("-----Saddle-point solver has started.-----")
        for i in range(self.max_iteration):
            #First derivative
            z_b1, grad_b1 = gradient(coordinate + delta*b)
            z_b2, grad_b2 = gradient(coordinate - delta*b)
            z_c1, grad_c1 = gradient(coordinate + delta*c)
            z_c2, grad_c2 = gradient(coordinate - delta*c)
            grad_through_b = (z_b1-z_b2) / (2.0*delta)
            grad_through_c = (z_c1-z_c2) / (2.0*delta)
            #Second derivative
            z, _ = gradient(coordinate)
            grad2_through_b = (z_b1-2.0*z+z_b2) / delta**2.0
            grad2_through_c = (z_c1-2.0*z+z_c2) / delta**2.0
            
            #update
#            coordinate = sdg_a.solve(coordinate,
#                                     grad_through_b*b/(np.linalg.norm(grad_through_b)+np.linalg.norm(grad2_through_b))
#                                     -grad_through_c*c/(np.linalg.norm(grad_through_c)+np.linalg.norm(grad2_through_c)))
            coordinate = sdg_a.solve(coordinate, grad_through_b*b - grad_through_c*c)
            coordinate_Array = np.vstack([coordinate_Array, coordinate])
            #Convergence test
            error_coordinate = np.linalg.norm(grad_through_b**2 + grad_through_c**2)
            
            # b,update of c
            nabla_b = -(grad_b1 - grad_b2)/delta
            grad_b = nabla_b - (np.dot(b, nabla_b))*b
            #update
            b = sdg_b.solve(b, grad_b)
            #Standardization
            b /= norm(b)
            #Convergence test
            error_b = np.linalg.norm(grad_b)
            
            nabla_c = (grad_c1 - grad_c2)/delta
            grad_c = nabla_c - (np.dot(c, nabla_c))*c
            #update
            c = sdg_c.solve(c, grad_c)
            #Standardization
            c /= norm(c)
            #Convergence test
            error_c = np.linalg.norm(grad_c)
            
            if i%log_span == 0:
                print("B converged! Iteration = {}, Error = {}".format(i, error_b))
                print("C converged! Iteration = {}, Error = {}".format(i, error_c))
                print("Iteration = {}, Error = {}".format(i, error_coordinate))
                print()
            if error_coordinate < judgement2:
                print("Converged! Iteration = {}, Error = {}".format(i, error_coordinate))
                break
        
        self.coordinate = coordinate.copy()
        self.b = b.copy()
        self.c = c.copy()
        
        print()
        print("Result")
        print("coordinate = {}".format(self.coordinate))
        print("b = {}".format(self.b))
        print("c = {}".format(self.c))
        print()
        return self.coordinate, coordinate_Array, self.b, self.c
# =============================================================================
#Calculation 1 Initialization
# =============================================================================
def f(x):
    #function
    return x[0]**2 - x[1]**2

def gradient_f(x):
    #First derivative of the function
    return f(x), np.array([2*x[0], -2*x[1]])

x_init = np.array([-2.0, -1.0], dtype="float")

saddlePoint = SaddlePoint(f, gradient_f, x_init, delta=1e-3,
                          learning_rate=0.1, max_iteration=100,
                          judgment1=1.0e-5, judgment2=1.0e-5, log_span=1)

b, c = saddlePoint.initialize() #Initialization

# =============================================================================
#Graph drawing(2D)
# =============================================================================
t = np.linspace(-1.0, 1.0, 100)
tb = np.linspace(x_init-1.0*b, x_init+1.0*b, 100)
tc = np.linspace(x_init-1.0*c, x_init+1.0*c, 100)
fb = f(tb.T)
fc = f(tc.T)
plt.xlabel("t")
plt.ylabel("z")
plt.plot(t, fb, c="red")
plt.plot(t, fc, c="green")
plt.savefig("file/" + str(datetime.now().strftime("%Y_%m_%d_%H_%M_%S")) + "_2d1.png ", dpi=300)
plt.show()

# =============================================================================
#Graph drawing(3D)
# =============================================================================
#Wire frame
fig = plt.figure(figsize=(6, 6))
ax = fig.add_subplot(111, projection='3d')
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.set_zlabel("z")
X = np.linspace(-2.5, 2.5, 50)
Y = np.linspace(-2.5, 2.5, 50)
X, Y = np.meshgrid(X, Y)
Z = f([X,Y])

#Basis vector
width = 1.0
bt = np.linspace(x_init-width*b,x_init+width*b,10)
bz = f(bt.T)
ct = np.linspace(x_init-width*c,x_init+width*c,10)
cz = f(ct.T)

#Trajectory
zArray = f(x_init)
#display
ax.plot_wireframe(X,Y,Z,color="gray",linewidth=0.2) #Wire frame
ax.plot(bt[:,0],bt[:,1],bz,color="red") #Minimal
ax.plot(ct[:,0],ct[:,1],cz,color="green") #maximum
ax.scatter(x_init[0],x_init[1],f(x_init),color="blue") #Trajectory
plt.savefig("file/" + str(datetime.now().strftime("%Y_%m_%d_%H_%M_%S")) + "_3d1.png ", dpi=300)
plt.show()

# =============================================================================
#Calculation 2 Saddle point search
# =============================================================================
x, xArray, b, c = saddlePoint.solve() #Saddle point calculation

# =============================================================================
#Graph drawing(2D)
# =============================================================================
t = np.linspace(-1.0, 1.0, 100)
tb = np.linspace(x-1.0*b, x+1.0*b, 100)
tc = np.linspace(x-1.0*c, x+1.0*c, 100)
fb = f(tb.T)
fc = f(tc.T)
plt.xlabel("t")
plt.ylabel("z")
plt.plot(t, fb, c="red")
plt.plot(t, fc, c="green")
plt.savefig("file/" + str(datetime.now().strftime("%Y_%m_%d_%H_%M_%S")) + "_2d2.png ", dpi=300)
plt.show()

# =============================================================================
#Graph drawing(2D)
# =============================================================================
#Wire frame
fig = plt.figure(figsize=(6, 6))
ax = fig.add_subplot(111, projection='3d')
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.set_zlabel("z")
X = np.linspace(-2.5, 2.5, 50)
Y = np.linspace(-2.5, 2.5, 50)
X, Y = np.meshgrid(X, Y)
Z = f([X,Y])

#Basis vector
width = 1.0
bt = np.linspace(x-width*b,x+width*b,10)
bz = f(bt.T)
ct = np.linspace(x-width*c,x+width*c,10)
cz = f(ct.T)

#Trajectory
zArray = f(xArray.T)

#display
ax.plot_wireframe(X,Y,Z,color="gray",linewidth=0.2) #Wire frame
ax.plot(bt[:,0],bt[:,1],bz,color="red") #Minimal
ax.plot(ct[:,0],ct[:,1],cz,color="green") #maximum
ax.scatter(xArray[:,0],xArray[:,1],zArray,color="blue") #Trajectory
ax.text(xArray[0,0],xArray[0,1],zArray[0], "start")
ax.text(xArray[-1,0],xArray[-1,1],zArray[-1], "goal")
plt.savefig("file/" + str(datetime.now().strftime("%Y_%m_%d_%H_%M_%S")) + "_3d2.png ", dpi=300)
plt.show()

Recommended Posts

Saddle point search using the gradient method
Roughly think about the gradient descent method
Try cluster analysis using the K-means method
Quadratic programming by the interior point method
Search for adsorption structure using Minima Hopping method
Gradient method implementation 1
Generate a hash value using the HMAC method.
Approximate a Bezier curve through a specified point using the least squares method in Python
Calculation of the shortest path using the Monte Carlo method
Determine the threshold using the P tile method in python
I tried clustering ECG data using the K-Shape method
Using the National Diet Library Search API in Python
Gradient descent method list (2020)
Search Twitter using Python
Method call using __getattr__
Reuse the behavior of the @property method by using a descriptor [16/100]
Try a similar search for Image Search using the Python SDK [Search]
[Python] LASSO regression with equation constraints using the multiplier method
Output the result of gradient descent method as matplotlib animation