Last time University of Tsukuba Machine Learning Course: Study sklearn while creating the Python script part of the assignment (15) https://github.com/legacyworld/sklearn-basic

Challenge 7.3 Comparison of Gradient Method and Newton's Method in Logistic Regression

Commentary on Youtube: 8th (1) per 27 minutes I couldn't get the result explained in the lecture, probably because the implementation of the re-descent method was bad. I tried it with ridge regression etc., but the result did not change much.

Mathematically, it implements the following:

The steepest descent method

E(w) = -\frac{1}{N}\sum_{n=1}^{N}{t_n\,\log\hat t_n + (1-t_n)\,\log(1-\hat t_n)}\\
\frac{\partial E(w)}{\partial w} = X^T(\hat t-t) \\
w \leftarrow w - \eta X^T(\hat t-t)

In the iris data, $ N = 150 $ and $ w $ are 5 dimensions by adding the intercept. The reason why $ E (w) $ is divided by $ N $ is that the initial cost does not match unless this is done. However, even with this, only 0.1 diverged, and other than that, it converged properly. I think something is wrong, but I'm not sure.

Newton's method

\nabla \nabla E(w) = X^TRX\,,R=\hat{t}_n(1-\hat{t}_n)Diagonal matrix\\
w \leftarrow w-(X^TRX)^{-1}X^T(\hat{t}-t)

The Newton method is generally correct, but the Final Cost is unfortunately quite different. It is also difficult to understand that only three weights are displayed in the results of the lecture.

Click here for source code Since the source code of Exercise 4.3 is diverted, it is classified by BaseEstimator, but there is no meaning there.

`python:Homework_7.3.py`


#Challenge 7.3 Comparison of gradient method and Newton's method in logistic regression
#Commentary on Youtube: 8th(1)Per 27 minutes
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing,metrics
from sklearn.linear_model import LogisticRegression
from sklearn.base import BaseEstimator
import statsmodels.api as sm
from sklearn.datasets import load_iris  
iris = load_iris()

class MyEstimator(BaseEstimator):
    def __init__(self,ep,eta):
        self.ep = ep
        self.eta = eta
        self.loss = []

    def fit(self, X, y,f):
        m = len(y)
        loss = []
        diff = 10**(10)
        ep = self.ep
        #Types of features
        dim = X.T.shape[1]
        #Initial value of beta
        beta = np.zeros(dim).reshape(-1,1)
        eta = self.eta
        
        while abs(diff) > ep:
            t_hat = self.sigmoid(beta.T,X)
            loss.append(-(1/m)*np.sum(y*np.log(t_hat) + (1-y)*np.log(1-t_hat)))
            #The steepest descent method
            if f == "GD":
                beta = beta - eta*np.dot(X,(t_hat-y).reshape(-1,1))
            #Newton's method
            else:
                #Diagonal matrix of NxN
                R = np.diag((t_hat*(1-t_hat))[0])
                #Hessian matrix
                H = np.dot(np.dot(X,R),X.T)
                beta = beta - np.dot(np.linalg.inv(H),np.dot(X,(t_hat-y).reshape(-1,1)))
            if len(loss) > 1:
                diff = loss[len(loss)-1] - loss[len(loss)-2]
                if diff > 0:
                    break
        self.loss = loss
        self.coef_ = beta
        return self

    def sigmoid(self,w,x):
        return 1/(1+np.exp(-np.dot(w,x)))

#Graph
fig = plt.figure(figsize=(20,10))
ax = [fig.add_subplot(3,3,i+1) for i in range(9)]

#Just consider whether virginica or not
target = 2
X = iris.data
y = iris.target
# y =Not 2(Not virginica)If 0
y[np.where(np.not_equal(y,target))] = 0
y[np.where(np.equal(y,target))] = 1
scaler = preprocessing.StandardScaler()
X_fit = scaler.fit_transform(X)
X_fit = sm.add_constant(X_fit).T #Add 1 to the first column
epsilon = 10 ** (-8)
#The steepest descent method
eta_list = [0.1,0.01,0.008,0.006,0.004,0.003,0.002,0.001,0.0005]
for index,eta in enumerate(eta_list):
    myest = MyEstimator(epsilon,eta)
    myest.fit(X_fit,y,"GD")
    ax[index].plot(myest.loss)
    ax[index].set_title(f"Optimization with Gradient Descent\nStepsize = {eta}\nIterations:{len(myest.loss)}; Initial Cost is:{myest.loss[0]:.3f}; Final Cost is:{myest.loss[-1]:.6f}")
plt.tight_layout()    
plt.savefig(f"7.3GD.png ")

#Newton's method
myest.fit(X_fit,y,"newton")
plt.clf()
plt.plot(myest.loss) 
plt.title(f"Optimization with Newton Method\nInitial Cost is:{myest.loss[0]:.3f}; Final Cost is:{myest.loss[-1]:.6f}")
plt.savefig("7.3Newton.png ")

#Results from sklearn's Logistic Regression
X_fit = scaler.fit_transform(X)
clf = LogisticRegression(penalty='none')
clf.fit(X_fit,y)
print(f"accuracy_score = {metrics.accuracy_score(clf.predict(X_fit),y)}")
print(f"coef = {clf.coef_} intercept = {clf.intercept_}")

Results of the steepest descent method

In the lecture, the step parameter diverged up to 0.003 and became the minimum at 0.002, but the result was completely different. 7.3GD.png

Newton's method results

Final Cost is one digit smaller, but the number of times is about the same as the lecture. It doesn't seem that wrong. 7.3Newton.png

Results of sklearn's Logistics Regression

accuracy_score = 0.9866666666666667
coef = [[-2.03446841 -2.90222851 16.58947002 13.89172352]] intercept = [-20.10133936]

result

Obtained parameters are as follows The re-descent method is the result when the final cost is the smallest step size = 0.01

The steepest descent method:(w_0,w_1,w_2,w_3,w_4) = (-18.73438888,-1.97839772,-2.69938233,15.54339092,12.96694841)\\
Newton's method:(w_0,w_1,w_2,w_3,w_4) = (-20.1018028,-2.03454941,-2.90225059,16.59009858,13.89184339)\\
sklearn:(w_0,w_1,w_2,w_3,w_4) = (-20.10133936,-2.03446841,-2.90222851,16.58947002,13.89172352)

The Newton's method is certainly fast, but inverse matrix calculation is essential, so if the number of dimensions or the number of samples increases, will it eventually settle into the stochastic re-descent method?

Past posts

University of Tsukuba Machine Learning Course: Study sklearn while creating the Python script part of the task (16)