Introduction

Step-by-step on the theory, implementation in python, and analysis using scikit-learn about the algorithm previously taken up in "Classification of Machine Learning" I will study with. I'm writing it for personal learning, so I'd like you to overlook any mistakes.

This time about ** logistic regression **. Logistic regression is also an algorithm that handles binary classification like perceptron, although it is written as regression.

The following sites were referred to this time. Thank you very much.

theory

Regarding the theory of logistic regression, let's first derive the activation function ** sigmoid function **.

Sigmoid function

Since logistic regression is a binary classification, consider classes $ C_1 $ and $ C_2 $. The sum of the probability $ P (C_1) $ for $ C_1 $ and the probability $ P (C_2) $ for $ C_2 $ is 1.

The probability of becoming $ C_1 $ when the data sequence $ \ boldsymbol {x} $ is given is from ** Bayes' theorem **.

\begin{align}
P(C_1|\boldsymbol{x})&=\frac{P(\boldsymbol{x}|C_1)P(C_1)}{P(\boldsymbol{x})} \\
&= \frac{P(\boldsymbol{x}|C_1)P(C_1)}{P(\boldsymbol{x}|C_1)P(C_1)+P(\boldsymbol{x}|C_2)P(C_2)} \\
&= \frac{1}{1+\frac{P(\boldsymbol{x}|C_2)P(C_2)}{P(\boldsymbol{x}|C_1)P(C_1)}} \\
&= \frac{1}{1+\exp(-\ln\frac{P(\boldsymbol{x}|C_1)P(C_1)}{P(\boldsymbol{x}|C_2)P(C_2)})} \\
&= \frac{1}{1+\exp(-a)} = \sigma(a)
\end{align}

This $ \ sigma (a) $ is called the ** sigmoid function **. The sigmoid function takes a value from 0 to 1 as shown below, so it is a convenient function to express the probability.

sigmoid

Logistic regression model

Using the given data sequence $ \ boldsymbol {x} = (x_0, x_1, \ cdots, x_n) $ and the teacher's classification $ \ boldsymbol {t} = (t_0, t_1, \ cdots, t_n) $

L(\boldsymbol{x})=\frac{1}{1+\exp(-\boldsymbol{w}^T\boldsymbol{x})}

We will optimize the parameter $ \ boldsymbol {w} = (w_0, w_1, \ cdots, w_n) $ of.

Cross entropy error

is therex_iWhen given the classC_1Probability of becomingP(C_1|x_i)Top_iThen the classC_2Probability of becomingP(C_2|x_i)Is(1-p_i)Will be. That is, the classt_iProbability of becomingP(t_i|x_i)Is、$P(t_i|x_i)=p_i^{t_i}(1-p_i)^{1-t_i}$Will be.

Applying this to all data

\begin{align}
P(\boldsymbol{t}|\boldsymbol{x})&=P(t_0|x_0)P(t_1|X_1)\cdots P(t_{n-1}|x_{n-1}) \\
&=\prod_{i=0}^{n-1}P(t_i|x_i) \\
&=\prod_{i=1}^{n-1}p_i^{t_i}(1-p_i)^{1-t_i}
\end{align}

It will be. Taking the logarithm of both sides,

\log P(\boldsymbol{t}|\boldsymbol{x}) = \sum_{i=0}^{n-1}\{t_i\log p_i+(1-t_i)\log (1-p_i)\}

This is called ** log-likelihood **, and in order to maximize log-likelihood, the sign is inverted.

E(\boldsymbol{x}) = -\frac{1}{n}\log P(\boldsymbol{t}|\boldsymbol{x}) = \frac{1}{n}\sum_{i=0}^{n-1}\{-t_i\log p_i-(1-t_i)\log (1-p_i)\}

This $ E $ is called the ** cross entropy error function **. Since we will use it later, the derivative of $ E $ is

\frac{\partial{E}}{\partial{w_i}}=\frac{1}{n}\sum_{i=0}^{n-1}(p_i-t_i)x_i

(Explanation omitted)

Conjugate gradient method

Now, to minimize the cross-entropy error function, we use the previously mentioned gradient method. Again, you can use the steepest descent method or the stochastic gradient descent method, but use the ** Conjugate Gradient Method **. For more information, see [Wikipedia: Conjugate Gradient Method](https://ja.wikipedia.org/wiki/%E5%85%B1%E5%BD%B9%E5%8B%BE%E9%85%8D%E6% I will give it to B3% 95), but it is an algorithm that is faster than the steepest gradient method and converges without setting the learning rate. I'd like to implement this in python, but it's troublesome (!) [Scipy.optimize.fmin_cg] in python (https://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.optimize) Use a library called .fmin_cg.html).

Implementation by python

We will implement the LogisticRegression class using the theory so far. fmin_cg uses the gradient function because it gives good results when given a gradient function.

from scipy import optimize

class LogisticRegression:
  def __init__(self):
    self.w = np.array([])

  def sigmoid(self, a):
    return 1.0 / (1 + np.exp(-a))
  
  def cross_entropy_loss(self, w, *args):
    def safe_log(x, minval=0.0000000001):
      return np.log(x.clip(min=minval))
    t, x = args
    loss = 0
    for i in range(len(t)):
      ti = (t[i]+1)/2
      h = self.sigmoid(w.T @ x[i])
      loss += -ti*safe_log(h) - (1-ti)*safe_log(1-h)

    return loss/len(t)

  def grad_cross_entropy_loss(self, w, *args):
    t, x = args
    grad = np.zeros_like(w)
    for i in range(len(t)):
      ti = (t[i]+1)/2
      h = self.sigmoid(w.T @ x[i])
      grad += (h - ti) * x[i]

    return grad/len(t)

  def fit(self, x, y):
    w0 = np.ones(len(x[0])+1)
    x = np.hstack([np.ones((len(x),1)), x])

    self.w = optimize.fmin_cg(self.cross_entropy_loss, w0, fprime=self.grad_cross_entropy_loss, args=(y, x))

  @property
  def w_(self):
    return self.w

Let's use this class to classify iris data. Draw the boundary as well. The boundary is a line of $ \ boldsymbol {w} ^ T \ boldsymbol {x} = 0 $. I changed the 2 classes to 1 and -1, so I made the code to match it.

df = df_iris[df_iris['target']!='setosa']
df = df.drop(df.columns[[1,2]], axis=1)
df['target'] = df['target'].map({'versicolor':1, 'virginica':-1})

#Drawing a graph
fig, ax = plt.subplots()

df_versicolor = df_iris[df_iris['target']=='versicolor']

x1 = df_iris[df_iris['target']=='versicolor'].iloc[:,3].values
y1 = df_iris[df_iris['target']=='versicolor'].iloc[:,0].values

x2 = df_iris[df_iris['target']=='virginica'].iloc[:,3].values
y2 = df_iris[df_iris['target']=='virginica'].iloc[:,0].values

xs = StandardScaler()
ys = StandardScaler()

xs.fit(np.append(x1,x2).reshape(-1, 1))
ys.fit(np.append(y1,y2).reshape(-1, 1))

x1s = xs.transform(x1.reshape(-1, 1))
x2s = xs.transform(x2.reshape(-1, 1))
y1s = ys.transform(y1.reshape(-1, 1))
y2s = ys.transform(y2.reshape(-1, 1))

x = np.concatenate([np.concatenate([x1s, y1s], axis=1), np.concatenate([x2s, y2s], axis=1)])

y = df['target'].values

model = LogisticRegression()
model.fit(x, y)

ax.scatter(x1s, y1s, color='red', marker='o', label='versicolor')
ax.scatter(x2s, y2s, color='blue', marker='s', label='virginica')

ax.set_xlabel("petal width (cm)")
ax.set_ylabel("sepal length (cm)")

#Draw classification boundaries
w = model.w_
x_fig = np.linspace(-2.,2.,100)
y_fig = [-w[1]/w[2]*xi-w[0]/w[2] for xi in x_fig]
ax.plot(x_fig, y_fig)
ax.set_ylim(-2.5,2.5)

ax.legend()
print(w)
plt.show()

Optimization terminated successfully.
         Current function value: 0.166434
         Iterations: 12
         Function evaluations: 41
         Gradient evaluations: 41
[-0.57247091 -5.42865492 -0.20202263]

It seems that they can be classified fairly neatly.

Implementation of scikit-learn

scikit-learn also has a LogisticRegression class, so it's almost the same as the code above.

from sklearn.linear_model import LogisticRegression

df = df_iris[df_iris['target']!='setosa']
df = df.drop(df.columns[[1,2]], axis=1)
df['target'] = df['target'].map({'versicolor':1, 'virginica':-1})

#Drawing a graph
fig, ax = plt.subplots()

df_versicolor = df_iris[df_iris['target']=='versicolor']

x1 = df_iris[df_iris['target']=='versicolor'].iloc[:,3].values
y1 = df_iris[df_iris['target']=='versicolor'].iloc[:,0].values

x2 = df_iris[df_iris['target']=='virginica'].iloc[:,3].values
y2 = df_iris[df_iris['target']=='virginica'].iloc[:,0].values

xs = StandardScaler()
ys = StandardScaler()

xs.fit(np.append(x1,x2).reshape(-1, 1))
ys.fit(np.append(y1,y2).reshape(-1, 1))

x1s = xs.transform(x1.reshape(-1, 1))
x2s = xs.transform(x2.reshape(-1, 1))
y1s = ys.transform(y1.reshape(-1, 1))
y2s = ys.transform(y2.reshape(-1, 1))

x = np.concatenate([np.concatenate([x1s, y1s], axis=1), np.concatenate([x2s, y2s], axis=1)])

y = df['target'].values

model = LogisticRegression(C=100)
model.fit(x, y)

ax.scatter(x1s, y1s, color='red', marker='o', label='versicolor')
ax.scatter(x2s, y2s, color='blue', marker='s', label='virginica')

ax.set_xlabel("petal width (cm)")
ax.set_ylabel("sepal length (cm)")

#Draw classification boundaries
w = model.coef_[0]

x_fig = np.linspace(-2.,2.,100)
y_fig = [-w[0]/w[1]*xi-model.intercept_/w[1] for xi in x_fig]
ax.plot(x_fig, y_fig)
ax.set_ylim(-2.5,2.5)

ax.legend()
plt.show()

This is also classified as good.

Summary

We have summarized the logistic regression that is relatively important (believed to be) in the world of machine learning. The theory has become more difficult from around here.

Machine learning algorithm (logistic regression)