Step-by-step on the theory, implementation in python, and analysis using scikit-learn about the algorithm previously taken up in "Classification of Machine Learning" I will study with. I'm writing it for personal learning, so I'd like you to overlook any mistakes.
This time about ** logistic regression **. Logistic regression is also an algorithm that handles binary classification like perceptron, although it is written as regression.
The following sites were referred to this time. Thank you very much.
Regarding the theory of logistic regression, let's first derive the activation function ** sigmoid function **.
Since logistic regression is a binary classification, consider classes $ C_1 $ and $ C_2 $. The sum of the probability $ P (C_1) $ for $ C_1 $ and the probability $ P (C_2) $ for $ C_2 $ is 1.
The probability of becoming $ C_1 $ when the data sequence $ \ boldsymbol {x} $ is given is from ** Bayes' theorem **.
\begin{align}
P(C_1|\boldsymbol{x})&=\frac{P(\boldsymbol{x}|C_1)P(C_1)}{P(\boldsymbol{x})} \\
&= \frac{P(\boldsymbol{x}|C_1)P(C_1)}{P(\boldsymbol{x}|C_1)P(C_1)+P(\boldsymbol{x}|C_2)P(C_2)} \\
&= \frac{1}{1+\frac{P(\boldsymbol{x}|C_2)P(C_2)}{P(\boldsymbol{x}|C_1)P(C_1)}} \\
&= \frac{1}{1+\exp(-\ln\frac{P(\boldsymbol{x}|C_1)P(C_1)}{P(\boldsymbol{x}|C_2)P(C_2)})} \\
&= \frac{1}{1+\exp(-a)} = \sigma(a)
\end{align}
This $ \ sigma (a) $ is called the ** sigmoid function **. The sigmoid function takes a value from 0 to 1 as shown below, so it is a convenient function to express the probability.
Using the given data sequence $ \ boldsymbol {x} = (x_0, x_1, \ cdots, x_n) $ and the teacher's classification $ \ boldsymbol {t} = (t_0, t_1, \ cdots, t_n) $
L(\boldsymbol{x})=\frac{1}{1+\exp(-\boldsymbol{w}^T\boldsymbol{x})}
We will optimize the parameter $ \ boldsymbol {w} = (w_0, w_1, \ cdots, w_n) $ of.
is there
Applying this to all data
\begin{align}
P(\boldsymbol{t}|\boldsymbol{x})&=P(t_0|x_0)P(t_1|X_1)\cdots P(t_{n-1}|x_{n-1}) \\
&=\prod_{i=0}^{n-1}P(t_i|x_i) \\
&=\prod_{i=1}^{n-1}p_i^{t_i}(1-p_i)^{1-t_i}
\end{align}
It will be. Taking the logarithm of both sides,
\log P(\boldsymbol{t}|\boldsymbol{x}) = \sum_{i=0}^{n-1}\{t_i\log p_i+(1-t_i)\log (1-p_i)\}
This is called ** log-likelihood **, and in order to maximize log-likelihood, the sign is inverted.
E(\boldsymbol{x}) = -\frac{1}{n}\log P(\boldsymbol{t}|\boldsymbol{x}) = \frac{1}{n}\sum_{i=0}^{n-1}\{-t_i\log p_i-(1-t_i)\log (1-p_i)\}
This $ E $ is called the ** cross entropy error function **. Since we will use it later, the derivative of $ E $ is
\frac{\partial{E}}{\partial{w_i}}=\frac{1}{n}\sum_{i=0}^{n-1}(p_i-t_i)x_i
(Explanation omitted)
Now, to minimize the cross-entropy error function, we use the previously mentioned gradient method. Again, you can use the steepest descent method or the stochastic gradient descent method, but use the ** Conjugate Gradient Method **. For more information, see [Wikipedia: Conjugate Gradient Method](https://ja.wikipedia.org/wiki/%E5%85%B1%E5%BD%B9%E5%8B%BE%E9%85%8D%E6% I will give it to B3% 95), but it is an algorithm that is faster than the steepest gradient method and converges without setting the learning rate. I'd like to implement this in python, but it's troublesome (!) [Scipy.optimize.fmin_cg] in python (https://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.optimize) Use a library called .fmin_cg.html).
We will implement the LogisticRegression class using the theory so far. fmin_cg uses the gradient function because it gives good results when given a gradient function.
from scipy import optimize
class LogisticRegression:
def __init__(self):
self.w = np.array([])
def sigmoid(self, a):
return 1.0 / (1 + np.exp(-a))
def cross_entropy_loss(self, w, *args):
def safe_log(x, minval=0.0000000001):
return np.log(x.clip(min=minval))
t, x = args
loss = 0
for i in range(len(t)):
ti = (t[i]+1)/2
h = self.sigmoid(w.T @ x[i])
loss += -ti*safe_log(h) - (1-ti)*safe_log(1-h)
return loss/len(t)
def grad_cross_entropy_loss(self, w, *args):
t, x = args
grad = np.zeros_like(w)
for i in range(len(t)):
ti = (t[i]+1)/2
h = self.sigmoid(w.T @ x[i])
grad += (h - ti) * x[i]
return grad/len(t)
def fit(self, x, y):
w0 = np.ones(len(x[0])+1)
x = np.hstack([np.ones((len(x),1)), x])
self.w = optimize.fmin_cg(self.cross_entropy_loss, w0, fprime=self.grad_cross_entropy_loss, args=(y, x))
@property
def w_(self):
return self.w
Let's use this class to classify iris data. Draw the boundary as well. The boundary is a line of $ \ boldsymbol {w} ^ T \ boldsymbol {x} = 0 $. I changed the 2 classes to 1 and -1, so I made the code to match it.
df = df_iris[df_iris['target']!='setosa']
df = df.drop(df.columns[[1,2]], axis=1)
df['target'] = df['target'].map({'versicolor':1, 'virginica':-1})
#Drawing a graph
fig, ax = plt.subplots()
df_versicolor = df_iris[df_iris['target']=='versicolor']
x1 = df_iris[df_iris['target']=='versicolor'].iloc[:,3].values
y1 = df_iris[df_iris['target']=='versicolor'].iloc[:,0].values
x2 = df_iris[df_iris['target']=='virginica'].iloc[:,3].values
y2 = df_iris[df_iris['target']=='virginica'].iloc[:,0].values
xs = StandardScaler()
ys = StandardScaler()
xs.fit(np.append(x1,x2).reshape(-1, 1))
ys.fit(np.append(y1,y2).reshape(-1, 1))
x1s = xs.transform(x1.reshape(-1, 1))
x2s = xs.transform(x2.reshape(-1, 1))
y1s = ys.transform(y1.reshape(-1, 1))
y2s = ys.transform(y2.reshape(-1, 1))
x = np.concatenate([np.concatenate([x1s, y1s], axis=1), np.concatenate([x2s, y2s], axis=1)])
y = df['target'].values
model = LogisticRegression()
model.fit(x, y)
ax.scatter(x1s, y1s, color='red', marker='o', label='versicolor')
ax.scatter(x2s, y2s, color='blue', marker='s', label='virginica')
ax.set_xlabel("petal width (cm)")
ax.set_ylabel("sepal length (cm)")
#Draw classification boundaries
w = model.w_
x_fig = np.linspace(-2.,2.,100)
y_fig = [-w[1]/w[2]*xi-w[0]/w[2] for xi in x_fig]
ax.plot(x_fig, y_fig)
ax.set_ylim(-2.5,2.5)
ax.legend()
print(w)
plt.show()
Optimization terminated successfully.
Current function value: 0.166434
Iterations: 12
Function evaluations: 41
Gradient evaluations: 41
[-0.57247091 -5.42865492 -0.20202263]
It seems that they can be classified fairly neatly.
scikit-learn also has a LogisticRegression class, so it's almost the same as the code above.
from sklearn.linear_model import LogisticRegression
df = df_iris[df_iris['target']!='setosa']
df = df.drop(df.columns[[1,2]], axis=1)
df['target'] = df['target'].map({'versicolor':1, 'virginica':-1})
#Drawing a graph
fig, ax = plt.subplots()
df_versicolor = df_iris[df_iris['target']=='versicolor']
x1 = df_iris[df_iris['target']=='versicolor'].iloc[:,3].values
y1 = df_iris[df_iris['target']=='versicolor'].iloc[:,0].values
x2 = df_iris[df_iris['target']=='virginica'].iloc[:,3].values
y2 = df_iris[df_iris['target']=='virginica'].iloc[:,0].values
xs = StandardScaler()
ys = StandardScaler()
xs.fit(np.append(x1,x2).reshape(-1, 1))
ys.fit(np.append(y1,y2).reshape(-1, 1))
x1s = xs.transform(x1.reshape(-1, 1))
x2s = xs.transform(x2.reshape(-1, 1))
y1s = ys.transform(y1.reshape(-1, 1))
y2s = ys.transform(y2.reshape(-1, 1))
x = np.concatenate([np.concatenate([x1s, y1s], axis=1), np.concatenate([x2s, y2s], axis=1)])
y = df['target'].values
model = LogisticRegression(C=100)
model.fit(x, y)
ax.scatter(x1s, y1s, color='red', marker='o', label='versicolor')
ax.scatter(x2s, y2s, color='blue', marker='s', label='virginica')
ax.set_xlabel("petal width (cm)")
ax.set_ylabel("sepal length (cm)")
#Draw classification boundaries
w = model.coef_[0]
x_fig = np.linspace(-2.,2.,100)
y_fig = [-w[0]/w[1]*xi-model.intercept_/w[1] for xi in x_fig]
ax.plot(x_fig, y_fig)
ax.set_ylim(-2.5,2.5)
ax.legend()
plt.show()
This is also classified as good.
We have summarized the logistic regression that is relatively important (believed to be) in the world of machine learning. The theory has become more difficult from around here.
Recommended Posts