Step-by-step on the theory, implementation in python, and analysis using scikit-learn about the algorithm previously taken up in "Classification of Machine Learning" I will study with. I'm writing it for personal learning, so I'd like you to overlook any mistakes.
Last time, extended 2 class classification to multi class classification. This time I will actually implement it in Python.
I referred to the following sites. Thank you very much.
I would like to extend Logistic Regression implemented before to multiple classes. The method is
I will try it with.
Iris data is used for classification. It uses 4 features (sepal_length, sepal_width, petal_length, petal_width) and classifies them into 3 classes (setosa, versicolor, virginica).
Below, we will implement the classification using sepal_length and sepal_width for clarity.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.datasets import load_iris
sns.set()
iris = sns.load_dataset("iris")
ax = sns.scatterplot(x=iris.sepal_length, y=iris.sepal_width,
hue=iris.species, style=iris.species)
One-vs-Rest One-vs-Rest builds a two-class classifier for each label class to learn, and finally uses the most plausible value. Since logistic regression outputs the probability value, the classification of the classifier with the highest probability is adopted.
Use the LogisticRegression class, which is a slightly modified version of the logistic regression code we used last time. I made the predict_proba
method because it determines which value to use with probability.
from scipy import optimize
class LogisticRegression:
def __init__(self):
self.w = None
def sigmoid(self, a):
return 1.0 / (1 + np.exp(-a))
def predict_proba(self, x):
x = np.hstack([1, x])
return self.sigmoid(self.w.T @ x)
def predict(self, x):
return 1 if self.predict_proba(x)>=0.5 else -1
def cross_entropy_loss(self, w, *args):
def safe_log(x, minval=0.0000000001):
return np.log(x.clip(min=minval))
t, x = args
loss = 0
for i in range(len(t)):
ti = 1 if t[i] > 0 else 0
h = self.sigmoid(w.T @ x[i])
loss += -ti*safe_log(h) - (1-ti)*safe_log(1-h)
return loss/len(t)
def grad_cross_entropy_loss(self, w, *args):
t, x = args
grad = np.zeros_like(w)
for i in range(len(t)):
ti = 1 if t[i] > 0 else 0
h = self.sigmoid(w.T @ x[i])
grad += (h - ti) * x[i]
return grad/len(t)
def fit(self, x, y):
w0 = np.ones(len(x[0])+1)
x = np.hstack([np.ones((len(x),1)), x])
self.w = optimize.fmin_cg(self.cross_entropy_loss, w0, fprime=self.grad_cross_entropy_loss, args=(y, x))
@property
def w_(self):
return self.w
Implement the One-vs-Rest class. I also implemented the ʻaccuracy_score` method to calculate how correct the answer is, as I will use it later for algorithm comparison.
from sklearn.metrics import accuracy_score
class OneVsRest:
def __init__(self, classifier, labels):
self.classifier = classifier
self.labels = labels
self.classifiers = [classifier() for _ in range(len(self.labels))]
def fit(self, x, y):
y = np.array(y)
for i in range(len(self.labels)):
y_ = np.where(y==self.labels[i], 1, 0)
self.classifiers[i].fit(x, y_)
def predict(self, x):
probas = [self.classifiers[i].predict_proba(x) for i in range(len(self.labels))]
return np.argmax(probas)
def accuracy_score(self, x, y):
pred = [self.labels[self.predict(i)] for i in x]
acc = accuracy_score(y, pred)
return acc
Actually classify using the previous data.
model = OneVsRest(LogisticRegression, np.unique(iris.species))
x = iris[['sepal_length', 'sepal_width']].values
y = iris.species
model.fit(x, y)
print("accuracy_score: {}".format(model.accuracy_score(x,y)))
accuracy_score: 0.8066666666666666
The correct answer rate of 81% is not very good. Let's visualize how it was classified.
Use matplotlib's contourf
method for visualization. Colors according to which values on the grid points are classified.
from matplotlib.colors import ListedColormap
x_min = iris.sepal_length.min()
x_max = iris.sepal_length.max()
y_min = iris.sepal_width.min()
y_max = iris.sepal_width.max()
x = np.linspace(x_min, x_max, 100)
y = np.linspace(y_min, y_max, 100)
data = []
for i in range(len(y)):
data.append([model.predict([x[j], y[i]]) for j in range(len(x))])
xx, yy = np.meshgrid(x, y)
cmap = ListedColormap(('blue', 'orange', 'green'))
plt.contourf(xx, yy, data, alpha=0.25, cmap=cmap)
ax = sns.scatterplot(x=iris.sepal_length, y=iris.sepal_width,
hue=iris.species, style=iris.species)
plt.show()
As you can see, setosa is properly classified, but the remaining two classes are mixed, so it seems that the correct answer rate is a little low. For the time being, it will be like this.
Implements the LogisticRegressionMulti class for softmax classification in logistic regression.
The cross-entropy error was used as the error function for evaluation, and the parameters were calculated using the steepest gradient descent method. I made it quite properly, I'm sorry
from sklearn.metrics import accuracy_score
class LogisticRegressionMulti:
def __init__(self, labels, n_iter=1000, eta=0.01):
self.w = None
self.labels = labels
self.n_iter = n_iter
self.eta = eta
self.loss = np.array([])
def softmax(self, a):
if a.ndim==1:
return np.exp(a)/np.sum(np.exp(a))
else:
return np.exp(a)/np.sum(np.exp(a), axis=1)[:, np.newaxis]
def cross_entropy_loss(self, w, *args):
x, y = args
def safe_log(x, minval=0.0000000001):
return np.log(x.clip(min=minval))
p = self.softmax(x @ w)
loss = -np.sum(y*safe_log(p))
return loss/len(x)
def grad_cross_entropy_loss(self, w, *args):
x, y = args
p = self.softmax(x @ w)
grad = -(x.T @ (y-p))
return grad/len(x)
def fit(self, x, y):
self.w = np.ones((len(x[0])+1, len(self.labels)))
x = np.hstack([np.ones((len(x),1)), x])
for i in range(self.n_iter):
self.loss = np.append(self.loss, self.cross_entropy_loss(self.w, x, y))
grad = self.grad_cross_entropy_loss(self.w, x, y)
self.w -= self.eta * grad
def predict(self, x):
x = np.hstack([1, x])
return np.argmax(self.softmax(x @ self.w))
def accuracy_score(self, x, y):
pred = [self.predict(i) for i in x]
y_ = np.argmax(y, axis=1)
acc = accuracy_score(y_, pred)
return acc
@property
def loss_(self):
return self.loss
Input to LogisticRegressionMulti
uses a One-Hot-Encoded label. This is easy with Pandas' get_dummies`. (I thought after making it, but I should have used get_dummies in the class)
model = LogisticRegressionMulti(np.unique(iris.species), n_iter=10000, eta=0.1)
x = iris[['sepal_length', 'sepal_width']].values
y = pd.get_dummies(iris['species']).values
model.fit(x, y)
print("accuracy_score: {}".format(model.accuracy_score(x, y)))
accuracy_score: 0.8266666666666667
The correct answer rate is about 83%. Looking at the history of the error, it seems that it has converged, so it seems like this.
Also, let's color how it is classified in the same way as before.
Finally, using all the features, we will compare the classifier we created this time with the LogisticRegression class of scikit-learn.
Method | accuracy_score |
---|---|
OneVsRest | 0.98 |
LogisticRegressionMulti | 0.98 |
sklearn LogisticRegression | 0.973 |
Even with this implementation, it seems that you can get a good score if it is about the classification of irises.
Implemented multi-class classification using logistic regression. I feel that other classifiers can be used in a similar way. Especially in neural networks, multi-class softmax is a popular method, so I thought it would be useful to understand the theoretical part later.
Recommended Posts