In part3, we will do simple machine learning.
Machine learning is divided into three types.
type | Feature |
---|---|
Supervised learning | Data: Labeled Purpose: Outcome prediction and future prediction Example: Email filter,Stock price forecast |
Unsupervised learning | Data: unlabeled Purpose:Find hidden structures in your data Example: Customer segmentation,Anomaly detection |
Reinforcement learning | Data: decision-making process Purpose: Learn a series of actions, etc. AlphaGo,Robot discipline |
Even if you do machine learning, it is rare that you can get good results just by creating a learning model and inputting data. The specific workflow is as follows.
By repeating this process, a suitable learning model will be generated.
In this seminar, we will do 3-5. However, since 3 was mostly done in part2, we will focus on 4 and 5 for the models to be introduced from now on.
What exactly is machine learning? Warren McCulloch and Walter Pitts wondered if they could imitate the human brain to design artificial intelligence. Therefore, we announced a simplified version of the neuron, which is the smallest unit of the human brain, as the McCulloch-Pitts neuron.
Neurons receive chemical and electrical signals in the brain and generate output signals when the accumulated signals exceed a certain threshold. They considered it as a logic gate for binary output and designed MCP neurons.
A few years later, Frank Rosenblatt devised an algorithm that automatically learned the optimal weighting factor and then multiplied it with the input signal to determine if the neuron would fire (exceed the threshold). This is the beginning of the "classification problem" of "supervised learning".
The definition of artificial neurons is as follows. Let the weight vector for multiple input signals $ \ boldsymbol {x} $ be $ \ boldsymbol {w} $. Put a linear combination of each input $ x_i $ and its weight $ w_i $ as the total input $ z $.
z = \Sigma_i^Nw_i x_i=\boldsymbol{w}^T\boldsymbol{x}
Divide the output value of binary classification into 1 (positive class)
and -1 (negative class)
.
Define a function $ \ phi $ (decision function) that classifies the total input $ z $ into a positive class if it is greater than the threshold $ \ theta $, and a negative class otherwise.
\phi(z) = \Biggl\{\begin{array}{l} 1 (z \geq\When theta) \\-1 (z < \When theta)\end{array}
Here, for the sake of simplicity, we move the threshold $ \ theta $ to the left side and define it as $ w_0 =-\ theta $, $ x_0 = 1 $. The negative threshold of $-\ theta $ at this time is called the bias unit.
\phi(z) = \Biggl\{\begin{array}{l} 1 (z \When geq0) \\-1 (z <When 0)\end{array}
Learning machine learning refers to updating weights. The learning procedure is summarized below.
To update the weight $ w_j $ for an input $ x_j $ as follows:
w_j = w_j + \Delta w_j
\Delta w_j = \eta(y^{(i)}-\hat{y}^{(i)})x_j
However, $ \ eta $ is the learning rate, and the larger this value, the more Each training sample has a greater impact on weight updates. The difference $ (y ^ {(i)}-\ hat {y} ^ {(i)}) $ between the correct class and the output label is called the error.
Now let's implement the perceptron.
class Perceptron(object):
def __init__(self,eta=0.01,n_iter=50,random_state=1):
#Definition of learning rate
self.eta=eta
#Definition of training frequency
self.n_iter=n_iter
#Random seed used to initialize weights
self.random_state=random_state
def fit(self,X,y):
#Random number generation
rgen=np.random.RandomState(self.random_state)
#step1 Initialization of weight
self.w_=rgen.normal(loc = 0.0,scale=0.01,size=1+X.shape[1])
#Declaration of error
self.errors_=[]
#Perform for the number of trainings
for _ in range(self.n_iter):
#Error initialization
errors=0
#step2 Execute for each training sample
for xi, target in zip(X,y):
#Output value calculation and delta_calculation of w
udelta_w = self.eta * (target - self.predict(xi))
#Weight update
self.w_[1:] += delta_w * xi
self.w_[0] += delta_w
errors += int(delta_w != 0.0)
self.errors_.append(errors)
return self
#Definition of total input
def net_input(self,X):
return np.dot(X, self.w_[1:]) + self.w_[0]
#Definition of decision function
def predict(self,X):
return np.where(self.net_input(X) >= 0.0,1,-1)
Let's use this perceptron to predict the Iris data used in part2.
import pandas as pd
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',header=None)
y = df.iloc[0:100,4].values
y = np.where(y == 'Iris-setosa',-1,1)
X = df.iloc[0:100,[0,2]].values
ppn = Perceptron(eta=0.01,n_iter=10)
ppn.fit(X,y)
plt.plot(range(1,len(ppn.errors_)+1), ppn.errors_,marker='o')
plt.xlabel('Epochs')
plt.ylabel('Number of errors')
plt.show()
After 10 training sessions like this, the misclassification finally disappeared. The model learned the dataset by updating the weights.
** Supplement ** If this is illustrated, it will be such a decision area.
from matplotlib.colors import ListedColormap
def plot_decision_regions(X,y,classifier,resolution=0.02):
markers = ('s','x','o','^','v')
colors = ('red','blue','lightgreen','gray','cyan')
cmap = ListedColormap(colors[:len(np.unique(y))])
x1_min, x1_max = X[:,0].min() -1,X[:,0].max()+1
x2_min, x2_max = X[:,1].min() -1,X[:,1].max()+1
xx1,xx2 = np.meshgrid(np.arange(x1_min,x1_max,resolution),np.arange(x2_min,x2_max,resolution))
Z = classifier.predict(np.array([xx1.ravel(),xx2.ravel()]).T)
Z = Z.reshape(xx1.shape)
plt.contourf(xx1,xx2,Z,alpha=0.3,cmap = cmap)
plt.xlim(xx1.min(),xx1.max())
plt.ylim(xx2.min(),xx2.max())
for idx, cl in enumerate(np.unique(y)):
plt.scatter(x = X[y == cl,0],y = X[y == cl,1],alpha = 0.8, c = colors[idx],marker=markers[idx],label=cl,edgecolor='black')
plot_decision_regions(X,y,classifier=ppn)
plt.xlabel('sepal length[cm]')
plt.ylabel('petal lemgth[cm]')
plt.legend(loc='upper left')
plt.show()
Challenges
Let's see how the learning of the model changes by changing the learning rate $ \ eta $ and the number of trainings.
Scikit-learn Scikit-learn is a module that contains many simple classification algorithms. It also contains the perceptron that I implemented earlier. Let's try it out.
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Perceptron
import numpy as np
#Acquisition of iris data, selection of data to use
iris = datasets.load_iris()
X = iris.data[:,[2,3]]
y = iris.target
#Divided into test data and train data
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=1,stratify=y)
#Calling Perceptron
ppn = Perceptron(n_iter_no_change=500,eta0=0.1,random_state=1)
#Train the model to train the data
ppn.fit(X_train,y_train)
#Validate the trained model
y_predict = ppn.predict(X_test)
#Display of the number of misclassifications
print('Misclassified samples: %d' %(y_test != y_predict).sum())
#Display of correct answer rate
print('Accuracy: %.2f' %ppn.score(X_test,y_test))
Misclassified samples: 7
Accuracy: 0.84
Challenges
Use the
plot_decision_regions function
used in the above supplement to illustrate the decision regions this time.
What kind of decision area has it become? Perceptron is a model that performs linear separation, so linear separation is not possible Not suitable for datasets. So let's look at an algorithm called logistic regression. Logistic regression uses an odds ratio.
\frac{p}{(1-p)}
The odds ratio is the ratio that indicates the likelihood of an event. $ p $ indicates the probability of the event you want to predict. The odds ratio multiplied by the natural logarithm is called the logit function.
logit(p)=log{\frac{p}{1-p}}
By using this function, such a relationship is established between the feature quantity and the log odds ratio.
logit(p(y=1|x))=\boldsymbol{w}^T \boldsymbol{x}
p (y = 1 | x)
is the conditional probability that the sample belongs to class 1 (y = 1) given the feature $ x $.
This time I want to predict the probability that the sample belongs to a specific class, so
p(y=1|x) = \phi(z) = logit^{-1}(\boldsymbol{w}^T \boldsymbol{x})=\frac{1}{1+e^{-z}}
It can be expressed like this. In other words, the logistic sigmoid function (commonly known as the sigmoid function) is bitten before the decision function. The weight is also updated using the output value after applying the logistic sigmoid as prediction data. The function that gets between the total input and the decision function in this way is called the ** activation function **. Since the perceptron updated the weight by comparing the output of the decision function with the true value, The difference used when updating the weight was a discrete value. By inserting the activation function, the predicted value $ \ hat {y} $ used for updating the weight becomes a continuous value.
Let's actually use logistic regression.
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(C=100.0,random_state=1)
Challenges
Run the above code and then learn Iris data with logistic regression Let's also illustrate the post-learning decision area
At Perceptron, the goal was to minimize the misclassification rate. In contrast, Support Vector Machines (SVMs) The purpose is to maximize the ** margin **. The margin is the decision boundary of the classification and the closest to the decision boundary. It refers to the distance from the training sample. The above training sample is called ** support vector **. By adjusting the decision boundaries to maximize the margin A strong decision boundary is generated as a classifier.
Let $ y = f (\ boldsymbol {x}) $ be the classification function that classifies the input $ \ boldsymbol {x} $ into two classes. Suppose you have $ n $ of learning samples $ (\ boldsymbol {x_1}, y_1), (\ boldsymbol {x_2}, y_2), ..., (\ boldsymbol {x_m}, y_m) $. Here we define the linear classifier $ f (\ boldsymbol {x}) = sgn [\ boldsymbol {w} \ bullet \ boldsymbol {x} + b] $. This returns 1 when $ \ boldsymbol {w} \ bullet \ boldsymbol {x} + b \ geq0 $, and -1 when $ \ boldsymbol {w} \ bullet \ boldsymbol {x} + b <0 $ Returns. At this time, $ \ boldsymbol {w} \ bullet \ boldsymbol {x} + b = 0 $ is called a hypersurface. The distance between the hypersurface and the closest sample (support vector) is
\frac{|\boldsymbol{w}\bullet\boldsymbol{x_i}+b|}{||\boldsymbol{w}||}
It will be. The formula for maximizing the margin is as follows.
\max_{\boldsymbol{w},b}\min_{i}\{\frac{|\boldsymbol{w}\bullet\boldsymbol{x_i}|+b}{||\boldsymbol{w}||}\}
Since the hypersurface is invariant even if it is multiplied by a constant, for all samples
y_i(\boldsymbol{a}\bullet\boldsymbol{x_i}+b) \geq 1
Can be assumed. At this time, the margin is
\min_{i}\{\frac{|\boldsymbol{w}\bullet\boldsymbol{x_i}|+b}{||\boldsymbol{w}||}\}=\frac{1}{||\boldsymbol{w}||}
In other words
This maximum margin classification is valid only when linear separation is possible. In other cases, we will introduce a slack variable as a workaround.
Linear convention formula
y_i(\boldsymbol{w}\bullet\boldsymbol{x_i}+b) \geq 1
Introduce the slack variable $ \ xi $ for. This allows for acceptable but costly samples that do not meet the constraints, and also addresses non-linear problems.
y_i(\boldsymbol{w}\bullet\boldsymbol{x_i}+b) \geq 1-\xi
By introducing a slack variable, the maximum margin classification can be described as follows.
\frac{1}{2}||\boldsymbol{w}||^2+C(\Sigma_{i}{\xi^{(i)})}
We will use this variable $ C $ to control the misclassification penalty. The larger $ C $, the larger the penalty and the narrower the margin width. This type of maximum margin classification is called soft-margin classification. The maximum margin classification before the introduction of $ \ xi $ is called hard margin classification.
This time, we will use Soft-Margin Classification (SVC).
from sklearn.svm import SVC
svm = SVC(kernel='linear',C = 1.0,random_state=1)
svm.fit(X_train,y_train)
Challenges
Run the above code and then learn Iris data with SVC Let's also illustrate the post-learning decision area
There is an effective method for the nonlinear classification problem called ** kernelization **. SVMs are easier to kernel than other classification algorithms. This makes SVM a popular classification method. Let's look at an example of non-linear data for which kernelization is enabled.
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(1)
X_xor = np.random.randn(200, 2)
y_xor = np.logical_xor(X_xor[:, 0] > 0,
X_xor[:, 1] > 0)
y_xor = np.where(y_xor, 1, -1)
plt.scatter(X_xor[y_xor == 1, 0],
X_xor[y_xor == 1, 1],
c='c', marker='x',
label='1')
plt.scatter(X_xor[y_xor == -1, 0],
X_xor[y_xor == -1, 1],
c='m',
marker='*',
label='-1')
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.legend(loc='best')
plt.tight_layout()
plt.show()
In such a case, it is impossible to separate them with a single straight line. However, projecting this data to a higher dimension changes the appearance of the data.
Projection function $ \ phi (x_1, x_2) = (z_1, z_2, z_3) = (x_1, x_2, x_1 * x_2) $
Challenges
Let's plot the data in 3D as $ z = \ phi (x, y) $.
np.random.seed(1)
X_xor = np.random.randn(200, 2)
y_xor = np.logical_xor(X_xor[:, 0] > 0,
X_xor[:, 1] > 0)
y_xor = np.where(y_xor, 1, -1)
svm = SVC(kernel='rbf', random_state=1, gamma=0.10, C=10.0)
svm.fit(X_xor, y_xor)
plot_decision_regions(X_xor, y_xor,
classifier=svm)
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()
svm = SVC(kernel='rbf',random_state=1,gamma=0.20,C = 10.0)
svm.fit(X_train,y_train)
plot_decision_regions(X_combined,y_combined,classifier=svm)
plt.xlabel('petal length')
plt.ylabel('sepal length')
plt.legend(loc = 'upper left')
plt.tight_layout()
plt.show()
Challenges
Let's check how the decision area changes by changing the value of gamma
The decision tree classifier is a model that can be effective when considering the interpretability of meaning. This classifier makes decisions based on a series of questions and classifies the data. In explaining the decision tree classifier, we will first explain the index ** information gain **. Information gain refers to the reduction of variability in the elements of each set when a set is divided. The decision tree forms a tree with many leaves by conditional branching until the information gain disappears. If a tree is completely created for train data, it is easy to fall into overfitting (a model that adapts too much only to train data). Stop growing branches to a depth. This is called ** pruning **.
The objective function of the decision tree learning algorithm is as follows.
IG(D_p,f)=\boldsymbol{I}(D_p)-\Sigma^{m}_{j=1}\frac{N_j}{N_p}\boldsymbol{I}(D_j)
Where $ f $ is the feature to split, $ D_p $ is the parent dataset, $ N_p $ is the total number of samples of the parent node, $ D_j $ is the jth child dataset, $ N_j $ is the total number of jth child nodes, and $ \ boldsymbol {I} $ is the impureness. In other words, the information gain is the difference between the purity of the parent node and the purity of the child node, and is an index to quantify the proportion of samples of different classes mixed in the node. .. The most commonly used indicators of purity
--Gini Impure --Entropy --Classification error
There are three. Among them, Gini Impure indicates the ratio of samples in which $ p (i | t) $ belongs to the class $ i $ to the special node $ t $. Therefore, Gini impureness is a condition that minimizes the probability of misclassification.
I_G(t)=\Sigma^c_{i=1}p(i|t)(1-p(i|t)) = 1-\Sigma^c_{i=1}p(i|t)^2
The decision tree learns by what value the sample data is divided into conditional branches with less misclassification.
Let's classify the Iris data with a decision tree classifier with a decision tree depth of 4.
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(criterion='gini',max_depth=4,random_state=1)
Challenges
Run the above code and then learn the Iris data in the decision tree Let's also illustrate the post-learning decision area Let's try changing the depth (max_depth) of the decision tree and see how the decision area changes.
The Random Forest algorithm is the idea of creating a model with higher generalization performance (which can handle any data) by averaging multiple deep decision trees. Combining multiple algorithms in this way is called ** ensemble learning **. It is common to combine different models.
Random forest is performed by the following procedure.
Now let's classify the Iris data in a random forest.
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=25,criterion='gini',random_state=1,n_jobs=2)
Challenges
Run the above code and then learn Iris data in Random Forest Let's also illustrate the post-learning decision area
The k-nearest neighbor method is also called lazy learning, in which the dataset is memorized and classified without learning from the training dataset. The k-nearest neighbor algorithm takes the following steps:
The advantage of this method is that it does not require learning and is immediately put into the classification stage. However, it should be noted that if the amount of training data is too large, the amount of calculation will be enormous.
Let's actually classify using the k-nearest neighbor method.
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5,p=2,metric='minkowski')
Challenges
Execute the above code and then learn the Iris data by the k-nearest neighbor method Let's also illustrate the post-learning decision area
model | merit |
---|---|
Logistic regression | Can predict the probability that an event will occur |
SVM | Can handle non-linear problems with kernel tricks |
Decision tree | Consideration can be given to the interpretability of meaning |
Random forest | Not many parameter adjustments Overfitting does not occur as much as a decision tree |
k-nearest neighbor method | No training required |
Challenges
Let's use titanic data to predict whether or not we can survive! The column to use this time 'Passenderld','Age','Pclass','Sex','FamilySize' will do.
Recommended Posts