Even in machine learning, it is difficult to automate the entire learning process There are times when you have to manually adjust the model.
Hyperparameters are among the parameters of machine learning models. It is a parameter that a person has to adjust.
Hyperparameters vary depending on the method selected, so we will explain each model.
Adjusting hyperparameters is called tuning. As for the adjustment method, besides inputting the value directly into the model There is also a way to search for the best value by specifying a range of hyperparameter values.
In scikit-learn, by entering values in the parameters when building the model Parameters can be tuned. If you do not enter any parameters The initial values of the parameters defined for each model are specified as they are.
The code looks like this:
#Tuning method using a fictitious model Classifier as an example
model = Classifier(param1=1.0, param2=True, param3="linear")
There is a parameter called C in logistic regression.
In this C, the identification boundary line learned by the model is against misclassification of teacher data. It is an indicator of how strict it is.
The higher the value of C, the more the model learns a discriminant line that can completely classify the teacher data. However, it falls into overfitting due to excessive learning for teacher data. If you make predictions on data other than training data, the accuracy rate will often decrease.
Decreasing the value of C makes it more tolerant of misclassification of teacher data. By allowing classification errors, the boundaries are less likely to be affected by outlier data. It makes it easier to get generalized boundaries. However, data with few outliers may not be able to identify the boundaries well. Also, even if it is extremely small, the boundary line cannot be identified well.
The initial value of C for scikit-learn's logistic regression model is 1.0.
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
%matplotlib inline
#Data generation
X, y = make_classification(
n_samples=1250, n_features=4, n_informative=2, n_redundant=2, random_state=42)
train_X, test_X, train_y, test_y = train_test_split(X, y, random_state=42)
#Set range of C values(This time 1e-5,1e-4,1e-3,0.01,0.1,1,10,100,1000,10000)
C_list = [10 ** i for i in range(-5, 5)]
#Prepare an empty list for drawing a graph
train_accuracy = []
test_accuracy = []
#The code below is important.
for C in C_list:
model = LogisticRegression(C=C, random_state=42) #I'm bringing a list of C.
model.fit(train_X, train_y)
train_accuracy.append(model.score(train_X, train_y))
test_accuracy.append(model.score(test_X, test_y))
#Graph preparation
# semilogx()Changes the scale of x to the scale of 10 to the xth power
plt.semilogx(C_list, train_accuracy, label="accuracy of train_data")
plt.semilogx(C_list, test_accuracy, label="accuracy of test_data")
plt.title("accuracy by changing C")
plt.xlabel("C")
plt.ylabel("accuracy")
plt.legend()
plt.show()
Whereas C was the tolerance for classification errors
penalty represents a penalty for model complexity.
You can enter two values for penalty, "L1" and "L2". Basically, it is okay to select "L2", but there are cases where you can obtain the data you want by selecting "L1".
L1:This is a penalty for generalizing the identification boundary line by reducing the amount of data features.
L2:This is a penalty for generalizing the identification boundaries by reducing the weight of the entire data.
multi_class is a parameter that determines how the model behaves when performing multiclass classification.
Logistic regression provides two values, "ovr" and "multinomial".
ovr: Suitable for problems that respond to a class with a binary value of "belonging / not belonging".
multinomial: The probability of being classified in each class is also taken into account Suitable for problems dealing with "how much may belong" as well as "belonging / not belonging".
The model processes the data in a random order during training.
random_state is a parameter to control the order.
In the case of a logistic regression model, the boundary line may change significantly depending on the processing order depending on the data.
Also, by fixing the value of this random_state, the learning result with the same data can be saved.
Similar to logistic regression, SVM also has C, which indicates the tolerance of classification errors. It is defined as a parameter. Usage is similar to logistic regression.
Compared to logistic regression, SVM has more fluctuation in the predicted value of data label by C. SVM algorithms to get more generalized boundaries compared to logistic regression As the error tolerance goes up and down, the support vector changes, and the accuracy rate goes up and down compared to logistic regression.
In the linear SVM model, the initial value of C is 1.0.
The module utilizes LinearSVC.
from sklearn.svm import LinearSVC
#Build a model of linear SVM
model = LinearSVC(C=C,random_state=2)
Like logistic regression, linear SVMs have a penalty parameter. The values that can be set are also "L1" and "L2".
multi_class is a parameter that determines how the model behaves when performing multinomial classification. Linear SVM provides two values, "ovr" and "crammer_singer". Basically, ovr is lighter and gives better results.
Random_state is used for fixing the result (fixing random numbers), but for SVM, it is also involved in determining the support vector. Keep in mind that the boundaries you will eventually learn will be about the same, but with slight differences.
When dealing with non-linearly separable data, use a module called SVC of SVM. Parameter C exists in SVC as well as in logistic regression and SVM.
For non-linear SVMs, C is adjusted to give a penalty. Penalties control how much classification errors are tolerated during learning.
The parameter kernel is a particularly important parameter in nonlinear SVM. A parameter that defines a function that manipulates the received data to make it easier to classify.
Five values can be taken: linear, rbf, poly, sigmoid, and precomputed. The default is rbf.
linear
#It is a linear SVM and is almost the same as the Linear SVC. Use LinearSVC unless you have a specific reason to do so.
rbf
poly
#It's like a 3D projection.
#Since rbf often gives a relatively high accuracy rate compared to others, we usually use the default rbf.
precomputed
#It is used when the data has already been formatted by preprocessing.
sigmoid
#Performs the same processing as the logistic regression model.
decision_function_shape is like the multi_class parameter in SVC.
Two values are available, ovo and ovr.
ovo
Make a pair of classes, make a binary classification with that pair, and belong by majority vote
The idea is to decide on a class.
The amount of calculation is large, and the operation may become heavy depending on the amount of data.
ovr
Classify one class and the other and decide the class to which it belongs by majority vote.
random_state is a parameter related to the order of data processing. In order to reproduce the prediction result, it is recommended to fix it at the learning stage.
When actually doing machine learning, there is a way to specify a generator to generate random numbers. The code for specifying the generator is as follows.
import numpy as np
from sklearn.svm import SVC
#Build a random number generator
random_state = np.random.RandomState()
#Random generator_Build SVM model specified in state
model = SVC(random_state=random_state)
Recommended Posts