Hyperparameters

For neural network models There are some so-called hyperparameters that need to be adjusted when configuring the network.

There are many hyperparameters, and if they are not set properly, they will not be trained correctly. Therefore, it is necessary to design the optimum hyperparameters when creating a new model.

Let's look at the hyperparameters in the actual code.

import numpy as np
import matplotlib.pyplot as plt
from keras.datasets import mnist
from keras.layers import Activation, Dense, Dropout
from keras.models import Sequential, load_model
from keras import optimizers
from keras.utils.np_utils import to_categorical

(X_train, y_train), (X_test, y_test) = mnist.load_data()

X_train = X_train.reshape(X_train.shape[0], 784)[:6000]
X_test = X_test.reshape(X_test.shape[0], 784)[:1000]
y_train = to_categorical(y_train)[:6000]
y_test = to_categorical(y_test)[:1000]

model = Sequential()
model.add(Dense(256, input_dim=784))
#Hyperparameters: activation function
model.add(Activation("sigmoid"))
#Hyperparameters: Number of hidden layers, number of hidden layer units
model.add(Dense(128))
model.add(Activation("sigmoid"))
#Hyperparameters: Dropout rate (rate)
model.add(Dropout(rate=0.5))
model.add(Dense(10))
model.add(Activation("softmax"))

#Hyperparameters: Learning rate (lr)
sgd = optimizers.SGD(lr=0.01)

#Hyperparameters: optimizer
#Hyperparameters: Error function (loss)
model.compile(optimizer=sgd, loss="categorical_crossentropy", metrics=["accuracy"])

#Hyperparameters: batch size_size）
#Hyperparameters: Number of epochs (epochs)
model.fit(X_train, y_train, batch_size=32, epochs=10, verbose=1)

score = model.evaluate(X_test, y_test, verbose=0)
print("evaluate loss: {0[0]}\nevaluate acc: {0[1]}".format(score))

Network structure settings

The number of hidden layers and the number of units between the input layer and the output layer can be freely specified. By increasing the number, it becomes possible to express various functions.

However, when the number of hidden layers is large, the difficulty of adjusting the weight increases and the progress of learning slows down. When the number of units is large, overfitting (a state with low generalization performance) is performed by extracting features of low importance. It will be easier to wake up. Therefore, it is necessary to set an appropriate number for learning, rather than just blindly increasing the number.

Consider the network structure based on the precedent, such as referring to a similar implementation example.

As an example

Confirm the effect of the structure of the hidden layer on the learning of the model Let's predict the most accurate model from the following three.

A: One fully connected hidden layer with 256 units and one fully connected hidden layer with 128 units B: 1 fully connected hidden layer with 256 units, 3 fully connected hidden layers with 128 units C: 1 fully connected hidden layer with 256 units, 1 fully connected hidden layer with 1568 units

import numpy as np
import matplotlib.pyplot as plt
from keras.datasets import mnist
from keras.layers import Activation, Dense, Dropout
from keras.models import Sequential, load_model
from keras import optimizers
from keras.utils.np_utils import to_categorical

(X_train, y_train), (X_test, y_test) = mnist.load_data()

X_train = X_train.reshape(X_train.shape[0], 784)[:6000]
X_test = X_test.reshape(X_test.shape[0], 784)[:1000]
y_train = to_categorical(y_train)[:6000]
y_test = to_categorical(y_test)[:1000]

model = Sequential()
model.add(Dense(256, input_dim=784))
model.add(Activation("sigmoid"))

def funcA():
    model.add(Dense(128))
    model.add(Activation("sigmoid"))

def funcB():
    model.add(Dense(128))
    model.add(Activation("sigmoid"))
    model.add(Dense(128))
    model.add(Activation("sigmoid"))
    model.add(Dense(128))
    model.add(Activation("sigmoid"))

def funcC():
    model.add(Dense(1568))
    model.add(Activation("sigmoid"))

#Choose one of the A, B and C models and comment out the other two.
#---------------------------
funcA()
funcB()
funcC()
#---------------------------

model.add(Dropout(rate=0.5))
model.add(Dense(10))
model.add(Activation("softmax"))

sgd = optimizers.SGD(lr=0.1)

model.compile(optimizer=sgd, loss="categorical_crossentropy", metrics=["accuracy"])

model.fit(X_train, y_train, batch_size=32, epochs=3, verbose=1)

score = model.evaluate(X_test, y_test, verbose=0)
print("evaluate loss: {0[0]}\nevaluate acc: {0[1]}".format(score))




#The correct answer is funcA()is.

Drop out

Dropout prevents overfitting of training data This is one of the methods to improve the accuracy of the model.

Randomly delete neurons (overwrite with 0) in the dropout Repeat learning. As a result, the neural network becomes a specific neuron. You will learn more general features without dependence.

The description of the dropout is as follows.

model.add(Dropout(rate=0.5))
#rate is the percentage of units to delete.

Both dropout positions and rates are hyperparameters.

Implement a dropout to get closer to the training data and test data, and the accuracy rate of each.

import numpy as np
import matplotlib.pyplot as plt
from keras.datasets import mnist
from keras.layers import Activation, Dense, Dropout
from keras.models import Sequential, load_model
from keras import optimizers
from keras.utils.np_utils import to_categorical

(X_train, y_train), (X_test, y_test) = mnist.load_data()

X_train = X_train.reshape(X_train.shape[0], 784)[:6000]
X_test = X_test.reshape(X_test.shape[0], 784)[:1000]
y_train = to_categorical(y_train)[:6000]
y_test = to_categorical(y_test)[:1000]

model = Sequential()
model.add(Dense(256, input_dim=784))
model.add(Activation("sigmoid"))
model.add(Dense(128))
model.add(Activation("sigmoid"))

# ---------------------------
#Here is the code for the dropout
model.add(Dropout(rate=0.5))
# ---------------------------

model.add(Dense(10))
model.add(Activation("softmax"))

sgd = optimizers.SGD(lr=0.1)

model.compile(optimizer=sgd, loss="categorical_crossentropy", metrics=["accuracy"])

history = model.fit(X_train, y_train, batch_size=32, epochs=5, verbose=1, validation_data=(X_test, y_test))

#acc, val_acc plot
plt.plot(history.history["acc"], label="acc", ls="-", marker="o")
plt.plot(history.history["val_acc"], label="val_acc", ls="-", marker="x")
plt.ylabel("accuracy")
plt.xlabel("epoch")
plt.legend(loc="best")
plt.show()

Activation function

The activation function is a function that is applied after the fully connected layer, etc.
It is equivalent to firing a neuron.

In the fully connected layer, the input is linearly transformed and output. Non-linearity can be given by using the activation function.

If the activation function is not used, it cannot be separated by a single straight line as shown in the figure below (linear separation is not possible). The data can no longer be classified.

By giving non-linearity by the activation function, Even models that are not linearly separable can always be classified if they are properly trained.

There are several types of activation functions, so it is important to choose the appropriate one.

Sigmoid function

The sigmoid function is one of the activation functions The input value is set to a value between 0 and 1 and output. The formula is as follows.

The sigmoid function is basically 1/2 at X = 0 0 or more and less than 1/2 when X <0 When x> 0, 1/2 or more and less than 1 It has become.

The blue graph is the sigmoid function The orange graph is the derivative of the sigmoid function.

ReLU

As an activation function that is often used

There is a ReLU (ramp function).

This function takes 0 if the input value is less than 0, and takes the input value as the output value if it is 0 or more.

The definition formula of ReLU (Rectified Linear Unit) is shown.

ReLU (ramp function) is basically

0 when x <0 x => 0 and y = x It will be.

The blue graph represents the ReLU and the orange graph represents the derivative of the ReLU.

Loss function

A function that evaluates the difference between the output data and the teacher data during training

It is called a loss function (error function).

It is possible to evaluate using the correct answer rate as an index, but that alone is not enough for individual output data. I do not know the detailed results such as correct and incorrect answers. In other words, the loss function is used to see the difference between each output data and the teacher data.

There are many types of loss functions, Commonly used for machine learning

Examples include square error and cross entropy error.

We'll talk more about these two later.

In addition, a technique called error backpropagation is used to make the calculation of the derivative of the loss function more efficient. This technique updates the weights of each layer to minimize the difference between the output data and the teacher data.

Mean squared error

Mean squared error is on par with least squares
An error function often used in fields such as statistics.

Describe with the following formula.

yi is the prediction label and ti is the correct label. Since the mean square error is good at evaluating continuous values, it is mainly used as an error function of regression models.

Since 1 / N is a constant determined by the number of data, we may omit it and simply use the error function that is the sum of the squared values of the error.

Cross entropy error

Because the cross-entropy error is specialized in the evaluation of classification It is mainly used as an error function of the classification model. The formula is as follows.

In addition, the following formula is often used in general.

By doing so, all terms with labels other than 1 will be 0. In effect, only the error of the correct label is calculated.

For example, suppose you have output like the one above. The calculation of this cross entropy error is as follows.

In other words, the closer yi is to 1, the closer logyi is to 0, so the error becomes smaller. We can see that the principle is that the closer y is to 0, the closer logyi is to -∞, and the larger the error.

Optimization function

Based on the value differentiated by the error function, the direction and degree are determined and the weight is updated. At that time, how to check the learning rate, the number of epochs, the amount of past weight updates, etc. The optimization function is used to determine whether it will be reflected in the weight update.

Optimization functions are hyperparameters that need to be adjusted by humans. As you can see, there are several types of optimization functions that must be selected correctly. Be aware that it can adversely affect learning.

Learning rate

The learning rate determines how much the weight of each layer is changed at one time. It is a hyperparameter.

The figure below illustrates the minimization model and the impact of the learning rate. The upper right point is the initial value.

1, The learning rate is too low, and the update is hardly progressing. 2, With an appropriate learning rate, the values converge in a small number of times. 3, It converges, but there is a waste in updating because the value is large. 4, The learning rate is too high and the values are diverging. (Updated to the top and the value is getting bigger and bigger)

In other words, in order to train the model properly, it is necessary to set an appropriate training rate for the loss function.

#Learning rate setting
global lr
    lr = #Parameters

I've shown an example below, so please adjust and check it.

import numpy as np
import matplotlib.pyplot as plt
from keras.datasets import mnist
from keras.layers import Activation, Dense, Dropout
from keras.models import Sequential, load_model
from keras import optimizers
from keras.utils.np_utils import to_categorical

(X_train, y_train), (X_test, y_test) = mnist.load_data()

X_train = X_train.reshape(X_train.shape[0], 784)[:6000]
X_test = X_test.reshape(X_test.shape[0], 784)[:1000]
y_train = to_categorical(y_train)[:6000]
y_test = to_categorical(y_test)[:1000]

model = Sequential()
model.add(Dense(256, input_dim=784))
model.add(Activation("sigmoid"))
model.add(Dense(128))
model.add(Activation("sigmoid"))
model.add(Dropout(rate=0.5))
model.add(Dense(10))
model.add(Activation("softmax"))


def funcA():
    global lr
    lr = 0.01

def funcB():
    global lr
    lr = 0.1

def funcC():
    global lr
    lr = 1.0

#Choose one of the three and comment out the other two lines to compare the changes.
#---------------------------
#funcA()
funcB()
#funcC()
#---------------------------

sgd = optimizers.SGD(lr=lr)

model.compile(optimizer=sgd, loss="categorical_crossentropy", metrics=["accuracy"])

model.fit(X_train, y_train, batch_size=32, epochs=3, verbose=1)

score = model.evaluate(X_test, y_test, verbose=0)
print("evaluate loss: {0[0]}\nevaluate acc: {0[1]}".format(score))

Mini batch learning

The number of data to enter in the model at one time

We call it batch size, which is also one of the hyperparameters.

If you pass multiple data at once, the model calculates the loss and the slope of the loss function for each data. The weight is updated only once based on the average value of the gradient of each data.

In this way, by updating the weights with multiple data, the influence of biased data is reduced. You can also shorten the calculation time by performing parallel calculation.

However, on the other hand, it becomes difficult for large weight updates to occur, and it is optimized for some data.

It is possible that you will be stuck in a state where optimization for the entire data is not performed (local solution).

To avoid that, when there is a lot of irregular data Increase the batch size, decrease the batch size when it is small, etc. Adjust the batch size.

Online learning(Stochastic gradient descent):Learning method to set batch size to 1
Batch learning(The steepest descent method):Learning method to set to the total number of data
Mini batch learning:Learning method to set a small number in the middle
Is called.

#Batch size adjustment
global batch_size
    batch_size = #Parameters

Here is an example code.

import numpy as np
import matplotlib.pyplot as plt
from keras.datasets import mnist
from keras.layers import Activation, Dense, Dropout
from keras.models import Sequential, load_model
from keras import optimizers
from keras.utils.np_utils import to_categorical

(X_train, y_train), (X_test, y_test) = mnist.load_data()

X_train = X_train.reshape(X_train.shape[0], 784)[:6000]
X_test = X_test.reshape(X_test.shape[0], 784)[:1000]
y_train = to_categorical(y_train)[:6000]
y_test = to_categorical(y_test)[:1000]

model = Sequential()
model.add(Dense(256, input_dim=784))
model.add(Activation("sigmoid"))
model.add(Dense(128))
model.add(Activation("sigmoid"))
model.add(Dropout(rate=0.5))
model.add(Dense(10))
model.add(Activation("softmax"))

sgd = optimizers.SGD(lr=0.1)

model.compile(optimizer=sgd, loss="categorical_crossentropy", metrics=["accuracy"])

def funcA():
    global batch_size
    batch_size = 16

def funcB():
    global batch_size
    batch_size = 32

def funcC():
    global batch_size
    batch_size = 64

#Choose one of the three, comment out the other two lines and batch_Please compare size.
#---------------------------
#funcA()
#funcB()
funcC()
#---------------------------

model.fit(X_train, y_train, batch_size=batch_size, epochs=3, verbose=1)

score = model.evaluate(X_test, y_test, verbose=0)
print("evaluate loss: {0[0]}\nevaluate acc: {0[1]}".format(score))

Iterative learning

Generally, in deep running, iterative learning is performed and learning is repeated with the same training data. The number of learnings is called the epoch number, which is also a hyperparameter.

Setting a large number of epochs does not mean that the accuracy of the model will continue to improve.

If you do not set an appropriate number of epochs, the accuracy will not increase in the middle. Not only that, we try to minimize the loss function by repeating learning. It can cause overfitting.

Therefore, it is also important to set an appropriate number of epochs and stop learning in a timely manner.

#Number of epochs
global epochs
    epochs = #Parameters

Here is an example code.

import numpy as np
import matplotlib.pyplot as plt
from keras.datasets import mnist
from keras.layers import Activation, Dense, Dropout
from keras.models import Sequential, load_model
from keras import optimizers
from keras.utils.np_utils import to_categorical

(X_train, y_train), (X_test, y_test) = mnist.load_data()

X_train = X_train.reshape(X_train.shape[0], 784)[:1500]
X_test = X_test.reshape(X_test.shape[0], 784)[:6000]
y_train = to_categorical(y_train)[:1500]
y_test = to_categorical(y_test)[:6000]

model = Sequential()
model.add(Dense(256, input_dim=784))
model.add(Activation("sigmoid"))
model.add(Dense(128))
model.add(Activation("sigmoid"))
#I won't use Dropout this time.
#model.add(Dropout(rate=0.5))
model.add(Dense(10))
model.add(Activation("softmax"))

sgd = optimizers.SGD(lr=0.1)

model.compile(optimizer=sgd, loss="categorical_crossentropy", metrics=["accuracy"])

def funcA():
    global epochs
    epochs = 5

def funcB():
    global epochs
    epochs = 10

def funcC():
    global epochs
    epochs = 60

#Choose one of the three and comment out the other two lines to determine the number of epochs.
#---------------------------
#funcA()
funcB()
#funcC()
#---------------------------

history = model.fit(X_train, y_train, batch_size=32, epochs=epochs, verbose=1, validation_data=(X_test, y_test))

#acc, val_acc plot
plt.plot(history.history["acc"], label="acc", ls="-", marker="o")
plt.plot(history.history["val_acc"], label="val_acc", ls="-", marker="x")
plt.ylabel("accuracy")
plt.xlabel("epoch")
plt.legend(loc="best")
plt.show()

score = model.evaluate(X_test, y_test, verbose=0)
print("evaluate loss: {0[0]}\nevaluate acc: {0[1]}".format(score))

Python: Deep Learning Tuning

Hyperparameters

Network structure settings

Drop out

Activation function

Activation function

Sigmoid function

Loss function

Loss function

Mean squared error

Cross entropy error

Optimization function

Learning rate

Mini batch learning

Iterative learning