Introduction

It's been about two weeks since I started learning deep learning. It's about time I heard the sound of what I learned spilling out of my head, so I'd like to output it in an organized manner. From this time, we will build DNN multiple times. This time is the forward propagation edition.

About the DNN to be created

Build a network to determine if the image is a cat (1) or not (0).

Data to use

209 images will be used as training data and 50 images will be used as test data. The size of each image is 64 * 64.

Number of training examples : 209
Number of testing examples : 50
Each image is of size : 64 * 64

In addition, the ratio of correct answer data and incorrect answer data is as follows.

Number of cat-images of training data : 72 / 209
Number of cat-images of test data : 33 / 50

DNN to build

Number of layers

Build a four-tier network with 12288 (64 * 64) input nodes and one output node. I intended to connect each node with a line, but I gave up because it was harsh to make with PowerPoint.

Dimention of each layer : [12288, 20, 7, 5, 1]

Activation function

This time, we will use the ReLU function for the middle layer and the sigmoid function for the output layer.

Relu function

The Relu function is a function that outputs 0 if the input value is 0 or less, and outputs the input value as it is otherwise. It can be written as follows.

y = np.maximum(0, x)

relu function.png

sigmoid function

The sigmoid function is a function that converts the input value from 1 to 0 and outputs it. It can be written as follows. The expression is as follows. $ y = \frac{1}{1+e^{-z}} $ Written in Python, it looks like this:

y = 1 / (1 + np.exp(-x))

The big picture of learning

Now that you know the overall design of DNN, let's review learning about DNN. In DNN, learning is performed according to the following procedure. The parameters are gradually optimized by repeating steps 2 to 5.

Parameter initialization
Forward Propagation
Calculate the error
Back Propagation of error
Parameter update
Return to 2

Parameter initialization

This time, all layers were initialized using the initial values of Xivier. It seems that you should use He initialization when using the relu function, but honestly, the difference in this area is not well understood yet.

Xavier initialization

Xivier initialization randomly picks the initial parameters from a normal distribution with mean $ 0 $ and standard deviation $ \ frac {1} {\ sqrt {n}} $. Compared to the case of extracting from the standard normal distribution, extracting from a narrow range prevents the activation of each layer from being biased to around 0 and 1, and makes it difficult for gradient disappearance to occur. See page 182 of "Deep Learning from scratch" for details. Since there are 4 layers, it is necessary to initialize the parameters for 4 layers. Create a function that returns a parameter with the vectorization of the number of layers in each layer as an argument.

def initialize_parameters(layers_dims):
    np.random.seed(1)
    parameters = {}
    L = len(layers_dims)
    
    for l in range(1, L):
        parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1]) / np.sqrt(layers_dims[l-1])
        parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))
    return parameters

Forward Propagation

Predict the correct label using the prepared parameters. Do the following for all layers: Since the figure used in another post is diverted, the cost function $ L (a, y) $ is drawn, but Forwad Propagation does not need to be aware of it, so ignore it.

The functions needed to do this are: The cache and caches that appear occasionally are variables used when implementing backpropagation.

A function that calculates the inner product of the input value X and the parameter W + the bias term.
activation function (sigmoid (), relu ())
A function that combines 1 and 2
A function that repeats 3 several times in layers

A function that calculates the inner product of the input value X and the parameter W + the bias term

def linear_forward(A, W, b):
    Z = np.dot(W, A) + b
    cache = (A, W, b)
    return Z, cache

activation function

def sigmoid(Z):
    A = 1 / (1+np.exp(-Z))
    cache = Z
    return A, cache


def relu(Z):
    A = np.maximum(0, Z)
    cache = Z
    return A, cache

A function that puts 1 and 2 together

def linear_activation_forward(A_prev, W, b, activation):
    if activation == 'relu':
        Z, linear_cache = linear_forward(A_prev, W, b)
        A, activation_cache = relu(Z)
    elif activation == 'sigmoid':
        Z, linear_cache = linear_forward(A_prev, W, b)
        A, activation_cache = sigmoid(Z)
    cache = (linear_cache, activation_cache)
    
    return A, cache

Function that repeats 3 for the number of layers

def L_model_forward(X, parameters):
    caches = []
    A = X
    L = len(parameters) // 2
    
    for l in range(1, L):
        A_prev = A
        A, cache = linear_activation_forward(A_prev, parameters['W'+str(l)], parameters['b'+str(l)], activation='relu')
        caches.append(cache)
    AL, cache = linear_activation_forward(A, parameters['W'+str(L)], parameters['b'+str(L)], activation = 'sigmoid')
    caches.append(cache)
    return AL, caches

Prediction with initial parameters can be performed by executing L_model_forward.

Summary

This time, we implemented up to forward propagation. Next time I would like to calculate the cost.

Deep learning from scratch (forward propagation edition)