Cross-entropy to review in Coursera Machine Learning week 2 assignments

I took Coursera's Machine Learning about a year ago, but I feel that I was able to learn the basics of machine learning. "Cross entropy" is an error function used in the classification problem, but once again, for the purpose of getting the concept into mind, we did the task of logistic regression in Week2 of Coursera ML. In fact, Matlab (or Octave) was used in the course, but here Python is used.

(Programming environment: Python 2.7.11, numpy 1.11.0, scipy 0.17.0, tensorflow 0.9.0rc.)

What is cross entropy?

First, let us look at the model of logistic regression in the classification problem.

** Fig. Logistic regression model **

logisticregression_diagram.PNG

This is a model in which the input values x1, x2, ... xn are weighted and added up, and the estimated value is obtained through the sigmoid function, which is the activation function. In the above figure, the weights w1, w2, ... wn and the bias value b are used, but in Coursera's explanation, w and b are collectively used as the parameter theta.

The cross entropy is a numerical value indicating "how much the estimated value of the above model differs from the actual value". It is expressed by the following equation. (English is sometimes abbreviated as cross entropy, xentropy, xent.)

J(\theta) = \frac{1}{m}\sum_{i-1}^{m} [-y^{(i)}\ log(h_{\theta} (x^{(i)})) - (1-y^{(i)})\ log(1-h_{\theta}(x^{(i)}))]  
\\
(h_{\theta}(x) = g(\theta^{T}x),\ g(z) = \frac{1}{1+e^{-z}})

Here, h_theta (x) is an estimated value. Since it is an estimated value in the binary classification problem, the probability that the classification y = 1 is [0.0 ... 1.0]. Also, y ^ (i) is the actual class (0 or 1) given as training data. The flow of logistic regression is to obtain a highly accurate model by finding the parameter value theta that minimizes this error function (cost function in Coursera) J (theta). The gradient of the error function is required in the process of optimizing the function minimization, which is as follows.

\frac{\partial J(\theta)}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}

Based on these expressions, we will create Python code.

Implementation using Scipy.Optimize

First, create a code to find the cost function, following the procedure of the task in the course.

import numpy as np

def sigmoid(x):
    s = 1. / (1. + np.exp(-x))

    return s

def compute_cost(theta, xm, ym, lamb):
    '''
      function:
        compute the cost of a particular choice of theta
      args:
        theta  : parameter related to weight, bias
        xm, ym : data of features, labels
    '''
    m, n = xm.shape
    xent = 0.0
    logits = np.dot(xm, theta)

    for i in range(m):
        # J calculation, J is scalar
        sigmd_x = sigmoid(logits[i])
        xent += 1. / m *(-ym[i] * np.log(sigmd_x) - (1. - ym[i]) 
                                             * np.log(1.0 - sigmd_x))
    xent = np.asscalar(xent)

    # add regularization term to cost
    theta_sq = sum([item **2 for item in theta[1:]])
    cost = xent + lamb /2. /m * theta_sq

    return cost

First, define the sigmoid function. Next, compute_cost () calculates the cost based on the above cross entropy equation. The above code adds the variable xent in the for loop for each data sample, but for efficiency it is better not to use the for loop, but I think this is fine for the first code. After calculating the cross entropy xent, the L2 norm (weight attenuation) term to prevent overfitting is added to obtain the cost cost.

In the Deep Learning framework, the partial differential of the parameter is calculated automatically, but when implementing with Scipy + Numpy, it is necessary to obtain the partial differential by yourself.

def compute_grad(theta, xm, ym, lamb):
    '''
      function:
        compute the cost gradient of a particular choice of theta
      args:
        theta  : parameter related to weight, bias
        xm, ym : data of features, labels
    '''
    m, n = xm.shape
    xent_grad = np.zeros_like(theta)   
    logits = np.dot(xm, theta)

    for i in range(m):
        # grad(J) calculation
        sigmd_x = sigmoid(logits[i])
        delta = sigmd_x - ym[i]
        xent_grad += 1. / m * (delta * xm[i])

    xent_grad = xent_grad.flatten()
    theta = theta.flatten()

    # add regularization term to grad
    cost_grad = np.zeros_like(xent_grad)
    cost_grad[0] = xent_grad[0]
    cost_grad[1:] = xent_grad[1:] + lamb / m * theta[1:]
    
    return cost_grad

Basically, the expression is converted into the code as it is, but the regularization term is added in the latter half. (I am creating a program according to how to proceed with the Coursera assignment ...)

In Cousera's task, a function for the optimization process is prepared and the flow is to solve the task using it, but this time, we will use the Scipy function scipy.optimize.minimize.

from scipy import optimize

def compute_cost_sp(theta):
    global xmat, ymat, lamb
    j = compute_cost(theta, xmat, ymat, lamb)

    return j

def compute_grad_sp(theta):
    global xmat, ymat, lamb
    j_grad = compute_grad(theta, xmat, ymat, lamb)

    return j_grad

if __name__ == '__main__':
    x_raw, ymat = load_data(DATA)
    xmat = map_feature(x_raw[:,0], x_raw[:, 1])
    m, n = xmat.shape

    theta_ini = np.zeros((n, 1))
   
    lamb = 1.e-6
    print('initial cost ={:9.6f}'.format(
                            np.asscalar(compute_cost_sp(theta_ini))))

    res1 = optimize.minimize(compute_cost_sp, theta_ini, method='BFGS',
         jac=compute_grad_sp, options={'gtol': 1.e-8, 'disp': True})
    print('lambda ={:9.6f}, cost ={:9.6f}'.format(lamb, res1.fun))

A function that wraps the cost function calculation and its derivative calculation functions is prepared, and it is input to optimize.minimize () of Scipy together with each option. When this is executed, the optimum solution can be obtained as follows. (The purpose of preparing the wrapper is to improve readability by specifying the parameter theta.)

initial cost = 0.693147
Optimization terminated successfully.
         Current function value: 0.259611
         Iterations: 555
         Function evaluations: 556
         Gradient evaluations: 556
lambda = 0.000001, cost = 0.259611

Although the data set may be small, the solution could be found in a very short time even in the situation where the number of iterations was 555.

The story goes back and forth, but in this task, higher-order features are generated (mapped) from data with two features (x1, x2), and calculations are performed using them.

mapFeature(x) = [1,\ x_1,\ x_2,\ x_1^2,\ x_1x_2,\ x_2^2,\  ...\ ,\ x_2^6] ^ T

The Python code is as follows.

def map_feature(x1, x2):
    degree = 6   # accordint to coursera exercise
    m = len(x1)
    out = np.ones((m, 28))
    index = 1
    for i in range(1, degree+1):
        for j in range(i+1):
            out[:, index] = x1[:] ** (i-j) * x2[:] ** j
            index += 1

    return out

As mentioned above, the model is created by considering up to the 6th order and expanding it to 28 features including the interaction term of x1 and x2. Although it appears in statistical modeling textbooks, I felt that there were few opportunities to program such processing. The reason is that when using a neural network model, it is often a multi-layer model with two or more layers (although this is a one-layer logistic regression), and the higher order on that model. Since the model can be expressed, it is considered that mapFeature () processing is unnecessary. In this mapFeature (), up to the 6th order is considered, but instead, it is expected that the 6-layer Multi-layer Perceptron (MLP) model can perform calculations with the same classification function.

Implementation using TensorFlow

Next, I implemented it using TensorFlow. Since I created the Scipy.Optimize code first, it doesn't take much effort.

def compute_cost_tf(theta, x, y_, lamb):

    logits = tf.matmul(x, theta)
    pred = tf.sigmoid(logits)

    xent = -1. * y_ * tf.log(pred) - (1. - y_) * tf.log(1. - pred)
    # xent = tf.nn.sigmoid_cross_entropy_with_logits(logits, y_)
    xent_mean = tf.reduce_mean(xent)
    L2_sqr = tf.nn.l2_loss(w)
    cost = xent_mean + lamb * L2_sqr

    return cost, pred

TensorFlow also has a sigmond function tf.sigmod () and supports tf.nn.l2_loss (), which finds the weight attenuation term for regularization. Even more thankfully, for the calculation of xent = -1. * Y_ * tf.log (pred)-(1. --y_) * tf.log (1. --pred), the function tf.nn.sigmoid_cross_entropy_with_logits ( ) Can also be used.

(In the [Documentation] of this tf.nn.sigmoid _... function (https://www.tensorflow.org/versions/r0.9/api_docs/python/nn.html#sigmoid_cross_entropy_with_logits) about the cost function of logistic regression It will be helpful as it contains detailed explanations.)

After that, I created a program according to the method of TensorFlow.

if __name__ == '__main__':
    x_raw, ymat = load_data(DATA)
    xmat = map_feature(x_raw[:,0], x_raw[:, 1])

    # Variables
    x = tf.placeholder(tf.float32, [None, 28])
    y_ = tf.placeholder(tf.float32, [None, 1])
    w = tf.Variable(tf.zeros([28, 1], tf.float32))

    lamb = 1.0 / len(xmat)

    cost, y_pred = compute_cost_tf(w, x, y_, lamb)
    train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cost)

    delta = tf.abs((y_ - y_pred))
    correct_prediction = tf.cast(tf.less(delta, 0.5), tf.int32)
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

    # Train
    init = tf.initialize_all_variables()

    with tf.Session() as sess:
        sess.run(init)

        print('Training...')
        for i in range(10001):
            batch_xs, batch_ys = xmat, ymat
            fd_train = {x: batch_xs, y_: batch_ys.reshape((-1, 1))}
            train_step.run(fd_train)                 
        
            if i % 1000 == 0:
                cost_step = cost.eval(fd_train)
                train_accuracy = accuracy.eval(fd_train)
                print('  step, loss, accurary = %6d: %8.3f,%8.3f' % (i, 
                                                cost_step, train_accuracy))
        # final model's parameter
        w_np = sess.run(w)

The situation differs depending on the coefficient of regularization (lambda), but the figure of the result of classification is as follows.

Fig. Training data with decision Boundary cousera_ML_wk2.png

This time, I looked back on the concept of cross entropy according to the teaching materials of Week 2 of Coursera Machine Learning. I was able to realize the goodness of the teaching materials once again, but for those who are interested but have not taken the course, I recommend that you find time to take the course.

References / web site

--Cousera Machine Learning (by prof. Ng) Week 2 assignment text (PDF text) --Deep learning (written by Mr. Okaya) --Scipy documentation Optimization and root finding (scipy.optimize) http://docs.scipy.org/doc/scipy/reference/optimize.html --TensorFlow documentation https://www.tensorflow.org/ --Coursera Machine Learning Challenges in Python: ex2 (Logistic Regression)-Qiita http://qiita.com/nokomitch/items/40fb63c40baa0239fb83 --Enjoy Coursera / Machine Learning materials twice --Qiita http://qiita.com/TomokIshii/items/b22a3681cb17836c8f6e

Recommended Posts

Cross-entropy to review in Coursera Machine Learning week 2 assignments
How to enjoy Coursera / Machine Learning (Week 10)
Coursera Machine Learning Challenges in Python: ex6 (How to Adjust SVM Parameters)
[For beginners] Introduction to vectorization in machine learning
Introduction to machine learning
Coursera Machine Learning Challenges in Python: ex2 (Logistic Regression)
Coursera Machine Learning Challenges in Python: ex1 (Linear Regression)
Machine learning in Delemas (practice)
Used in machine learning EDA
Super introduction to machine learning
How to adapt multiple machine learning libraries in one shot
Coursera Machine Learning Challenges in Python: ex7-2 (Principal Component Analysis)
Introduction to machine learning Note writing
Coursera Machine Learning Challenges in Python: ex5 (Adjustment of Regularization Parameters)
Automate routine tasks in machine learning
Machine learning in Delemas (data acquisition)
Python: Preprocessing in Machine Learning: Overview
Enjoy Coursera / Machine Learning materials twice
Preprocessing in machine learning 2 Data acquisition
Random seed research in machine learning
People memorize learned knowledge in the brain, how to memorize learned knowledge in machine learning
Preprocessing in machine learning 4 Data conversion
How to collect machine learning data
Coursera Machine Learning Challenges in Python: ex7-1 (Image compression with K-means clustering)
I tried to classify guitar chords in real time using machine learning
Bringing machine learning to a practical level in one month # 1 (Starting edition)
Introduction to Machine Learning: How Models Work
scikit-learn How to use summary (machine learning)
I installed Python 3.5.1 to study machine learning
An introduction to OpenCV for machine learning
Python: Preprocessing in machine learning: Data acquisition
An introduction to Python for machine learning
[Python] Saving learning results (models) in machine learning
Python: Preprocessing in machine learning: Data conversion
Preprocessing in machine learning 1 Data analysis process
9 Steps to Become a Machine Learning Expert in the Shortest Time [Completely Free]
I tried to organize the evaluation indexes used in machine learning (regression model)
Coursera Machine Learning Challenges in Python: ex3 (Handwritten Number Recognition with Logistic Regression)
Machine learning
I tried to predict the change in snowfall for 2 years by machine learning
A machine learning beginner tried to create a sheltie judgment AI in one day
[Python] Easy introduction to machine learning with python (SVM)
[Super Introduction to Machine Learning] Learn Pytorch tutorials
An introduction to machine learning for bot developers
Try to forecast power demand by machine learning
Full disclosure of methods used in machine learning
Notes on machine learning (updated from time to time)
Machine learning algorithms (from two-class classification to multi-class classification)
[Super Introduction to Machine Learning] Learn Pytorch tutorials
Summary of evaluation functions used in machine learning
Get a glimpse of machine learning in Python
Arrangement of self-mentioned things related to machine learning