How to enjoy Coursera / Machine Learning (Week 10)

(Regarding the benchmark test, [Addition](http://qiita.com/TomokIshii/items/b5708a02895847e3588c#%E8%BF%BD%E8%A8%98-theano-gpu%E8%A8%88%E7%AE% 97% E3% 81% A7mini-batch% E3% 82% B5% E3% 82% A4% E3% 82% BA% E3% 81% AB% E7% 9D% 80% E7% 9B% AE% E3% 81% 97% E3% 81% A6% E3% 83% 99% E3% 83% B3% E3% 83% 81% E3% 83% 9E% E3% 83% BC% E3% 82% AF).

The other day, I posted to Qiita about how to enjoy the programming task (Matlab) of Coursera, Machine Learning course (by Stanford University, Prof. Andrew Ng) while porting it to Python. After that, I took the course, but I learned that there are no programming tasks I have been looking forward to since Week 10. (Of the 11 weeks in total, Program Assignment is available from Week 1 to Week 9. By the way, Quiz is also Week 10 and 11.)

Week 10 was interesting in lectures on Stochastic Gradient Descent (SGD, Stochastic Gradient Descent) and Online Learning because it was "Large Scale Machine Learning", but if there are no programming tasks, I will study by myself. So, I implemented SGD with Python. (This is an attempt to study Deep Learning Framework, "Theano" along with the implementation of SGD.) Also, I will introduce some interesting results (tips) in the benchmark test conducted after the implementation of SGD. ..

Video Lecture Outline (Week 10)

In the video, the stochastic gradient descent method is explained in contrast to the normal gradient descent method (Batch Gradient Descent).

Batch Gradient Descent Cost function:

J_{train}(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_{\theta} (x^{(i)}) 
- y^{(i)} ) ^2

The following iterative calculation is performed to minimize this cost function. Repeat {

{\theta}_j := {\theta}_j - \alpha \frac{1}{m} \sum_{i=1}^{m}(h_{\theta} (x^{(i)}
-y^{(i)} ) x_j^{(i)}
\\ \ \ \ \ \ \ \ \ \ (\textbf{for every } j=0, ..., n)

}

Stochastic Gradient Descent 1. Randomly shuffle (reorder) training examples Randomly shuffle the training data.

** 2. ** Below, update $ \ theta $ by referring to the training data one by one. Repeat {


for\ i:= 1,...,m {\ \ \ \ \ \ \{\\

{\theta}_j := {\theta}_j - \alpha (h_{\theta} (x^{(i)}) - y^{(i)}) x_j^{(i)}
\\\ \ \ (\textbf{for every } j=0, ...,n)
\\\ \ \ \}

\ \ }

}

In the lecture, I explained how to update the parameters by referring to the training data one by one, and then there was a Mini-Batch Gradient Descent (as a method between Batch Gradient Descent and Stochastic GD). It was.

Implementation of normal gradient descent (logistic regression)

As for the data used for checking the code, I selected the "Adult" dataset from the UCI Machine Learning Repository. This is extracted from the US Census databese and seems to be popular data in machine learning.

39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K
50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K
38, Private, 215646, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K
53, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K
28, Private, 338409, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, Black, Female, 0, 0, 40, Cuba, <=50K

Age, educational background, occupation type, marriage history, etc. are lined up, but at the end of each line is the label of income class "<= 50K" or "> 50K". This is the explained variable used for classification. I'm wondering what to choose as the explanatory variable (feature) to use for regression, but this time I chose only one year of school enrollment. It is thought that the resolution of educational background is rather high, but it seems that this is linked to income in the world.

We will start by defining a cost function and a function to calculate its partial derivative (gradient) according to the method of the previous tasks of Coursera Machine Learning.

import numpy as np
import pandas as pd
import timeit

import theano
import theano.tensor as T

def load_data():
(Omitted)
    return xtr, ytr, xte, yte

def compute_cost(w, b, x, y):
    p_1 = 1 / (1 + T.exp(-T.dot(x, w) -b))  # same as sigmoid(T.dot(x,w)+b)
    income_class = lambda predictor: T.gt(predictor, 0.5)  # 0.5 is threshold
    prediction = income_class(p_1)
    
    xent = -y * T.log(p_1) - (1-y) * T.log(1- p_1)
    cost = xent.mean() + 0.01 * (w ** 2).sum()  # regularization
    
    return cost, prediction

def compute_grad(cost, w, b):
    gw, gb = T.grad(cost, [w, b])
    
    return gw, gb

A feature of the framework "Theano" is that once you get used to it (it's hard to understand if you're not used to it), you can put together a statement very concisely. (In particular, the calculation of gradient can be done in one line.)

The main processing is performed using these functions.

    xtr, ytr, xte, yte = load_data()
   
    # Declare Theano symbolic variables
    xtr_shape = xtr.shape
    if len(xtr_shape) == 2:
        w_len = xtr_shape[1]
    else:
        w_len = 1
    
    x = T.matrix('x')    # for xmat
    y = T.vector('y')    # for ymat, labels
    w = theano.shared(np.zeros(w_len), name='w')    # w, b <- all zero
    b = theano.shared(0., name='b')

    print ' Initial model: '
    wi = w.get_value()
    bi = w.get_value()
    print 'w : [%12.4f], b : [%12.4f]' % (wi[0], bi)

    cost, prediction = compute_cost(w, b, x, y)  # ... Cost-J
    gw, gb = compute_grad(cost, w, b)            # ... Gradients
    
    # Compile
    train = theano.function(
          inputs=[x,y],
          outputs=[cost, prediction],
          updates=((w, w - 0.1 * gw), (b, b - 0.1 * gb)),
          allow_input_downcast=True)
    predict = theano.function(inputs=[x], outputs=prediction,
          allow_input_downcast=True)
    
    # Train (Optimization)
    start_time = timeit.default_timer()
    training_steps = 10000
    xtr= xtr.reshape(len(xtr), 1)  # shape: (m,) to (m,1)
    for i in range(training_steps):
        cost_j, pred = train(xtr, ytr)

As described above, the parameters (w, b) that minimize the cost function were obtained by the gradient descent method (Batch Gradient Descent). The convergence test is not performed, and the solution is obtained by updating the parameters a predetermined number of times.

Implementation of Stochastic Gradient Descent

Now, this is the implementation of Stochastic Gradient Descent (stochastic gradient descent). In Cousera's lecture, there was an explanation of Stochastic Gradient Descent, which scans data used for training one set at a time, and Mini-Batch Stochastic Gradient Descent, which scans data of 2 to 100 sets of small size. Choose the Mini-Batch.

In SGD, training data is randomly shuffled as preprocessing. In addition, in order to speed up the processing, we decided to put the data in the shared variable of Theano.

def setup_data(xmat, ymat):
    # store the data into 'shared' variables to be accessible by Theano
    def shared_dataset(xm, ym, borrow=True):
        shared_x = theano.shared(np.asarray(xm, dtype=theano.config.floatX),
                                        borrow=borrow)
        shared_y = theano.shared(np.asarray(ym, dtype=theano.config.floatX),
                                        borrow=borrow)
        #
        return shared_x, shared_y
    
    def data_shuffle(xm, ym, siz):
        idv = np.arange(siz)
        idv0 = np.array(idv)    # copy numbers
        np.random.shuffle(idv)
        xm[idv0] = xm[idv]
        ym[idv0] = ym[idv]
        
        return xm, ym
     
    total_len = ymat.shape[0]
    n_features = np.size(xmat) / total_len
    # Random Shuffle
    xmat, ymat = data_shuffle(xmat, ymat, total_len)
    train_len = int(total_len * 0.7)
    test_len = total_len - train_len
    
    xtr, ytr = shared_dataset((xmat[:train_len]).reshape(train_len, n_features), 
                               ymat[:train_len])
    xte, yte = shared_dataset((xmat[train_len:]).reshape(test_len, n_features), 
                               ymat[train_len:])
    
    rval = [(xtr, ytr), (xte, yte)]
    return rval

The description of theano.function is changed because the storage destination of the data set is moved to the shared variable. Arguments must be given indirectly with the keyword ** givens **, not directly with ** inputs **.

Input (reprint) data with Theano variable (not shared variable)

    # Compile
    train = theano.function(
          inputs=[x,y],
          outputs=[cost, prediction],
          updates=((w, w - 0.1 * gw), (b, b - 0.1 * gb)),
          allow_input_downcast=True)
    predict = theano.function(inputs=[x], outputs=prediction,
          allow_input_downcast=True)

Input data from shared variables (SGD version)

    # Compile
    batch_size = 10
    train_model = theano.function(
          inputs=[index, learning_rate],
          outputs=[cost, prediction],
          updates=((w, w - learning_rate * gw), (b, b - learning_rate * gb)),
          givens=[(x, xtr[index * batch_size:(index + 1) * batch_size]), 
                  (y, ytr[index * batch_size:(index + 1) * batch_size])],
          allow_input_downcast=True
    )
    predict = theano.function(
          inputs=[],
          outputs=prediction,
          givens=[(x, xte)],
          allow_input_downcast=True
    )

Iterative calculation is performed using the Theano function defined above.

    # Train (Optimization)
    start_time = timeit.default_timer()
    n_epochs = 20
    epoch = 0
    lrate_base = 0.03
    lrate_coef = 20
    n_train_batches = int(ytr.get_value().shape[0] / batch_size)
    
    while (epoch < n_epochs):
        epoch += 1
        for mini_batch_index in range(n_train_batches):
            l_rate = lrate_base * lrate_coef / (epoch + lrate_coef)
            cost_j, pred = train_model(mini_batch_index, l_rate)
        
        print 'epoch[%3d] : cost =%f ' % (epoch, cost_j)

Execution result.

 Initial model: 
w : [      0.0000], b : [      0.0000]
epoch[  1] : cost =0.503755 
epoch[  2] : cost =0.510341 
epoch[  3] : cost =0.518218 
epoch[  4] : cost =0.524344 
epoch[  5] : cost =0.528745 
epoch[  6] : cost =0.531842 
epoch[  7] : cost =0.534014 
epoch[  8] : cost =0.535539 
epoch[  9] : cost =0.536614 
epoch[ 10] : cost =0.537375 
epoch[ 11] : cost =0.537913 
epoch[ 12] : cost =0.538294 
epoch[ 13] : cost =0.538563 
epoch[ 14] : cost =0.538751 
epoch[ 15] : cost =0.538880 
epoch[ 16] : cost =0.538966 
epoch[ 17] : cost =0.539021 
epoch[ 18] : cost =0.539053 
epoch[ 19] : cost =0.539067 
epoch[ 20] : cost =0.539069 

 Final model: 
w : [      0.3680], b : [     -4.9370]
Elapsed time:     26.565 [s]
accuracy =       0.7868

I plotted the changes in the parameters in the calculation.

** Fig. Plot for each Epoch **

** Fig. Plot for each Mini-Batch **

When the resolution is increased, the movement characteristic of the stochastic gradient descent (SGD) can be observed.

Increase the independent variable (feature) of logistic regression

Since the dataset "Adult" has many independent variables (features), we decided to increase the features for calculation. As a Code, only the input processing of the training data x is changed. The features used are as follows. --Year of education (Educational background, 12 years if you have been educated to high school in Japan.) (Used in the first regression model) --Role in the household (husband, wife, child-bearing, single, etc.) (Added in this regression model.) --Working hours per week (added in this regression model.)

I expected that the accuracy of the classification would improve to some extent, but unfortunately the accuracy did not improve from the first regression model. (This time, the purpose is to implement the program, so we have not considered the data analysis results.)

The results of the benchmark regarding the calculation time are as follows.

** Comparison of calculation time **

Optimize method	Model feature number	CPU / GPU	epoch number	mini-batch size	time [s]
Batch Gradient Descent	1	CPU	10,000	-	76.75
Batch Gradient Descent	1	GPU	10,000	-	91.14
Stochastic Gradient Descent	1	CPU	20	10	1.76
Stochastic Gradient Descent	1	GPU	20	10	23.87
Stochastic Gradient Descent	3	CPU	20	10	4.51
Stochastic Gradient Descent	3	GPU	20	10	88.38

! No convergence test is performed in any calculation. Calculate the specified number of loops. ! Batch Gradient Descent required about 10,000 calculations to obtain a convergent solution. (Learning rate learning rate = 0.1)

If you don't look at GPU calculation, Batch G.D. vs. SGD will save you a lot of calculation time. We were able to confirm the high calculation efficiency of SGD.

Effect of Mini-Batch size on GPU calculation

Now, the problem is "inefficient calculation of GPU". Since the calculation time is set by setting the timer before training and acquiring the timer value after completion, there is no doubt that this learning part is the cause. Usually, it is suspected that CPU calculation (especially numpy processing) is mixed in the GPU calculation part. With this in mind, I looked around the code in detail, but couldn't find the cause. (Actually, I'm referring to the code in Theano Tutorial, Deep Learning 0.1 documentation, which is a model, so I don't think I can make a simple mistake.)

After that, I came up with the overhead of calling Theano Function. I changed (increased) the Mini-Batch size and measured the calculation time.

The horizontal axis is the Mini-Batch size and the vertical axis is the Training time. The scale on the left is Linear-Linear, and the scale on the right is Log-Linear. It is necessary to consider that the number of loops decreases in proportion to the Mini-Batch size, but the above result shows that the calculation time decreases "exponentially", and it is considered that the influence of the training function call is large.

In Coursera's lecture, it was explained that "Mini-Batch size should be decided with consideration for parallel computing (vectorization) of the processor, about 2 to 100 is practical". In addition, there seems to be a proposal that "in the case of class classification, it is appropriate to decide according to the number of classes of the sorting destination (2 because it is 2 classifications this time, 10 for MNIST handwritten digit classification). However, this time. From the result of, it means that "it is better to set the size of Mini-Batch to some extent in GPU calculation".

This time, since it is a logistic regression, the amount of calculation per batch is considerably smaller than that of a neural network. I would like to investigate the effect of Mini-Batch size on a slightly larger calculation such as a neural network at a later date. In addition, the overhead of Function call is due to data transfer between memories, so the situation may differ depending on the hardware. (Is it impossible with a Laptop PC?)

(The programming environment of this article is as follows, python 2.7.8, theano 0.7.0, CUDA Driver / Runtime 7.5 / 7.0)

References (web site)

--Coursera, Machine Learning (especially Week 10)

Theano Tutorial http://deeplearning.net/software/theano/tutorial/
Theano, Deep Learning Tutorials http://deeplearning.net/tutorial/
UCI Machine Learning Repository, Adult http://archive.ics.uci.edu/ml/datasets/Adult -[Qiita] Learn the basics of Theano once again http://qiita.com/TomokIshii/items/1f483e9d4bfeb05ae231 (Since I researched about Theano this time, I would like to update this linked article as well.)

(Addition) Benchmark focusing on Mini-Batch size in Theano GPU calculation

In the above article, I wrote that "when performing stochastic gradient descent with Mini-Batch, it seems to be affected by Mini-Batch size", but since I received a comment about this, I increased the conditions and benchmark test I tried.

Benchmark issues-Adult Dataset

As in this article, I selected and used "Adult" from the UCI Machine Learning repository. "Adult" data is a problem to classify the annual income of ʻUS $ 50k or less and ʻUS $ 50k or more based on the" family structure "and" educational background "of American residents. This time, we used two classification codes.

Classification by logistic regression. Create a regression model by selecting 3 of the 14 features included in the "Adult" dataset. Also, the data in the file'adult.data'is divided into 70% / 30% and used for Train data and Test data, respectively. (Last time, I didn't notice the existence of the test file'adult.test'included in the dataset, so I performed the above operation.)
Classification by Multi-layer Perceptron (MLP) model. Select 11 of the 14 features included in the "Adult" dataset and enter them in the MLP net model. The composition of the MLP was hidden layer 1 (22 units) + hidden layer 2 (20 units) + output layer (1 unit). The file'adult.data'was used as Train data and'adult.test' was used as Test data. The number of instances is Train-32561, Test-16281.

The optimizer used the stochastic gradient descent method, which adjusts the parameters while supplying data with Mini-Batch.

Calculation process

The training data is set as one set, divided into the specified Mini-Batch size, and then input to the classifier. The calculation step of inputting the entire set is called epoch, and the calculation of the determined number of epochs (epoch = 50 this time) was performed without performing the convergence test. Below is the code for that part.

    #############################################
    batch_size = 100
    #############################################

    # Compile
    train_model = theano.function(
        inputs=[index],
        outputs=[cost, accur],
        updates=one_update,
        givens=[(x, trXs[index * batch_size:(index + 1) * batch_size]), 
                (y_, trYs[index * batch_size:(index + 1) * batch_size])],
        allow_input_downcast=True
    )
    accuracy = theano.function(
        inputs=[],
        outputs=accur,
        givens=[(x, teXs), (y_, teYs)],
        allow_input_downcast=True
    )

    # Train (Optimization)
    start_time = timeit.default_timer()

    n_epochs = 50
    epoch = 0

    n_train_batches = int(trY.shape[0] / batch_size)
    
    while (epoch < n_epochs):
        epoch += 1
        for mini_batch_index in range(n_train_batches):
            cost_j, accur = train_model(mini_batch_index)
        
        print('epoch[%3d] : cost =%8.4f' % (epoch, cost_j))
    
    elapsed_time = timeit.default_timer() - start_time
    print('Elapsed time: %10.3f [s]' % elapsed_time)
    
    last_accur = accuracy()
    print('Accuracy = %10.3f ' % last_accur)

One caveat is the number of Mini-Batch in one epoch.

#   Mini-Batch count=Number of instances of Train data/ Mini-Batch size
    n_train_batches = int(trY.shape[0] / batch_size)

In some cases, the surplus is truncated. This has a relatively large effect as the Mini-Batch size increases. For example, in the case of 32,561 Mini-Batch size 10000 instances, 30,000 instances are referenced, but 2561 is skipped.

Benchmark test results

The computer environment is as follows.

Laptop PC (with GPU), OS: Windows 10, Python 2.7.11, Theano 0.7.0
Desktop PC (with GPU), OS: Linux, Ubuntu 14.04LTS, Python 2.7.11, Theano 0.7.0

** Test result (raw data) ** (Unit is seconds [s])

batch_siz	Laptop_LR_fastc	Laptop_MLP_fastc	Laptop_LR_fastr	Laptop_MLP_fastr	Desktop_LR_fastr	Desktop_MLP_fastr
10	113.3	1546.6	108.8	362.7	15.3	57.4
20	56.9	758.6	55.5	176.1	8.0	28.5
50	22.6	321.6	22.2	91.4	3.2	16.6
100	11.6	159.8	11.5	47.0	3.1	8.6
200	6.2	77.0	5.9	23.8	1.6	4.5
500	4.4	30.6	4.3	7.9	1.0	1.8
1000	2.2	15.4	2.3	4.6	0.5	1.2
2000	1.2	9.3	1.3	3.5	0.3	0.9
5000	0.4	4.6	0.5	1.9	0.2	0.6
10000	0.3	4.0	0.4	1.6	0.1	0.5

Description of each column: --batch_siz: Mini-Batch size --Laptop_LR_fastc: Logistic regression on Laptop PC, theano.config.Mode = fast_compile --Laptop_MLP_fastc: MLP model classification on Laptop PC, theano.config.Mode = fast_compile --Laptop_LR_fastr: Logistic regression on Laptop PC, theano.config.Mode = fast_run --Laptop_MLP_fastr: MLP model classification on Desktop PC, theano.config.Mode = fast_run --Desktop_LR_fastr: Logistic regression on Desktop PC, theano.config.Mode = fast_run --Desktop_MLP_fastr: MLP model classification on Desktop PC, theano.config.Mode = fast_run

'theano.config.Mode' is an option that the optimization level is increased and the execution speed is increased by'fast_run', and'fast_compile' is an option that some optimization is performed (compilation time is shortened).

Next, we will look at the details while referring to the plot.

Fig. Logistic Regression vs. MLP model (Laptop_LR_fastr vs. Laptop_MLP_fastr)

The horizontal axis is the Mini-Batch size, and the vertical axis is the time required for the learning part. First, the difference in classification code and the comparison between logistic regression and classification by MLP model. As expected, the calculation time is about 3 to 4 times longer due to the increase in the amount of calculation in the MLP model. In addition, the influence of the Mini-Batch size is the same for both, and it can be seen that the calculation time decreases as the Mini-Batch size increases.

Fig. Theano mode FAST_COMPILE vs. FAST_RUN (Logistic Regression)

This is a comparison between the mode FAST_COMPILE, which does not perform much CUDA-related optimization, and the mode FAST_RUN, which performs more optimization, but as shown in the above figure, there is not much difference between logistic regressions.

Fig. Theano mode FAST_COMPILE vs. FAST_RUN (MLP classification)

On the other hand, in the classification of MLP models with a large amount of calculation, the effect of FAST_RUN, which has been optimized, has come out, leading to a reduction in calculation time.

FIG. Laptop PC vs. Desktop PC (MLP classification)

This is thought to be the result of a simple difference in hardware performance. (In both cases, Theano mode is FAST_RUN. Also, I haven't looked at the difference in OS in detail, but the effect will be small.)

Consideration

As mentioned above, it has been observed that the learning calculation time tends to decrease significantly as the Mini-Batch size increases under all conditions. The cause is that when the Mini-Batch size is carved into small pieces, the number of calls to the function train_model () in the learning while loop increases, and the overhead of this function call is considered to be significant. (If you want to investigate in more detail, I think you need to use a profiler. For the time being, I looked at the situation with the Cprofile of the standard profiler, but I was able to grasp the part that took time in the part inside the Theano code. However, I gave up on the details due to lack of skill.)

In this test (as mentioned above, there is a problem of data "truncating" that occurs during data feed), the outline and the amount of data to be fed are the same conditions. Originally, at the time of learning, the point is how to improve the accuracy of the classifier as quickly as possible, so it is possible to take a strategy to adjust the calculation parameters (learning rate, optimizer parameters, etc.) for each Mini-Batch. There are many. It is considered important to set the size of Mini-Batch appropriately in consideration of this case and the overhead generated at the time of the observed GPU calculation function call in order to realize efficient learning.