I tried to implement a basic Recurrent Neural Network model

I'm interested in Recurrent Neural Networks, but I'm having a hard time writing code. Isn't there a lot of cases like this? There are several reasons, but in my case I can think of the following.

  1. The network configuration is simply complicated. From MLP (Multi-layer Perceptron) to CNN (Convolutional-NN), the signal flow was only forward, even if there was a special layer. (Excluding error calculation.)
  2. There was an easy-to-understand example in MLP and CNN, "MNIST" (called'Hello World'in Deep Learning), but there is no such standard (standard) example in RNN.

By the way, Theano's Deep Learning and TensorFlow's Tutorial deal with language models. Those who are familiar with language models may get started quickly, but beginners first need to understand "what the example is trying to solve".

This time, I took up an example dealing with a simpler sequence that is not a language model, and decided to implement a simple Recurrent Neural Network (RNN).

(The programming environment used is python 2.7.11, Theano 0.7.0.)

Simple RNN structure

To check the RNN, I first tried running the Tutorial (ptb_word_lm.py) of "TensorFlow". It can be seen that the variable of "perplexity" decreases as the value of "epoch" increases. However, I could not understand the details of what it solved. Since LSTM (Long Short-term Memory) is also used as the model of RNN, I felt that the introduction to RNN was a high threshold.

The Elman net is introduced as a simple RNN in the document "Deep Learning". Also, when I searched for "Elman RNN" as a keyword, I found a blog called "Peter's note" (http://peterroelants.github.io/) that introduces simple RNNs, so I used this as a reference for the program. investigated.

The figure of RNN is quoted from the above site.

Fig. Simple RNN structure SRNmodel2.png

Data enters from the input unit x, and after multiplying by the weight W_in, it enters the hidden layer unit s. There is a recursive flow for the output of unit S, and the result of applying the weight W_rec returns to unit s at the next time. In addition, it is usually necessary to consider the weight W_out for the output, but to simplify the structure, if W_out = 1.0 is fixed, the state of unit S will be output as it is.

To apply the BPTT method (Backpropagation through time) to the state shown on the left, consider the "expanded" state on the right. The state of the initial value s_0 of the hidden unit changes to the right while multiplying by the weight W_rec as the time advances. Also, [x_1, x_2, ... x_n] is input at each time. The state of s_n is output to the unit y at the final time.

The above model can be converted into Python code as follows. (Quoted from "Peter's note".)

def update_state(xk, sk, wx, wRec):

    return xk * wx + sk * wRec

def forward_states(X, wx, wRec):
    # Initialise the matrix that holds all states for all input sequences.
    # The initial state s0 is set to 0.
    S = np.zeros((X.shape[0], X.shape[1]+1))
    # Use the recurrence relation defined by update_state to update the 
    #  states trough time.
    for k in range(0, X.shape[1]):
        # S[k] = S[k-1] * wRec + X[k] * wx
        S[:,k+1] = update_state(X[:,k], S[:,k], wx, wRec)
    
    return S

What is the content of the example?

Also, regarding "what kind of problem is handled by the above RNN model", input a binary numerical value of X_k = 0. or 1. as an input. The output is a network model that outputs the total value of these binaries. For example For X = [0.1.0.0.0.0.0.0.0.0.0.0. 1.](because the total value of this list X is 2.) Set the output of Y = 2. to the correct value. Of course, the content of the example is to estimate by RNN (including two weighting coefficients) without using the "algorithm of counting numerical values".

Since the output value is a numerical value that takes continuous values, it can be considered as a kind of "regression" problem, not a "classification" problem. Therefore, MSE (mean square error) is used as the cost function, and the unit data is passed as it is without passing the Activation function.

First, training is performed using Train data (created in advance), and the weighting coefficient of 2 [W_in, W_rec] is obtained. It can be easily estimated by looking at the above figure, but the correct answer is [W_in, W_rec] = [1.0, 1.0] .

Preliminary study of model implementation

In the "Peter's note" article that I referred to, I used python (with numpy) to put together an IPython Notebook without using the Deep Learning library. If you copy this as it is, you can get the result as in the blog article, but considering the development, I tried to implement it using the Deep Learning library. I considered the following as options.

  1. Use "TensorFlow".
  2. Use "Theano".
  3. Use higher level (abstracted) libraries such as "Keras" and "Pylearn2".

At first, I tried to make the original python code "one by one" into the TensorFlow version, but

    for k in range(0, X.shape[1]):
        # S[k] = S[k-1] * wRec + X[k] * wx
        S[:,k+1] = update_state(X[:,k], S[:,k], wx, wRec)
    
    return S

It turned out that the loop processing of the part of can not be fixed well (to TensorFlow version). If you refer to the tutorial code of TensorFlow (ptb_word_lm.py etc.), you should be able to implement this simple RNN model as a matter of course, but since the related class library is complicated and difficult to understand, we will use TensorFlow this time. passed.

In addition, options 3 such as "Keras" and "Pylearn2" were not selected this time because they deviate from the purpose of "understanding the implementation of RNN".

In the end, I decided to write the "Theano" version of the code for option 2.

“Theano scan” for RNN

What is common to the RNN codes by Theano found on the net is that most of the codes use "Theano scan". Theano scan is a function for performing Loop processing (iteration processing) and Iteration processing (convergence calculation) in the Theano framework. The specifications are complicated, and it is difficult to understand immediately even if you look at the original documentation (Theano Documentation). Although the Japanese information is quite limited, I proceeded with the behavior investigation of Theano scan while trying a small code with Jupyter Notebook by referring to Mr. sinhrks' blog article.

n = T.iscalar('n')
result, updates = theano.scan(fn=lambda prior, nonseq: prior * 2,
                              sequences=None,
                              outputs_info=a, #Refer to the value in the previous Loop--> prior
                              non_sequences=a, #Non-sequence value--> nonseq
                              n_steps=n)

myfun1 = theano.function(inputs=[a, n], outputs=result, updates=updates)
myfun1(5, 3)
# array([10, 20, 40])
# return-1 = 5 * 2
# return-2 = return-1 * 2
# return-3 = return-2 * 2 

Execution result:

>>> array([10, 20, 40], dtype=int32)

I can't explain it in great detail, so I'll cover some usage examples. Theano.scan () takes 5 types of arguments as described above.

Key Word Contents Example of use
fn Function for iteration fn=lambda prior, nonseq: prior * 2
sequences List that inputs while advancing the elements during sequential processing,Matrix type variable sequences=T.arange(x)
outputs_info Gives the initial value of sequential processing outputs_info=a
non_sequences Fixed value that is not a sequence (invariant with iterative processing) non_sequences=a
n_steps Iterative function n_steps=n

In the above code, theano.scan () is given an initial value of 5 (not a sequence) and a number of times 3, and each iteration is multiplied by 2 to the result of the previous process. There is. First iteration: 5 x 2 = 10 Second iteration: 10 x 2 = 20 Third iteration: 20 x 2 = 40 As a result, result = [10, 20, 40] is calculated.

The following is a test that is a little more RNN conscious.

v = T.matrix('v')
s0 = T.vector('s0')
result, updates = theano.scan(fn=lambda seq, prior: seq + prior * 2,
                                             sequences=v,
                                             outputs_info=s0,
                                             non_sequences=None)
myfun2 = theano.function(inputs=[v, s0], outputs=result, updates=updates)

myfun2([[1., 0.], [0., 1.], [1., 1.]], [0.5, 0.5])

Execution result:

>>> array([[ 2.,  1.],
       [ 4.,  3.],
       [ 9.,  7.]], dtype=float32)

The initial value [0.5, 0.5] is input to the function. $ fn = \ texttt {lambda} \ seq, prior: \ seq + prior * Since we defined it as 2 $, First iterative process: [1., 0.] + [0.5, 0.5] x 2 = [2., 1.] Second iteration: [0., 1.] + [2., 1.] x 2 = [4., 3.] Third iteration: [1., 1.] + [4., 3.] x 2 = [9., 7.] It is calculated in the flow.

"theano.scan ()" is a function that supports the flow control of processing required by RNN. Similar functionality is not currently supported for TensorFlow,

Our white paper mentions a number of control flow operations that we've experimented with -- I think once we're happy with its API and confident in its implementation we will try to make it available through the public API -- we're just not quite there yet. It's still early days for us :)

(Quoted from discussion in GitHub TensorFlow issue # 208.)

So I would like to wait for future support.

(I don't understand what kind of implementation is done for TensorFlow's RNN model, but the fact that RNN calculation has already been realized means that such a "theano.scan ()" -like function is " It means that it is not "essential". I think that it is necessary to study the sample code of TensorFlow a little more in this case.)

Simple RNN code details using Theano

Now that we know Theano Scan (), let's take a look at the Simple RNN code. First, define a simpleRNN class.

class simpleRNN(object):
    #   members:  slen  : state length
    #             w_x   : weight of input-->hidden layer
    #             w_rec : weight of recurrnce 
    def __init__(self, slen, nx, nrec):
        self.len = slen
        self.w_x = theano.shared(
            np.asarray(np.random.uniform(-.1, .1, (nx)),
            dtype=theano.config.floatX)
        )
        self.w_rec = theano.shared(
            np.asarray(np.random.uniform(-.1, .1, (nrec)),
            dtype=theano.config.floatX)
        )
    
    def state_update(self, x_t, s0):
        # this is the network updater for simpleRNN
        def inner_fn(xv, s_tm1, wx, wr):
            s_t = xv * wx + s_tm1 * wr
            y_t = s_t
            
            return [s_t, y_t]
        
        w_x_vec = T.cast(self.w_x[0], 'float32')
        w_rec_vec = T.cast(self.w_rec[0], 'float32')

        [s_t, y_t], updates = theano.scan(fn=inner_fn,
                                    sequences=x_t,
                                    outputs_info=[s0, None],
                                    non_sequences=[w_x_vec, w_rec_vec]
                                   )
        return y_t

As a class member, a class is defined by giving the length and weight (w_x, w_rec) of the state. The class method state_update () updates the network state given the initial value s0 of state and the input sequence x_t, and calculates y_t (output sequence). y_t is a vector, but in the main processing, only the final value is extracted and used to calculate the cost function, such as y = y_t [-1].

In the main process, first, the data used for learning is created. (Almost as in the original "Peter's note".)

    np.random.seed(seed=1)

    # Create Dataset by program
    num_samples = 20
    seq_len = 10
    
    trX = np.zeros((num_samples, seq_len))
    for row_idx in range(num_samples):
        trX[row_idx,:] = np.around(np.random.rand(seq_len)).astype(int)
    trY = np.sum(trX, axis=1)
    trX = trX.astype(np.float32)
    trX = trX.T                    # need 'List of vector' shape dataset
    trY = trY.astype(np.float32)
    # s0 is time-zero state 
    s0np = np.zeros((num_samples), dtype=np.float32)

trX is a series data of length 10 and 20 samples. The point here is that the matrix is transposed as trX = trX.T. As a general machine learning data set, it seems that the features of one data are arranged in the horizontal direction (column) and arranged in the vertical direction (row) for the number of samples.

  Data Set Shape
                  feature1   feature2   feature3  ...
     sample1:        -          -          -
     sample2:        -          -          -
     sample3:        -          -          -
       .
       .

However, this time, when updating the time series data with theano.scan (), it was necessary to group the data vertically and pass the data.

(By grouping as follows, theano.scan()It is consistent with the operation of. )
  Data Set Shape (updated)
               [  time1[sample1,  time2[sample1,  time3[sample1 ...    ]
                        sample2,        sample2,        sample2,
                        sample3,        sample3,        sample3,
                         ...    ]         ...   ]         ...    ]

In order to realize this easily, the matrix is transposed and processed as an input to theano.scan ().

After this, the cost loss is calculated from Theano's graph, the model calculation value y_hypo, and the Train data label y_.

    # Tensor Declaration
    x_t = T.matrix('x_t')
    x = T.matrix('x')
    y_ = T.vector('y_')
    s0 = T.vector('s0')
    y_hypo = T.vector('y_hypo')

    net = simpleRNN(seq_len, 1, 1)  
    y_t = net.state_update(x_t, s0)
    y_hypo = y_t[-1]
    loss = ((y_ - y_hypo) ** 2).sum()

Once you reach this point, you can proceed with learning in a familiar way.

# Train Net Model
    params = [net.w_x, net.w_rec]
    optimizer = GradientDescentOptimizer(params, learning_rate=1.e-5)
    train_op = optimizer.minimize(loss)

    # Compile ... define theano.function 
    train_model = theano.function(
        inputs=[],
        outputs=[loss],
        updates=train_op,
        givens=[(x_t, trX), (y_, trY), (s0, s0np)],
        allow_input_downcast=True
    )
    
    n_epochs = 2001
    epoch = 0
    
    w_x_ini = (net.w_x).get_value()
    w_rec_ini = (net.w_rec).get_value()
    print('Initial weights: wx = %8.4f, wRec = %8.4f' \
                % (w_x_ini, w_rec_ini))
    
    while (epoch < n_epochs):
        epoch += 1
        loss = train_model()
        if epoch % 100 == 0:
            print('epoch[%5d] : cost =%8.4f' % (epoch, loss[0]))
    
    w_x_final = (net.w_x).get_value()
    w_rec_final = (net.w_rec).get_value()
    print('Final weights : wx = %8.4f, wRec = %8.4f' \
                % (w_x_final, w_rec_final))

This time, we prepared and used two optimizers, Gradient Decent (gradient descent method) and RMSPropOptimizer (RMSProp method). (The code for the optimizer part is omitted this time. For the RMSProp method, refer to the website shown later.)

Execution result

The description that "RNNs are generally difficult to advance learning" can be found in various places, but the result made me realize it.

Condition 1. Gradient Descent, Learning Rate = 1.0e-5

Initial weights: wx =   0.0900, wRec =   0.0113
epoch[  100] : cost =529.6915
epoch[  200] : cost =504.5684
epoch[  300] : cost =475.3019
epoch[  400] : cost =435.9507
epoch[  500] : cost =362.6525
epoch[  600] : cost =  0.2677
epoch[  700] : cost =  0.1585
epoch[  800] : cost =  0.1484
epoch[  900] : cost =  0.1389
epoch[ 1000] : cost =  0.1300
epoch[ 1100] : cost =  0.1216
epoch[ 1200] : cost =  0.1138
epoch[ 1300] : cost =  0.1064
epoch[ 1400] : cost =  0.0995
epoch[ 1500] : cost =  0.0930
epoch[ 1600] : cost =  0.0870
epoch[ 1700] : cost =  0.0813
epoch[ 1800] : cost =  0.0760
epoch[ 1900] : cost =  0.0710
epoch[ 2000] : cost =  0.0663
Final weights : wx =   1.0597, wRec =   0.9863

As a result of learning, we were able to obtain an approximate value of the correct answer [w_x, w_rec] = [1.0, 1.0]. The figure below shows how the cost function is reduced.

Fig. Loss curve (GradientDescent) rnn_loss_log1.PNG

Condition 2. RMSProp method, learning rate = 0.001

Initial weights: wx =   0.0900, wRec =   0.0113
epoch[  100] : cost =  5.7880
epoch[  200] : cost =  0.3313
epoch[  300] : cost =  0.0181
epoch[  400] : cost =  0.0072
epoch[  500] : cost =  0.0068
epoch[  600] : cost =  0.0068
epoch[  700] : cost =  0.0068
epoch[  800] : cost =  0.0068
epoch[  900] : cost =  0.0068
epoch[ 1000] : cost =  0.0068
epoch[ 1100] : cost =  0.0068
epoch[ 1200] : cost =  0.0068
epoch[ 1300] : cost =  0.0068
epoch[ 1400] : cost =  0.0068
epoch[ 1500] : cost =  0.0068
epoch[ 1600] : cost =  0.0068
epoch[ 1700] : cost =  0.0068
epoch[ 1800] : cost =  0.0068
epoch[ 1900] : cost =  0.0068
epoch[ 2000] : cost =  0.0068
Final weights : wx =   0.9995, wRec =   0.9993

Fig. Loss curve (RMSProp) rnn_loss_log2.PNG

In this model, the non-linearity of the cost function vs. parameters is very strong. Since the numerical value diverges as soon as the learning rate is increased, it was necessary to set the learning rate to 1.0e-5, which is quite small, in the gradient descent method. On the other hand, with the RMSProp method, which is said to be suitable for RNNs, learning can proceed without problems even with a learning rate of 0.001.

(Supplement) The reference "Peter's note" blog has a detailed explanation of the status of the cost function and RMSProp (named "Rprop" in the blog from which it was quoted). The non-linearity of the cost function is visualized with shades of color, so please refer to it if you are interested. (It will be the link below.)

References (web site)

Recommended Posts

I tried to implement a basic Recurrent Neural Network model
I tried to implement anomaly detection using a hidden Markov model
I tried to implement TOPIC MODEL in Python
Implement a 3-layer neural network
I tried to implement PCANet
I tried to implement StarGAN (1)
I tried to implement a pseudo pachislot in Python
I tried to implement a recommendation system (content-based filtering)
I tried to summarize four neural network optimization methods
I tried to implement a volume moving average with Quantx
I tried to implement Deep VQE
I tried to implement a one-dimensional cellular automaton in Python
I tried to implement adversarial validation
I implemented a two-layer neural network
I tried to implement hierarchical clustering
I tried to implement Realness GAN
I tried to classify music major / minor on Neural Network
I tried to divide with a deep learning language model
I tried to implement SSD with PyTorch now (model edition)
I tried to predict the genre of music from the song title on the Recurrent Neural Network
[Python] I tried to implement stable sorting, so make a note
I tried to implement a misunderstood prisoner's dilemma game in Python
I tried to create a linebot (implementation)
I tried to implement PLSA in Python
I tried to implement Autoencoder with TensorFlow
I tried to implement permutation in Python
I tried to create a linebot (preparation)
I tried to implement PLSA in Python 2
I tried to implement ADALINE in Python
I tried to implement PPO in Python
I tried to implement CVAE with PyTorch
I tried to make a Web API
I tried to touch Python (basic syntax)
I tried to implement a blockchain that actually works with about 170 lines
I tried how to improve the accuracy of my own Neural Network
I made an image discrimination (cifar10) model using a convolutional neural network.
I tried to implement a card game of playing cards in Python
I tried to train the RWA (Recurrent Weighted Average) model in Keras
I tried to build a super-resolution method / ESPCN
I tried to implement reading Dataset with PyTorch
I want to easily create a Noise Model
I tried to build a super-resolution method / SRCNN ①
I tried to generate a random character string
I tried to build a super-resolution method / SRCNN ③
I tried to build a super-resolution method / SRCNN ②
I tried to implement selection sort in python
I tried to implement the traveling salesman problem
I tried to make a ○ ✕ game using TensorFlow
[Python] Deep Learning: I tried to implement deep learning (DBN, SDA) without using a library.
I tried a neural network Π-Net that does not require an activation function
I tried to implement various methods for machine learning (prediction model) using scikit-learn.
I tried to implement what seems to be a Windows snipping tool in Python
I tried to create a model with the sample of Amazon SageMaker Autopilot
Implement Convolutional Neural Network
I tried to debug.
Introduction to AI creation with Python! Part 3 I tried to classify and predict images with a convolutional neural network (CNN)
Implement Neural Network from 1
I tried to paste
Introduction to AI creation with Python! Part 2 I tried to predict the house price in Boston with a neural network
I tried to make a "fucking big literary converter"
I want to easily implement a timeout in python