I suddenly started studying "Deep Learning from scratch ❷ --- Natural language processing" Note that I stumbled in Chapter 5. is.
The execution environment is macOS Catalina + Anaconda 2019.10, and the Python version is 3.7.4. For details, refer to Chapter 1 of this memo.
This chapter describes recurrent neural networks.
It explains the language model and the problems when trying to use CBOW as a language model. I think that the reason why Equation 5.8 is an approximation is that CBOW ignores the word sequence.
Since word2vec ignores the sequence of words, it seems that the RNN learned in this chapter is better if it is used for distributed expression, but the RNN was born first, and later to increase the number of vocabulary and improve the quality. It is interesting that word2vec was proposed by, and it was actually the opposite flow.
This is an explanation of RNN. The tanh function (hyperbolic tangent function) appears as an activation function, but for some reason there is no explanation in this book, so let's google for details with "tanh".
A little more worrisome is the support to return to the beginning when the data is used to the end in mini-batch learning. In this case, the end and the beginning of the corpus will be connected. However, in the first place, this book treats the PTB corpus as "one big time series data" and does not even consider the sentence breaks (see the scorpion mark part in the center of P.87). Therefore, it may be meaningless to worry about the end and the beginning being connected.
The implementation should be a little tricky as Figure 5-19 and Figure 5-20 omit the Repeat node after the bias $ b $. Forward propagation can be implemented as shown in the figure because it is broadcast, but it is necessary to add it consciously when calculating $ db $ by back propagation. Just this part of QA was also in teratail (teratail: why sum the db with axis = 0 in the backpropagation of the RNN).
Also, the tanh function that came out this time is implemented without explanation, but forward propagation can be calculated with numpy.tanh ()
like the code in the book. For back propagation, the part dt = dh_next * (1 --h_next ** 2)
is the derivative of tanh, which is explained in detail in "Appendix A Differentiation of sigmoid and tanh functions" at the end of the book. there is.
Also, on page 205, the story of "... (3-point dot)" appears, which is the same as the "3-point reader" that appeared on page 34. [Chapter 1 of this memo](https://qiita.com/segavvy/items/91be1d4fc66f7e322f25#13-%E3%83%8B%E3%83%A5%E3%83%BC%E3%83%A9%E3 % 83% AB% E3% 83% 8D% E3% 83% 83% E3% 83% 88% E3% 83% AF% E3% 83% BC% E3% 82% AF% E3% 81% AE% E5% AD As I wrote in% A6% E7% BF% 92), it is better to understand the relationship between slices and views of ndarray rather than remembering that it will be overwritten with 3 point dots.
The explanation of the code is omitted, but it is simple and understandable.
The Time Embedding layer (TimeEmbedding
class in common / time_layers.py) simply loops through $ T $ of Embedding layers.
In the Time Affine layer (TimeAffine
class in common / time_layers.py), instead of looping $ T $ times, the batch size $ N $ is transformed into $ T $ times and calculated at once, and the result is the original. It is made more efficient by transforming it into a shape.
The Time Softmax With Loss layer (TimeSoftmaxWithLoss
class in common / time_layers.py) is as explained in the book, but I was worried that a mask using ʻignore_label` was implemented. When the correct label is -1, both the loss and the gradient are set to 0 and it is excluded from the denominator $ T $ when calculating $ L $, but the process of setting the correct label to -1 is now. I don't think it was there. I may use it in a later chapter, so I will leave it for the time being.
Unfortunately, this implementation does not give good results when using the entire PTB dataset, so I played with the previous chapter Aozora Bunko's divided text I also stopped studying at. It is said that it will be improved in the next chapter, so I will try it there.
As an aside, when I saw the code rn = np.random.randn
inSimpleRnnlm.__ init__ ()
, I found it useful because Python can easily put functions into variables and use them. In C language, it is complicated to put a function in a variable (put a function entry point in a variable) with a lot of *
and ()
, and it is complicated to use it, and I am not really good at it in my active days. I did: sweat:
I have managed to handle time series data.
That's all for this chapter. If you have any mistakes, I would be grateful if you could point them out.
Recommended Posts