Unbearable shortness of Attention in natural language processing

"You only hear the last part of my words."

In fact, this paper shows the result that not only humans but also neural networks were the same.

Frustratingly Short Attention Spans in Neural Language Modeling

The excuse is, "Because it's enough to predict your next word," but it seems that the point is the same in relationships and research.

In this volume, I would like to take a look at the fact that only the last one is really needed, and if so, why, while introducing the above papers and other related papers. ..

The referenced papers are managed on the following GitHub. It is updated daily, so if you are interested in research trends, please do Star & Watch! ..

arXivTimes ax.PNG

What is Attention

Attention is a method for focusing on important points in the past (= Attention) when dealing with continuous data. The image is that when answering a question, you pay attention to a specific keyword in the other person's question. As represented by this example, it is a widely used method in natural language processing.

The figure below shows that when predicting the next hidden layer $ h ^ * $ (red box), the past 5 hidden layers ($ h_2-h_6 $) are referenced. $ A_1-a_5 $ written on the arrow from each hidden layer in the past becomes "Attention", and it becomes "weight" which point in the past is important.

temp.png From Figure 1: Memory-augmented neural language modelling architectures.

Proposal in the dissertation: Let's share the role of the hidden layer

Now, with the advent of this Attention, the role played by the hidden layer in RNNs has increased. In addition to the original role of "predicting the next word," it must also play a role of Attention, that is, "information useful for predicting the future." Furthermore, since Attention itself is calculated from the hidden layer, it is necessary to have information such as "whether it is information that should be noted in the future".

In other words, the hidden layer plays the following three roles in the RNN that introduced Attention.

  1. Storage of information for predicting the next word
  2. Storage of information on whether or not it should be noted in the future (calculation of Attention)
  3. Storage of information useful for future prediction

A situation that can be called a one-operation in a neural network. Isn't it better to share the work a little? That is what this paper proposes.

temp2.png

Orange plays the role of (p) 1, green (k) plays the role of 2, and blue (v) plays the role of 3. These are simply a combination of vectors, implemented as x3 300 dimensions if the original was 100 dimensions.

When I verified this with the Wikipedia corpus and the children's book corpus called Children's Book Test, the result was that it was generally more effective than the existing model, but it became clear during the verification. There was one fact.

Does Attention only look at the most recent location?

image

This figure shows the weight of Attention at the time of prediction, randomly sampled from the Wikipedia corpus used in the experiment. From the right, it is -1 to -15, but -1 is one before, then two, three, and so on, and the darker the color, the more important it is.

If you look at this, you can see that -1, that is, the most recent, is very important and has hardly been referred to since then.

temp3.png

This is a more detailed diagram, but you can see that the points with high weights are concentrated around -1 to -5. In fact, Attention's Window Size (how far it looks) was optimally 5.

Does that mean ...?

temp4.png

This is an RNN that uses an ordinary n-gram (*), and if only the past 5 are attracted anyway, the past 5 hidden layers can be used as they are for prediction.

h^*_t = tanh \left( 
W^N 
\begin{bmatrix}
 h^1_t \\ 
 \vdots \\
 h^{N-1}_{t-N+1} 
\end{bmatrix}
\right)

Excerpt from equation13

As a result, it is said that the accuracy surpasses the elaborate RNN and the accuracy is thinner than the method proposed in this study.

temp5.png (The value is perplexity, the lower the better. Key-Value-Predict is the proposed method of this research, and 4-gram is a model that simply uses the hidden layer of the past)

What is this!

asrhya.jpg from Shadow Hearts 2

The curtain closes in the form of.

Two issues that create the unbearable shortness of Attention

First of all, there are two possible problems with this outcome.

The problem with problem setting is that it was a task that didn't require a long dependency in the first place, so this was the result. This was also the case in a study previously pointed out to Stanford by Deep Mind.

From Toward the acquisition of the ability to read and understand sentences-Research trends of Machine Comprehension-.

machine-comprehension-9-638.jpg

It is said that Deep Mind succeeded in mechanically creating a learning data set from CNN news, but when I verified this data ...

machine-comprehension-11-638.jpg

It is a story that a simple model was able to overwhelm the neural network. When I looked it up, there were few problems that required long dependence and understanding of the context, and even a simple model was able to record sufficient accuracy.

machine-comprehension-12-638.jpg

In other words, in this case as well, it was a task that could be answered sufficiently even with a simple model, so it is possible that high accuracy could be recorded even with a simple model, and Attention was within a short range. As a response to this point, data sets that require a high degree of understanding have recently been developed. Stanford's SQuAD and Salesforce's WikiText -A lot of datasets such as modeling-dataset /) were released last year alone (Is there any data in Japanese ...?).

The other point is that long dependencies may not be captured well. This may be due in part to the lack of data that requires such dependencies as described above, but there seems to be room for consideration in terms of network configuration and other factors.

Recently, the trend is to have external memory.

Attempts are also being made to change the structure so that longer-term dependence can be grasped.

This is a study of voice, and in the case of voice, the data density is quite high (normal music has nearly 40,000 data per second). As a result, there is a greater need to capture long-term addiction. In that sense, the structure that is suitable for capturing long-term addiction may come out first in the voice. (In this paper, there is a sentence at the beginning, such as "WaveNet, but I think that CNN still can not catch long-term dependence", and I feel hot.)

The proposed network has the role of stacking RNNs in a pyramid shape in the image, and taking charge of longer dependence at the top. The image is that roles are divided according to the length of the dependency in charge.

p4.PNG

By the way, speech synthesis using this model has also been proposed.

Attempts have been made to explore cell structures that replace LSTMs, which are often used in RNNs, but in fact LSTMs, a simplified version of them, GRUs are pretty good, and research shows that it's not easy to go beyond that. It is shown.

Therefore, I have the impression that it is better to devise the entire network configuration, including the externalization of memory.

In this way, research is still underway from various points. The development beyond this end will be updated steadily.

Recommended Posts

Unbearable shortness of Attention in natural language processing
Performance verification of data preprocessing in natural language processing
Types of preprocessing in natural language processing and their power
Natural language processing of Yu-Gi-Oh! Card name-Yu-Gi-Oh!
[WIP] Pre-processing memo in natural language processing
Easy padding of data that can be used in natural language processing
Python: Natural language processing
RNN_LSTM2 Natural language processing
Python: Deep Learning in Natural Language Processing: Basics
Python: Deep learning in natural language processing: Implementation of answer sentence selection system
Model using convolutional neural network in natural language processing
Overview of natural language processing and its data preprocessing
Natural language processing 1 Morphological analysis
Natural language processing 2 Word similarity
■ [Google Colaboratory] Preprocessing of Natural Language Processing & Morphological Analysis (janome)
[Natural language processing] I tried to visualize the remarks of each member in the Slack community
Study natural language processing with Kikagaku
100 natural language processing knocks Chapter 4 Commentary
100 Language Processing Knock Chapter 1 in Python
Natural language processing for busy people
[Natural language processing] Preprocessing with Japanese
Artificial language Lojban and natural language processing (artificial language processing)
100 Language Processing Knock-59: Analysis of S-expressions
Natural language processing analyzer installation summary
Summary of multi-process processing of script language
Dockerfile with the necessary libraries for natural language processing in python
[Word2vec] Let's visualize the result of natural language processing of company reviews
Natural Language: ChatBot Part2-Sequence To Sequence Attention
I tried to display the analysis result of the natural language processing library GiNZA in an easy-to-understand manner
Answers and impressions of 100 language processing knocks-Part 1
100 Language Processing Knock-91: Preparation of Analogy Data
100 Language Processing Knock-44: Visualization of Dependent Tree
Language processing 100 knocks-22: Extraction of category names
100 Knocking Natural Language Processing Chapter 1 (Preparatory Movement)
100 Language Processing Knock-26: Removal of emphasized markup
You become an engineer in 100 days ――Day 66 ――Programming ――About natural language processing
3. Natural language processing with Python 2-1. Co-occurrence network
Learn the basics of document classification by natural language processing, topic model
3. Natural language processing with Python 1-1. Word N-gram
I tried natural language processing with transformers.
Status of each Python processing system in 2020
Convenient goods memo around natural language processing
[Python] [Natural language processing] I tried Deep Learning ❷ made from scratch in Japanese ①
3. Natural language processing with Python 3-3. A year of corona looking back at TF-IDF
[Python] Sort the list of pathlib.Path in natural sort
100 Language Processing Knock UNIX Commands Learned in Chapter 2
100 Language Processing Knock Regular Expressions Learned in Chapter 3
View the result of geometry processing in Python
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 Language Processing Knock-45: Extraction of verb case patterns
100 language processing knock-75 (using scikit-learn): weight of features
100 language processing knocks Morphological analysis learned in Chapter 4
Let's enjoy natural language processing with COTOHA API
100 language processing knocks 03 ~ 05
100 language processing knocks (2020): 40
100 language processing knocks (2020): 32
100 language processing knocks (2020): 47
100 language processing knocks (2020): 22
100 language processing knocks (2020): 26
100 language processing knocks (2020): 34
100 Language Processing Knock (2020): 28