** Why is distributed representation of words important for natural language processing? ** **
If you are reading this article, you may be familiar with ** Word2vec ** (Mikolov et al., 2013). Word2vec allows you to perform operations as if you were capturing the meaning of a word. For example, it is a famous example that Queen is obtained by subtracting Man from King and adding Woman (King --Man + Woman = Queen).
from https://www.tensorflow.org/get_started/embedding_vizIn fact, inside that, words are represented by vectors of about 200 dimensions called ** distributed representation ** (or embedded representation), and the vectors are added and subtracted. It is thought that the characteristics of each word are stored inside this vector of about 200 dimensions. Therefore, adding and subtracting vectors can give meaningful results.
** Distributed representation of words is an important technique that is commonly used in today's natural language processing. ** Recently, a huge number of neural network (NN) -based models have been proposed in natural language processing research. These NN-based models often use distributed representations of words as input.
In this article, I will explain "** Why is distributed expression of words important for natural language processing? **". The flow of explanation begins with a brief explanation of the distributed representation of words and a shared understanding. Next, I will explain why the main theme, distributed representation of words, is important for natural language processing. Finally, I will explain the challenges of distributed representation.
Here, we will give a brief explanation for the purpose of understanding the ** distributed expression ** of words. We'll also discuss the ** one-hot representation ** of words for comparison to illustrate their benefits. As for the flow of the story, after explaining the one-hot expression and its problems, we will move on to the explanation of the distributed expression.
The first possible way to represent a word as a vector is the one-hot representation. A one-hot expression is a method in which only one element is 1 and the other elements are 0. By setting 1 or 0 for each dimension, "whether or not it is the word" is indicated.
For example, let's say the one-hot expression represents the word python. Here, the vocabulary that is a set of words is 5 words (nlp, python, word, ruby, one-hot). Then the vector representing python looks like this:
The one-hot representation is simple, but it has the disadvantage that operations between vectors do not produce any meaningful results. For example, let's say you take the dot product to calculate the similarity between words. In the one-hot expression, different words have 1s in different places and other elements are 0, so the result of taking the inner product between different words is 0. This is not the desired result. Also, since one dimension is assigned to one word, it becomes very high dimension as the number of vocabulary increases.
Distributed representations, on the other hand, are representations of words as low-dimensional real-value vectors. It is often expressed in about 50 to 300 dimensions. For example, the words mentioned earlier can be expressed as follows in a distributed expression.
You can solve the problems of one-hot expressions by using distributed expressions. For example, you will be able to calculate the similarity between words by performing operations between vectors. Looking at the vector above, the similarity between python and ruby is likely to be higher than the similarity between python and word. Also, even if the number of vocabulary increases, it is not necessary to increase the number of dimensions of each word.
This section describes the importance of distributed representation of words in natural language processing. As for the flow of the story, after talking about the input to the natural language processing task, we will talk about using the distributed expression as the input. And I will talk about how this distributed representation affects the performance of the task.
There are various tasks in natural language processing, but many tasks give a word string as input. Specifically, for document classification, enter a set of words contained in the document. Part-of-speech tagging gives a word-separated word string, and named entity recognition gives a word-separated word string as well. The image is as follows.
In modern natural language processing, neural networks are often used, but word strings are often given as input. It is a word to input to the RNN that is commonly used in the past, and it is often the case that the input to the model using the CNN, which has been attracting attention recently, is input at the word level. The image is as follows.
In fact, we often use distributed expressions as expressions for words given to these neural networks [^ 1]. It is based on the expectation that using expressions that better capture the meaning of words as input will also improve task performance. It is also possible to use a distributed representation learned with a large amount of unlabeled data as the initial value of the network and tune it with a small amount of labeled data.
This distributed representation is important because it affects the performance of the task. It has also been reported to improve performance compared to not using distributed representation [2]. In this way, distributed expressions of words are important because they are often used as input for many tasks and have a considerable effect on performance.
It's not that the distributed representation of words is the silver bullet in natural language processing. As a result of many studies, it is known that there are various problems. Here, I will introduce two of them.
The first issue is that even good results in the evaluation dataset do not improve performance more than you would expect to use in an actual task (such as document classification). In the first place, how distributed expressions of words are evaluated is often evaluated by the degree of correlation with a human-created evaluation set of word similarity (Schnabel, Tobias, et al, 2015). In other words, using the distributed representation obtained from a model that can produce results that correlate with human evaluation for actual tasks does not improve performance.
The reason is that most evaluation datasets do not distinguish between word similarity and relevance. Word similarity and relevance is, for example, that (male, man) are similar and (computer, keyboard) are related but not similar. It has been reported that the distinguishing datasets have a positive performance correlation with the actual task (Chiu, Billy, Anna Korhonen, and Sampo Pyysalo, 2016).
As a result, attempts are currently being made to create evaluation datasets that correlate with actual tasks (Oded Avraham, Yoav Goldberg, 2016). We are trying to solve two problems that exist in existing datasets (word similarity and relevance are not distinguished, annotation scores vary among evaluators).
In addition to creating evaluation datasets, research has been conducted to evaluate distributed representations by making it easier to evaluate actual tasks (Nayak, Neha, Gabor Angeli, and Christopher D. Manning, 2016). It is expected that this will make it easy to verify whether the learned distributed representation is effective for tasks that are close to the task that you actually want to perform.
Personally, I hope that the models that have been buried up to now will be reviewed by evaluating with new data sets and tasks.
The second issue is that current distributed expressions do not take into account word ambiguity. Words have various meanings. For example, the word "bank" has the meaning of "bank" in addition to the meaning of "bank". In this way, there is a limit to expressing with one vector without considering the ambiguity of words.
Several methods have been proposed to solve this problem by learning expressions for each semantics [5] [6] [7] [8]. At SENSE EMBED, we are learning expressions for each meaning by eliminating ambiguity in the meaning. As a result of learning the expressions for each meaning, it is reported that the performance in word similarity evaluation has improved.
The following repositories contain information on distributed representations of words and sentences, learned vectors, and Python implementations. awesome-embedding-models
It will be encouraging if you can add a star m (_ _) m
Distributed expression of words is an interesting field that is being actively studied. I hope this article will help you understand.
The following Twitter account provides easy-to-understand information on the latest papers on ** machine learning / natural language processing / computer vision **. We are waiting for you to follow us as we are delivering interesting content for those who read this article. @arXivTimes
I also tweet information about machine learning and natural language processing in my account, so I'd love to hear from anyone interested in this field. @Hironsan
[^ 1]: Actually, after giving it in one-hot expression, it is converted to distributed expression.
Recommended Posts