After breaking down a sentence into words, how the words are treated is converted into numbers. For example, this = [0.2, 0.4, 0.5], is = [0.1, 0.7, 0.35]. What these characteristics represent is the characteristics of each word. These [0.2, 0.4, 0.5] and [0.1, 0.7, 0.35] are called word vectors.
For example, if the only word that appears in the sentence you want to analyze this time is "I am John Cena."
I = [ 1 , 0 , 0 , 0 ]
am = [ 0 , 1 , 0 , 0 ]
John Cena = [ 0 , 0 , 1 , 0 ]
. = [ 0 , 0 , 0 , 1 ]
It can be converted to a one-hot vector like this. Change this word vector with an encoder. As a result, the word can be changed into a feature quantity. The vector obtained by changing this one-hot vector with an encoder is called an embedding vector.
As an example I = [1, 0, 0, 0] ⇒ Encoder ⇒ $ x_1 $ = [0.3, -0.3, 0.6, 2.2]
This $ x_1 $ is the embedding vector
The idea like this is detailed here (I used it as a reference) https://ishitonton.hatenablog.com/entry/2018/11/25/200332
A transformer returns a certain character string when you input a character string. As for the contents, it consists of many encoders and decoders as shown in the above figure. The entered character string first enters the encoder. The contents of the encoder are shown below.
This self attention looks at the relationship between words in the input string. Also, the strong relationship between words looks at the similarity of each word vector. Therefore, in order to check the similarity, the inner product of the matrix should be checked. Then, the conversion is performed using a general neural network.
The decoder then uses the input from the encoder to predict the next word.
This E-D-attention looks at the relationship between input and output.
A rough transformer looks like this. If you want to know properly https://qiita.com/omiita/items/07e69aef6c156d23c538
I almost used this as a reference! Insanely easy to understand! https://www.youtube.com/watch?v=BcNZRiO0_AE