For those who want to try this or that for deep learning modeling but don't know how to implement it Using keras's functional API as a framework that is relatively flexible and reasonably abstracted Try to implement seq2seq, which is difficult with sequential, as simply as possible
It turns out that deep learning can be implemented using keras. I understand that deep learning has preprocessing. So how can I convert the data into a format that can use keras' deep learning capabilities? The main answer is that.
When the translation model infers the first word of a sentence, it takes the virtual word start token \
In order to input to the machine learning model, it is necessary to quantify the loaded character string data in some way. Bag of words and one-hot encoding of each word are famous methods. This time I want to use the embedding layer of keras at the beginning of the network, so I assign a word ID to the word and convert it to a column of word ID.
Embedding layer https://keras.io/ja/layers/embeddings/
If possible, unify the length of the word string within the dataset to make it easier to enter into the LSTM later. This time, the length of the word string is adjusted by padding to match the maximum length in the dataset.
There is teacher forcing as a technique when learning the seq2seq model. Originally, the decoder uses the estimation result of the previous word to estimate the next word. Since the correct answer data can be used during learning, the next word is estimated using the previous correct answer word instead of the previous estimation result. In the figure, the flow is as follows. Even if the inference of "that" and "this", and "pen" and "pencil" is wrong, the next input will be corrected to the correct answer data. To achieve this, prepare a word string with only one word shifted from the target word string as the input of the decoder.
Example
If the estimation target is "This is a pen. \
The above flow can be summarized as follows.
When the above processing is performed, for example, a conversion like this will be performed. Dataset word string <start> i can 't tell who will arrive first . <end> ↓ Word ID column \ [2, 6, 42, 20, 151, 137, 30, 727, 234, 4, 3, 0, 0, 0, 0, 0, 0, 0](18 elements)
Define the following two functions, read the data from the dataset for each row, and give the start / end tokens.
def preprocess_sentence(w):
w = w.rstrip().strip()
#Add statement start and end tokens
#To let the model know when to start and when to end the forecast
w = '<start> ' + w + ' <end>'
return w
def create_dataset(path, num_examples):
with open(path) as f:
word_pairs = f.readlines()
word_pairs = [preprocess_sentence(sentence) for sentence in word_pairs]
return word_pairs[:num_examples]
Although it is called preprocess_sentence, it only gives start / end tokens, which is not a very good name for a function. The variable in create_dataset is word_pairs because the sample code of TensorFlow that I referred to is still there. It's not pairs at all, it returns num_examples word strings with start / end tokens.
Here, keras keras.preprocessing.text.Tokenizer
is very convenient and you can take a break.
def tokenize(lang):
lang_tokenizer = keras.preprocessing.text.Tokenizer(filters='', oov_token='<unk>')
lang_tokenizer.fit_on_texts(lang)
tensor = lang_tokenizer.texts_to_sequences(lang)
tensor = keras.preprocessing.sequence.pad_sequences(tensor, padding='post')
return tensor, lang_tokenizer
Determine the conversion rule between words and word IDs from the list of word strings entered by the fit_on_texts
method.
You can use the texts_to_sequences
method to convert the list of input word strings into a list of word ID strings.
0 padding is also done with keras.preprocessing.sequence.pad_sequences
.
Input word ID string obtained by processing ʻinput_tensorby the method described above. Process
target_tensor` as the correct word ID string processed by the above method as follows.
encoder_input_tensor = input_tensor
decoder_input_tensor = target_tensor[:,:-1]
decoder_target_tensor = target_tensor[:,1:] #This realizes teacher forcing
You now have the data to use in the seq2seq model. We will do modeling and learning in the next article.
The pretreatment part is as follows Neural machine translation with attention https://www.tensorflow.org/tutorials/text/nmt_with_attention
The code base for the learning / inference part is as follows Sequence to sequence example in Keras (character-level). https://keras.io/examples/lstm_seq2seq/
The data used for learning is as follows https://github.com/odashi/small_parallel_enja
Repository containing the code for this article https://github.com/nagiton/simple_NMT
Recommended Posts