How to make a Japanese-English translation

How to make a Japanese-English translation

We will implement Japanese-English translation with tensorflow and keras.

This is the table of contents for this article.

  1. Environment and dataset details
  2. [Basic flow](#basic flow)
  3. Data preprocessing (#preprocess)
  4. [Build model](#construct model)
  5. Learning
  6. Evaluate

The details of the code are published on github, so please refer to it. Japanese-English_Translation Since it is saved as .pyinb, it can be easily moved with google colab. I will publish the code when I studied natural language processing long ago. (Publish a little organized)

We look forward to helping you.

Environment and dataset details

Hardware environment gooble colab

Software environment python3 tensorflow (version2.3.1)

data set small_parallel_enja

small_parallel_enja is a small dataset of some sentences extracted from the Tanaka corpus. It has been pre-processed and is very easy to use. Since the data set is divided into training data, verification data, and test data, there is no need to divide it. If resources are available, cross-validation may be performed using a mixture of training data and verification data as training data. (People with multiple GPUs)

Basic flow We will proceed according to the following flow.

1. Data preprocessing

2. Model building

3. Learning

4. Evaluation

Well, it's normal ww

Data preprocessing It is quite easy because the data that has been preprocessed is used.

Tokenize uses the keras API built into tensorflow. tf.keras.preprocessing.text.Tokenizer

It's fairly easy to use, and an example is shown below.

tokenizer = tf.keras.preprocessing.text.Tokenizer(oov="<unk>")
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

Create an instance of tf.keras.preprocessing.text.Tokenizer and I will tell you the word to use for that instance with fit_on_texts (texts). By doing this, you manage unique words internally. After that, all you have to do is quantify each sentence with tokenizer.texts_to_sequences (texts). texts are text datasets. The texts format must be a list of strings.

texts = ["I am Niwaka", "Hello !", .., "Wow !"]

The code above is an example of the texts format passed to the tf.keras.preprocessing.text.Tokenizer instance.

How to use tf.keras.preprocessing.text.Tokenizer You can find out by jumping to the link above.

In order to perform learning by mini-batch, the shape of the data in the mini-batch must match. However, natural language data is generally variable length </ strong>. Therefore, unlike other datasets, it must be devised.

There are two possible ways to handle variable data:

1.padding 2. Set the batch size to 1 (If the time series length is several hundred levels, should I use this?)

Here we use padding of 1. Padding is a method to make the length L by filling a special value for data that does not meet the maximum time series length in the mini-batch. It seems that 0 is set as a special value in tensorflow and keras.

For example, suppose you have a dataset with non-uniform lengths like this:

sequences = [
  [12, 45],
  [3, 4, 7],
  [4],
]

The maximum length of the above dataset is 3. When padding is performed, it will be as follows.

padded_sequences = [
  [12, 45, 0],
  [3, 4, 7],
  [4, 0, 0],
]

In tensorflow, padding processing for text data is tf.keras.preprocessing.sequence.pad_sequences </ It is provided by API called a>.

To use, add the length to tf.keras.preprocessing.sequence.pad_sequences Just give the dataset you want to unify as input. The padding argument specifies whether to pad after 0 or before. By specifying "post", 0 is filled with post-filling. I think you can use it as you like.

padded_sequences = tf.keras.preprocessing.sequence.pad_sequences(sequences, padding="post")

Now you can learn by mini-batch. But there is a problem here. It's a question of how the model interprets 0. </ strong>

If possible, you want to be able to ignore the special value of 0. Otherwise, it makes no sense to use an RNN that can handle variable lengths.

tensorflow provides a feature called Masking </ strong>. Masking is a function that ignores the value at the specified step. This makes it possible to handle variable-length data collectively. (You can handle variable-length data without using Masking, but a special value of 0 will also be included in the model. That's unpleasant, so I'd like to avoid it.)

For more information, please follow the link below. At the link, detailed explanations on how to use Making and Padding in tensorflow and keras are written. Masking and padding with Keras

In tensorflow, Masking is enabled in the following ways:

  1. Add tf.keras.layers.Masking
  2. Set the argument mask_zero of tf.keras.layers.Embedding to True.
  3. Pass it directly to the layer that uses the mask. (This is a straightforward way, maybe I use this for videos, etc.)

Here we use 2.

In the model used this time, Embedding automatically generates a mask, and that mask is automatically propagated to the next layer.

Model construction Use the Seq2Seq model as the model. What is Seq2Seq? A model that transforms Sequence data into some other Sequence data. Sequence data here is time series data.

The interface of Seq2Seq is

Sequence after conversion= Seq2Seq(Sequence before conversion)

is.

For example, suppose you enter "I am a student" into the Seq2Seq model.

"I am student ." = Seq2Seq("I am a student.")

Seq2Seq consists of two modules. The first is Encoder </ strong> and the second is Decoder </ strong>. Sequence data is encoded by Encoder and outputs features that are incomprehensible to humans. Enter it and the start token in the Decoder to get some other Sequence data. Here, the start token is a special word that means the beginning of a sequence.

The interface between Encoder and Decoder is described below in a pseudo language.

Features that humans cannot understand= Encoder(Sequence before conversion)
Sequence after conversion= Decoder(Features that humans cannot understand,<start>token)    

Each module uses RNN. The specific mechanism is Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention) You can see it by reading the first part of. This article is about attention, but it also describes Seq2Seq.

Seq2Seq uses RNN. RNNs are good at processing natural language because they can handle variable length data. In addition, the weights used in each step are shared, so parameter increases can be suppressed. However, be aware that data with too long a time series length will lead to gradient disappearance and gradient explosion during error back propagation. Even if the time series length is only 10, it is the same as expanding 10 layers on the time axis. We adopted GRU for RNN. GRU is an RNN model that does not easily lose its gradient.

The model diagram of the decoder and encoder used this time is as follows.

スクリーンショット 2020-11-16 19.01.01.png Figure 1 encoder

スクリーンショット 2020-11-16 19.01.07.png Figure 2 decoder

I'm only using one RNN for each Encoder and Decoder. Since the data set is small, I chose a smaller model.

Below is the code using tensorflow and keras.

It uses The Functional API. If you want to represent variable length data, specify None for the shape of tf.keras.Input.

To use The Functional API, please follow the link below. The Functional API Below is the model implementation code. We have model, encoder and decoder respectively. The model is prepared for learning. The parameters obtained by learning with model.fit are read into the encoder and decoder. Processing is different during learning and inference.

def CreateEncoderModel(vocab_size):
  units = 128
  emb_layer = tf.keras.layers.Embedding(vocab_size, units, mask_zero=True)#mask to enable padding_zero=True
  gru_layer  = tf.keras.layers.GRU(units)
  encoder_inputs = tf.keras.Input(shape=(None,))
  outputs = emb_layer(encoder_inputs)
  outputs = gru_layer(outputs)
  
  encoder = tf.keras.Model(encoder_inputs, outputs)

  return encoder

def CreateDecoderModel(vocab_size):
  units = 128

  emb_layer = tf.keras.layers.Embedding(vocab_size, units, mask_zero=True)#mask to enable padding_zero=True
  gru_layer  = tf.keras.layers.GRU(units, return_sequences=True)
  dense_layer = tf.keras.layers.Dense(vocab_size, activation="softmax")

  decoder_inputs  = tf.keras.Input(shape=(None,))
  encoder_outputs = tf.keras.Input(shape=(None,))

  outputs = emb_layer(decoder_inputs)
  outputs = gru_layer(outputs, initial_state=encoder_outputs)
  outputs = dense_layer(outputs)
  
  decoder = tf.keras.Model([decoder_inputs, encoder_outputs], outputs)

  return decoder

def CreateModel(seed, ja_vocab_size, en_vocab_size):
  tf.random.set_seed(seed)
  encoder = CreateEncoderModel(ja_vocab_size)
  decoder = CreateDecoderModel(en_vocab_size)

  encoder_inputs = tf.keras.Input(shape=(None,))
  decoder_inputs = tf.keras.Input(shape=(None,))

  encoder_outputs = encoder(encoder_inputs)
  decoder_outputs = decoder([decoder_inputs, encoder_outputs])
  
  model = tf.keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)
  model.compile(optimizer='adam',
                loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
                metrics=['accuracy'])
  return model, encoder, decoder

learning Let's search for batch sizes 32, 64, 128.

The experimental settings are shown below.

  1. The number of RNN units is 128
  2. How to update the weight is Adam (learning rate remains the default)
  3. The number of epoch is 2
  4. Evaluation method is BLEU

The epoch was stopped twice, and the verification data was evaluated with the second epoch model. BLEU is an index used to measure the quality of translations.

Below is the learning code.

bleu_scores = []
batch_size_list = [32, 64, 128]
for batch_size in batch_size_list:
  model, encoder, decoder = CreateModel(seed, len(ja_tokenizer.word_index)+1, len(en_tokenizer.word_index)+1)
  model.fit([train_ja_sequences, train_en_sequences[:, :-1]], train_en_sequences[:, 1:], batch_size=batch_size, epochs=2)
  model.save(str(batch_size)+"model.h5")
  encoder.load_weights(str(batch_size)+"model.h5", by_name=True)
  decoder.load_weights(str(batch_size)+"model.h5", by_name=True)
  bleu_score = Evaluate(valid_ja, valid_en, encoder, decoder)
  bleu_scores.append(bleu_score)

BLEU is measured with the Evaluate function. (Implementation is published on github.) The model was saved using each batch size as the name.

The Decode method outputs the word that is the maximum probability of each step. I greedily decided the word. (Greedy algorithm) In fact, it's better to use Beam Search. Beam Search is a search algorithm that relaxes the conditions of the greedy algorithm. Even if you greedily decide the word for each step, you do not know whether it will be the optimal solution, so it is better to use Beam Search. The following explanation will be helpful for Beam Search. C5W3L03 Beam Search The link is a video, so it's a good idea to watch it when you have time.

In keras, I am wondering if mask is applied when calculating the loss value. I've heard that it's applied, but I'm not sure because I don't know the implementation of what's going on there.

If you feel uncomfortable, the implementation of the cost function of Neural machine translation with attention can be helpful. However, please note that the implementation of the link destination is quite difficult for those who have only learned with model.fit. The implementation of the linked cost function is implemented so that the cost at the time of the masking step is not included in the final cost.

The experimental results are graphed. スクリーンショット 2020-11-18 16.22.00.png Fig. 3 Experimental results

Please note that the image is rough.

The batch size with the best BLEU for the validation data is 32, so use 32 to retrain. As you can see in Figure 3, smaller batch sizes may give better results. Before starting the evaluation, mix the training data and the verification data and retrain. In the re-learning, the epoch number was set to 10. Everything else is the same.

train_and_valid_ja_sequences = tf.concat([train_ja_sequences, valid_ja_sequences], 0)
train_and_valid_en_sequences = tf.concat([train_en_sequences, valid_en_sequences], 0)

best_model, best_encoder, best_decoder = CreateModel(seed, len(ja_tokenizer.word_index)+1, len(en_tokenizer.word_index)+1)
best_model.fit([train_and_valid_ja_sequences, train_and_valid_en_sequences[:, :-1]], train_and_valid_en_sequences[:, 1:], batch_size=32, epochs=10)
best_model.save("best_model.h5")

If you're using a GPU, you won't always get the same results.

Evaluation It was BLEU 0.19 for the test data. (Maximum is 1) I do not know because I have not compared it with others, but I think that it is a pretty terrible result www

The processing code for the test data is as follows.

best_encoder.load_weights("best_model.h5", by_name=True)
best_decoderbest_decoder.load_weights("best_model.h5", by_name=True)
bleu_score = Evaluate(test_ja, test_en, best_encoder, best_decoder)
print("bleu on test_dataset:")
print(bleu_score)

It's a simple question, but there seem to be several BLEU evaluation methods. (There seem to be some smoothing_functions.) It seems that it is not unified, but please tell me who is familiar with it. If it's not unified, then BLEU should be measured with the smoothing_function that works best ... is this an ant? ...

Finally

I'll end this article with some ways to improve accuracy.

  1. Invert the input data
  2. Incorporate attention into the model
  3. Use stop_word.
  4. Ensemble
  5. Deepen the layer (don't forget to use skip connection)
  6. Share the weight of the embedding layer and the fully connected layer
  7. Change model to Transformer
  8. Change the initial weight setting method

I have listed the ones that you can find as many as you want by searching on the net. Just because you use it doesn't mean that BLEU will improve. If you are interested, please ask Google teacher.

I'm a beginner in natural language processing, so if there's something wrong with it, I'd appreciate it if you could let me know.

References 1.small_parallel_enja 2.Masking and padding with Keras 3.The Functional API 4. Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention) 5.Neural machine translation with attention 6.C5W3L03 Beam Search

Recommended Posts