My name is okayu and I am studying machine translation in Nara. As the title suggests, let's do English-Japanese neural machine translation from scratch.

What is neural machine translation?

Refers to all machine translation methods using deep learning. Transformer is a well-known model of the machine translation method that is currently the mainstream.

-Attention Is All You Need (Original Paper) -Paper commentary Attention Is All You Need (Transformer) (Japanese commentary article by Ryobot)

This time, I would like to do English-Japanese neural machine translation using this Transformer model! Use Sentenecepiece, OpenNMT, MeCab, multi-bleu.perl. It is supposed to be learned on a single GPU. This article does not explain technologies such as Transformer ...

Flow of neural machine translation

The general flow is as follows

[Prepare data](# anchor0)
[Pretreatment](# anchor2)
[Learning](# anchor3)
[Translation](# anchor4)
[Detonize](# anchor5)
[Word Split](# anchor6)
[Evaluation](# anchor7)

Since OpenNMT-py is used this time, the environment is python3.5 or later, PyTorch 1.4.

Let's prepare the data

A corpus is linguistic data used in natural language processing. A dataset consisting of Japanese and English patents called ASPEC and KFTT /kftt/index-ja.html) and JParaCrawl are currently provided free of charge for research purposes. The contents of other corpora can be found on the here page of Dr. Graham Neubig. In addition to Japanese and English, there are corpora such as ParaCrawl.

First, let's prepare a bilingual sentence data set for learning. What you need is as follows.

--train.ja --Japanese text for train --train.en --English sentence for train --dev.ja --Japanese text for validation --dev.en --English sentence for validation --test.ja --Japanese sentence for test --test.en --English for test

This time it is an English-Japanese translation, so test.ja is the reference sentence and test.en is the original sentence. Train.ja, train.en, dev.ja, dev.en are used for learning. The .en and .ja files have the same lines and translations, respectively.

As an actual example, ASPEC provides bilingual sentences for train, validation (developement), and test.

 B-94A0894379 ||| 3 |||Since the operation on the user side is important for the idea support in material development, an interface for material manipulation at the atomic level was developed [1994.8].||| Because user operation is important for the idea support in material development, an interface for a substance operation at atomic level was developed.

The article ID, Japanese and English sentences that are translated are summarized in one line. There are 1 million such sentences in train1.txt, 1790 sentences in dev.txt, and 1812 sentences in test.txt. It cannot be used for learning as it is, it is necessary to divide it into an English-only dataset and a Japanese-only dataset. (In this example, train1.txt is divided into train.ja and train.en, dev.txt is divided into dev.ja and dev.en, and test.txt is divided into test.ja and test.en) At this time, make sure that the lines and translations match in the .en and .ja files.

Preprocessing

Talknize

Tokenize using Sentencepiece. MeCab and KyTea may be used, but this time the word Tokenize in subword units instead of units. This time, the number of vocabulary is 16000, and the vocabulary is shared in English and Japanese. Cat train.en train.ja> tmp.txt temporarily creates sentence data including English and Japanese train data.

from python

Install with pip install sentencepiece. Next, we will study with the sentence piece.

`spm_train.py`


import sentencepiece as spm
sp = spm.SentencePieceProcessor()
spm.SentencePieceTrainer.Train("--input=tmp.txt --model_prefix=spm_trained_model --vocab_size=16000")

Execute (python spm_train.py) and learn. spm_trained_model.model and spm_trained_model.vocab are created. Since tmp.txt is no longer used, delete it. Tokenize all bilingual datasets. (python spm_tok.py)

`spm_tok.py`


import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load('spm_trained_model.model')

def detok(filename, outputfilename):
    f = open(filename, mode='r')
    foutput = open(outputfilename, mode='w')
    d = f.read()
    d = d.split('\n')
    for i in d:
        data = sp.EncodeAsPieces(str(i))
        data = ' '.join(data)
        foutput.write(data + '\n')
    f.close()
    foutput.close()

detok('train.en','train.en.atok')
detok('train.ja','train.ja.atok')
detok('dev.en','dev.en.atok')
detok('dev.ja','dev.ja.atok')
detok('test.en','test.en.atok')
detok('test.ja','test.ja.atok')

Tokenized train.en.atok, train.ja.atok, dev.en.atok, dev.ja.atok, test.en.atok, test.ja. atok has been created. (In OpenNMT, test.ja.atok is not used, but in fairseq, it is used, so I created it for the time being) Please refer to the official github for how to do it from the command line. (Sentencepiece / GitHub)

Serialization

This time, we will study using OpenNMT. Besides OpenNMT, there are fairseq. Both OpenNMT and fairseq perform unique preprocessing. Converts a sentence into a data structure that is easy to learn.

ʻInstall OpenNMT-py`.

git clone https://github.com/OpenNMT/OpenNMT-py.git
cd OpenNMT-py
python setup.py install

From now on, it is assumed that all processing is performed in the ʻOpenNMT-pydirectory. Next, serialize it withpreprocess.py`. Since this is an English-Japanese translation, specify the English dataset for src and the Japanese dataset for tgt.

python preprocess.py -train_src train.en.atok -train_tgt train.ja.atok -valid_src dev.en.atok -valid_tgt dev.ja.atok -save_data preprocessed_dataset

When executed, the files preprocessed_dataset.train.pt, preprocessed_dataset.valid.pt, and preprocessed_dataset.vocab.pt will be created. (Multiple will be created depending on the amount of data) The explanation of each data is as follows. (Quoted from official GitHub)

After running the preprocessing, the following files are generated:

demo.train.pt: serialized PyTorch file containing training data

demo.valid.pt: serialized PyTorch file containing validation data

demo.vocab.pt: serialized PyTorch file containing vocabulary data

Learning

Learn with the following command (quote: OpenNMTFAQ). Please refer to Official document (English) for the explanation of hyperparameters. Since learning is performed on a single GPU, the CUDA device is specified and world_size is set to 1.

export CUDA_VISIBLE_DEVICES=0 && \
python train.py -data preprocessed_dataset -save_model save_model_name \
-layers 6 -rnn_size 512 -word_vec_size 512 \
-transformer_ff 2048 -heads 8 -encoder_type transformer -decoder_type transformer -position_encoding \
-train_steps 200000 -max_generator_batches 2 -dropout 0.1 -batch_size 4096 -batch_type tokens \
-normalization tokens -accum_count 2 -optim adam -adam_beta2 0.998 -decay_method noam \
-warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot \
-label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 40000 -world_size 1 -gpu_ranks 0

When you run it, it starts like this. NMTModel (Since the model structure is displayed, it is a little long, so it is omitted.

[2020-06-16 14:50:34,141 INFO]  * src vocab size = 8071
[2020-06-16 14:50:34,143 INFO]  * tgt vocab size = 14921
[2020-06-16 14:50:34,145 INFO] Building model...
[2020-06-16 14:50:38,414 INFO] NMTModel(

It is proof that learning is progressing as Step progresses in this way (the following example erases the numerical value a little)

[2020-06-18 00:59:56,869 INFO] Step 100/200000; acc:  *; ppl:  *; xent: *; lr: *; 12933/10876 tok/s; 122958 sec

When it's over, it's like this

[2020-06-18 01:01:23,330 INFO] Step 200000/200000; acc:  *; ppl:  *; xent: *; lr: *; 13220/10803 tok/s; 123045 sec
[2020-06-18 01:01:23,336 INFO] Loading dataset from preprocessed_dataset.valid.pt
[2020-06-18 01:01:23,473 INFO] number of examples: 1791
[2020-06-18 01:01:26,183 INFO] Validation perplexity: *
[2020-06-18 01:01:26,185 INFO] Validation accuracy: *
[2020-06-18 01:01:26,265 INFO] Saving checkpoint save_model_name_step_200000.pt

Learning will be completed in 1 to 2 days with ASPEC dataset and single GPU. (GPU, CPU is slower)

translation

After learning, checkpoints such as save_model_name_step_200000.pt will be generated. Translate using this. For the original text (English text you want to translate), use the tokenized test.en.atok created earlier. (test.en is no good)

python translate.py -model save_model_name_step_200000.pt -src test.en.atok -output output.txt -replace_unk -verbose -gpu 0

The translated sentence file ʻoutput.txt` has been created!

Detonize

The translated text ʻoutput.txt is tokenized, so detalk it with the Sentencepiece and return it to the raw text. (python spm_detok.py`)

`spm_detok.py`


import sentencepiece as spm
import sys

sp = spm.SentencePieceProcessor()
sp.load('spm_trained_model.model')
openfile = 'output.txt'
f = open(openfile,mode='r')
data = f.read()
outputsentence = data.split('\n')
f.close()
f2 = open(openfile+'detok', mode='w')
for i in outputsentence:
    i = i.split(' ')
    tmp = sp.decode_pieces(i)
    f2.write(tmp + '\n')
f2.close()

You now have a raw Japanese translation ʻoutput.txt.detok`! This is the basic flow of neural machine translation. Below, we will explain how the generated text is evaluated.

Word split

As we will see later, BLEU, which is based on the similarity of word overlap by ngram, is used for evaluation. Unlike English, Japanese does not include spaces in sentences, so it is necessary to divide words. This time, I will use MeCab for word division. (KyTea Some people seem to be using it)

Install the dictionary and MeCab according to this procedure. Next, use python mecab_owakati.py output.txt.detok and python mecab_owakati.py test.ja to divide the generated translation and reference sentences into words.

`mecab_owakati.py`


import MeCab
import sys
m = MeCab.Tagger('-Owakati')
openfile = sys.argv[1]
f = open(openfile,mode='r')
f2 = open(openfile+'.mecab', mode='w')
data = f.read()
f.close()
data = data.split('\n')
for i in data:
    k = m.parse(i)
    f2.write(k)
f2.close()

Word-divided Japanese translation ʻoutput.txt.detok.mecab Word-divided reference sentence test.ja.mecab` was generated. Next, we will evaluate the translated text generated using these.

Evaluation

BLEU is often used in the evaluation of machine translation, so we will use it for evaluation.

-BLEU: a Method for Automatic Evaluation of Machine Translation (Original paper) -Automatic evaluation scale BLEU (NICT BLEU commentary by Masao Uchiyama)

In recent years, evaluation methods based on BERT instead of BLEU have come out (such as BERT score?), And there are also evaluation scales such as RIBES and ROUGE. This time I will use BLEU quietly. Use multi-bleu.perl for evaluation. When you clone OpenNMT-py, it is in the tools / directory by default.

If you execute perl tools / multi-bleu.perl test.ja.mecab <output.txt.detok.mecab, the value of BLEU will be displayed.

BLEU = *, */*/*/* (BP=*, ratio=*, hyp_len=*, ref_len=*)
#BLEU = global-BLEU score, precisions of 1-grams/2-grams/3-grams/4-grams (BP=brevity penalty, ratio=length ratio， hyp_len=hypothesis length, ref_len=reference length)

By the way, the evaluation result by WAT is listed in here. If it is the above hyperparameters in ASPEC, I think it is BLEU38-40 ?. (It depends greatly on the hyperparameters when learning) Compared to that, I thought that a certain N company and a certain N | CT are stronger. Also, although I used multi-bleu.perl this time, I think you can use sacreBLEU. Recently I started to support Japanese. (Mecab only)

Finally

When I did machine translation, I didn't know what to do, so I wrote it for people who have never done machine translation (I don't have Japanese articles and I'm not good at English. I had a hard time). I'm sorry if it's hard to understand. We hope that you will be interested in natural language processing and machine translation as much as possible.

If you are interested in models other than Transformer, it may be fun to try fairseq. Various are implemented. If you have any suggestions or opinions, please do not hesitate to contact us. If there is a problem, erase it.

~~ I've been doing it from the command line lately, but when I didn't understand anything I was running it in python ... ignorance ... ~~

Let's try neural machine translation using Transformer