My name is okayu and I am studying machine translation in Nara. As the title suggests, let's do English-Japanese neural machine translation from scratch.
Refers to all machine translation methods using deep learning. Transformer is a well-known model of the machine translation method that is currently the mainstream.
-Attention Is All You Need (Original Paper) -Paper commentary Attention Is All You Need (Transformer) (Japanese commentary article by Ryobot)
This time, I would like to do English-Japanese neural machine translation using this Transformer model! Use Sentenecepiece, OpenNMT, MeCab, multi-bleu.perl. It is supposed to be learned on a single GPU. This article does not explain technologies such as Transformer ...
The general flow is as follows
Since OpenNMT-py is used this time, the environment is python3.5 or later, PyTorch 1.4.
A corpus is linguistic data used in natural language processing. A dataset consisting of Japanese and English patents called ASPEC and KFTT /kftt/index-ja.html) and JParaCrawl are currently provided free of charge for research purposes. The contents of other corpora can be found on the here page of Dr. Graham Neubig. In addition to Japanese and English, there are corpora such as ParaCrawl.
First, let's prepare a bilingual sentence data set for learning. What you need is as follows.
--train.ja --Japanese text for train --train.en --English sentence for train --dev.ja --Japanese text for validation --dev.en --English sentence for validation --test.ja --Japanese sentence for test --test.en --English for test
This time it is an English-Japanese translation, so test.ja is the reference sentence and test.en is the original sentence. Train.ja, train.en, dev.ja, dev.en are used for learning. The .en and .ja files have the same lines and translations, respectively.
As an actual example, ASPEC provides bilingual sentences for train, validation (developement), and test.
B-94A0894379 ||| 3 |||Since the operation on the user side is important for the idea support in material development, an interface for material manipulation at the atomic level was developed [1994.8].||| Because user operation is important for the idea support in material development, an interface for a substance operation at atomic level was developed.
The article ID, Japanese and English sentences that are translated are summarized in one line. There are 1 million such sentences in train1.txt, 1790 sentences in dev.txt, and 1812 sentences in test.txt. It cannot be used for learning as it is, it is necessary to divide it into an English-only dataset and a Japanese-only dataset. (In this example, train1.txt is divided into train.ja and train.en, dev.txt is divided into dev.ja and dev.en, and test.txt is divided into test.ja and test.en) At this time, make sure that the lines and translations match in the .en and .ja files.
Tokenize using Sentencepiece. MeCab and KyTea may be used, but this time the word Tokenize in subword units instead of units.
This time, the number of vocabulary is 16000, and the vocabulary is shared in English and Japanese. Cat train.en train.ja> tmp.txt
temporarily creates sentence data including English and Japanese train data.
Install with pip install sentencepiece
.
Next, we will study with the sentence piece.
spm_train.py
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
spm.SentencePieceTrainer.Train("--input=tmp.txt --model_prefix=spm_trained_model --vocab_size=16000")
Execute (python spm_train.py
) and learn. spm_trained_model.model
and spm_trained_model.vocab
are created. Since tmp.txt
is no longer used, delete it.
Tokenize all bilingual datasets. (python spm_tok.py
)
spm_tok.py
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load('spm_trained_model.model')
def detok(filename, outputfilename):
f = open(filename, mode='r')
foutput = open(outputfilename, mode='w')
d = f.read()
d = d.split('\n')
for i in d:
data = sp.EncodeAsPieces(str(i))
data = ' '.join(data)
foutput.write(data + '\n')
f.close()
foutput.close()
detok('train.en','train.en.atok')
detok('train.ja','train.ja.atok')
detok('dev.en','dev.en.atok')
detok('dev.ja','dev.ja.atok')
detok('test.en','test.en.atok')
detok('test.ja','test.ja.atok')
Tokenized train.en.atok
, train.ja.atok
, dev.en.atok
, dev.ja.atok
, test.en.atok
, test.ja. atok
has been created.
(In OpenNMT, test.ja.atok is not used, but in fairseq, it is used, so I created it for the time being)
Please refer to the official github for how to do it from the command line. (Sentencepiece / GitHub)
This time, we will study using OpenNMT. Besides OpenNMT, there are fairseq. Both OpenNMT and fairseq perform unique preprocessing. Converts a sentence into a data structure that is easy to learn.
ʻInstall OpenNMT-py`.
git clone https://github.com/OpenNMT/OpenNMT-py.git
cd OpenNMT-py
python setup.py install
From now on, it is assumed that all processing is performed in the ʻOpenNMT-pydirectory. Next, serialize it with
preprocess.py`. Since this is an English-Japanese translation, specify the English dataset for src and the Japanese dataset for tgt.
python preprocess.py -train_src train.en.atok -train_tgt train.ja.atok -valid_src dev.en.atok -valid_tgt dev.ja.atok -save_data preprocessed_dataset
When executed, the files preprocessed_dataset.train.pt
, preprocessed_dataset.valid.pt
, and preprocessed_dataset.vocab.pt
will be created. (Multiple will be created depending on the amount of data)
The explanation of each data is as follows. (Quoted from official GitHub)
After running the preprocessing, the following files are generated:
- demo.train.pt: serialized PyTorch file containing training data
- demo.valid.pt: serialized PyTorch file containing validation data
- demo.vocab.pt: serialized PyTorch file containing vocabulary data
Learn with the following command (quote: OpenNMTFAQ). Please refer to Official document (English) for the explanation of hyperparameters. Since learning is performed on a single GPU, the CUDA device is specified and world_size is set to 1.
export CUDA_VISIBLE_DEVICES=0 && \
python train.py -data preprocessed_dataset -save_model save_model_name \
-layers 6 -rnn_size 512 -word_vec_size 512 \
-transformer_ff 2048 -heads 8 -encoder_type transformer -decoder_type transformer -position_encoding \
-train_steps 200000 -max_generator_batches 2 -dropout 0.1 -batch_size 4096 -batch_type tokens \
-normalization tokens -accum_count 2 -optim adam -adam_beta2 0.998 -decay_method noam \
-warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot \
-label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 40000 -world_size 1 -gpu_ranks 0
When you run it, it starts like this. NMTModel (Since the model structure is displayed, it is a little long, so it is omitted.
[2020-06-16 14:50:34,141 INFO] * src vocab size = 8071
[2020-06-16 14:50:34,143 INFO] * tgt vocab size = 14921
[2020-06-16 14:50:34,145 INFO] Building model...
[2020-06-16 14:50:38,414 INFO] NMTModel(
It is proof that learning is progressing as Step progresses in this way (the following example erases the numerical value a little)
[2020-06-18 00:59:56,869 INFO] Step 100/200000; acc: *; ppl: *; xent: *; lr: *; 12933/10876 tok/s; 122958 sec
When it's over, it's like this
[2020-06-18 01:01:23,330 INFO] Step 200000/200000; acc: *; ppl: *; xent: *; lr: *; 13220/10803 tok/s; 123045 sec
[2020-06-18 01:01:23,336 INFO] Loading dataset from preprocessed_dataset.valid.pt
[2020-06-18 01:01:23,473 INFO] number of examples: 1791
[2020-06-18 01:01:26,183 INFO] Validation perplexity: *
[2020-06-18 01:01:26,185 INFO] Validation accuracy: *
[2020-06-18 01:01:26,265 INFO] Saving checkpoint save_model_name_step_200000.pt
Learning will be completed in 1 to 2 days with ASPEC dataset and single GPU. (GPU, CPU is slower)
After learning, checkpoints such as save_model_name_step_200000.pt
will be generated. Translate using this.
For the original text (English text you want to translate), use the tokenized test.en.atok
created earlier. (test.en
is no good)
python translate.py -model save_model_name_step_200000.pt -src test.en.atok -output output.txt -replace_unk -verbose -gpu 0
The translated sentence file ʻoutput.txt` has been created!
The translated text ʻoutput.txt is tokenized, so detalk it with the Sentencepiece and return it to the raw text. (
python spm_detok.py`)
spm_detok.py
import sentencepiece as spm
import sys
sp = spm.SentencePieceProcessor()
sp.load('spm_trained_model.model')
openfile = 'output.txt'
f = open(openfile,mode='r')
data = f.read()
outputsentence = data.split('\n')
f.close()
f2 = open(openfile+'detok', mode='w')
for i in outputsentence:
i = i.split(' ')
tmp = sp.decode_pieces(i)
f2.write(tmp + '\n')
f2.close()
You now have a raw Japanese translation ʻoutput.txt.detok`! This is the basic flow of neural machine translation. Below, we will explain how the generated text is evaluated.
As we will see later, BLEU, which is based on the similarity of word overlap by ngram, is used for evaluation. Unlike English, Japanese does not include spaces in sentences, so it is necessary to divide words. This time, I will use MeCab for word division. (KyTea Some people seem to be using it)
Install the dictionary and MeCab according to this procedure.
Next, use python mecab_owakati.py output.txt.detok
and python mecab_owakati.py test.ja
to divide the generated translation and reference sentences into words.
mecab_owakati.py
import MeCab
import sys
m = MeCab.Tagger('-Owakati')
openfile = sys.argv[1]
f = open(openfile,mode='r')
f2 = open(openfile+'.mecab', mode='w')
data = f.read()
f.close()
data = data.split('\n')
for i in data:
k = m.parse(i)
f2.write(k)
f2.close()
Word-divided Japanese translation ʻoutput.txt.detok.mecab Word-divided reference sentence
test.ja.mecab` was generated. Next, we will evaluate the translated text generated using these.
BLEU is often used in the evaluation of machine translation, so we will use it for evaluation.
-BLEU: a Method for Automatic Evaluation of Machine Translation (Original paper) -Automatic evaluation scale BLEU (NICT BLEU commentary by Masao Uchiyama)
In recent years, evaluation methods based on BERT instead of BLEU have come out (such as BERT score?), And there are also evaluation scales such as RIBES and ROUGE. This time I will use BLEU quietly. Use multi-bleu.perl
for evaluation. When you clone OpenNMT-py, it is in the tools /
directory by default.
If you execute perl tools / multi-bleu.perl test.ja.mecab <output.txt.detok.mecab
, the value of BLEU will be displayed.
BLEU = *, */*/*/* (BP=*, ratio=*, hyp_len=*, ref_len=*)
#BLEU = global-BLEU score, precisions of 1-grams/2-grams/3-grams/4-grams (BP=brevity penalty, ratio=length ratio, hyp_len=hypothesis length, ref_len=reference length)
By the way, the evaluation result by WAT is listed in here. If it is the above hyperparameters in ASPEC, I think it is BLEU38-40 ?. (It depends greatly on the hyperparameters when learning) Compared to that, I thought that a certain N company and a certain N | CT are stronger. Also, although I used multi-bleu.perl this time, I think you can use sacreBLEU. Recently I started to support Japanese. (Mecab only)
When I did machine translation, I didn't know what to do, so I wrote it for people who have never done machine translation (I don't have Japanese articles and I'm not good at English. I had a hard time). I'm sorry if it's hard to understand. We hope that you will be interested in natural language processing and machine translation as much as possible.
If you are interested in models other than Transformer, it may be fun to try fairseq. Various are implemented. If you have any suggestions or opinions, please do not hesitate to contact us. If there is a problem, erase it.
~~ I've been doing it from the command line lately, but when I didn't understand anything I was running it in python ... ignorance ... ~~
Recommended Posts