Introduction

English-Japanese translation will be added to WMT, the world's largest workshop on machine translation research, from 2020, DeepL will support Japanese, and attention to Japanese machine translation is increasing. Under such circumstances, there may be people who want to try out the machine translation model and want to learn it. I am also an amateur in this field, but I may want to use it in the future. I researched for myself what kind of machine translation library is currently used in my research. I have only used Fairseq and the old OpenNMT.

What kind of library is used

According to Findings of WMT2019, a machine translation competition --Marian over 30%

Fairseq 18%
OpenNMT 16%
Tensor2Tensor 14%
Sockeye 14% That's right. Each has its own features (implementation of methods and models), but any ordinary Transformer can be used.

Marian NMT Marian is a framework developed by the Microsoft Translator team. Since it is written in C ++, it is very fast. In terms of accuracy, the Microsoft team's system using Marian at WMT2019 has a proven track record of being ranked high in various language pairs.

As far as the example is seen, the usage seems to be the flow of learning by creating a vocabulary file with Marian's command from the tokenized corpus.

To install the GPU version, prepare CMake 3.5.1, GCC / G ++ 5.4, Boost 1.65.1, CUDA 9.0 or newer and make it (I made it without knowing anything when I was young) I gave up because of moss).

Unfortunately, I haven't found a Japanese article about installing and using Marian so far. Read English documents and tutorials.

Fairseq Fairseq is a toolkit developed by Facebook AI. Written in Pytorch, the main feature is that it is designed to be easy to expand, and I get the impression that it is being updated steadily. In terms of accuracy, the Facebook team has achieved excellent results at WMT 2019. The speed is faster using FP16 mode (initially it was faster than Marian, but I think I saw somewhere that the update made Marian faster).

To use it, use a dedicated preprocess script to binary the corpus and vocabulary before training. At the time of inference, the test statement can be input as it is without being binaryized. One of my personal gratitude is that the binary training data and the checkpoint file of the trained model cannot be overwritten by the Fairseq command (it stops with an Assersion Error). .. There are many things I can write about Fairseq, but I can write one article by itself, and there are other Japanese articles, so I will only mention this level in this article. There is also an official example and it's very easy to use.

Installation is almost okay as long as PyTorch works. For PyTorch, follow the Official installation instructions for your OS, Package, and CUDA version.

OpenNMT OpenNMT is a tool developed by Harvard NLP group and SYSTRAN. It is the oldest of the ones introduced today. There used to be a Lua version, but it seems that it ended when the maintenance of Torch was completed. Currently, development of PyTorch version (OpenNMT-py) and TensorFlow version (OpenNMT-tf) is ongoing. The available features of the two are quite different. In addition to machine translation and language modeling, image to text, speech to text, summarization, series classification, and series tagging are also possible.

It is usually used as a learning process with special pre-processing.

When I used OpenNMT-py, I was suffering from the version of torchtext at the time of installation, and it does not automatically save the best model in validation, so I choose a model by looking at the learning log at the time of inference. I was dissatisfied with the need and not knowing the options and high paras that would give accuracy. I don't know what's going on now. There are so many Japanese articles, so you may want to refer to them.

Tensor2Tensor T2T is a library of deep learning and datasets developed by the Google Brain team. Written in TensorFlow. The other libraries featured in this article have machine translation as their main function, but they can use deep learning models for various tasks such as image classification and image generation.

The rough usage is to execute a data generation command to learn and infer, but there is an option called --problem. This is an option to specify the dataset to use, rather than just specifying the task. Therefore, it is very easy to experiment with the existing benchmark data set, but when using the data set prepared by yourself, you need to define a class that inherits the class called Problem. It's nice that this specification explicitly links the model and the data (and the pre-processing method), but I think other libraries are superior in terms of ease of use.

There are Japanese articles and official Jupyter Notebook, so I have the impression that the examples are substantial. Also, TensorFlow Serving can be used, so I wonder if this will be the case when using deep learning models in production.

Sockeye Seq2Seq framework using Apache MXNet (Incubating). The usage seems to be similar to OpenNMT.

I couldn't understand the features of this library by looking at it ... I'm sorry it looks like a blog ...

end

I think it's better to try from the top.

A quick introduction to the neural machine translation library

Introduction

What kind of library is used

end