Introduction

I tried to create a chatbot by changing the method from the previous failure. This time it worked, but it's not very interesting because it's almost as per the official documentation.

The last failure is from here The entire code is from here

How to make

This time I decided to use Tensor2Tensor (t2t) provided by the Google Brain team. The feature of t2t is that it is easy to execute without writing code (command only) if you just learn with the already prepared data set. It's a lot easier to run your own dataset, as you only need a few lines of code and a well-formed dataset as described in most official documentation.

This time, I will try to learn and infer using input_corpus.txt and output_corpus.txt extracted from the Nagoya University Conversation Corpus created last time as a data set. The execution environment is Google Colab.

In addition, we will proceed according to the contents of the official document [^ 1], here [^ 2], here [^ 3], etc.

Data set preparation

If you proceed according to the above reference page, you need the following two things.

Code to reference the dataset (myproblem.py)
Code to make t2t recognize myproblem.py (\ _ \ _ init__.py)

For details on how to make it, leave it to the reference page. For the time being, just put the code

`myproblem.py`


from tensor2tensor.data_generators import problem
from tensor2tensor.data_generators import text_problems
from tensor2tensor.utils import registry
 
 
@registry.register_problem
class chat_bot(text_problems.Text2TextProblem):
    @property
    def approx_vocab_size(self):
        return 2**13
    
    @property
    def is_generate_per_split(self):
        return False
 
    @property
    def dataset_splits(self):
        return [{
            "split": problem.DatasetSplit.TRAIN,
            "shards": 9,
        }, {
            "split": problem.DatasetSplit.EVAL,
            "shards": 1,
        }]
 
    def generate_samples(self, data_dir, tmp_dir, dataset_split):
        filename_input = '/content/drive/My Drive/Colab Notebooks/input_corpus.txt'
        filename_output = '/content/drive/My Drive/Colab Notebooks/output_corpus.txt'
 
        with open(filename_input) as f_in, open(filename_output) as f_out:
            for src, tgt in zip(f_in, f_out):
                src = src.strip()
                tgt = tgt.strip()
                if not src or not tgt:
                    continue
                yield {'inputs': src, 'targets': tgt}

The changes from the official documentation are the class name and the required parts of the generate_samples function. It is customary to write the class name in camel case, but for some reason I had to write it in snake case for some reason. It's a little mystery that it should work with camel case.

`init.py`


from . import myproblem

For this, just write the above contents and put it in the same directory as myproblem.py.

Data preprocessing

t2t also preprocesses data almost automatically. It's convenient. (The code below is converting from notebook format to .py format)

`ChatBot_with_t2t.py`


#Tensorflow version 1,x
"""

# Commented out IPython magic to ensure Python compatibility.
# %tensorflow_version 1.x

"""#Machine learning model(Transformer)To install"""

!pip install tensor2tensor

"""#Google Drive mount"""

from google.colab import drive
drive.mount('/content/drive')

"""#Change working directory"""

cd /content/drive/My Drive/Colab Notebooks

"""#Preprocessing of training data"""

!t2t-datagen \
  --data_dir=. \
  --tmp_dir=./t2t \
  --problem=chat_bot \
  --t2t_usr_dir=./t2t

This time, put myproblem.py and \ _ \ _ init__.py in the t2t directory one level below ChatBot \ _with \ _ t2t.ipynb. Also, this time I put input_corpus.txt and output_corpus.txt in the same directory as .ipynb, but since there are files generated after execution, it may be better to save them in a separate folder.

For the command line option problem =, specify the class name changed in myproblem.py (it was originally automatically converted from camel case to snake case, but it didn't work).

To learn

`ChatBot_with_t2t.py`


"""#Execution of learning"""

!t2t-trainer \
  --data_dir=/content/drive/My\ Drive/Colab\ Notebooks \
  --problem=chat_bot \
  --model=transformer \
  --hparams_set=transformer_base_single_gpu \
  --output_dir=/content/drive/My\ Drive/Colab\ Notebooks/t2t \
  --t2t_usr_dir=/content/drive/My\ Drive/Colab\ Notebooks/t2t

In the preprocessing, the directory was specified with a relative path, but of course it is possible to describe with an absolute path. This time I escaped with \ because it contains spaces in the path. This time as well, the learning model is performed by the transformer. This study took about 3-4 hours. In addition, in order to clear the 90-minute limit, it is automatically reloaded by the Chorme extension.

There is one caveat: when you perform learning, many intermediate products are created. In my own environment, the trash can in Google Drive became full and I had to empty it during execution. The learning files generated after execution still take up a lot of space, so you may need to delete them each time you no longer need them.

Infer

`ChatBot_with_t2t.py`


"""#Infer"""

!t2t-decoder \
   --data_dir=./ \
   --problem=chat_bot \
   --model=transformer \
   --hparams_set=transformer_base_single_gpu \
   --output_dir=./t2t \
   --decode_hparams="beam_size=4,alpha=0.6" \
   --decode_interactive=true \
   --t2t_usr_dir=./t2t

It can be executed interactively with the above command.

Execution result

You can talk to your bot by hitting the decode command. Below is an excerpt of the input and output content.

Input: Hello
Output: What does that mean?

Input: No, you're saying good morning
Output: All right.

Input: What's okay(Lol)
Output: Yeah.

Input: Yeah
Output: Personality?

Input: That's the personality
Output: Well, that's right.

Input: I'm convinced
Output: Telephone cards, telephone cards, usually thank you, it's natural to give them.

Input: Suddenly rampage
Output: <Laughter> If you stab, you.

Input: Suddenly scared
Output: It's scary.

Input: I'll stop you
Output: What is it for the first time?

You can exit interactive mode by typing q at the end.

The unnaturalness cannot be wiped out and sometimes it goes out of control, but it seems that it is generally working well. Since this dataset is mainly about broken conversations, it may be difficult to write stiff sentences.

Summary

Last time I made a big mistake, but this time I was able to make a chat chatbot by using t2t. I'm just answering questions rather than talking, but I think it's working to some extent.

It's easy to create a chatbot, and it also supports other machine learning tasks, so you may be able to easily create what you want.

References

[^ 1]: Official document How to create your own dataset [^ 2]: Come on! Chatbot whywaita-kun! [^ 3]: Japanese-English translation with tensor2tensor

I made a chatbot with Tensor2Tensor and this time it worked

Introduction

How to make

Data set preparation

myproblem.py

__init__.py

Data preprocessing

ChatBot_with_t2t.py

To learn

ChatBot_with_t2t.py

Infer

ChatBot_with_t2t.py

Execution result

Summary

References

`myproblem.py`

`init.py`

`ChatBot_with_t2t.py`

`ChatBot_with_t2t.py`

`ChatBot_with_t2t.py`