This is a story that I stumbled upon when I first tried machine learning in natural language processing. Write down the process up to production. At the time of posting the article, it wasn't working well, so on the other hand, it would only be a teacher. If you want to know how to do it well, please see here.
It will be an article of such a paper engineer apprentice. If you think that it is not very helpful, we recommend browser back.
Originally I was touching machine learning itself. However, since I have no experience in natural language processing, I decided to collect information and knowledge for the time being.
The first thing I jumped in was that google's BERT was amazing. So I checked the structure and learning mechanism, but it was in a completely "?????????" state.
Anyway, BERT seems to be amazing. Then I decided to make something using it, and decided to make a chatbot.
I also considered using the later XLNet and ALBERT. However, there was nothing including BERT that could be easily modified for myself.
In particular, Github's BERT repository and text classification, which are unofficially provided by Google, seemed to be easy, but it seemed to be a high hurdle to do other tasks that seemed to be unexpected. Therefore, I searched for another measure.
After checking various things, People who are doing Japanese-English translation with transformer and [People who are making chatbots with transformer] ](Https://sekailab.com/wp/2019/03/27/transformer-general-responce-bot/).
Then why not make a chatbot with a transformer? I came up with the idea, so I decided to put it into practice.
Click here for the materials used this time
Let's move on to how to make it. This time it will be executed on Google colab, so it is assumed to be in notebook format. Click here for the full code (https://github.com/NJIMAMTO/transformer-chat-bot/blob/master/transformer.ipynb)
First from the installer installation
!pip install keras-transformer
Next, install the sentence piece to be used as a talker
!pip install sentencepiece
Mount Google Drive here (how to do it)
Next, download the corpus and shape it.
!git clone https://github.com/knok/make-meidai-dialogue.git
Change to the directory where the repository is located.
cd "/content/drive/My Drive/Colab Notebooks/make-meidai-dialogue"
Run makefile
!make all
You will be returned to the original directory.
cd "/content/drive/My Drive/Colab Notebooks"
Download the corpus and run the makefile to generate sequence.txt. In this input:~~~~ output:~~~~ Since the conversational sentence is written in the format of, we will format it so that it will be easier to use in the future.
input_corpus = []
output_corpus = []
for_spm_corpus = []
with open('/content/drive/My Drive/Colab Notebooks/make-meidai-dialogue/sequence.txt') as f:
for s_line in f:
if s_line.startswith('input: '):
input_corpus.append(s_line[6:])
for_spm_corpus.append(s_line[6:])
elif s_line.startswith('output: '):
output_corpus.append(s_line[7:])
for_spm_corpus.append(s_line[7:])
with open('/content/drive/My Drive/Colab Notebooks/input_corpus.txt', 'w') as f:
f.writelines(input_corpus)
with open('/content/drive/My Drive/Colab Notebooks/output_corpus.txt', 'w') as f:
f.writelines(output_corpus)
with open('/content/drive/My Drive/Colab Notebooks/spm_corpus.txt', 'w') as f:
f.writelines(for_spm_corpus)
It is divided into a text for input to the transformer and a text for output, and a text for learning with the input: and output: parts removed for training in the sentence piece.
Next, use the sentence piece to divide the conversation. Let's train using spm_corpus.txt.
import sentencepiece as spm
# train sentence piece
spm.SentencePieceTrainer.Train("--input=spm_corpus.txt --model_prefix=trained_model --vocab_size=8000 --bos_id=1 --eos_id=2 --pad_id=0 --unk_id=5")
Details are omitted because the method is described in the official repository of sentence piece.
Now, let's divide the sentence once with the sentence piece.
sp = spm.SentencePieceProcessor()
sp.Load("trained_model.model")
#test
print(sp.EncodeAsPieces("Oh that's right"))
print(sp.EncodeAsPieces("I see"))
print(sp.EncodeAsPieces("So what do you mean by that?"))
print(sp.DecodeIds([0,1,2,3,4,5]))
This is the execution result.
['Oh oh', 'Such thing', 'Ne']
['I see', 'all right']
['▁', 'In other words', '、', 'That', 'you', 'of', 'say', 'Want', 'That is', 'Such thing', 'Is it', '?']
、。 ⁇
It turns out that it is divided as above.
The content from here is a partial modification of the content written in the README.md of Keras Transformer.
Now let's shape the corpus into a format suitable for padding and transformers.
import numpy as np
# Generate toy data
encoder_inputs_no_padding = []
encoder_inputs, decoder_inputs, decoder_outputs = [], [], []
max_token_size = 168
with open('/content/drive/My Drive/Colab Notebooks/input_corpus.txt') as input_tokens, open('/content/drive/My Drive/Colab Notebooks/output_corpus.txt') as output_tokens:
#Read line by line from the corpus
input_tokens = input_tokens.readlines()
output_tokens = output_tokens.readlines()
for input_token, output_token in zip(input_tokens, output_tokens):
if input_token or output_token:
encode_tokens, decode_tokens = sp.EncodeAsPieces(input_token), sp.EncodeAsPieces(output_token)
#Padding
encode_tokens = ['<s>'] + encode_tokens + ['</s>'] + ['<pad>'] * (max_token_size - len(encode_tokens))
output_tokens = decode_tokens + ['</s>', '<pad>'] + ['<pad>'] * (max_token_size - len(decode_tokens))
decode_tokens = ['<s>'] + decode_tokens + ['</s>'] + ['<pad>'] * (max_token_size - len(decode_tokens))
encode_tokens = list(map(lambda x: sp.piece_to_id(x), encode_tokens))
decode_tokens = list(map(lambda x: sp.piece_to_id(x), decode_tokens))
output_tokens = list(map(lambda x: [sp.piece_to_id(x)], output_tokens))
encoder_inputs_no_padding.append(input_token)
encoder_inputs.append(encode_tokens)
decoder_inputs.append(decode_tokens)
decoder_outputs.append(output_tokens)
else:
break
#Convert for input to the training model
X = [np.asarray(encoder_inputs), np.asarray(decoder_inputs)]
Y = np.asarray(decoder_outputs)
Now let's train the transformer.
from keras_transformer import get_model
# Build the model
model = get_model(
token_num=sp.GetPieceSize(),
embed_dim=32,
encoder_num=2,
decoder_num=2,
head_num=4,
hidden_dim=128,
dropout_rate=0.05,
use_same_embed=True, # Use different embeddings for different languages
)
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
)
model.summary()
# Train the model
model.fit(
x=X,
y=Y,
epochs=10,
batch_size=32,
)
This is the execution result.
Epoch 1/10
33361/33361 [==============================] - 68s 2ms/step - loss: 0.2818
Epoch 2/10
33361/33361 [==============================] - 66s 2ms/step - loss: 0.2410
Epoch 3/10
33361/33361 [==============================] - 66s 2ms/step - loss: 0.2331
Epoch 4/10
33361/33361 [==============================] - 66s 2ms/step - loss: 0.2274
Epoch 5/10
33361/33361 [==============================] - 66s 2ms/step - loss: 0.2230
Epoch 6/10
33361/33361 [==============================] - 66s 2ms/step - loss: 0.2193
Epoch 7/10
33361/33361 [==============================] - 66s 2ms/step - loss: 0.2163
Epoch 8/10
33361/33361 [==============================] - 66s 2ms/step - loss: 0.2137
Epoch 9/10
33361/33361 [==============================] - 66s 2ms/step - loss: 0.2114
Epoch 10/10
33361/33361 [==============================] - 66s 2ms/step - loss: 0.2094
The loss doesn't look bad.
Let's make inferences with the learned model.
from keras_transformer import decode
input = "It's nice weather today, is not it"
encode = sp.EncodeAsIds(input)
decoded = decode(
model,
encode,
start_token=sp.bos_id(),
end_token=sp.eos_id(),
pad_token=sp.pad_id(),
max_len=170
)
decoded = np.array(decoded,dtype=int)
decoded = decoded.tolist()
print(sp.decode(decoded))
This is the execution result.
Hey, but that, that, that, that, that, that, that, that, that, that, that
That ... the result is not good at all. This is a communication disorder.
What was the cause of the failure?
Etc. can be given first?
So I tried to tune with Optuna but it didn't work for the following reasons:
So I decided to try another method. The method that worked is here
Recommended Posts