A memorandum when creating a simple neural network that uses text as training data in order to understand the mechanism of chatbots using machine learning.
Applying a rule-based chatbot created in English text to Japanese text to operate it. Preprocess the Japanese text and make sure it can be passed through the neural network. As training data, we used a web scraped support page related to Niantic's "Pokemon GO".
With reference to the "rule-based type" that returns the response sentence prepared in advance according to the input, even the part of the multi-class classification that identifies and predicts "Intents" (intention) is formed.
Since it predicts related "Frequently Asked Questions (FAQ)" from input information rather than "generated type", the model is created with a normal neural network layer instead of "RNN".
Build a virtual environment without using Jupyter notebook. − macOS Mojave 10.14.6
Reference page for MeCab installation
wakatigaki.py
import MeCab
import csv
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
def create_tokenizer() :
#Read CSV file
text_list = []
with open("pgo_train_texts.csv", "r") as csvfile :
texts = csv.reader(csvfile)
for text in texts :
text_list.append(text)
#Use MeCab to divide Japanese text.
wakati_list = []
label_list = []
for label, text in text_list :
text = text.lower()
wakati = MeCab.Tagger("-O wakati")
text_wakati = wakati.parse(text).strip()
wakati_list.append(text_wakati)
label_list.append(label)
#Find out the number of elements in the largest sentence.
#Create a list of text data to use in the tokenizer.
max_len = -1
split_list = []
sentences = []
for text in wakati_list :
text = text.split()
split_list.extend(text)
sentences.append(text)
if len(text) > max_len :
max_len = len(text)
print("Max length of texts: ", max_len)
vocab_size = len(set(split_list))
print("Vocabularay size: ", vocab_size)
label_size = len(set(label_list))
#Use Tokenizer to assign numbers from index 1 to words.
#Also create a dictionary.
tokenizer = tf.keras.preprocessing.text.Tokenizer(oov_token="<oov>")
tokenizer.fit_on_texts(split_list)
word_index = tokenizer.word_index
print("Dictionary size: ", len(word_index))
sequences = tokenizer.texts_to_sequences(sentences)
#Label data used for supervised learning is also numbered using Tokenizer.
label_tokenizer = tf.keras.preprocessing.text.Tokenizer()
label_tokenizer.fit_on_texts(label_list)
label_index = label_tokenizer.word_index
label_sequences = label_tokenizer.texts_to_sequences(label_list)
#The Tokenizer assigns numbers from 1, while the actual label starts indexing from 0, so it is -1.
label_seq = []
for label in label_sequences :
l = label[0] - 1
label_seq.append(l)
# to_categorical()Is the actual label data passed to the model using One-Create Hot vector.
one_hot_y = tf.keras.utils.to_categorical(label_seq)
#To match the size of the training data, add 0 to the short text to match the longest text data.
padded = pad_sequences(sequences, maxlen=max_len, padding="post", truncating="post")
print("padded sequences: ", padded)
reverse_index = dict()
for intent, i in label_index.items() :
reverse_index[i] = intent
return padded, one_hot_y, word_index, reverse_index, tokenizer, max_len, vocab_size
model.py
import tensorflow as tf
def model(training, label, vocab_size) :
model = tf.keras.models.Sequential([
tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=16, input_length=len(training[0])),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(30, activation="relu"),
tf.keras.layers.Dense(len(label[0]), activation="softmax")
])
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
model.fit(x=training, y=label, epochs=100)
model.summary()
return model
--First, use the Embedding layer so that you can capture the relationships between words as vectors. --Flatten the Embedding Matrix so that it can be passed to the Fully connected Dense layer with Flatten () in between. --If AveragePooling1D () is used instead, the number of parameters of the neural network can be reduced and the calculation cost can be reduced. --Since the number of "Intents" you want to identify is the same as the label type, match it with the number of elements at index 0 of the One-Hot vector. --For the activation function of the output layer, select "softmax" that supports multi-class classification. --When compiling the model, set the loss calculation method for multi-class classification. --Use "Adam" as the learning algorithm.
chat.py
import MeCab
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
#Arrange the text received by the console so that the model can process it.
def prepro_wakati(input, tokenizer, max_len) :
sentence = []
input = input.lower()
wakati = MeCab.Tagger("-O wakati")
text_wakati = wakati.parse(input).strip()
sentence.append(text_wakati)
print(sentence)
seq = tokenizer.texts_to_sequences(sentence)
seq = list(seq)
padded = pad_sequences(seq, maxlen=max_len, padding="post", truncating="post")
print(padded)
return padded
def chat(model, tokenizer, label_index, max_len) :
print("Start talking with the bot (type quit to stop): ")
while True :
input_text = input("You: ")
if input_text.lower() == "quit" :
break
x = prepro_wakati(input_text, tokenizer, max_len)
results = model.predict(x, batch_size=1)
print("results: ", results)
results_index = np.argmax(results)
print("Predicted index: ", results_index)
intent = label_index[results_index + 1]
print("Type of intent: ", intent)
Console screen. Apply the entered text to the trained model and predict which of the nine "Intents" it applies to.
--Types of Intents
Call and execute the defined function.
ex.py
import wakatigaki
import model
import chat
padded, one_hot_y, word_index, label_index, tokenizer, max_len, vocab_size = wakatigaki.create_tokenizer()
model = model.model(padded, one_hot_y, vocab_size)
chat.chat(model, tokenizer, label_index, max_len)
The numbers in "results:" represent the corresponding probabilities for each category.
Predict a "start guide" for the input "how to catch Pokemon". Next, predict "shop" for "poke coins and items". Both were able to predict the appropriate category.
Recommended Posts