Introduction

A memorandum when creating a simple neural network that uses text as training data in order to understand the mechanism of chatbots using machine learning.

Purpose

Applying a rule-based chatbot created in English text to Japanese text to operate it. Preprocess the Japanese text and make sure it can be passed through the neural network. As training data, we used a web scraped support page related to Niantic's "Pokemon GO".

Niantic Support Page

CSV file used (GitHub)

Multi-class classification

With reference to the "rule-based type" that returns the response sentence prepared in advance according to the input, even the part of the multi-class classification that identifies and predicts "Intents" (intention) is formed.

Since it predicts related "Frequently Asked Questions (FAQ)" from input information rather than "generated type", the model is created with a normal neural network layer instead of "RNN".

environment

Build a virtual environment without using Jupyter notebook. − macOS Mojave 10.14.6

Python 3.6.0
TensorFlow 1.9.0 --MeCab (Japanese morphological analysis tool)

Reference page for MeCab installation

code

Reading training data and preprocessing Japanese text

`wakatigaki.py`


import MeCab
import csv
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences

def create_tokenizer() :
#Read CSV file
    text_list = []
    with open("pgo_train_texts.csv", "r") as csvfile :
        texts = csv.reader(csvfile)

        for text in texts :
            text_list.append(text)

#Use MeCab to divide Japanese text.
        wakati_list = []
        label_list = []
        for label, text in text_list :
            text = text.lower()

            wakati = MeCab.Tagger("-O wakati")
            text_wakati = wakati.parse(text).strip()
            wakati_list.append(text_wakati)
            label_list.append(label)

#Find out the number of elements in the largest sentence.
#Create a list of text data to use in the tokenizer.
        max_len = -1
        split_list = []
        sentences = []
        for text in wakati_list :
            text = text.split()
            split_list.extend(text)
            sentences.append(text)

            if len(text) > max_len :
                max_len = len(text)
        print("Max length of texts: ", max_len)
        vocab_size = len(set(split_list))
        print("Vocabularay size: ", vocab_size)
        label_size = len(set(label_list))

#Use Tokenizer to assign numbers from index 1 to words.
#Also create a dictionary.
        tokenizer = tf.keras.preprocessing.text.Tokenizer(oov_token="<oov>")
        tokenizer.fit_on_texts(split_list)
        word_index = tokenizer.word_index
        print("Dictionary size: ", len(word_index))
        sequences = tokenizer.texts_to_sequences(sentences)

#Label data used for supervised learning is also numbered using Tokenizer.
        label_tokenizer = tf.keras.preprocessing.text.Tokenizer()
        label_tokenizer.fit_on_texts(label_list)
        label_index = label_tokenizer.word_index
        label_sequences = label_tokenizer.texts_to_sequences(label_list)

#The Tokenizer assigns numbers from 1, while the actual label starts indexing from 0, so it is -1.
        label_seq = []
        for label in label_sequences :
            l = label[0] - 1
            label_seq.append(l)

# to_categorical()Is the actual label data passed to the model using One-Create Hot vector.
        one_hot_y = tf.keras.utils.to_categorical(label_seq)

#To match the size of the training data, add 0 to the short text to match the longest text data.
        padded = pad_sequences(sequences, maxlen=max_len, padding="post", truncating="post")
        print("padded sequences: ", padded)

        reverse_index = dict()
        for intent, i in label_index.items() :
            reverse_index[i] = intent

    return padded, one_hot_y, word_index, reverse_index, tokenizer, max_len, vocab_size

Model creation using TensorFlow

`model.py`


import tensorflow as tf

def model(training, label, vocab_size) :
    model = tf.keras.models.Sequential([
        tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=16, input_length=len(training[0])),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(30, activation="relu"),
        tf.keras.layers.Dense(len(label[0]), activation="softmax")
    ])

    model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
    model.fit(x=training, y=label, epochs=100)

    model.summary()

    return model

--First, use the Embedding layer so that you can capture the relationships between words as vectors. --Flatten the Embedding Matrix so that it can be passed to the Fully connected Dense layer with Flatten () in between. --If AveragePooling1D () is used instead, the number of parameters of the neural network can be reduced and the calculation cost can be reduced. --Since the number of "Intents" you want to identify is the same as the label type, match it with the number of elements at index 0 of the One-Hot vector. --For the activation function of the output layer, select "softmax" that supports multi-class classification. --When compiling the model, set the loss calculation method for multi-class classification. --Use "Adam" as the learning algorithm.

Creating an input screen

`chat.py`



import MeCab
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences

#Arrange the text received by the console so that the model can process it.
def prepro_wakati(input, tokenizer, max_len) :
    sentence = []

    input = input.lower()
    wakati = MeCab.Tagger("-O wakati")
    text_wakati = wakati.parse(input).strip()
    sentence.append(text_wakati)
    print(sentence)

    seq = tokenizer.texts_to_sequences(sentence)
    seq = list(seq)
    padded = pad_sequences(seq, maxlen=max_len, padding="post", truncating="post")
    print(padded)

    return padded

def chat(model, tokenizer, label_index, max_len) :
    print("Start talking with the bot (type quit to stop): ")
    while True :
        input_text = input("You: ")
        if input_text.lower() == "quit" :
            break

        x = prepro_wakati(input_text, tokenizer, max_len)
        results = model.predict(x, batch_size=1)
        print("results: ", results)
        results_index = np.argmax(results)
        print("Predicted index: ", results_index)

        intent = label_index[results_index + 1]

        print("Type of intent: ", intent)

Console screen. Apply the entered text to the trained model and predict which of the nine "Intents" it applies to.

--Types of Intents

Contact Us --Bug --Start Guide --Accessories
battle --Event
Partner
Security --Shop

Run

Call and execute the defined function.

`ex.py`


import wakatigaki
import model
import chat

padded, one_hot_y, word_index, label_index, tokenizer, max_len, vocab_size = wakatigaki.create_tokenizer()

model = model.model(padded, one_hot_y, vocab_size)

chat.chat(model, tokenizer, label_index, max_len)

Training results

スクリーンショット 2020-04-29 16.39.32.png

The numbers in "results:" represent the corresponding probabilities for each category.

Predict a "start guide" for the input "how to catch Pokemon". Next, predict "shop" for "poke coins and items". Both were able to predict the appropriate category.

Japanese preprocessing for machine learning