Preface

Last time Introduction to Markov Chain Chatbot (1) Janome with Python + Janome This time, as a preparation for the Markov chain, I will use Janome to break up the sentences and implement a simple simple Markov chain.

What is a Markov chain?

Introduction

A Markov chain (Markov chain) is a type of stochastic process in which the possible states are discrete (finite or countable) (discrete state Markov process). (From Wikipedia)

Imagine a simple sugoroku. Roll the dice in a certain square and move forward by the amount of the roll. No matter where the starting point is, the values that the dice have given up to that point have nothing to do with the future event of "going four squares ahead." This is Markov property. You will not be caught between the squares and will reach one of the squares. This is discrete. Such a Markov process is called a simple Markov chain, for example. "Advance only the sum of the previous eye and this time" If you set a rule like this, this will be a Markov chain on the second floor.

Application to sentence generation

In sentence generation, the Nth floor Markov chain is used. Consider the following sentence as an example.

A Markov process is a stochastic process with Markov properties. It is a stochastic process that has the property that future behavior is determined only by the current value and has nothing to do with past behavior.

If you disassemble these using Janome, it will look like this.

Markov|process|When|Is|、|Markov|sex|To|Offal|probability|process|of|こWhen|To|Say|。 future|of|behavior|But|Current|of|value|Only|so|Decision|Sa|Re|、|past|of|behavior|When|Irrelevant|so|is there|Whenいう|nature|To|Have|probability|process|so|is there|。

First, set the starting point arbitrarily. Let's make it "Markov". Determine the next block with reference to the current state "Markov". Since it is "process" and "sex" that are connected to "Markov", choose one of them. Let's make it a "process". If it is a simple Markov chain, then select from "to", "no", and "de" following the "process". For the 2nd floor Markov chain, select "to" following the 2 blocks of "Markov" and "Process". This time, let's easily implement a simple Markov chain with N = 1.

Implementation (simple Markov process)

from janome.tokenizer import Tokenizer
import random

t = Tokenizer()

s = "The Markov process is a stochastic process with Markov properties.\
It is a stochastic process that has the property that future behavior is determined only by the current value and has nothing to do with past behavior."

line = ""

for token in t.tokenize(s):
    line += token.surface
    line += "|"

word_list = line.split("|")
word_list.pop()

dictionary = {}
queue = ""
for word in word_list:
    if queue != "" and queue != "。":
        if queue not in dictionary:
            dictionary[queue] = []
            dictionary[queue].append(word)
        else:
            dictionary[queue].append(word)
    queue = word

def generator(start):
    sentence = start
    now_word = start
    for i in range (1000):
        if now_word == "。":
            break
        else:
            next_word = random.choice(dictionary[now_word])
            now_word = next_word
            sentence += next_word
    return sentence

for i in range(5):
    print(generator("Markov"))

=========== Note below ===========

from janome.tokenizer import Tokenizer
import random

t = Tokenizer()

s = "The Markov process is a stochastic process with Markov properties.\
It is a stochastic process that has the property that future behavior is determined only by the current value and has nothing to do with past behavior."

line = ""

for token in t.tokenize(s):
    line += token.surface
    line += "|"

word_list = line.split("|")
word_list.pop()

In the first half, an array is generated in which the above sentences are separated and stored. As mentioned above|Generate a delimiter string and|And split. There are empty strings left in the box, so remove them with pop.

dictionary = {}
queue = ""
for word in word_list:
    if queue != "" and queue != "。":
        if queue not in dictionary:
            dictionary[queue] = []
            dictionary[queue].append(word)
        else:
            dictionary[queue].append(word)
    queue = word

In the middle part, a dictionary (dict type object) is created using the array created in the first half. Put the current word in the queue The following word word is added (create a new one if there is no item). After adding, replace queue with the following word word. The contents of the dictionary

{'Markov': ['process', 'sex'], 'process': ['When', 'of', 'so'], 'When': ['Is', 'Irrelevant'], 'Is': ['、'], '、': ['Markov', 'past'], 'sex': ['To'], 'To': ['Offal', 'Say', 'Have'], 'Offal': ['Certainly
rate'], '確rate': ['process', 'process'], 'of': ['thing', 'behavior', 'value', 'behavior'], 'thing': ['To'], 'Say': ['。'], 'future': ['of'], 'behavior': ['But', 'When'], 'But': ['Current'], 'Current': ['of'], 'value': ['Is
Ke'], 'だKe': ['so'], 'so': ['Decision', 'is there', 'is there'], 'Decision': ['Sa'], 'Sa': ['Re'], 'Re': ['、'], 'past': ['of'], 'Irrelevant': ['so'], 'is there': ['That', '。'], 'That': ['nature'], 'sex
quality': ['To'], 'Have': ['probability']}

It is like this.

def generator(start):
    sentence = start
    now_word = start
    for i in range (1000):
        if now_word == "。":
            break
        else:
            next_word = random.choice(dictionary[now_word])
            now_word = next_word
            sentence += next_word
    return sentence

for i in range(5):
    print(generator("Markov"))

In the second half, it's finally time to generate sentences. Randomly select the next word from the group of words that follow the current word and attach it to the sentence. I wrote that it will end when "." Comes, but it is probabilistically that "." Will not be selected forever, so it is specified that the repetition ends at most 1000 times. Ask them to write out five sentences that start with "Markov".

It has nothing to do with Markov property stochastic processes.
It has nothing to do with the Markov property stochastic process.
The behavior of stochastic processes with Markov properties is the present.
Markov property.
It is a stochastic process that has a Markov process.

(It's chubby) However, such a sentence was born.

Prospects for the next time

The simple Markov chain tends to be incoherent as above because it has no connection with the past. Also, this time the dictionary was too poor, and the sentences were similar and close to each other. Next time, the Nth-order Markov chain, which selects the next word based on multiple chains and generates sentences, I will try to implement it with a larger dictionary.