What is a Markov chain?

A simple explanation of Markov chains is that the state of the previous point in time determines the state of the next point in time. Looking at a concrete example in the text, when you see the word "tummy", it seems like "empty" will come next. However, this is not the only correct answer for "vacant", but other "full" may come. So, let's think about expressing this with a probability. Let's assume that the words that follow "tummy" have a 60% chance of being "vacant" and a 40% chance of being "full". This probability is the probability of each of the following states called the transition probability. So far, it's easy, but I've talked about Markov chains. If you want to know more about the story around here, please read Basics of Markov Chain and Kolmogorov Equation (Beautiful Story of High School Mathematics)

However, it is not the case if it is said that all sentences can be explained by Markov chains. For example, when "I'm hungry", the probability of becoming "hungry" is high, but when "I'm hungry", the probability of becoming "full" is high. This means that the sentence depends not only on the previous word but also on the previous word. For that matter, it depends on the context. However, since this article deals with Markov chains, I would like to introduce this area in another article.

program

The purpose of the program created this time is to automatically generate a new report using the data of the report created by myself. So, first read the file.


import random
from janome.tokenizer import Tokenizer

with open("data.csv", "rt", encoding="utf-8_sig") as f:
    text_raws = f.read()
text_raws = text_raws.replace("\n", "@\n").split("\n")

I loaded data.csv. Here is the data of the author's report, but I feel that it is a little bad to publish this to the outside, so I will put it in an appropriate sentence when posting it on github. I replaced it after reading it because I wanted to insert @ as a mark at the end of the sentence.


text_lists = []
t = Tokenizer()
for text_raw in text_raws:
    text_list = []
    tokens = t.tokenize(text_raw, wakati=True)
    for token in tokens:
        text_list.append(token)
    text_lists.append(text_list)

We will perform morphological analysis using Tokenizer. Morphological analysis is to divide a sentence into words, for example, as follows.

["I will post an article on qiita."]] ↓ ['I',' is',' qiita','to','article',' to','post','to','. ']

Also, by default, extra information such as part of speech is added, so by setting the parameter to wakati = True, only the words are extracted.


dic = {}
for text_list in text_lists:
    for i in range(len(text_list) - 1):
        if text_list[i] in dic:
            lists = dic[text_list[i]]
        else:
            lists = []
        lists.append(text_list[i + 1])
        dic[text_list[i]] = lists

Here, the correspondence between the previous word and the next word is generated in a dictionary format such as {"Tummy": ["Suita", "Full"]}.


word = input("Please enter the first word")
generate = word
word = list(t.tokenize(word, wakati=True))[-1]
limit = 10000
cnt = 0

while cnt < limit:
    try:
        word = random.choice(dic[word])
        if word == "@":
            break
    except:
        break
    cnt += 1
    generate += word
print(generate)

The first word is in the form of being input. Then, the entered word is morphologically analyzed, and the Markov chain is started from the last word. The transition probability is randomly retrieved from the dictionary and proportional to the number of occurrences. Finally, I introduced it as a sign at the end of the sentence, set an upper limit so as not to reach @ or infinite loop, and finish.

This completes the program. Let's try it out.

Input "Today" Generation "Today there was a delay in aggregating less satisfying procedures and taking advantage of unnatural techniques."

Input "people" Generation "I thought it was necessary to see the flow of technical assistance for specific heavy rain data procedures in the future, which is more likely to be worn by people."

I don't know what you're talking about. As I mentioned at the beginning, the language is not determined only by the immediately preceding words, so I ended up with an unnatural connection such as "many → → →". Next time, I would like to improve it so that I can judge the optimum word from more past states using LSTM etc. Source code here

References

I want to generate tweets in Python! -Markov chain- Basics of Markov Chain and Kolmogorov Equation (Beautiful Story of High School Mathematics)

I tried to automatically create a report with Markov chain

What is a Markov chain?

program

References