3. Natural language processing with Python 1-1. Word N-gram

There are two methods for natural language processing.
The method of expressing words from statistical information is called "** count base ", and the method of neural network is called " inference base **".
As a count-based method, consider a program that generates sentences based on ** frequency distribution N-gram ** of "series" of letters and words.

⑴ Reading text data

from google.colab import files
uploaded = files.upload()

Select a text file locally and upload it on Colaboratory.

Open the uploaded text file and store it in a variable.
This time, I will use the full text of Natsume Soseki's "I Am a Cat" as a corpus.
https://github.com/yumi-ito/sample_data/blob/master/Neko.txt

with open('Neko.txt', mode='rt', encoding='utf-8') as f:
    read_text = f.read()
nekotxt = read_text

print(nekotxt)

The argument of ʻopen ()is thefile name from the left, mode ='rt' is the text mode specification, and ʻencoding ='utf-8' is the character code. Files opened by adding with to the left will be closed automatically after the code in the indent is executed.

⑵ Morphological analysis by MeCab

MeCab is an open source morphological analysis engine that can be used in two ways.
One is to ** create a word-separator **, which allows you to perform the same processing as in English.
The other is ** to get information such as reading, original form, part of speech for each word **, and you can use it to extract only nouns, for example.

!apt install aptitude
!aptitude install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file -y
!pip install mecab-python3==0.7

The above is the installation of MeCab. Then import and convert the full text to "separate".
You can get the result as a string by creating an instance in the class MeCab.Tagger () with the argument -Owakati and then calling the methodparse ().

import MeCab

tagger = MeCab.Tagger("-Owakati")
nekotxt = tagger.parse(nekotxt)

print(nekotxt)

The character string is further divided, but if the argument is omitted with split (), a space will be the delimiter.

nekotxt = nekotxt.split()
print(nekotxt)

⑶ Generation of N-gram dictionary

from collections import Counter
import numpy as np
from numpy.random import *

Now put the split word sequence nekotxt as the variable string.
In the case of 2 words, that is, 2-gram, the list from the beginning to the end of string and the list from the next word to the end of string are combined into one with zip () and double. Let's say .
For 3-gram, the list from the beginning to the end of string, the list from the next word at the beginning to the one before the end, and the list from the next word after the beginning to the end. Combine the list into one and call it triple.
At that time, use the filter () function to remove any character symbols defined in the variable delimiiter.

string = nekotxt

#Character symbols to exclude
delimiter = ['「', '」', '…', '　']

#2-word list
double = list(zip(string[:-1], string[1:]))
double = filter((lambda x: not((x[0] in delimiter) or (x[1] in delimiter))), double)

#List of 3 words
triple = list(zip(string[:-2], string[1:-1], string[2:]))
triple = filter((lambda x: not((x[0] in delimiter) or (x[1] in delimiter) or (x[2] in delimiter))), triple)

#Count the number of elements and generate a dictionary
dic2 = Counter(double)
dic3 = Counter(triple)

double is a list with two consecutive words as elements, and triple is a list with three consecutive words as elements.
The result of counting the frequency of occurrence of elements with Counter () is ** 2-gram is dic2, 3-gram is dic3 frequency data **, that is, ** N-gram dictionary **. I will.
Shows the contents of the N-gram dictionary.

for u,v in dic2.items():
    print(u, v)

for u,v in dic3.items():
    print(u, v)

⑷ Definition of sentence generation method

Define a method nextword ** that generates sentences by generating words one after another based on the N-gram dictionary.
In other words, ** the frequency of occurrence for each set of consecutive words is read as the probability of the "next word" **.
Give the first word, repeat the selection of the next word, the next word, and the most frequent words, and stop the generation when you reach the end such as ".".

def nextword(words, dic):
    ##➀ Get the number of elements grams of the first word words
    grams = len(words)

    ## ➁N-Extract matching elements from gram dictionary dic
    #For 2 words
    if grams == 2:
        matcheditems = np.array(list(filter(
            (lambda x: x[0][0] == words[1]), #1st matches
            dic.items())))
    #For 3 words
    else:
        matcheditems = np.array(list(filter(
            (lambda x: x[0][0] == words[1]) and (lambda x: x[0][1] == words[2]), #1st and 2nd match
            dic.items())))

    ##➂ Error message when there is no matching word
    if(len(matcheditems) == 0):
        print("No matched generator for", words[1])
        return ''

    ##➃ Weighted appearance frequency list
    #Get frequency of occurrence from matched items
    probs = [row[1] for row in matcheditems]
    #Generate a pseudo-random number from 0 to 1 and multiply it by the frequency of appearance
    weightlist = rand(len(matcheditems)) * probs

    ##➄ Get the element with the highest weighted appearance frequency from matched items
    if grams == 2:
        u = matcheditems[np.argmax(weightlist)][0][1]
    else:
        u = matcheditems[np.argmax(weightlist)][0][2]
    return u

** ➀ ** The first argument words is ** a word to be entered arbitrarily as the beginning **. The second argument dic (dic2 or dic3) is selected depending on whether the number of elements is 2 or 3 words.
** ➁ ** In the case of 2 words, the first word matches, in the case of 3 words, the first and the next word ** match **, respectively ** Extract from the N-gram dictionary ** And store it in the variable matcheditems.
** ➂ ** Returns if there is no matching word in the error message.
** ➃ Generates a list of "weighted frequency of occurrence" ** by multiplying the frequency of occurrence by a pseudo-random number. By the way, if you simply take the element with the highest appearance frequency, the result will be fixed and not interesting, so you can change it by giving noise.
** ➄ ** Gets and returns the element ** with the highest "weighted appearance frequency" from the list of matched elements matched items.

⑸ Execution of sentence generation program

Here, 2 words (2-gram) are adopted, and the first word is entered as "I".
If you use the optional argument ʻend =” for print () `in the final documented output, the space (half-width space) that is created when concatenating character strings will be eliminated.

#Enter the first word words
words = ['', 'I'] # 2-gram
#words = ['', 'I', 'Is'] # 3-gram

#Embed words at the beginning of output output
output = words[1:]

#Get "next word"
for i in range(100):
    #For 2 words
    if len(words) == 2:
        newword = nextword(words, dic2)
    #For 3 words
    else:
        newword = nextword(words, dic3)

    #Add the following words to the output output
    output.append(newword)
    #End if the next character is a full stop
    if newword in ['', '。', '？', '！']:
        break
    #Preparing the next next word
    words = output[-len(words):]
    print(words)

#Display output output
for u in output:
    print(u, end='')

2-Indicates the process of extracting elements one after another from a gram dictionary.

Thus, unlike inference-based methods, N-gram is very simple ...
As a personal experience, in 2008 (Heisei 20), I came across a very interesting research report using N-gram.
Using the Kokin Wakashū, a collection of Japanese poems from the early Heian period, as a corpus, extract the unique expressions of men and women using N-gram, and based on this, verify the wakash that the characters wrote in The Tale of Genji. It is an initiative.
It is not uncommon for women to use men's language even in modern times, but it turned out that in reality, people were skillfully modeled by using men's and women's languages. For example, a man who lacks male-specific expressions and is not masculine, or a woman who goes beyond male-centered social norms, can be said to be interesting that only readers at that time can understand.
In short, I think N-gram still has a lot of potential, depending on the idea of what to use for the corpus and what to analyze with it.