from google.colab import files
uploaded = files.upload()
with open('Neko.txt', mode='rt', encoding='utf-8') as f:
read_text = f.read()
nekotxt = read_text
print(nekotxt)
is the
file name from the left,
mode ='rt' is the text mode specification, and ʻencoding ='utf-8'
is the character code. Files opened by adding with
to the left will be closed automatically after the code in the indent is executed.!apt install aptitude
!aptitude install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file -y
!pip install mecab-python3==0.7
MeCab.Tagger ()
with the argument -Owakati
and then calling the methodparse ()
.import MeCab
tagger = MeCab.Tagger("-Owakati")
nekotxt = tagger.parse(nekotxt)
print(nekotxt)
split ()
, a space will be the delimiter.nekotxt = nekotxt.split()
print(nekotxt)
from collections import Counter
import numpy as np
from numpy.random import *
nekotxt
as the variable string
.string
and the list from the next word to the end of string
are combined into one with zip ()
and double. Let's say
.string
, the list from the next word at the beginning to the one before the end, and the list from the next word after the beginning to the end. Combine the list into one and call it triple
.filter ()
function to remove any character symbols defined in the variable delimiiter
.string = nekotxt
#Character symbols to exclude
delimiter = ['「', '」', '…', ' ']
#2-word list
double = list(zip(string[:-1], string[1:]))
double = filter((lambda x: not((x[0] in delimiter) or (x[1] in delimiter))), double)
#List of 3 words
triple = list(zip(string[:-2], string[1:-1], string[2:]))
triple = filter((lambda x: not((x[0] in delimiter) or (x[1] in delimiter) or (x[2] in delimiter))), triple)
#Count the number of elements and generate a dictionary
dic2 = Counter(double)
dic3 = Counter(triple)
double
is a list with two consecutive words as elements, and triple
is a list with three consecutive words as elements.Counter ()
is ** 2-gram is dic2
, 3-gram is dic3
frequency data **, that is, ** N-gram dictionary **. I will.for u,v in dic2.items():
print(u, v)
for u,v in dic3.items():
print(u, v)
nextword
** that generates sentences by generating words one after another based on the N-gram dictionary.def nextword(words, dic):
##➀ Get the number of elements grams of the first word words
grams = len(words)
## ➁N-Extract matching elements from gram dictionary dic
#For 2 words
if grams == 2:
matcheditems = np.array(list(filter(
(lambda x: x[0][0] == words[1]), #1st matches
dic.items())))
#For 3 words
else:
matcheditems = np.array(list(filter(
(lambda x: x[0][0] == words[1]) and (lambda x: x[0][1] == words[2]), #1st and 2nd match
dic.items())))
##➂ Error message when there is no matching word
if(len(matcheditems) == 0):
print("No matched generator for", words[1])
return ''
##➃ Weighted appearance frequency list
#Get frequency of occurrence from matched items
probs = [row[1] for row in matcheditems]
#Generate a pseudo-random number from 0 to 1 and multiply it by the frequency of appearance
weightlist = rand(len(matcheditems)) * probs
##➄ Get the element with the highest weighted appearance frequency from matched items
if grams == 2:
u = matcheditems[np.argmax(weightlist)][0][1]
else:
u = matcheditems[np.argmax(weightlist)][0][2]
return u
words
is ** a word to be entered arbitrarily as the beginning **. The second argument dic
(dic2 or dic3) is selected depending on whether the number of elements is 2 or 3 words.matcheditems
.matched items
.for
print () `in the final documented output, the space (half-width space) that is created when concatenating character strings will be eliminated.#Enter the first word words
words = ['', 'I'] # 2-gram
#words = ['', 'I', 'Is'] # 3-gram
#Embed words at the beginning of output output
output = words[1:]
#Get "next word"
for i in range(100):
#For 2 words
if len(words) == 2:
newword = nextword(words, dic2)
#For 3 words
else:
newword = nextword(words, dic3)
#Add the following words to the output output
output.append(newword)
#End if the next character is a full stop
if newword in ['', '。', '?', '!']:
break
#Preparing the next next word
words = output[-len(words):]
print(words)
#Display output output
for u in output:
print(u, end='')
Recommended Posts