In April of this year, ** "Deep Learning3 Framework from scratch" ** was released. I read ** Zero work ** as ** 1.2 ** and learned a lot, so I decided to challenge the framework edition this time.
So I recently bought a book, but before I started studying step by step, I decided to write some code to get a quick idea of the ** framework overview **.
Refer to ** DeZero **'s ** library ** and ** example ** at Github. I wrote a simple code for natural language processing on google colab while browsing ** this **, so I will leave it as a memorandum.
The created ** google colab code ** is posted on ** Github **, so if you like ** this link ** Click .ipynb) to move it.
When I tried to do natural language processing in Japanese, I thought it would be convenient if there was something that was easy to use, such as ** MNIST ** in the case of image processing, so I made something similar.
For the ** data set class ** to be created, download ** "I am a cat" ** of Aozora Bunko. Then, after deleting the unnecessary part, divide it with ** janome **, create a dictionary and corpus, and then create ** time series data ** and ** next correct answer data **.
As a preliminary preparation, install the ** framework dezero ** with ! Pip install dezero
, and install the ** morphological analysis library janome ** with! Pip install janome
.
Set the class name of the dataset to ** Neko **, inherit the ** Dataset class ** according to dezero's method, write ** processing content ** in def prepare ()
, and then ** process Write the required function ** in.
import numpy as np
import dezero
from dezero.datasets import Dataset
from dezero.utils import get_file, cache_dir
import zipfile
import re
from janome.tokenizer import Tokenizer
class Neko(Dataset):
def prepare(self):
url = 'https://www.aozora.gr.jp/cards/000148/files/789_ruby_5639.zip'
file = get_file(url)
data = self.unzip(cache_dir + '/' + '789_ruby_5639.zip')
self.text = self.preprocess(cache_dir + '/' + 'wagahaiwa_nekodearu.txt')
self.wakati = self.keitaiso(self.text)
self.corpus, self.word_to_id, self.id_to_word = self.process(self.wakati)
self.data = np.array(self.corpus[:-1])
self.label = np.array(self.corpus[1:])
def unzip(self, file_path):
with zipfile.ZipFile(file_path) as existing_zip:
existing_zip.extractall(cache_dir)
def preprocess(self, file_path):
binarydata = open(file_path, 'rb').read()
text = binarydata.decode('shift_jis')
text = re.split(r'\-{5,}', text)[2] #Delete header
text = re.split('Bottom book:',text)[0] #Remove footer
text = re.sub('|', '', text) # |Delete
text = re.sub('[.+?]', '', text) #Delete input note
text = re.sub(r'《.+?》', '', text) #Delete ruby
text = re.sub(r'\u3000', '', text) #Remove whitespace
text = re.sub(r'\r\n', '', text) #Remove line breaks
text = text[1:] #Delete the first character (adjustment)
return text
def keitaiso(self, text):
t = Tokenizer()
output = t.tokenize(text, wakati=True)
return output
def process(self, text):
# word_to_id, id_to_word creation
word_to_id, id_to_word = {}, {}
for word in text:
if word not in word_to_id:
new_id = len(word_to_id)
word_to_id[word] = new_id
id_to_word[new_id] = word
#Creating corpus
corpus = np.array([word_to_id[W] for W in text])
return corpus, word_to_id, id_to_word
The ** constructor ** (at def __init __ ()
) of the ** inherited ** Dataset class ** says self.prepare ()
, so the Neko class is ** instanced. Then, def prepare ()
will ** work **.
def prepare ()
usesget_file (url)
in the dezero library to download the file from the specified ʻurl and save it in
cache_dir. For google colab,
cache_dir is
/root/.dezero`.
After that, four functions are called in sequence to perform processing. Finally, substitute ** corpus ** into self.data
(time series data) and self.label
(next correct answer data) according to the method.
The variables text, wakati, corpus, word_to_id, id_to_word
are each given self.
so that they can be called as ** attributes ** once the Neko class is ** instantiated **. ..
def unzip ()
is a function that unzips the downloaded ** zip file **. def preprocess ()
is a function that reads the decompressed file and returns the text with ** unnecessary parts such as ruby and line breaks ** removed. def keitaiso ()
is a function that morphologically analyzes text and returns ** word-separation **. def process ()
is a function that creates ** dictionaries ** and ** corpus ** from word-separation.
Let's actually move it.
** Instantiate ** the Neko class with neko = Neko ()
to download the file and ** start the process **. It takes a few tens of seconds to complete because the janome word-separation process takes some time. When you're done, let's use it right away.
You can display ** text ** with neko.text
, ** word-separation ** with neko.wakati
, and ** corpus ** with neko.corpus
. The text is so-called solid, the word-separated list is a word-by-word list, and the corpus is a number from the beginning of the word-separated word (no duplication). By the way, let's take a look at the dictionary.
neko.waord_to_id []
is a dictionary that ** converts words to numbers **, and neko.id_to_word []
is a dictionary that ** converts numbers to words **. Let's look at the training data.
You can see that neko.data
and neko.label
are off by one. Finally, let's look at the length of the data and the number of words in the dictionary.
The ** data length ** is 205,815 and the number of words in the dictionary ** vovab_size ** is 13,616.
Now, let's write the code of the main body.
import numpy as np
import dezero
from dezero import Model
from dezero import SeqDataLoader
import dezero.functions as F
import dezero.layers as L
import random
from dezero import cuda
import textwrap
max_epoch = 70
batch_size = 30
vocab_size = len(neko.word_to_id)
wordvec_size = 650
hidden_size = 650
bptt_length = 30
class Lstm_nlp(Model):
def __init__(self, vocab_size, wordvec_size, hidden_size, out_size):
super().__init__()
self.embed = L.EmbedID(vocab_size, wordvec_size)
self.rnn = L.LSTM(hidden_size)
self.fc = L.Linear(out_size)
def reset_state(self): #State reset
self.rnn.reset_state()
def __call__(self, x): #Describe the connection contents of the layer
y = self.embed(x)
y = self.rnn(y)
y = self.fc(y)
return y
The model has a simple structure of ** Embedding layer + LSTM layer + Linear layer **. Enter the EmbedID as a number (integer) for the word.
The size of the EmbedID word embedding matrix is ** vocal_size x wordvec_size **, so it is 13616 x 650. The LSTM hidden_size
is 650, which is the same as wordvec_size. And the output size of Linear ʻout_size` is 13616, which is the same as vocab_size.
Describe ** the connection contents of each layer ** in def __call __ ()
. The contents described here can be called by giving arguments to the created instance like a function. For example, if you instantiate with model = Lstm_nlp (....)
, you can move the def __call__ ()
part with y = model (x)
. In other words, so-called predict can be achieved with this. This is smart.
model = Lstm_nlp(vocab_size, wordvec_size, hidden_size, vocab_size) #Model generation
dataloader = SeqDataLoader(neko, batch_size=batch_size) #Data loader generation
seqlen = len(neko)
optimizer = dezero.optimizers.Adam().setup(model) #The optimization method is Adam
#Presence / absence judgment and processing of GPU
if dezero.cuda.gpu_enable: #If GPU is enabled, do the following
dataloader.to_gpu() #Data loader to GPU
model.to_gpu() #Model to GPU
The data loader uses SeqDataLoader
for time series data. Since the order of time-series data changes when shuffled, the method of extracting multiple data by dividing the time-series data at regular intervals is adopted.
If the GPU is available, ʻif dezero.cuda.gpu_enable:` will be True, in which case send the data loader and model to the GPU.
#Learning loop
for epoch in range(max_epoch):
model.reset_state()
loss, count = 0, 0
for x, t in dataloader:
y = model(x) #Forward propagation
#Degree of appearance of the next word y(vocab_size dimensional vector)Softmax processed and correct answer(One hot vector)Loss calculation with
#However, the input t is the index number in which 1 of the one-hot vector stands.(integer)
loss += F.softmax_cross_entropy_simple(y, t)
count += 1
if count % bptt_length == 0 or count == seqlen:
model.cleargrads() #Derivative initialization
loss.backward() #Backpropagation
loss.unchain_backward() #Go back to the calculation graph and break the connection
optimizer.update() #Weight update
avg_loss = float(loss.data) / count
print('| epoch %d | loss %f' % (epoch + 1, avg_loss))
#Sentence generation
model.reset_state() #Reset state
with dezero.no_grad(): #Do not update weights
text = []
x = random.randint(0,vocab_size) #Randomly choose the first word number
while len(text) < 100: #Repeat until 100 words
x = np.array(int(x))
y = model(x) #y is the degree of appearance of the next word(vocab_size dimensional vector)
p = F.softmax_simple(y, axis=0) #Multiply by softmax to get the probability of appearance
xp = cuda.get_array_module(p) #Xp with GPU=xp without cp=np
sampled = xp.random.choice(len(p.data), size=1, p=p.data) #Numbers considering the probability of appearance(index)Choose
word = neko.id_to_word[int(sampled)] #Convert numbers to words
text.append(word) #Add word to text
x = sampled #Make sampled the following input
text = ''.join(text)
print(textwrap.fill(text, 60)) #Display with line breaks at 60 characters
It's a learning loop. ** Forward propagation ** with y = model (x)
and calculate loss with loss + = F.softmax_cross_entropy_simple (y, t)
.
At this time, y is a ** vector ** (vocab_size dimension) representing the ** appearance degree ** of the next word, which is multiplied by softmax to obtain the ** appearance probability **, and ** one hot next. Loss is calculated from the correct answer data **. However, the input t is the ** number (integer) ** of the one-hot vector in which 1 stands.
ʻIf count% bptt_length == 0 or count == If count is an integral multiple of bptt_length or goes to the end with seqlen: `, backpropagation is performed and the weight is updated.
Next, 100 words are generated for each eopch. First, reset the state with model.reset_state ()
and keep the weights unchanged with with dezero.no_grad ():
. Then, with x = random.randint (0, vocal_size)
, the initial value of the word is randomly determined from an integer from 0 to vocal_size, and the next word is predicted. Based on the predicted word, further prediction is repeated to generate a sentence.
p = F.softmax_simple (y, axis = 0)
multiplies y by softmax to find the probability of occurrence of the next word, andxp.random.choice ()
is a random word along that probability. I'm choosing.
The reason why xp.random.choice ()
starts with ** xp ** is ** np ** (numpy) when the first character is moved by the CPU, and ** cp when it is moved by the GPU. This is because it needs to be changed to ** (cupy). Therefore, judge by xp = cuda.get_array_module (p)
and substitute xp = np for CPU and xp = cp for GPU.
Now let's move the main unit.
When you execute the main body code, it learns the word order of "I am a cat" and generates a sentence for each epoch. It takes about 1 to 2 minutes per epoch. After learning to some extent, it looks like this. It's also fun to see the sentences become more like that little by little.
The impression that I wrote the code by imitating it is that it is a ** simple framework written in ** all python **, so the contents are easy to understand ** and the degree of freedom is high for easy writing * I had a good impression of *. During this period, I would like to study the contents of the DeZero framework.