Motivation

-Summer of 3rd year undergraduate students-

I'm a 3rd year student, so I'm looking for an intern ...

ES "** What is the deliverable? **"

I "Eh ..."

ES "** Qiita? Github? **"

I "No ..."

"** Lost **"

Master's degree I "I just finished my graduation research and want to make a deliverable ..."

What to do this time

I would like to implement something that I am interested in and have some knowledge about.

My major was natural language processing, and in my undergraduate dissertation, I was doing deep learning to generate documents. After practicing with Pytorch, we will automatically generate documents.

The document generator shall be a stack of Embedding layers, LSTM layers, and linear layers. Create hyperparameters such as the number of layers of LSTM that can be specified from the command line.

Data sets used for training and so on

The data set used for the training will be "SNOW T15: Easy Japanese Corpus" from the Natural Language Processing Laboratory, Nagaoka University of Technology.

Nagaoka University of Technology Natural Language Processing Laboratory http://www.jnlp.org/SNOW/T15

A parallel corpus of 50,000 sentences in Japanese and English + easy Japanese parallel corpus is super convenient. Since it is provided in xlsx format, convert it to csv format in advance. Also, since it is an automatic sentence generation, we do not use easy Japanese and English.

Morphological analysis

Natural language processing is morphological analysis. Morphological analysis divides a given raw sentence into morphemes. OOV (Out Of Vocab) is often a problem in morphological analysis. The output dimension of deep learning depends on the size of the corpus that can be output, and the larger it is, the more memory it takes. Therefore, many people register low-frequency words in the corpus as UNK (unknown), and devise various other things.

Here, the Sentence piece is used.

https://github.com/google/sentencepiece

Sentencepiece is a super convenient tool that performs morphological analysis so that it fits in the specified number of words without OOV by unsupervised learning. Please refer to the quoted URL for detailed specifications. It is used to morphologically analyze a dataset in the range of 8000 words without OOV.

Model definition

Well, it's an ordinary LSTM, so I have nothing to talk about. I would be delighted if you could point out any problems.

`LSTM.py`


import torch
import torch.nn as nn
import torch.nn.functional as F

class LSTM(nn.Module):
    def __init__(self, source_size, hidden_size, batch_size, embedding_size, num_layers):
        super(LSTM, self).__init__()
        self.hidden_size = hidden_size
        self.source_size = source_size
        self.batch_size = batch_size
        self.num_layers = num_layers
        self.embed_source = nn.Embedding(source_size, embedding_size, padding_idx=0)
        self.embed_source.weight.data.normal_(0, 1 / embedding_size**0.5)
        self.lstm_source = nn.LSTM(self.hidden_size, self.hidden_size, num_layers=self.num_layers,
                                bidirectional=True, batch_first=True)
        self.linear = nn.Linear(self.hidden_size*2, self.source_size)

    def forward(self, sentence_words, hx, cx):
        source_k = self.embed_source(sentence_words)
        self.lstm_source.flatten_parameters()
        encoder_output, (hx, cx) = self.lstm_source(source_k, (hx, cx))
        prob = F.log_softmax(self.linear(encoder_output), dim=1)
        _, output = torch.max(prob, dim = -1)
        return prob, output, (hx, cx)

    def init_hidden(self, bc):
        hx = torch.zeros(self.num_layers*2, bc, self.hidden_size)
        cx = torch.zeros(self.num_layers*2, bc, self.hidden_size)
        return hx, cx

It's normal.

Training & loader

Next, create the training code and the dataset loader. By the way, index.model is a model for morphological analysis created by sentence piece. I feel like testing suddenly after training without verification. The training learns to input a certain Japanese sentence and output the exact same sentence. At the time of the test, only the first word of the test sentence is input, and the rest are output in chronological order by the greedy method. Maybe this should automatically generate the document ...

`train.py`


import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
import argparse
from loader import Dataset
import torch.optim as optim
import sentencepiece as spm
from utils import seq_to_string, to_np, trim_seqs
import matplotlib.pyplot as plt
from torchviz import make_dot

from model.LSTM import LSTM

def make_model(source_size, hidden_size, batch_size, embedding_size=256, num_layers=1):
    model = LSTM(source_size, hidden_size, batch_size, embedding_size, num_layers)
    criterion = nn.NLLLoss(reduction="sum")
    model_opt = optim.Adam(model.parameters(), lr=0.0001)
    return model, criterion, model_opt

def data_load(maxlen, source_size, batch_size):
    data_set = Dataset(maxlen=maxlen)
    data_num = len(data_set)
    train_ratio = int(data_num*0.8)
    test_ratio = int(data_num*0.2)
    res = int(data_num - (train_ratio + test_ratio))
    train_ratio += res
    ratio=[train_ratio, test_ratio]
    train_dataset, test_dataset = torch.utils.data.random_split(data_set, ratio)
    dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    test_dataloader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=True)
    del(train_dataset)
    del(data_set)
    return dataloader, test_dataloader

def run_epoch(data_iter, model, criterion, model_opt, epoch):
    model, criterion = model.cuda(), criterion.cuda()
    model.train()
    total_loss = 0
    for i, data in enumerate(data_iter):
        model_opt.zero_grad()
        src = data[:,:-1]
        trg = data[:,1:]
        src, trg = src.cuda(), trg.cuda()
        hx, cx = model.init_hidden(src.size(0))
        hx, cx = hx.cuda(), cx.cuda()
        output_log_probs, output_seqs, _ = model(src, hx, cx)
        flattened_log_probs = output_log_probs.view(src.size(0) * src.size(1), -1)
        loss = criterion(flattened_log_probs, trg.contiguous().view(-1))
        loss /= (src.size(0) * src.size(1))
        loss.backward()
        model_opt.step()
        total_loss += loss
        if i % 50 == 1:
            print("Step: %d Loss: %f " %
                    (i, loss))
    mean_loss = total_loss / len(data_iter)
    torch.save({
                'model': model.state_dict()
                }, "./model_log/model.pt")
    dot = make_dot(output_log_probs ,params=dict(model.named_parameters()))
    dot.format = 'png'
    dot.render('image')
    return model, mean_loss

def depict_graph(mean_losses, epochs):
    epoch = [i+1 for i in range(epochs)]
    plt.xlim(0, epochs)
    plt.ylim(1, mean_losses[0])
    plt.plot(epoch, mean_losses)
    plt.title("loss")
    plt.xlabel("epoch")
    plt.ylabel("loss")
    plt.show()
    

def test(model, data_loader):
    model.eval()
    all_output_seqs = []
    all_target_seqs = []
    with torch.no_grad():
        for data in data_loader:
            src = Variable(data[:,:-1])
            src = src.cuda()
            del(data)
            input_data = src[:,:2]
            hx, cx = model.init_hidden(input_data.size(0))
            for i in range(18):
                hx, cx = hx.cuda(), cx.cuda()
                output_log_probs, output_seqs, hidden = model(input_data, hx, cx)
                hx, cx = hidden[0], hidden[1]
                input_data = torch.cat((input_data, output_seqs[:,-1:]), 1)
            all_output_seqs.extend(trim_seqs(input_data))
    out_set = (all_target_seqs, all_output_seqs)

    return out_set

if __name__ == "__main__":
    sp = spm.SentencePieceProcessor()
    sp.load("./index.model")
    source_size = sp.GetPieceSize()
    parser = argparse.ArgumentParser(description='Parse training parameters')
    parser.add_argument('--do_train', type=str, default='False')
    parser.add_argument('--batch_size', type=int, default=256)
    parser.add_argument('--maxlen', type=int, default=20)    
    parser.add_argument('--epochs', type=int, default=50)
    parser.add_argument('--hidden_size', type=int, default=128)
    parser.add_argument('--embedding_size', type=int, default=128)
    parser.add_argument('--num_layers', type=int, default=1)
    args = parser.parse_args()             
    model, criterion, model_opt = make_model(source_size, args.hidden_size, args.batch_size, args.embedding_size, args.num_layers)
    data_iter, test_data_iter = data_load(args.maxlen, source_size, args.batch_size)

    mean_losses = []
    if args.do_train == "True":
        for epoch in range(args.epochs):
            print(epoch+1)
            model, mean_loss = run_epoch(data_iter, model, criterion, model_opt, epoch)
            mean_losses.append(mean_loss.item())
        depict_graph(mean_losses, args.epochs)
    else:
        model.load_state_dict(torch.load("./model_log/model.pt")["model"])
    out_set = test(model, data_iter)

    true_txt = out_set[0]
    out_txt = out_set[1]

    with open("true.txt", "w", encoding="utf-8") as f:
        for i in true_txt:
            for j in i:
                f.write(sp.IdToPiece(int(j)))
            f.write("\n")
    
    with open("out.txt", "w", encoding="utf-8") as f:
        for i in out_txt:
            for j in i:
                f.write(sp.IdToPiece(int(j)))
            f.write("\n")

`loader.py`


import torch
import numpy as np
import csv
import sentencepiece as spm

class Dataset(torch.utils.data.Dataset):
    def __init__(self, maxlen):
        self.sp = spm.SentencePieceProcessor()
        self.sp.load("./index.model")
        self.maxlen = maxlen
        

        with open('./data/parallel_data.csv', mode='r', newline='', encoding='utf-8') as f:
            csv_file = csv.reader(f)
            read_data = [row for row in csv_file]
        self.data_num = len(read_data) - 1
        jp_data = []
        for i in range(1, self.data_num):    
            jp_data.append(read_data[i][1:2]) #Difficult Japanese sentences

        self.en_data_idx = np.zeros((len(jp_data), maxlen+1))

        for i,sentence in enumerate(jp_data):
            self.en_data_idx[i][0] = self.sp.PieceToId("<s>")
            for j,idx in enumerate(self.sp.EncodeAsIds(sentence[0])[:]):
                self.en_data_idx[i][j+1] = idx
                if j+1 == maxlen-1: #End symbol at the end
                    self.en_data_idx[i][j+1] = self.sp.PieceToId("</s>")
                    break
            if j+2 <= maxlen-1:
                self.en_data_idx[i][j+2] = self.sp.PieceToId("</s>")
                if j+3 < maxlen-1:
                    self.en_data_idx[i][j+3:] = self.sp.PieceToId("<unk>") #Because it is troublesome, the unk generated when learning the sentence piece is used as a pad.
            else:
                self.en_data_idx[i][j+1] = self.sp.PieceToId("</s>")
                if j+2 < maxlen-1:
                    self.en_data_idx[i][j+2:] = self.sp.PieceToId("<unk>")
        
    def __len__(self):
        return self.data_num

    def __getitem__(self, idx):
        en_data = torch.tensor(self.en_data_idx[idx-1][:], dtype=torch.long)
        return en_data

result

For the time being, try turning about 100 epochs. I didn't know if layers or hyperparameters should be small to prevent overfitting, so It is set so that it is not too large.

This is the loss during learning.

It's steadily falling, but it's subtle from the middle. It might be better to do something about hyperparameters.

So, the following is an example of the actual output text.

For now, I always stay home and sing for the exam
My sister wasn't happy with me and did the job well
A police officer on his birthday because he accidentally made him sleep for money

(^ Ω ^) ... It's no good ...

Conclusions and so on

First of all, the definition of the model may be bad. Is Seq2seq more suitable? I will try it if I have a chance. Anyway, I've never trained with the LSTM model naked consistently (although there is embedding), so It was pretty fun to think about what task to do. In my master's thesis, I basically use Transformers, so I think I'll occasionally introduce implementations and dissertations in the future.

Beginners automatically generate documents with Pytorch's LSTM