-Summer of 3rd year undergraduate students-
I'm a 3rd year student, so I'm looking for an intern ...
ES "** What is the deliverable? **"
I "Eh ..."
ES "** Qiita? Github? **"
I "No ..."
"** Lost **"
Master's degree I "I just finished my graduation research and want to make a deliverable ..."
I would like to implement something that I am interested in and have some knowledge about.
My major was natural language processing, and in my undergraduate dissertation, I was doing deep learning to generate documents. After practicing with Pytorch, we will automatically generate documents.
The document generator shall be a stack of Embedding layers, LSTM layers, and linear layers. Create hyperparameters such as the number of layers of LSTM that can be specified from the command line.
The data set used for the training will be "SNOW T15: Easy Japanese Corpus" from the Natural Language Processing Laboratory, Nagaoka University of Technology.
Nagaoka University of Technology Natural Language Processing Laboratory http://www.jnlp.org/SNOW/T15
A parallel corpus of 50,000 sentences in Japanese and English + easy Japanese parallel corpus is super convenient. Since it is provided in xlsx format, convert it to csv format in advance. Also, since it is an automatic sentence generation, we do not use easy Japanese and English.
Natural language processing is morphological analysis. Morphological analysis divides a given raw sentence into morphemes. OOV (Out Of Vocab) is often a problem in morphological analysis. The output dimension of deep learning depends on the size of the corpus that can be output, and the larger it is, the more memory it takes. Therefore, many people register low-frequency words in the corpus as UNK (unknown), and devise various other things.
Here, the Sentence piece is used.
https://github.com/google/sentencepiece
Sentencepiece is a super convenient tool that performs morphological analysis so that it fits in the specified number of words without OOV by unsupervised learning. Please refer to the quoted URL for detailed specifications. It is used to morphologically analyze a dataset in the range of 8000 words without OOV.
Well, it's an ordinary LSTM, so I have nothing to talk about. I would be delighted if you could point out any problems.
LSTM.py
import torch
import torch.nn as nn
import torch.nn.functional as F
class LSTM(nn.Module):
def __init__(self, source_size, hidden_size, batch_size, embedding_size, num_layers):
super(LSTM, self).__init__()
self.hidden_size = hidden_size
self.source_size = source_size
self.batch_size = batch_size
self.num_layers = num_layers
self.embed_source = nn.Embedding(source_size, embedding_size, padding_idx=0)
self.embed_source.weight.data.normal_(0, 1 / embedding_size**0.5)
self.lstm_source = nn.LSTM(self.hidden_size, self.hidden_size, num_layers=self.num_layers,
bidirectional=True, batch_first=True)
self.linear = nn.Linear(self.hidden_size*2, self.source_size)
def forward(self, sentence_words, hx, cx):
source_k = self.embed_source(sentence_words)
self.lstm_source.flatten_parameters()
encoder_output, (hx, cx) = self.lstm_source(source_k, (hx, cx))
prob = F.log_softmax(self.linear(encoder_output), dim=1)
_, output = torch.max(prob, dim = -1)
return prob, output, (hx, cx)
def init_hidden(self, bc):
hx = torch.zeros(self.num_layers*2, bc, self.hidden_size)
cx = torch.zeros(self.num_layers*2, bc, self.hidden_size)
return hx, cx
It's normal.
Next, create the training code and the dataset loader. By the way, index.model is a model for morphological analysis created by sentence piece. I feel like testing suddenly after training without verification. The training learns to input a certain Japanese sentence and output the exact same sentence. At the time of the test, only the first word of the test sentence is input, and the rest are output in chronological order by the greedy method. Maybe this should automatically generate the document ...
train.py
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
import argparse
from loader import Dataset
import torch.optim as optim
import sentencepiece as spm
from utils import seq_to_string, to_np, trim_seqs
import matplotlib.pyplot as plt
from torchviz import make_dot
from model.LSTM import LSTM
def make_model(source_size, hidden_size, batch_size, embedding_size=256, num_layers=1):
model = LSTM(source_size, hidden_size, batch_size, embedding_size, num_layers)
criterion = nn.NLLLoss(reduction="sum")
model_opt = optim.Adam(model.parameters(), lr=0.0001)
return model, criterion, model_opt
def data_load(maxlen, source_size, batch_size):
data_set = Dataset(maxlen=maxlen)
data_num = len(data_set)
train_ratio = int(data_num*0.8)
test_ratio = int(data_num*0.2)
res = int(data_num - (train_ratio + test_ratio))
train_ratio += res
ratio=[train_ratio, test_ratio]
train_dataset, test_dataset = torch.utils.data.random_split(data_set, ratio)
dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_dataloader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=True)
del(train_dataset)
del(data_set)
return dataloader, test_dataloader
def run_epoch(data_iter, model, criterion, model_opt, epoch):
model, criterion = model.cuda(), criterion.cuda()
model.train()
total_loss = 0
for i, data in enumerate(data_iter):
model_opt.zero_grad()
src = data[:,:-1]
trg = data[:,1:]
src, trg = src.cuda(), trg.cuda()
hx, cx = model.init_hidden(src.size(0))
hx, cx = hx.cuda(), cx.cuda()
output_log_probs, output_seqs, _ = model(src, hx, cx)
flattened_log_probs = output_log_probs.view(src.size(0) * src.size(1), -1)
loss = criterion(flattened_log_probs, trg.contiguous().view(-1))
loss /= (src.size(0) * src.size(1))
loss.backward()
model_opt.step()
total_loss += loss
if i % 50 == 1:
print("Step: %d Loss: %f " %
(i, loss))
mean_loss = total_loss / len(data_iter)
torch.save({
'model': model.state_dict()
}, "./model_log/model.pt")
dot = make_dot(output_log_probs ,params=dict(model.named_parameters()))
dot.format = 'png'
dot.render('image')
return model, mean_loss
def depict_graph(mean_losses, epochs):
epoch = [i+1 for i in range(epochs)]
plt.xlim(0, epochs)
plt.ylim(1, mean_losses[0])
plt.plot(epoch, mean_losses)
plt.title("loss")
plt.xlabel("epoch")
plt.ylabel("loss")
plt.show()
def test(model, data_loader):
model.eval()
all_output_seqs = []
all_target_seqs = []
with torch.no_grad():
for data in data_loader:
src = Variable(data[:,:-1])
src = src.cuda()
del(data)
input_data = src[:,:2]
hx, cx = model.init_hidden(input_data.size(0))
for i in range(18):
hx, cx = hx.cuda(), cx.cuda()
output_log_probs, output_seqs, hidden = model(input_data, hx, cx)
hx, cx = hidden[0], hidden[1]
input_data = torch.cat((input_data, output_seqs[:,-1:]), 1)
all_output_seqs.extend(trim_seqs(input_data))
out_set = (all_target_seqs, all_output_seqs)
return out_set
if __name__ == "__main__":
sp = spm.SentencePieceProcessor()
sp.load("./index.model")
source_size = sp.GetPieceSize()
parser = argparse.ArgumentParser(description='Parse training parameters')
parser.add_argument('--do_train', type=str, default='False')
parser.add_argument('--batch_size', type=int, default=256)
parser.add_argument('--maxlen', type=int, default=20)
parser.add_argument('--epochs', type=int, default=50)
parser.add_argument('--hidden_size', type=int, default=128)
parser.add_argument('--embedding_size', type=int, default=128)
parser.add_argument('--num_layers', type=int, default=1)
args = parser.parse_args()
model, criterion, model_opt = make_model(source_size, args.hidden_size, args.batch_size, args.embedding_size, args.num_layers)
data_iter, test_data_iter = data_load(args.maxlen, source_size, args.batch_size)
mean_losses = []
if args.do_train == "True":
for epoch in range(args.epochs):
print(epoch+1)
model, mean_loss = run_epoch(data_iter, model, criterion, model_opt, epoch)
mean_losses.append(mean_loss.item())
depict_graph(mean_losses, args.epochs)
else:
model.load_state_dict(torch.load("./model_log/model.pt")["model"])
out_set = test(model, data_iter)
true_txt = out_set[0]
out_txt = out_set[1]
with open("true.txt", "w", encoding="utf-8") as f:
for i in true_txt:
for j in i:
f.write(sp.IdToPiece(int(j)))
f.write("\n")
with open("out.txt", "w", encoding="utf-8") as f:
for i in out_txt:
for j in i:
f.write(sp.IdToPiece(int(j)))
f.write("\n")
loader.py
import torch
import numpy as np
import csv
import sentencepiece as spm
class Dataset(torch.utils.data.Dataset):
def __init__(self, maxlen):
self.sp = spm.SentencePieceProcessor()
self.sp.load("./index.model")
self.maxlen = maxlen
with open('./data/parallel_data.csv', mode='r', newline='', encoding='utf-8') as f:
csv_file = csv.reader(f)
read_data = [row for row in csv_file]
self.data_num = len(read_data) - 1
jp_data = []
for i in range(1, self.data_num):
jp_data.append(read_data[i][1:2]) #Difficult Japanese sentences
self.en_data_idx = np.zeros((len(jp_data), maxlen+1))
for i,sentence in enumerate(jp_data):
self.en_data_idx[i][0] = self.sp.PieceToId("<s>")
for j,idx in enumerate(self.sp.EncodeAsIds(sentence[0])[:]):
self.en_data_idx[i][j+1] = idx
if j+1 == maxlen-1: #End symbol at the end
self.en_data_idx[i][j+1] = self.sp.PieceToId("</s>")
break
if j+2 <= maxlen-1:
self.en_data_idx[i][j+2] = self.sp.PieceToId("</s>")
if j+3 < maxlen-1:
self.en_data_idx[i][j+3:] = self.sp.PieceToId("<unk>") #Because it is troublesome, the unk generated when learning the sentence piece is used as a pad.
else:
self.en_data_idx[i][j+1] = self.sp.PieceToId("</s>")
if j+2 < maxlen-1:
self.en_data_idx[i][j+2:] = self.sp.PieceToId("<unk>")
def __len__(self):
return self.data_num
def __getitem__(self, idx):
en_data = torch.tensor(self.en_data_idx[idx-1][:], dtype=torch.long)
return en_data
For the time being, try turning about 100 epochs. I didn't know if layers or hyperparameters should be small to prevent overfitting, so It is set so that it is not too large.
This is the loss during learning.
It's steadily falling, but it's subtle from the middle. It might be better to do something about hyperparameters.
So, the following is an example of the actual output text.
(^ Ω ^) ... It's no good ...
First of all, the definition of the model may be bad. Is Seq2seq more suitable? I will try it if I have a chance. Anyway, I've never trained with the LSTM model naked consistently (although there is embedding), so It was pretty fun to think about what task to do. In my master's thesis, I basically use Transformers, so I think I'll occasionally introduce implementations and dissertations in the future.
Recommended Posts