In the paper I read earlier, there was an interesting paper with the title ** Reproduction of sound symbols by machine learning: Generation of the strongest Pokemon **. I arranged it myself with reference to this paper and used it for the seminar presentation, so I will publish it. I made it with a little energy, so I posted it because I wanted to throw it in the outside world.
I usually deal with image systems, and since natural language processing is an amateur & first post, I think that there are many points that can not be reached, but please forgive me. * Material link used in the seminar presentation
This time, referring to the ideas in the following papers, we will aim to generate the strongest Pokemon by the same method while adding the accent of deep learning (LSTM). In the paper, the subject questionnaire was used to quantify the impression of sound, but since we do not have such data, we will use the race value instead.
Reproduction of sound symbols by machine learning: Generation of the strongest Pokemon Satoshi Miura ∗ 1 Masaki Murata ∗ 1 Sho Yasuda ∗ 2 Mai Miyabe ∗ 2 Eiji Aramaki ∗ 2 ∗ 1 Tottori University Graduate School * 2 Proceedings of the 18th Annual Meeting of the University of Tokyo Language Processing Society (March 2012) http://luululu.com/paper/2012/C1-1.pdf
・ Predict the strength of Pokemon by paired comparison with 8 subjects (who do not know Pokemon). -Use this as training data to generate a model with SVM. -Use the model, change the name of the Pokemon and repeat the judgment of strength to generate the strongest Pokemon.
The data used this time is the table data of Pokemon up to the 7th generation. Since the name of Pokemon is up to 6 characters, delete the value of Pokemon (Landorus spirit beast, Zygarde 10%, etc.) with a name exceeding that. In addition, this time we also excluded mega evolution Pokemon.
Pokemon data is borrowed from the link below.
https://rikapoke.hatenablog.jp/entry/pokemon_datasheet_gne7
import pandas as pd
status = pd.read_csv("pokemon_status.csv", encoding="shift_jis")
status
Csv data handled this time ↓
We will add pretreatment to this.
#Remove extra mega Pokemon
status = status[~status['Picture book number'].str.contains('-')]
status
#Removed Pokemon whose name is 7 or longer
status['len'] = status['Pokemon name'].map(lambda x: len(x))
de = status[status['len']>6]
status = status[status['len']<7]
de
#For data only used
status = status.loc[:, ['Pokemon name','total']]
status
Completion of the data used this time!
From here, we will continue the pretreatment to feed the LSTM.
#tokenize
def function(name):
n_gram = ''
for n in name:
n_gram = n_gram + n + ' '
return n_gram
status['Pokemon name'] = status['Pokemon name'].map(function)
status
#Normalization of race values& 0or1
from sklearn import preprocessing
def labeling(pred, p=0.5):
if pred < p:
pred_label = 0
else:
pred_label = 1
return pred_label
status['total'] = preprocessing.minmax_scale(status['total'])
status['total'] = status['total'].map(labeling)
status
Data after preprocessing ↓
Classify the created data into train and val and save.
from sklearn.model_selection import train_test_split
train_df, val_df = train_test_split(status, random_state=1234, test_size=0.2)
train_df.to_csv("./train_df.tsv", sep='\t')
val_df.to_csv("./val_df.tsv", sep='\t')
Since I am a pytorch believer, I will build a model with torch and briefly introduce the code. The full code is available at the link below, so if you are interested, please refer to it.
https://github.com/drop-ja/pokemon
from torchtext import data
import torchtext
batch_size = 4
max_len = 6
#tokenize method
tokenizer = lambda x: x.split()
#Label information, etc.
TEXT = data.Field(sequential=True, tokenize=tokenizer, include_lengths=True,
batch_first=True, fix_length=max_len)
LABEL = data.LabelField()
fields_train = [('id', None), ('name', TEXT), ('bs', LABEL)]
dataset_train, dataset_valid = data.TabularDataset.splits(
path = './',
format='TSV',
skip_header=True,
train="train_df.tsv",
validation="val_df.tsv",
fields=fields_train)
TEXT.build_vocab(dataset_train)
LABEL.build_vocab(dataset_train)
train_iter = data.BucketIterator(dataset=dataset_train, batch_size=batch_size,
sort_key=lambda x: len(x.name), repeat=False, shuffle=True)
val_iter = data.BucketIterator(dataset=dataset_valid, batch_size=1,
sort_key=lambda x: len(x.name), repeat=False, shuffle=False)
#Model definition
import torch
import torch.nn as nn
import torch.nn.init as init
import torch.optim as optim
import torch.nn.functional as F
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
class LSTMPVClassifier(nn.Module):
def __init__(self, vocab_size, embedding_dim, lstm_hidden_size, mlp_hidden_size, output_size):
super(LSTMPVClassifier, self).__init__()
self.lstm_hidden_size = lstm_hidden_size
self.embed = nn.Embedding(vocab_size, embedding_dim, padding_idx=1)
self.lstm = nn.LSTM(embedding_dim, self.lstm_hidden_size, batch_first=True,
num_layers=1, bidirectional=False, dropout=0.0)
self.fc1 = nn.Linear(self.lstm_hidden_size, mlp_hidden_size)
self.fc2 = nn.Linear(mlp_hidden_size, output_size)
def forward(self, x):
b_size = x.size(0) #Batch size
seq_len = x.size(1) #Pokemon name length
x = self.embed(x)
h0 = torch.zeros(1, b_size, self.lstm_hidden_size).to(device)
c0 = torch.zeros(1, b_size, self.lstm_hidden_size).to(device)
lstm_output_seq, (h_n, c_n) = self.lstm(x, (h0, c0))
out = torch.relu(self.fc1(lstm_output_seq))
out = torch.sigmoid(self.fc2(out))
return out
Here is the result of turning 10 Epoch with the above model. 'Total' returns the correct label, and'Predicted value' returns the result predicted by the model.
How to show the F value of the above result. A fairly good value is returned for the model that was properly built.
I put in a name and played with it. I put in the names of the laboratory members and ranked them, and it was quite exciting just to play with this.
Code below
#Data set
#Check the converted value
def to_dataset(list_obj, pri=True):
index = pd.DataFrame(list_obj)
index[0] = index[0].map(function)
index.to_csv('./test.tsv', sep='\t')
fields_test = [('id', None), ('name', TEXT)]
dataset_test = data.TabularDataset(path='./test.tsv',
format='TSV', skip_header=True, fields=fields_test)
test_iter = data.BucketIterator(dataset=dataset_test, batch_size=1,
sort_key=lambda x: len(x.name), repeat=False, shuffle=False)
batch = next(iter(test_iter))
if pri:
print(batch.name)
return test_iter
list_obj = ['Denshi Kettle', 'Gagigugego', 'microwave', 'Frying pan', 'JISABOKE', 'Pokémon']
test_iter = to_dataset(list_obj)
def result_show(test_iter, pri=True):
test_predicted = []
for batch in test_iter:
text = batch.name[0]
text = text.to(device)
outputs = eval_net(text)
outputs = outputs[:, -1]
tmp_pred_label = outputs.to('cpu').detach().numpy().copy()
test_predicted.extend(tmp_pred_label[0])
if pri:
print(test_predicted)
return test_predicted
result = result_show(test_iter)
df = pd.DataFrame(list_obj, columns=['name'])
df['Predicted value'] = result
df['0,1 label'] = labeling(df['Predicted value'])
df
Finally the subject of this time.
The strongest Pokemon generation method follows the paper and is generated by the same method.
** Paper method ** --Select a sample appropriately --Randomly replace one character --Compare using the replaced name and model --Loop the above 3 steps 50 times
In the paper, it was generated from "Parasect" and "Nidoqueen", so I will follow it. The code below.
def generate_pokemon(string):
history = []
score = []
history.append(string)
for i in range(50):
changed_string = change_name(string, 1)
cd_result = result_show(to_dataset([string, changed_string], False), False)
#Add only at the beginning
if i ==0:
score.append(cd_result[0])
if cd_result[0] > cd_result[1]:
score.append(cd_result[0])
else:
string = changed_string
score.append(cd_result[1])
history.append(string)
cd_df = pd.DataFrame(history, columns=['name'])
cd_df['Predicted value'] = score
return string, cd_df
string = 'Parasect'
saikyou, port = generate_pokemon(string)
print('The strongest name: ', saikyou)
pd.DataFrame(port)
Generation result from parasect (process 10/50)
The final result is "Egineo". Since the characters are exchanged at random, it was interesting that the result changed every time I turned it, and I turned it many times. It's fun to move while checking the results in this way.
Also, I used Pokemon this time, but it seems interesting to do it with the name of the ramen shop and the tabelog evaluation.