This article is a continuation of Predicting short-lived works of Weekly Shonen Jump by machine learning (Part 1: Data analysis). Using the data acquired in the first part, we will implement and evaluate a classifier with a multi-layer perceptron. Hereafter, Jump refers to Weekly Shonen Jump.
The above figure is a part of the evaluation result. When using the best model (Filtered + Augmented), ** If you enter the publication order [^ publication order] up to the 7th week and the number of colors, there is a 65% chance that works that will be completed within 20 weeks It turned out to be predictable ** [^ jump]. The latest 100 works registered in the Japan Media Arts Database were used for evaluation, and other works were used for learning and parameter adjustment. .. I devised various things, but this performance was the limit with my own power. The details are explained below. The jupyter notebook is here, the source code is here -comic-end).
This article does not express an opinion on Jump's editing policy, and does not appeal to the improper end or continuation of any work. Good luck jump! Good luck manga artist!
[^ Posting order]: The Jump editorial department seems to have denied the questionnaire supreme principle, saying, "We do not necessarily consider only the results of the reader questionnaire." "Jump" editorial department denies rumors of questionnaire supreme principle ... Readers are complicated
[^ Jump]: As mentioned above, in reality, the jump editorial department decides the discontinued work in consideration of various factors. I hope you understand this article as a delusion of a jump fan.
2.1 anaconda
In [ʻanaconda](https://www.continuum.io/Downloads), create the following virtual environment
comic`.
conda create -n comic python=3.5
source activate comic
conda install pandas matplotlib jupyter notebook scipy scikit-learn seaborn scrapy
pip install tensorflow
The yml
file is here. tensorflow
and scikit-learn
are included. Also, since I used pairplot ()
in the first part, seaborn
/) Is inserted.
It is assumed that the wj-api.json
obtained in Part 1 is in the data
directory. Also, assume that the ComicAnalyzer
introduced in Part 1 is defined in comic.py
.
import comic
wj = comic.ComicAnalyzer()
I want to display the title of the manga in Japanese, so set it referring to Draw Japanese with matplotlib on Ubuntu. If you are using other than Ubuntu, please take appropriate action.
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import matplotlib
from matplotlib.font_manager import FontProperties
font_path = '/usr/share/fonts/truetype/takao-gothic/TakaoPGothic.ttf'
font_prop = FontProperties(fname=font_path)
matplotlib.rcParams['font.family'] = font_prop.get_name()
In this article, we will challenge the problem of classifying whether it is a short-lived work or not based on the following input.
As input, we will use a total of 8 dimensions of information on the order of publication in each week up to the 7th week of serialization and the total number of colors. The reason for using the data up to the 7th week is that I wanted to predict the termination of the shortest serialization (8 weeks) in recent years, at the latest one week before. The reason for using the number of colors as well as the order of publication is to improve the prediction accuracy. Intuitively, the more popular works tend to have more colors.
In Part 1, "short-lived works" are defined as follows.
In this article, we use machine learning to predict short-lived works (works that finish within 10 weeks).
As a preliminary experiment, I tried to classify short-lived works with this definition, but I did not learn well. Analyzing wj-api.json
again, we can see that very few works are completed within 10 weeks.
The figure on the left shows the cumulative distribution of all works, and the figure on the right focuses on up to 50 weeks in the figure on the left. The horizontal axis is the publication period, and the vertical axis is the percentage. From the figure on the right, it can be seen that less than 10% of the works were completed by 10 weeks. Why neural networks can't beat SVM As you pointed out, Multilayer Perceptron is not good at learning unbalanced data [^ Commitment] ..
According to Applying deep learning to real-world problems --Merantix, when the data label is biased A change in labeling has been proposed as one of the countermeasures. Therefore, in this article, for the sake of convenience, the definition of short-lived works will be changed to ** works completed within 20 weeks ** (forecasting works completed within 10 weeks will be my homework in the future ...). If the threshold is set to 20 weeks, about half of the works can be treated as short-lived works.
[^ Commitment]: Then, it is reasonable to point out that SVM should be used. This time, I was particular about Perceptron for studying.
The following is a model of Multilayer Perceptron dealt with in this article. For more information on Multilayer Perceptron, see Notes on Backpropagation Method.
The hidden layer is 7 nodes and 2 layers. As the activation function of the hidden layer, [ReLU](https://ja.wikipedia.org/wiki/%E6%B4%BB%E6%80%A7%E5%8C%96%E9%96%A2%E6 % 95% B0 # ReLU.EF.BC.88.E3.83.A9.E3.83.B3.E3.83.97.E9.96.A2.E6.95.B0.EF.BC.89) .. The output layer outputs the probability that it is a short-lived work, and as an activation function [Sigmoid](https://ja.wikipedia.org/wiki/%E6%B4%BB%E6%80%A7%E5%8C%96 % E9% 96% A2% E6% 95% B0 # .E3.82.B7.E3.82.B0.E3.83.A2.E3.82.A4.E3.83.89.E9.96.A2.E6. Use 95.B0). Use Adam for learning. The learning rate $ r $ is adjusted with TensorBoard. By the way, the above model (number of hidden layers, number of hidden layer nodes, hidden layer activation function, optimization algorithm) is selected with the best performance in the preliminary experiment.
In this article, we will use 273 short-lived works and 273 other works (hereinafter referred to as continuous works), for a total of 546 works. From new work, 100 works are used as test data, 100 works are used as validation data, and 346 works are used as training data. The test data is the data for final evaluation, the validation data is the data for hyperparameter adjustment, and the training data is the data for training. For these, Why do we need to separate validation and test sets for supervised learning? is detailed.
In this article, we will use training data in the following three different ways. x_test
and y_test
represent test data, x_val
and y_val
represent validation data, and x_tra
and y_tra
represent training data.
In Dataset 1, all 346 training data works are used for learning. Dataset 2 excludes about half of the old works from the training data and uses them for learning. This is because I thought that some of the training data works were too old to be suitable for learning the current censoring policy of the jump editorial department (become noise). In Dataset 3, Dataset 2 is inflated by dataset augmentation and used for learning. This is because I thought that Dataset 2 had too little training data to obtain sufficient generalization performance.
Dataset augmentation is a technique for processing data to inflate training data. It is known to be effective mainly in improving the performance of image recognition and voice recognition. For details, see Section 7.4 of the Deep learning book and How to increase the number of machine learning dataset images. Please refer to / bohemian916 / items / 9630661cd5292240f8c7). The theme behind this article is to evaluate the effectiveness of Dataset augmentation in predicting the discontinuation of weekly comic books. In this article, Data augmentation is performed by the method shown below.
Roughly speaking, new data is generated by randomly selecting two pieces of data with the same label and taking their random weighted averages. Behind this, there is an assumption that works with intermediate grades (in order of publication) of multiple short-lived works are also short-lived works. Intuitively, it seems like a not-so-bad assumption.
The class ComicNet ()
for managing the multi-layer perceptron is defined below. ComicNet ()
sets various data (test, validation, and train), builds a multi-layer perceptron, trains, and tests. For implementation, use TensorFlow. Regarding TensorFlow, I'm not a programmer or data scientist in particular, but I've touched Tensorflow for a month, so it's super easy to understand / items / c977c79b76c5979874e8) is detailed.
ComicNet()
class ComicNet():
"""This class manages a multi-layer perceptron that identifies whether a manga work is short-lived or not.
:param thresh_week: Threshold that separates short-lived works from others.
:param n_x: Number of listing orders to enter in the Multilayer Perceptron.
"""
def __init__(self, thresh_week=20, n_x=7):
self.n_x = n_x
self.thresh_week = thresh_week
The following is a brief description of each member function.
configure_dataset ()
etc.ComicNet
def get_x(self, analyzer, title):
"""It is a function to get the normalized publication order of the specified work up to the specified week."""
worsts = np.array(analyzer.extract_item(title)[:self.n_x])
bests = np.array(analyzer.extract_item(title, 'best')[:self.n_x])
bests_normalized = bests / (worsts + bests - 1)
color = sum(analyzer.extract_item(title, 'color')[:self.n_x]
) /self.n_x
return np.append(bests_normalized, color)
def get_y(self, analyzer, title, thresh_week):
"""This is a function to get whether the specified work is a short-lived work."""
return int(len(analyzer.extract_item(title)) <= thresh_week)
def get_xs_ys(self, analyzer, titles, thresh_week):
"""A function that returns the features, label, and title of the specified work group.
y==0 and y==Returns the same number of data of 1.
"""
xs = np.array([self.get_x(analyzer, title) for title in titles])
ys = np.array([[self.get_y(analyzer, title, thresh_week)]
for title in titles])
# ys==0 and ys==Align the number of data of 1.
idx_ps = np.where(ys.reshape((-1)) == 1)[0]
idx_ng = np.where(ys.reshape((-1)) == 0)[0]
len_data = min(len(idx_ps), len(idx_ng))
x_ps = xs[idx_ps[-len_data:]]
x_ng = xs[idx_ng[-len_data:]]
y_ps = ys[idx_ps[-len_data:]]
y_ng = ys[idx_ng[-len_data:]]
t_ps = [titles[ii] for ii in idx_ps[-len_data:]]
t_ng = [titles[ii] for ii in idx_ng[-len_data:]]
return x_ps, x_ng, y_ps, y_ng, t_ps, t_ng
def augment_x(self, x, n_aug):
"""A function that artificially generates a specified number of x data."""
if n_aug:
x_pair = np.array(
[[x[idx] for idx in
np.random.choice(range(len(x)), 2, replace=False)]
for _ in range(n_aug)])
weights = np.random.rand(n_aug, 1, self.n_x + 1)
weights = np.concatenate((weights, 1 - weights), axis=1)
x_aug = (x_pair * weights).sum(axis=1)
return np.concatenate((x, x_aug), axis=0)
else:
return x
def augment_y(self, y, n_aug):
"""A function that artificially generates a specified number of y data."""
if n_aug:
y_aug = np.ones((n_aug, 1)) if y[0, 0] \
else np.zeros((n_aug, 1))
return np.concatenate((y, y_aug), axis=0)
else:
return y
def configure_dataset(self, analyzer, n_drop=0, n_aug=0):
"""A function that sets a dataset.
:param analyzer:An instance of the ComicAnalyzer class
:param n_drop:Number of old data to exclude from training data
:param n_aug:Number of augmented data to add to training data
"""
x_ps, x_ng, y_ps, y_ng, t_ps, t_ng = self.get_xs_ys(
analyzer, analyzer.end_titles, self.thresh_week)
self.x_test = np.concatenate((x_ps[-50:], x_ng[-50:]), axis=0)
self.y_test = np.concatenate((y_ps[-50:], y_ng[-50:]), axis=0)
self.titles_test = t_ps[-50:] + t_ng[-50:]
self.x_val = np.concatenate((x_ps[-100 : -50],
x_ng[-100 : -50]), axis=0)
self.y_val = np.concatenate((y_ps[-100 : -50],
y_ng[-100 : -50]), axis=0)
self.x_tra = np.concatenate(
(self.augment_x(x_ps[n_drop//2 : -100], n_aug//2),
self.augment_x(x_ng[n_drop//2 : -100], n_aug//2)), axis=0)
self.y_tra = np.concatenate(
(self.augment_y(y_ps[n_drop//2 : -100], n_aug//2),
self.augment_y(y_ng[n_drop//2 : -100], n_aug//2)), axis=0)
For configure_dataset ()
, first input (x_ps
, x_ng
), label (y_ps
, y_ng
) and work name (t_ps
, t_ng
) with get_xs_ys ()
. I will. Here, the number of short-lived work data (x_ps
, y_ps
, t_ps
) is equal to the number of continuous work data (x_ng
, y_ng
, t_ng
). Of these, the latest 100 works are used as test data, the remaining latest 100 works are used as validation data, and too much is used as training data. When setting training data, after excluding old data with a total of n_drop
, add inflated data with a total of n_aug
.
build_graph ()
ComicNet
def build_graph(self, r=0.001, n_h=7, stddev=0.01):
"""A function that builds a multi-layer perceptron.
:param r:Learning rate
:param n_h:Number of nodes in the hidden layer
:param stddev:Standard deviation of the initial distribution of variables
"""
tf.reset_default_graph()
#Input layer and target
n_y = self.y_test.shape[1]
self.x = tf.placeholder(tf.float32, [None, self.n_x + 1], name='x')
self.y = tf.placeholder(tf.float32, [None, n_y], name='y')
#Hidden layer (1st layer)
self.w_h_1 = tf.Variable(
tf.truncated_normal((self.n_x + 1, n_h), stddev=stddev))
self.b_h_1 = tf.Variable(tf.zeros(n_h))
self.logits = tf.add(tf.matmul(self.x, self.w_h_1), self.b_h_1)
self.logits = tf.nn.relu(self.logits)
#Hidden layer (second layer)
self.w_h_2 = tf.Variable(
tf.truncated_normal((n_h, n_h), stddev=stddev))
self.b_h_2 = tf.Variable(tf.zeros(n_h))
self.logits = tf.add(tf.matmul(self.logits, self.w_h_2), self.b_h_2)
self.logits = tf.nn.relu(self.logits)
#Output layer
self.w_y = tf.Variable(
tf.truncated_normal((n_h, n_y), stddev=stddev))
self.b_y = tf.Variable(tf.zeros(n_y))
self.logits = tf.add(tf.matmul(self.logits, self.w_y), self.b_y)
tf.summary.histogram('logits', self.logits)
#Loss function
self.loss = tf.reduce_mean(
tf.nn.sigmoid_cross_entropy_with_logits(
logits=self.logits, labels=self.y))
tf.summary.scalar('loss', self.loss)
#optimisation
self.optimizer = tf.train.AdamOptimizer(r).minimize(self.loss)
self.output = tf.nn.sigmoid(self.logits, name='output')
correct_prediction = tf.equal(self.y, tf.round(self.output))
self.acc = tf.reduce_mean(tf.cast(correct_prediction, tf.float32),
name='acc')
tf.summary.histogram('output', self.output)
tf.summary.scalar('acc', self.acc)
self.merged = tf.summary.merge_all()
In the input layer, tf.placeholder
defines the input tensor (x
) and the teacher label tensor ( y
).
In the hidden layer, tf.Variable
defines the weight tensor (w_h_1
, w_h_2
) and bias ( b_h_1
, b_h_2
). Here, tf.truncated_normal
is given as the initial distribution of Variable
. truncated_normal
is a normal distribution that excludes values outside 2 sigma and is often used. In fact, this truncated_normal
standard deviation is one of the important hyperparameters that affect the performance of the model. This time, I looked at the results of the preliminary experiment and set it to 0.01
. tf.add
, tf.matmul
/ matmul), tf.nn.relu
is used to connect tensors to form a hidden layer. I will. By the way, tf.nn.relu
is changed to tf.nn.sigmoid
If you rewrite it as .org / api_docs / python / tf / sigmoid # tfnnsigmoid), the activation function will be Sigmoid. A7% E5% 8C% 96% E9% 96% A2% E6% 95% B0 # .E3.82.B7.E3.82.B0.E3.83.A2.E3.82.A4.E3.83.89.E9 You can use .96.A2.E6.95.B0). Please refer to here for the activation functions that can be used in TensorFlow.
The output layer basically performs the same processing as the submerged layer. Especially active in the output layer because it contains an activation function (sigmoid) inside the loss function tf.nn.sigmoid_cross_entropy_with_logits
. Note that you do not need to use the conversion function. By passing tf.Variable
to tf.summary.scalar
, you can check the time change with TensorBoard. Become.
Use tf.train.AdamOptimizer
as the optimization algorithm. Please refer to here for the optimization algorithms that can be used with TensorFlow. The final output value logits
is rounded off (that is, judged by the threshold value 0.5), and the correct answer rate for the teacher label y
is calculated as ʻacc. Finally, merge all log information with all [
tf.summary.merge_all`](https://www.tensorflow.org/api_docs/python/tf/summary/merge_all).
train ()
In TensorFlow, learning is done in tf.Session
. You must always initialize Variable
with tf.global_variables_initializer ()
(otherwise you will get angry) ..
The model is trained by sess.run (self.optimizer)
. Multiple first arguments of sess.run
can be specified by tuple. Also, at the time of sess.run ()
, it is necessary to assign a value to placeholder
in dictionary format. Substitute x_tra
and x_tra
during training, and substitute x_val
and y_val
during validation.
You can save the log information for TensorBoard with tf.summary.FileWriter
. You can also save the trained model with tf.train.Saver
.
ComicNet
def train(self, epoch=2000, print_loss=False, save_log=False,
log_dir='./logs/1', log_name='', save_model=False,
model_name='prediction_model'):
"""A function that trains a multi-layer perceptron and saves logs and trained models.
:param epoch:Number of epochs
:pram print_loss:Whether to output the history of the loss function
:param save_log:Whether to save the log
:param log_dir:Log storage directory
:param log_name:Log save name
:param save_model:Whether to save the trained model
:param model_name:Conserved name of trained model
"""
with tf.Session() as sess:
sess.run(tf.global_variables_initializer()) #Variable initialization
#Settings for saving logs
log_tra = log_dir + '/tra/' + log_name
writer_tra = tf.summary.FileWriter(log_tra)
log_val = log_dir + '/val/' + log_name
writer_val = tf.summary.FileWriter(log_val)
for e in range(epoch):
feed_dict = {self.x: self.x_tra, self.y: self.y_tra}
_, loss_tra, acc_tra, mer_tra = sess.run(
(self.optimizer, self.loss, self.acc, self.merged),
feed_dict=feed_dict)
# validation
feed_dict = {self.x: self.x_val, self.y: self.y_val}
loss_val, acc_val, mer_val = sess.run(
(self.loss, self.acc, self.merged),
feed_dict=feed_dict)
#Save log
if save_log:
writer_tra.add_summary(mer_tra, e)
writer_val.add_summary(mer_val, e)
#Loss function output
if print_loss and e % 500 == 0:
print('# epoch {}: loss_tra = {}, loss_val = {}'.
format(e, str(loss_tra), str(loss_val)))
#Save model
if save_model:
saver = tf.train.Saver()
_ = saver.save(sess, './models/' + model_name)
test ()
ComicNet
def test(self, model_name='prediction_model'):
"""A function that reads and tests the specified model.
:param model_name:The name of the model to load
"""
tf.reset_default_graph()
loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
#Model loading
loader = tf.train.import_meta_graph(
'./models/{}.meta'.format(model_name))
loader.restore(sess, './models/' + model_name)
x_loaded = loaded_graph.get_tensor_by_name('x:0')
y_loaded = loaded_graph.get_tensor_by_name('y:0')
loss_loaded = loaded_graph.get_tensor_by_name('loss:0')
acc_loaded = loaded_graph.get_tensor_by_name('acc:0')
output_loaded = loaded_graph.get_tensor_by_name('output:0')
# test
feed_dict = {x_loaded: self.x_test, y_loaded: self.y_test}
loss_test, acc_test, output_test = sess.run(
(loss_loaded, acc_loaded, output_loaded), feed_dict=feed_dict)
return acc_test, output_test
test ()
is a member function that tests a trained multilayer perceptron. Use tf.train.import_meta_graph
to load the trained model. Give test data (x_test
, y_test
) to feed_dict
and runsess.run
.
By visualizing the accuracy (correct answer rate) and loss (loss function output) of validation data with TensorBoard, hyperparameters (learning rate $ r $) (Number of epochs) is tuned. For more information on TensorBoard, please refer to Official. For simplicity, this article adjusts only one significant digit. Although details are omitted, preliminary experiments were conducted on the number of hidden layers (2), the activation function of hidden layers (ReLU), the standard deviation of the initial distribution of variables (0.01), and the optimization algorithm (Adam). It has been easily adjusted with.
rs = [n * 10 ** m for m in range(-4, -1) for n in range(1, 10)]
datasets = [
{'n_drop':0, 'n_aug':0},
{'n_drop':173, 'n_aug':0},
{'n_drop':173, 'n_aug':173},
]
wjnet = ComicNet()
for i, dataset in enumerate(datasets):
wjnet.configure_dataset(wj, n_drop=dataset['n_drop'],
n_aug=dataset['n_aug'])
log_dir = './logs/dataset={}/'.format(i + 1)
for r in rs:
log_name = str(r)
wjnet.build_graph(r=r)
wjnet.train(epoch=20000, save_log=True, log_dir=log_dir,
log_name=log_name)
print('Saved log of dataset={}, r={}'.format(i + 1, r))
For Dataset 1, let's look at the accuracy and loss of validation data with TensorBoard.
tensorboard --logdir=./logs/dataset=1/val
The horizontal axis is the number of epochs. From here, look for $ r $ and $ epoch $ that minimize the validation loss.
For Dataset 1, $ r = 0.0003 $ and $ epoch = 2000 $ seem to be good. Do the same for Dataset 2 and Dataset 3.
For Dataset 2, $ r = 0.0005 $ and $ epoch = 2000 $ seem to be good.
For Dataset 3, $ r = 0.0001 $ and $ epoch = 8000 $ seem to be good.
For each Dataset, train with the hyperparameters adjusted above and save the model.
params = [
{'n_drop':0, 'n_aug':0, 'r':0.0003,
'e': 2000, 'name':'1: Original'},
{'n_drop':173, 'n_aug':0, 'r':0.0005,
'e': 2000, 'name':'2: Filtered'},
{'n_drop':173, 'n_aug':173, 'r':0.0001,
'e': 8000, 'name':'3: Filtered+Augmented'}
]
wjnet = ComicNet()
for i, param in enumerate(params):
model_name = str(i + 1)
wjnet.configure_dataset(wj, n_drop=param['n_drop'],
n_aug=param['n_aug'])
wjnet.build_graph(r=param['r'])
wjnet.train(save_model=True, model_name=model_name, epoch=param['e'])
print('Trained', param['name'])
Evaluate the performance with ComicNet.test ()
.
accs = []
outputs = []
for i, param in enumerate(params):
model_name = str(i + 1)
acc, output = wjnet.test(model_name)
accs.append(acc)
outputs.append(output)
print('Test model={}: acc={}'.format(param['name'], acc))
plt.bar(range(3), accs, tick_label=[param['name'] for param in params])
for i, acc in enumerate(accs):
plt.text(i - 0.1, acc-0.3, str(acc), color='w')
plt.ylabel('Accuracy')
Even if it is classified randomly, it should be $ acc = 0.5 $, so the result is subtle ... Fortunately, I was able to confirm the effects of Filter and Augmentation.
Let's dig a little deeper into the results of the best performing Model 3 (Filtered + Augmented).
idx_sorted = np.argsort(output.reshape((-1)))
output_sorted = np.sort(output.reshape((-1)))
y_sorted = np.array([wjnet.y_test[i, 0] for i in idx_sorted])
title_sorted = np.array([wjnet.titles_test[i] for i in idx_sorted])
t_ng = np.logical_and(y_sorted == 0, output_sorted < 0.5)
f_ng = np.logical_and(y_sorted == 1, output_sorted < 0.5)
t_ps = np.logical_and(y_sorted == 1, output_sorted >= 0.5)
f_ps = np.logical_and(y_sorted == 0, output_sorted >= 0.5)
weeks = np.array([len(wj.extract_item(title)) for title in title_sorted])
plt.plot(weeks[t_ng], output_sorted[t_ng], 'o', ms=10,
alpha=0.5, c='b', label='True negative')
plt.plot(weeks[f_ng], output_sorted[f_ng], 'o', ms=10,
alpha=0.5, c='r', label='False negative')
plt.plot(weeks[t_ps], output_sorted[t_ps], '*', ms=15,
alpha=0.5, c='b', label='True positive')
plt.plot(weeks[f_ps], output_sorted[f_ps], '*', ms=15,
alpha=0.5, c='r', label='False positive')
plt.ylabel('Output')
plt.xlabel('Serialized weeks')
plt.xscale('log')
plt.ylim(0, 1)
plt.legend()
The figure above shows the relationship between the actual serialization period and the output of the classifier. Blue is the work that was correctly classified (True), and red is the work that was misclassified (False). Stars are works classified as short-lived works (Positive), and circles are works classified as continuous works (Negative). It is considered that the more blue works and the more concentrated the distribution from the upper left to the lower right on the graph, the better the classification performance.
First of all, I am concerned that there is no output of 0.75 or more. Is learning not going well? It is not well understood…. The next thing to worry about is the False positive in the upper right of the graph. Some popular works serialized for more than 100 weeks have been misclassified as short-lived works. Therefore, let's compare the order of publication (worst) of representative works of each classification result.
plt.figure(figsize=(12, 8))
plt.subplot(2, 2, 1)
for output, week, title in zip(
output_sorted[t_ps][-5:], weeks[t_ps][-5:], title_sorted[t_ps][-5:]):
plt.plot(range(1, 8), wj.extract_item(title)[:7],
label='{0} ({1:>3}, {2:.2f})'.format(title[:5], week, output))
plt.ylabel('Worst')
plt.ylim(0, 23)
plt.title('Part of True positive (correctly classified short-lived work)')
plt.legend()
plt.subplot(2, 2, 2)
for output, week, title in zip(
output_sorted[f_ps], weeks[f_ps], title_sorted[f_ps]):
if week > 100:
plt.plot(range(1, 8), wj.extract_item(title)[:7],
label='{0} ({1:>3}, {2:.2f})'.format(title[:5], week, output))
plt.ylim(0, 23)
plt.title('Part of False positive (continuation work misclassified as short-lived work)')
plt.legend()
plt.subplot(2, 2, 3)
for output, week, title in zip(
output_sorted[f_ng][:5], weeks[f_ng][:5], title_sorted[f_ng][:5]):
plt.plot(range(1, 8), wj.extract_item(title)[:7],
label='{0} ({1:>3}, {2:.2f})'.format(title[:5], week, output))
plt.xlabel('Weeks')
plt.ylabel('Worst')
plt.ylim(0, 23)
plt.title('Part of False negative (short-lived work misclassified as a continuation work)')
plt.legend()
plt.subplot(2, 2, 4)
for output, week, title in zip(
output_sorted[t_ng][:5], weeks[t_ng][:5], title_sorted[t_ng][:5]):
plt.plot(range(1, 8), wj.extract_item(title)[:7],
label='{0} ({1:>3}, {2:.2f})'.format(title[:5], week, output))
plt.xlabel('Weeks')
plt.ylim(0, 23)
plt.title('Part of a True Negative')
plt.legend()
The horizontal axis is the publication week, and the vertical axis is the publication order counting from the end of the book. The legend shows the title of the work (serialization period, output value). It can be seen that the works with false positives (upper right) have a stronger downward trend in the order of publication up to the 7th week than the works with true negatives (lower right). Conversely, the False positive (upper right) work can be regarded as a popular work that has rewound the inferiority in the early stages. Also, the order of publication of False negative (lower left) works up to 7 weeks has a gentle downward trend, and at least in my eyes, it is indistinguishable from that of True negative (lower right) works. I can understand the reason for the misclassification.
Below, for reference, the output values of all 100 works are plotted.
labels = np.array(['{0} ({1:>3})'.format(title[:6], week)
for title, week in zip(title_sorted, weeks) ])
plt.figure(figsize=(4, 18))
plt.barh(np.arange(100)[t_ps], output_sorted[t_ps], color='b')
plt.barh(np.arange(100)[f_ps], output_sorted[f_ps], color='r')
plt.barh(np.arange(100)[f_ng], output_sorted[f_ng], color='r')
plt.barh(np.arange(100)[t_ng], output_sorted[t_ng], color='b')
plt.yticks(np.arange(100), labels)
plt.xlim(0, 1)
plt.xlabel('Output')
for i, out in enumerate(output_sorted):
plt.text(out + .01, i - .5, '{0:.2f}'.format(out))
The horizontal axis represents the output value. The parentheses next to the title of the work indicate the serialization period. Blue indicates the correct classification result, and red indicates the incorrect classification result. The closer the output value is to 1, the more it is judged to be a short-lived work.
Actually, this article is positioned as the output of what I learned in Deep learning foundation nanodegree [^ nd101]. I started writing. That is why I stubbornly stuck to the multi-layer perceptron. After all, applying machine learning to real-world problems is really hard. If it wasn't for this theme, I would have been absolutely frustrated.
The final performance was disappointing, but it was good to see the effects of filtering and augmentation of the dataset. I think that the performance will improve a little more if you adjust the hyperparameters (n_drop
, n_aug
) that were decided this time. Alternatively, as you pointed out in Why neural networks cannot beat SVM, other machine learning methods such as SVM may be applied. It may be. I'm exhausted so I won't do it.
Since the first part was released, we have received feedback from many people, both real and online. It's all about Sunday programmers. I hope to work with you in the future. Thank you for reading to the end!
[^ nd101]: I'm a so-called March student. Thank you.
In creating this article, I referred to the following. Thank you very much! : bow:
Recommended Posts