Reference URL: https://www.tensorflow.org/tutorials/keras/text_classification?hl=ja
Do the following
--Classify movie reviews into positive or negative using the text --Construct a neural network to classify as positive or negative (binary classification) --Training neural networks --Evaluate the performance of the model
import tensorflow as tf
from tensorflow import keras
import numpy as np
#Ver confirmation of tensorflow
print(tf.__version__)
2.3.0
Use IMDB dataset
num_words = 10000
is for holding 10,000 most frequently occurring words
Rarely appearing words are discarded to make the data size manageable
imdb = keras.datasets.imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
17465344/17464789 [==============================] - 0s 0us/step
print("Training entries: {}, labels: {}".format(len(train_data), len(train_labels)))
Training entries: 25000, labels: 25000
The data is labeled positive or negative with an array of integers representing the words in the movie review 0 indicates negative reviews, 1 indicates positive reviews
print(train_data[0])
[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]
print(train_labels[0])
1
Movie reviews vary in length for each data Inputs to the neural network must be the same length and must be resolved
len(train_data[0]), len(train_data[1])
(218, 189)
#A dictionary that maps words to integers
word_index = imdb.get_word_index()
#The first part of the index is reserved
word_index = {k:(v+3) for k,v in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2 # unknown
word_index["<UNUSED>"] = 3
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
def decode_review(text):
return ' '.join([reverse_word_index.get(i, '?') for i in text])
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
1646592/1641221 [==============================] - 0s 0us/step
decode_review(train_data[0])
"<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNK> and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <UNK> to the two little boy's that played the <UNK> of norman and paul they were just brilliant children are often left out of the <UNK> list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all"
There are two main ways to shape the data input to the neural network.
--Convert an array into a vector of 0s and 1s representing the occurrence of words, similar to one-hot encoding --For example, the array [3, 5] is a 10,000-dimensional vector with all zeros except indexes 3 and 5. And let this be the first layer of the network, that is, the Dense layer that can handle floating point vector data. However, this is a memory-intensive method that requires a matrix of words x reviews.
--Align the array to the same length by padding, and tensor an integer in the form of the number of samples * maximum length And make the Embedding layer that can handle this format the first layer of the network
The latter is adopted in this tutorial
Use the pad_sequences
function to standardize lengths
Reference: Preprocessing of sequence --Keras Documentation
--value: Value to pad --Padding: Position to pad (pre or post) --maxlen: Maximum length of array If not specified, it will be the maximum length in a given array
train_data = keras.preprocessing.sequence.pad_sequences(train_data,
value=word_index["<PAD>"],
padding='post',
maxlen=256)
test_data = keras.preprocessing.sequence.pad_sequences(test_data,
value=word_index["<PAD>"],
padding='post',
maxlen=256)
#Make sure the lengths are the same
len(train_data[0]), len(train_data[1])
(256, 256)
#Check the first padded data
print(train_data[0])
[ 1 14 22 16 43 530 973 1622 1385 65 458 4468 66 3941
4 173 36 256 5 25 100 43 838 112 50 670 2 9
35 480 284 5 150 4 172 112 167 2 336 385 39 4
172 4536 1111 17 546 38 13 447 4 192 50 16 6 147
2025 19 14 22 4 1920 4613 469 4 22 71 87 12 16
43 530 38 76 15 13 1247 4 22 17 515 17 12 16
626 18 2 5 62 386 12 8 316 8 106 5 4 2223
5244 16 480 66 3785 33 4 130 12 16 38 619 5 25
124 51 36 135 48 25 1415 33 6 22 12 215 28 77
52 5 14 407 16 82 2 8 4 107 117 5952 15 256
4 2 7 3766 5 723 36 71 43 530 476 26 400 317
46 7 4 2 1029 13 104 88 4 381 15 297 98 32
2071 56 26 141 6 194 7486 18 4 226 22 21 134 476
26 480 5 144 30 5535 18 51 36 28 224 92 25 104
4 226 65 16 38 1334 88 12 16 283 5 16 4472 113
103 32 15 16 5345 19 178 32 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0]
keras.layers.Embedding
Takes an integer-encoded vocabulary and searches for the embedded vector corresponding to each word index Embedded vectors are learned during model training One dimension is added to the output matrix for vectorization As a result, the dimensions are (batch, sequence, embedding) Simply put, it returns a vector representation of each word as input.
keras.layers.GlobalAveragePooling1D
GlobalAveragePooling1D (1D global average pooling) layer For each sample, find the mean in the dimensional direction of the sequence and return a fixed-length vector Simply put, take the average value for each dimension of the word vector (compression of information amount)
Fully connected layer with 16 hidden units activation ='relu' specifies the activation function ReLU
Fully join to one output node By using a sigmoid for the activation function, the output value will be a floating point number between 0 and 1 representing probability or certainty.
#The input format is the number of vocabularies used in movie reviews (10),000 words)
vocab_size = 10000
model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 16))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(16, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid'))
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, None, 16) 160000
_________________________________________________________________
global_average_pooling1d (Gl (None, 16) 0
_________________________________________________________________
dense (Dense) (None, 16) 272
_________________________________________________________________
dense_1 (Dense) (None, 1) 17
=================================================================
Total params: 160,289
Trainable params: 160,289
Non-trainable params: 0
_________________________________________________________________
The above model has two intermediate or "hidden" layers between the input and the output. The output (unit, node, or neuron) is the number of dimensions of the internal representation of that layer. In other words, the degree of freedom in which this network acquires internal representations through learning.
If the model has more hidden units (larger dimensions in the internal representation space) Or, if there are more layers, or both, the network can learn more complex internal representations. However, as a result, the amount of calculation of the network increases, and patterns that you do not want to learn are learned. Patterns that you do not want to learn are patterns that improve the performance of training data but not the performance of test data. This problem is called overfitting.
Defining a model for learning
--optimer: Optimizer algorithm
--This time, specify ʻAdam --Other optimization algorithms: https://www.tensorflow.org/api_docs/python/tf/keras/optimizers --loss: loss function --This time, specify
binary cross entropy --metrics: Items quantified during learning and testing --This time, specify ʻaccuracy
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
When training, verify the accuracy rate with data that the model does not see Create validation set by separating 10,000 samples from the original training data The verification data is used for adjusting hyperparameters when building a model.
x_val = train_data[:10000]
partial_x_train = train_data[10000:]
y_val = train_labels[:10000]
partial_y_train = train_labels[10000:]
Train a 40 epoch model using a mini-batch of 512 samples As a result, all the samples contained in x_train and y_train will be repeated 40 times. During training, use 10,000 samples of validation data to monitor model loss and accuracy
history = model.fit(partial_x_train,
partial_y_train,
epochs=40,
batch_size=512,
validation_data=(x_val, y_val),
verbose=1)
Epoch 1/40
30/30 [==============================] - 1s 21ms/step - loss: 0.6916 - accuracy: 0.5446 - val_loss: 0.6894 - val_accuracy: 0.6478
Epoch 2/40
30/30 [==============================] - 0s 16ms/step - loss: 0.6855 - accuracy: 0.7160 - val_loss: 0.6815 - val_accuracy: 0.6852
Epoch 3/40
30/30 [==============================] - 1s 17ms/step - loss: 0.6726 - accuracy: 0.7333 - val_loss: 0.6651 - val_accuracy: 0.7548
Epoch 4/40
30/30 [==============================] - 1s 17ms/step - loss: 0.6499 - accuracy: 0.7683 - val_loss: 0.6393 - val_accuracy: 0.7636
Epoch 5/40
30/30 [==============================] - 1s 17ms/step - loss: 0.6166 - accuracy: 0.7867 - val_loss: 0.6045 - val_accuracy: 0.7791
Epoch 6/40
30/30 [==============================] - 1s 17ms/step - loss: 0.5747 - accuracy: 0.8083 - val_loss: 0.5650 - val_accuracy: 0.7975
Epoch 7/40
30/30 [==============================] - 0s 16ms/step - loss: 0.5286 - accuracy: 0.8265 - val_loss: 0.5212 - val_accuracy: 0.8165
Epoch 8/40
30/30 [==============================] - 0s 16ms/step - loss: 0.4819 - accuracy: 0.8442 - val_loss: 0.4807 - val_accuracy: 0.8285
Epoch 9/40
30/30 [==============================] - 1s 17ms/step - loss: 0.4382 - accuracy: 0.8589 - val_loss: 0.4438 - val_accuracy: 0.8415
Epoch 10/40
30/30 [==============================] - 0s 16ms/step - loss: 0.3996 - accuracy: 0.8721 - val_loss: 0.4133 - val_accuracy: 0.8496
Epoch 11/40
30/30 [==============================] - 1s 17ms/step - loss: 0.3673 - accuracy: 0.8774 - val_loss: 0.3887 - val_accuracy: 0.8570
Epoch 12/40
30/30 [==============================] - 0s 16ms/step - loss: 0.3398 - accuracy: 0.8895 - val_loss: 0.3682 - val_accuracy: 0.8625
Epoch 13/40
30/30 [==============================] - 0s 16ms/step - loss: 0.3159 - accuracy: 0.8935 - val_loss: 0.3524 - val_accuracy: 0.8652
Epoch 14/40
30/30 [==============================] - 0s 16ms/step - loss: 0.2961 - accuracy: 0.8995 - val_loss: 0.3388 - val_accuracy: 0.8710
Epoch 15/40
30/30 [==============================] - 0s 16ms/step - loss: 0.2785 - accuracy: 0.9049 - val_loss: 0.3282 - val_accuracy: 0.8741
Epoch 16/40
30/30 [==============================] - 0s 16ms/step - loss: 0.2631 - accuracy: 0.9095 - val_loss: 0.3192 - val_accuracy: 0.8768
Epoch 17/40
30/30 [==============================] - 1s 17ms/step - loss: 0.2500 - accuracy: 0.9136 - val_loss: 0.3129 - val_accuracy: 0.8769
Epoch 18/40
30/30 [==============================] - 0s 16ms/step - loss: 0.2371 - accuracy: 0.9185 - val_loss: 0.3064 - val_accuracy: 0.8809
Epoch 19/40
30/30 [==============================] - 0s 16ms/step - loss: 0.2255 - accuracy: 0.9233 - val_loss: 0.3010 - val_accuracy: 0.8815
Epoch 20/40
30/30 [==============================] - 1s 17ms/step - loss: 0.2155 - accuracy: 0.9265 - val_loss: 0.2976 - val_accuracy: 0.8829
Epoch 21/40
30/30 [==============================] - 1s 17ms/step - loss: 0.2060 - accuracy: 0.9290 - val_loss: 0.2942 - val_accuracy: 0.8823
Epoch 22/40
30/30 [==============================] - 0s 17ms/step - loss: 0.1962 - accuracy: 0.9331 - val_loss: 0.2914 - val_accuracy: 0.8844
Epoch 23/40
30/30 [==============================] - 1s 17ms/step - loss: 0.1877 - accuracy: 0.9374 - val_loss: 0.2894 - val_accuracy: 0.8838
Epoch 24/40
30/30 [==============================] - 1s 17ms/step - loss: 0.1794 - accuracy: 0.9421 - val_loss: 0.2877 - val_accuracy: 0.8850
Epoch 25/40
30/30 [==============================] - 1s 17ms/step - loss: 0.1724 - accuracy: 0.9442 - val_loss: 0.2867 - val_accuracy: 0.8854
Epoch 26/40
30/30 [==============================] - 0s 16ms/step - loss: 0.1653 - accuracy: 0.9479 - val_loss: 0.2862 - val_accuracy: 0.8854
Epoch 27/40
30/30 [==============================] - 1s 17ms/step - loss: 0.1583 - accuracy: 0.9515 - val_loss: 0.2866 - val_accuracy: 0.8842
Epoch 28/40
30/30 [==============================] - 1s 17ms/step - loss: 0.1520 - accuracy: 0.9536 - val_loss: 0.2867 - val_accuracy: 0.8859
Epoch 29/40
30/30 [==============================] - 1s 17ms/step - loss: 0.1459 - accuracy: 0.9563 - val_loss: 0.2868 - val_accuracy: 0.8861
Epoch 30/40
30/30 [==============================] - 0s 16ms/step - loss: 0.1402 - accuracy: 0.9582 - val_loss: 0.2882 - val_accuracy: 0.8860
Epoch 31/40
30/30 [==============================] - 1s 17ms/step - loss: 0.1347 - accuracy: 0.9597 - val_loss: 0.2887 - val_accuracy: 0.8863
Epoch 32/40
30/30 [==============================] - 1s 17ms/step - loss: 0.1295 - accuracy: 0.9625 - val_loss: 0.2899 - val_accuracy: 0.8862
Epoch 33/40
30/30 [==============================] - 1s 17ms/step - loss: 0.1249 - accuracy: 0.9634 - val_loss: 0.2919 - val_accuracy: 0.8856
Epoch 34/40
30/30 [==============================] - 1s 18ms/step - loss: 0.1198 - accuracy: 0.9657 - val_loss: 0.2939 - val_accuracy: 0.8858
Epoch 35/40
30/30 [==============================] - 1s 17ms/step - loss: 0.1155 - accuracy: 0.9677 - val_loss: 0.2957 - val_accuracy: 0.8850
Epoch 36/40
30/30 [==============================] - 1s 18ms/step - loss: 0.1109 - accuracy: 0.9691 - val_loss: 0.2988 - val_accuracy: 0.8850
Epoch 37/40
30/30 [==============================] - 1s 17ms/step - loss: 0.1068 - accuracy: 0.9709 - val_loss: 0.3005 - val_accuracy: 0.8837
Epoch 38/40
30/30 [==============================] - 1s 19ms/step - loss: 0.1030 - accuracy: 0.9718 - val_loss: 0.3045 - val_accuracy: 0.8829
Epoch 39/40
30/30 [==============================] - 0s 17ms/step - loss: 0.0997 - accuracy: 0.9733 - val_loss: 0.3077 - val_accuracy: 0.8816
Epoch 40/40
30/30 [==============================] - 1s 17ms/step - loss: 0.0952 - accuracy: 0.9751 - val_loss: 0.3088 - val_accuracy: 0.8828
Two values are returned --Loss (a numerical value indicating an error, the smaller the better) --Correct answer rate
results = model.evaluate(test_data, test_labels, verbose=2)
print(results)
782/782 - 1s - loss: 0.3297 - accuracy: 0.8723
[0.32968297600746155, 0.8723199963569641]
Visualize the progress of learning
Visualizing how learning is progressing makes it easier to find out if overfitting is occurring.
model.fit ()
returns a History object containing a dictionary that records everything that happened during training
history_dict = history.history
history_dict.keys()
dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])
import matplotlib.pyplot as plt
acc = history_dict['accuracy']
val_acc = history_dict['val_accuracy']
loss = history_dict['loss']
val_loss = history_dict['val_loss']
epochs = range(1, len(acc) + 1)
# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
plt.clf() #Clear the figure
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()
In the graph above, the points are the loss and correct answer rate during training, and the solid line is the loss and correct answer rate during verification.
It has been flat since around 20 epochs. This is an example of overfitting Model performance is high for training data, but not so high for data you've never seen Beyond this point, the model is over-optimized and is learning internal representations that are characteristic of training data but cannot be generalized to test data.
In this case, overfitting can be prevented by stopping training after 20 epochs. To prevent overfitting, use callback or regularize (do it in another tutorial)
Recommended Posts