In the previous article, I wrote how to use CTC (Connectionist Temporal Classification) Loss to learn a model (RNN) that takes variable length data for input and output in TensorFlow 2.x. [\ TensorFlow 2 ] Learn RNN with CTC Loss-Qiita
However, there was one left unloaded. ** How to handle CTC Loss well with Keras **. I tried it in the previous article, but the result was unnoticeable, with a lot of dubious hacks and slow processing. This time, I found the solution, so make a note of it.
I think that the method described here can be applied not only to CTC Loss but also when you want to define and learn a special loss function.
I think the reason for the last defeat was, after all, that I got stuck trying to define CTC Loss with Keras's Model.compile ()
.
However, in fact, there was a way to add a loss function and evaluation scale (correct answer rate, etc.) other than Model.compile ()
.
Train and evaluate with Keras | TensorFlow Core
The overwhelming majority of losses and metrics can be computed from y_true and y_pred, where y_pred is an output of your model. But not all of them. For instance, a regularization loss may only require the activation of a layer (there are no targets in this case), and this activation may not be a model output.
In such cases, you can call self.add_loss(loss_value) from inside the call method of a custom layer. Here's a simple example that adds activity regularization (note that activity regularization is built-in in all Keras layers -- this layer is just for the sake of providing a concrete example): (Omitted) You can do the same for logging metric values: (Omitted)
** If you define your own layer and use ʻadd_loss ()
, you can define the loss function regardless of the prototype of `` (y_true, y_pred)
`! ** **
No, I have to read the tutorial properly ... orz
The API description for ʻadd_loss () ``, which defines the loss function, and
ʻadd_metric () ``, which defines the valuation scale, can be found on the following pages.
tf.keras.layers.Layer | TensorFlow Core v2.1.0
As you can see from the sample code in the tutorial, the ** loss function and evaluation scale are Tensor
, and they are assembled by the operation of Tensor
. ** The x1
included in the sample code is a Tensor
that represents the output from the layer and can be used to represent the loss function.
inputs = keras.Input(shape=(784,), name='digits')
x1 = layers.Dense(64, activation='relu', name='dense_1')(inputs)
x2 = layers.Dense(64, activation='relu', name='dense_2')(x1)
outputs = layers.Dense(10, name='predictions')(x2)
model = keras.Model(inputs=inputs, outputs=outputs)
model.add_loss(tf.reduce_sum(x1) * 0.1)
model.add_metric(keras.backend.std(x1),
name='std_of_activation',
aggregation='mean')
In this case, the feature quantity series, label series length information, and the label series itself are also necessary information for CTC Loss calculation, so they must be held as Tensor
. In other words, these also need to be given as inputs to the model (as x, not y). In other words, ** create a model with multiple inputs **.
The original source code is GitHub - igormq/ctc_tensorflow_example: CTC + Tensorflow Example for ASR is.
ctc_tensorflow_example_tf2_keras.py
# Compatibility imports
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import time
import tensorflow as tf
import scipy.io.wavfile as wav
import numpy as np
from six.moves import xrange as range
try:
from python_speech_features import mfcc
except ImportError:
print("Failed to import python_speech_features.\n Try pip install python_speech_features.")
raise ImportError
from utils import maybe_download as maybe_download
from utils import sparse_tuple_from as sparse_tuple_from
# Constants
SPACE_TOKEN = '<space>'
SPACE_INDEX = 0
FIRST_INDEX = ord('a') - 1 # 0 is reserved to space
FEAT_MASK_VALUE = 1e+10
# Some configs
num_features = 13
num_units = 50 # Number of units in the LSTM cell
# Accounting the 0th indice + space + blank label = 28 characters
num_classes = ord('z') - ord('a') + 1 + 1 + 1
# Hyper-parameters
num_epochs = 400
num_layers = 1
batch_size = 2
initial_learning_rate = 0.005
momentum = 0.9
# Loading the data
audio_filename = maybe_download('LDC93S1.wav', 93638)
target_filename = maybe_download('LDC93S1.txt', 62)
fs, audio = wav.read(audio_filename)
# create a dataset composed of data with variable lengths
inputs = mfcc(audio, samplerate=fs)
inputs = (inputs - np.mean(inputs))/np.std(inputs)
inputs_short = mfcc(audio[fs*8//10:fs*20//10], samplerate=fs)
inputs_short = (inputs_short - np.mean(inputs_short))/np.std(inputs_short)
# Transform in 3D array
train_inputs = tf.ragged.constant([inputs, inputs_short], dtype=np.float32)
train_seq_len = tf.cast(train_inputs.row_lengths(), tf.int32)
train_inputs = train_inputs.to_tensor(default_value=FEAT_MASK_VALUE)
# Reading targets
with open(target_filename, 'r') as f:
#Only the last line is necessary
line = f.readlines()[-1]
# Get only the words between [a-z] and replace period for none
original = ' '.join(line.strip().lower().split(' ')[2:]).replace('.', '')
targets = original.replace(' ', ' ')
targets = targets.split(' ')
# Adding blank label
targets = np.hstack([SPACE_TOKEN if x == '' else list(x) for x in targets])
# Transform char into index
targets = np.asarray([SPACE_INDEX if x == SPACE_TOKEN else ord(x) - FIRST_INDEX
for x in targets])
# Creating sparse representation to feed the placeholder
train_targets = tf.ragged.constant([targets, targets[13:32]], dtype=np.int32)
train_targets_len = tf.cast(train_targets.row_lengths(), tf.int32)
train_targets = train_targets.to_sparse()
# We don't have a validation dataset :(
val_inputs, val_targets, val_seq_len, val_targets_len = train_inputs, train_targets, \
train_seq_len, train_targets_len
# THE MAIN CODE!
# add loss and metrics with a custom layer
class CTCLossLayer(tf.keras.layers.Layer):
def call(self, inputs):
labels = inputs[0]
logits = inputs[1]
label_len = inputs[2]
logit_len = inputs[3]
logits_trans = tf.transpose(logits, (1, 0, 2))
label_len = tf.reshape(label_len, (-1,))
logit_len = tf.reshape(logit_len, (-1,))
loss = tf.reduce_mean(tf.nn.ctc_loss(labels, logits_trans, label_len, logit_len, blank_index=-1))
# define loss here instead of compile()
self.add_loss(loss)
# decode
decoded, _ = tf.nn.ctc_greedy_decoder(logits_trans, logit_len)
# Inaccuracy: label error rate
ler = tf.reduce_mean(tf.edit_distance(tf.cast(decoded[0], tf.int32),
labels))
self.add_metric(ler, name="ler", aggregation="mean")
return logits # Pass-through layer.
# Defining the cell
# Can be:
# tf.nn.rnn_cell.RNNCell
# tf.nn.rnn_cell.GRUCell
cells = []
for _ in range(num_layers):
cell = tf.keras.layers.LSTMCell(num_units) # Or LSTMCell(num_units)
cells.append(cell)
stack = tf.keras.layers.StackedRNNCells(cells)
input_feature = tf.keras.layers.Input((None, num_features), name="input_feature")
input_label = tf.keras.layers.Input((None,), dtype=tf.int32, sparse=True, name="input_label")
input_feature_len = tf.keras.layers.Input((1,), dtype=tf.int32, name="input_feature_len")
input_label_len = tf.keras.layers.Input((1,), dtype=tf.int32, name="input_label_len")
layer_masking = tf.keras.layers.Masking(FEAT_MASK_VALUE)(input_feature)
layer_rnn = tf.keras.layers.RNN(stack, return_sequences=True)(layer_masking)
layer_output = tf.keras.layers.Dense(
num_classes,
kernel_initializer=tf.keras.initializers.TruncatedNormal(0.0, 0.1),
bias_initializer="zeros",
name="logit")(layer_rnn)
layer_loss = CTCLossLayer()([input_label, layer_output, input_label_len, input_feature_len])
# create models for training and prediction (sharing weights)
model_train = tf.keras.models.Model(
inputs=[input_feature, input_label, input_feature_len, input_label_len],
outputs=layer_loss)
model_predict = tf.keras.models.Model(inputs=input_feature, outputs=layer_output)
optimizer = tf.keras.optimizers.SGD(initial_learning_rate, momentum)
# adding no loss: we have already defined with a custom layer
model_train.compile(optimizer=optimizer)
# training: y is dummy!
model_train.fit(x=[train_inputs, train_targets, train_seq_len, train_targets_len], y=None,
validation_data=([val_inputs, val_targets, val_seq_len, val_targets_len], None),
epochs=num_epochs)
# Decoding
print('Original:')
print(original)
print(original[13:32])
print('Decoded:')
decoded, _ = tf.nn.ctc_greedy_decoder(tf.transpose(model_predict.predict(train_inputs), (1, 0, 2)), train_seq_len)
d = tf.sparse.to_dense(decoded[0], default_value=-1).numpy()
str_decoded = [''.join([chr(x + FIRST_INDEX) for x in np.asarray(row) if x != -1]) for row in d]
for s in str_decoded:
# Replacing blank label to none
s = s.replace(chr(ord('z') + 1), '')
# Replacing space label to space
s = s.replace(chr(ord('a') - 1), ' ')
print(s)
The execution result is as follows.
Train on 2 samples, validate on 2 samples
Epoch 1/400
2/2 [==============================] - 2s 991ms/sample - loss: 546.3565 - ler: 1.0668 - val_loss: 464.2611 - val_ler: 0.8801
Epoch 2/400
2/2 [==============================] - 0s 136ms/sample - loss: 464.2611 - ler: 0.8801 - val_loss: 179.9780 - val_ler: 1.0000
(Omitted)
Epoch 400/400
2/2 [==============================] - 0s 135ms/sample - loss: 1.6670 - ler: 0.0000e+00 - val_loss: 1.6565 - val_ler: 0.0000e+00
Original:
she had your dark suit in greasy wash water all year
dark suit in greasy
Decoded:
she had your dark suit in greasy wash water all year
dark suit in greasy
There seems to be no problem with the processing time and the error rate value, and it seems that it finally worked properly ...! (Since the processing time is the value divided by the number of samples, the actual time will be twice the displayed value, but if it is 300 ms or less for 2 samples, it can be said that it is the same as the previous time)
As mentioned at the beginning, you are free to use the model-related Tensor
to define the loss function by using Layer.add_loss ()
. The above code defines a layer called CTCLossLayer
, and in call ()
, TensorFlow 2.x version (previous article See #% E3% 82% B5% E3% 83% B3% E3% 83% 97% E3% 83% AB% E3% 82% B3% E3% 83% BC% E3% 83% 892)) I am writing almost the same process. Finally, the input logits
is output as it is.
Here, call ()
has four arguments except for self
. With these four pieces of information, you can do CTC Loss and decode. When building a model, you also need to input four layers as shown below.
layer_loss = CTCLossLayer()(input_label, layer_output, input_label_len, input_feature_len)
The information given in the previous argument must be Tensor
. layer_output
is the same as the normal Keras model, but for `ʻinput_label, input_label_len, input_feature_len``, an input layer is added to support it.
input_feature = tf.keras.layers.Input((None, num_features), name="input_feature")
input_label = tf.keras.layers.Input((None,), dtype=tf.int32, sparse=True, name="input_label")
input_feature_len = tf.keras.layers.Input((1,), dtype=tf.int32, name="input_feature_len")
input_label_len = tf.keras.layers.Input((1,), dtype=tf.int32, name="input_label_len")
As you can see, we are creating a Input`` layer with the appropriate shape and `` dtype``. The `` dtype`` of layers other than features should be
int32. Considering the specification of the argument of
tf.nn.ctc_loss, I really want to change the shape of ```input_feature_len
and ```input_label_lento
() , but I get an error later. I couldn't move it well. Therefore, the shape is written as
(1,) , and
reshapeis performed in
CTCLossLayer``.
Another thing I've added is sparse = True
when input_label`` was created. If this is specified, the `` Tensor`` corresponding to
input_labelbecomes
SparseTensor``.
tf.keras.Input | TensorFlow Core v2.1.0
This sparse = True
is a measure for having to pass the correct label with SparseTensor
when calculating the error rate of the decoding result (tf.nn.ctc_loss used in the calculation of CTC Loss).
can also receive SparseTensor
). The data given by Model.fit ()
etc. is also created by SparseTensor
.
tf.nn.ctc_loss | TensorFlow Core v2.1.0
tf.edit_distance | TensorFlow Core v2.1.0
Similarly, when assuming the input of RaggedTensor
, there seems to be ragged = True
.
The models for learning and prediction (inference) are created separately as shown below.
model_train = tf.keras.models.Model(
inputs=[input_feature, input_label, input_feature_len, input_label_len],
outputs=layer_loss)
model_predict = tf.keras.models.Model(inputs=input_feature, outputs=layer_output)
Four inputs were required for training, but only Logits (that is, the output of Dense
) are required for prediction (decoding), so features are sufficient as inputs. Therefore, the model for prediction works with only one input. Of course, Loss cannot be calculated, but it is not necessary for decoding purposes only, so specify layer_output
before passing through CTCLossLayer
as the output.
If you draw it in a diagram, it will be like this.
Since the layers with weights are created to be shared, it is possible to infer with the prediction model as it is after training with the training model.
Since CTC Loss has been defined in its own layer, there is no need to define a loss function in compile ()
. In that case, you don't have to simply specify the loss
argument.
model_train.compile(optimizer=optimizer)
The correct label and length information used to calculate the loss function are the information to be sent to the input layer, so they must be specified on the x
side of the argument of Model.fit ()
. There is nothing to specify for y
, so write None
.
Similarly for validation_data
, write None
as the second tuple.
model_train.fit(x=[train_inputs, train_targets, train_seq_len, train_targets_len], y=None,
validation_data=([val_inputs, val_targets, val_seq_len, val_targets_len], None),
epochs=num_epochs)
To be honest, I don't know if this is the usage that the formula expects, but if you don't specify loss
in compile ()
, y = None
works fine ( When loss
is specified, a label to be passed to the y_true
argument of the loss function is required, so naturally an error will occur unless some data is given to y
).
As mentioned above, use model_predict
for inference. It is OK to give only the feature series to the argument of predict ()
.
decoded, _ = tf.nn.ctc_greedy_decoder(tf.transpose(model_predict.predict(train_inputs), (1, 0, 2)), train_seq_len)
―― Masking
and ```input_feature_len`` have similar functions, so it feels somewhat redundant ...
After reading the tutorial properly, Keras was able to perform learning using CTC Loss. Surprisingly, Keras also has a small turn. I'm sorry.
Recommended Posts