I tried TensorFlow 2.x to learn how to use CTC (Connectionist Temporal Classification) Loss to learn the parameters of RNN (Recurrent Neural Network) that returns a sequence. I made a note because there were few samples and I had a hard time moving it.
CTC Loss is summarized on the following pages.
-About the theory and implementation of Connectionist Temporal Classification – Is your order machine learning? -Phoneme recognition using Connectionist Temporal Classification (CTC) --Qiita -Voice recognition and deep learning --SlideShare
GitHub - igormq/ctc_tensorflow_example: CTC + Tensorflow Example for ASR
This is a sample implemented in TensorFlow 1.x without using Keras API. The correspondence between feature series and label (character) series is learned by LSTM, like an end-to-end speech recognition sample.
It's a 1.x version of the code, but it's not difficult to run with TensorFlow 2.x.
#Install the required packages
pip3 install python_speech_features --user
#Get the code
git clone https://github.com/igormq/ctc_tensorflow_example.git
If you change ctc_tensorflow_example.py
about 3 lines as shown below, it will work with TensorFlow 2.x.
patch
diff --git a/ctc_tensorflow_example.py b/ctc_tensorflow_example.py
index 579d431..2d96d54 100644
--- a/ctc_tensorflow_example.py
+++ b/ctc_tensorflow_example.py
@@ -5,7 +5,7 @@ from __future__ import print_function
import time
-import tensorflow as tf
+import tensorflow.compat.v1 as tf
import scipy.io.wavfile as wav
import numpy as np
@@ -20,6 +20,8 @@ except ImportError:
from utils import maybe_download as maybe_download
from utils import sparse_tuple_from as sparse_tuple_from
# Constants
SPACE_TOKEN = '<space>'
SPACE_INDEX = 0
@@ -103,9 +105,9 @@ with graph.as_default():
# tf.nn.rnn_cell.GRUCell
cells = []
for _ in range(num_layers):
- cell = tf.contrib.rnn.LSTMCell(num_units) # Or LSTMCell(num_units)
+ cell = tf.nn.rnn_cell.LSTMCell(num_units) # Or LSTMCell(num_units)
cells.append(cell)
- stack = tf.contrib.rnn.MultiRNNCell(cells)
+ stack = tf.nn.rnn_cell.MultiRNNCell(cells)
# The second output is the last state and we will no use that
outputs, _ = tf.nn.dynamic_rnn(stack, inputs, seq_len, dtype=tf.float32)
Terminal
python3 ctc_tensorflow_example.py
Epoch 1/200, train_cost = 726.374, train_ler = 1.000, val_cost = 167.637, val_ler = 1.000, time = 0.549
(Omitted)
Epoch 200/200, train_cost = 0.648, train_ler = 0.000, val_cost = 0.642, val_ler = 0.000, time = 0.218
Original:
she had your dark suit in greasy wash water all year
Decoded:
she had your dark suit in greasy wash water all year
If you use TensorFlow 2 with much effort, writing for TensorFlow 2 will improve processing efficiency (probably), and it will be good for maintenance later. That's why I'm going to rewrite the sample code, but I can't find a sample of how to write it ...
It finally worked as if I had cut and pasted the code in various places. The main reference sites are as follows.
ctc_tensorflow_example_tf2.py
# Compatibility imports
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import time
import tensorflow as tf
import scipy.io.wavfile as wav
import numpy as np
from six.moves import xrange as range
try:
from python_speech_features import mfcc
except ImportError:
print("Failed to import python_speech_features.\n Try pip install python_speech_features.")
raise ImportError
from utils import maybe_download as maybe_download
from utils import sparse_tuple_from as sparse_tuple_from
# Constants
SPACE_TOKEN = '<space>'
SPACE_INDEX = 0
FIRST_INDEX = ord('a') - 1 # 0 is reserved to space
# Some configs
num_features = 13
num_units=50 # Number of units in the LSTM cell
# Accounting the 0th indice + space + blank label = 28 characters
num_classes = ord('z') - ord('a') + 1 + 1 + 1
# Hyper-parameters
num_epochs = 200
num_hidden = 50
num_layers = 1
batch_size = 1
initial_learning_rate = 1e-2
momentum = 0.9
num_examples = 1
num_batches_per_epoch = int(num_examples/batch_size)
# Loading the data
audio_filename = maybe_download('LDC93S1.wav', 93638)
target_filename = maybe_download('LDC93S1.txt', 62)
fs, audio = wav.read(audio_filename)
inputs = mfcc(audio, samplerate=fs)
# Transform in 3D array
train_inputs = np.asarray(inputs[np.newaxis, :], dtype=np.float32)
train_inputs = (train_inputs - np.mean(train_inputs))/np.std(train_inputs)
train_seq_len = [train_inputs.shape[1]]
# Reading targets
with open(target_filename, 'r') as f:
#Only the last line is necessary
line = f.readlines()[-1]
# Get only the words between [a-z] and replace period for none
original = ' '.join(line.strip().lower().split(' ')[2:]).replace('.', '')
targets = original.replace(' ', ' ')
targets = targets.split(' ')
# Adding blank label
targets = np.hstack([SPACE_TOKEN if x == '' else list(x) for x in targets])
# Transform char into index
targets = np.asarray([SPACE_INDEX if x == SPACE_TOKEN else ord(x) - FIRST_INDEX
for x in targets])
train_targets = tf.sparse.SparseTensor(*sparse_tuple_from([targets], dtype=np.int32))
train_targets_len = [train_targets.shape[1]]
# We don't have a validation dataset :(
val_inputs, val_targets, val_seq_len, val_targets_len = train_inputs, train_targets, \
train_seq_len, train_targets_len
# THE MAIN CODE!
# Defining the cell
# Can be:
# tf.nn.rnn_cell.RNNCell
# tf.nn.rnn_cell.GRUCell
cells = []
for _ in range(num_layers):
cell = tf.keras.layers.LSTMCell(num_units) # Or LSTMCell(num_units)
cells.append(cell)
stack = tf.keras.layers.StackedRNNCells(cells)
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.RNN(stack, input_shape=(None, num_features), return_sequences=True))
# Truncated normal with mean 0 and stdev=0.1
# Zero initialization
model.add(tf.keras.layers.Dense(num_classes,
kernel_initializer=tf.keras.initializers.TruncatedNormal(0.0, 0.1),
bias_initializer="zeros"))
optimizer = tf.keras.optimizers.SGD(initial_learning_rate, momentum)
@tf.function
def step(inputs, targets, seq_len, targets_len, flag_training):
if flag_training:
with tf.GradientTape() as tape:
logits = model(inputs, training=True)
# Time major
logits = tf.transpose(logits, (1, 0, 2))
cost = tf.reduce_mean(tf.nn.ctc_loss(targets, logits, targets_len, seq_len, blank_index=-1))
gradients = tape.gradient(cost, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
else:
logits = model(inputs)
# Time major
logits = tf.transpose(logits, (1, 0, 2))
cost = tf.reduce_mean(tf.nn.ctc_loss(targets, logits, targets_len, seq_len, blank_index=-1))
# Option 2: tf.nn.ctc_beam_search_decoder
# (it's slower but you'll get better results)
decoded, _ = tf.nn.ctc_greedy_decoder(logits, seq_len)
# Inaccuracy: label error rate
ler = tf.reduce_mean(tf.edit_distance(tf.cast(decoded[0], tf.int32),
targets))
return cost, ler, decoded
for curr_epoch in range(num_epochs):
train_cost = train_ler = 0
start = time.time()
for batch in range(num_batches_per_epoch):
batch_cost, batch_ler, _ = step(train_inputs, train_targets, train_seq_len, train_targets_len, True)
train_cost += batch_cost*batch_size
train_ler += batch_ler*batch_size
train_cost /= num_examples
train_ler /= num_examples
val_cost, val_ler, decoded = step(val_inputs, val_targets, val_seq_len, val_targets_len, False)
log = "Epoch {}/{}, train_cost = {:.3f}, train_ler = {:.3f}, val_cost = {:.3f}, val_ler = {:.3f}, time = {:.3f}"
print(log.format(curr_epoch+1, num_epochs, train_cost, train_ler,
val_cost, val_ler, time.time() - start))
# Decoding
d = tf.sparse.to_dense(decoded[0])[0].numpy()
str_decoded = ''.join([chr(x) for x in np.asarray(d) + FIRST_INDEX])
# Replacing blank label to none
str_decoded = str_decoded.replace(chr(ord('z') + 1), '')
# Replacing space label to space
str_decoded = str_decoded.replace(chr(ord('a') - 1), ' ')
print('Original:\n%s' % original)
print('Decoded:\n%s' % str_decoded)
Only the first epoch takes time, but after that it seems to be about 30% faster.
python3 ctc_tensorflow_example_tf2.py
Epoch 1/200, train_cost = 774.063, train_ler = 1.000, val_cost = 505.479, val_ler = 0.981, time = 1.547
Epoch 2/200, train_cost = 505.479, train_ler = 0.981, val_cost = 496.959, val_ler = 1.000, time = 0.158
(Omitted)
Epoch 200/200, train_cost = 0.541, train_ler = 0.000, val_cost = 0.537, val_ler = 0.000, time = 0.143
Original:
she had your dark suit in greasy wash water all year
Decoded:
she had your dark suit in greasy wash water all year
The original code was based on TensorFlow 1.x's tf.Session
, so it won't work in TensorFlow 2.x (without the tf.compat.v1
API). .. The tf.placeholder
is gone, and you can just write the code that directly manipulates the entered Tensor
.
Basically, the combination of tf.Session
and tf.placeholder
as described in Effective TensorFlow 2 Rewrite as follows.
# TensorFlow 1.X
outputs = session.run(f(placeholder), feed_dict={placeholder: input})
# TensorFlow 2.0
outputs = f(input)
At this time, add a @ tf.function
decorator to move f
in graph mode [^ 1].
[^ 1]: It works without @ tf.function
, but it is slow because it is Eager Execution. Since Eager Execution is useful when debugging, I think it's a good idea to remove (comment out) @ tf.function
and add @ tf.function
when it works.
So the original code
# TensorFlow 1.X
feed = {inputs: train_inputs,
targets: train_targets,
seq_len: train_seq_len}
batch_cost, _ = session.run([cost, optimizer], feed)
train_cost += batch_cost*batch_size
train_ler += session.run(ler, feed_dict=feed)*batch_size
Is a function (with @ tf.function
) that gives train_inputs, train_targets, train_seq_len
as arguments and returns cost, optimizer
as a return value, if you think in principle. It will be rewritten.
However, `ʻoptimizerdoes not need to return a value as it only needs to be executed. Also, the same
feedis given in the immediately after
session.runto calculate
ler, and
decodedis used in the decoding process after learning is completed. I'll return them together (I used
decodeonly the last time, but anyway,
decodedinternally for the calculation of
ler`` It's a calculation, so it's not a waste of time (probably ...)).
# TensorFlow 2.0
@tf.function
def step(inputs, targets, seq_len, targets_len, flag_training):
(Omitted)
return cost, ler, decoded
batch_cost, batch_ler, _ = step(train_inputs, train_targets, train_seq_len, train_targets_len, True)
train_cost += batch_cost*batch_size
train_ler += batch_ler*batch_size
In order to reuse most of the processing for verification purposes, I wrote the function name step
to switch whether to train with the additional argument flag_training
.
In addition, the argument targets_len
has been increased, but this is because the argument given to tf.nn.ctc_loss
in TensorFlow 2.x has changed, and it should not be directly related to Eager Executionization.
The tf.sparse_placeholder
used to give variable-length correct labels is also gone. I prepared a tuple of (indices, values, shape)
to give data to tf.sparse_placeholder
, but now I can specify tf.SparseTensor
directly from the outside. So I created tf.SparseTensor
by myself. dtype
matches the type of the original tf.sparse_placeholder
, but note that it is np.int32
instead of tf.int32
(subtly caught) point).
# TensorFlow 1.X
train_targets = sparse_tuple_from([targets])
# TensorFlow 2.0
train_targets = tf.sparse.SparseTensor(*sparse_tuple_from([targets], dtype=np.int32))
In TensorFlow 2.x, `ʻOptimizerhas been changed to use Keras's. In line with that, so far I used to use ```Optimizer.minimize ()
, but changed it to use GradientTape
in TensorFlow 2.x. This process is in the step ()
we defined earlier.
##### TensorFlow 1.X #####
# Time major
logits = tf.transpose(logits, (1, 0, 2))
loss = tf.nn.ctc_loss(targets, logits, seq_len)
cost = tf.reduce_mean(loss)
optimizer = tf.train.MomentumOptimizer(initial_learning_rate,
0.9).minimize(cost)
##### TensorFlow 2.0 #####
optimizer = tf.keras.optimizers.SGD(initial_learning_rate, 0.9)
@tf.function
def step(inputs, targets, seq_len, targets_len, flag_training):
if flag_training:
with tf.GradientTape() as tape:
logits = model(inputs, training=True)
# Time major
logits = tf.transpose(logits, (1, 0, 2))
cost = tf.reduce_mean(tf.nn.ctc_loss(targets, logits, targets_len, seq_len, blank_index=-1))
gradients = tape.gradient(cost, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
else:
(Omitted below)
Here, I need to give a list of weights to be (learned) to be calculated by gradient, but it is troublesome to collect manually defined tf.Variable
, so I will make a calculation graph of the model by myself. I changed the part I was writing to tf.keras.Model
. This makes it easy to get a list of weights to learn with model.trainable_variables
. It also simplifies the creation of calculation graphs.
##### TensorFlow 1.X #####
# The second output is the last state and we will no use that
outputs, _ = tf.nn.dynamic_rnn(stack, inputs, seq_len, dtype=tf.float32)
shape = tf.shape(inputs)
batch_s, max_timesteps = shape[0], shape[1]
# Reshaping to apply the same weights over the timesteps
outputs = tf.reshape(outputs, [-1, num_hidden])
# Truncated normal with mean 0 and stdev=0.1
# Tip: Try another initialization
# see https://www.tensorflow.org/versions/r0.9/api_docs/python/contrib.layers.html#initializers
W = tf.Variable(tf.truncated_normal([num_hidden,
num_classes],
stddev=0.1))
# Zero initialization
# Tip: Is tf.zeros_initializer the same?
b = tf.Variable(tf.constant(0., shape=[num_classes]))
# Doing the affine projection
logits = tf.matmul(outputs, W) + b
# Reshaping back to the original shape
logits = tf.reshape(logits, [batch_s, -1, num_classes])
##### TensorFlow 2.0 #####
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.RNN(stack, input_shape=(None, num_features), return_sequences=True))
# Truncated normal with mean 0 and stdev=0.1
# Zero initialization
model.add(tf.keras.layers.Dense(num_classes,
kernel_initializer=tf.keras.initializers.TruncatedNormal(0.0, 0.1),
bias_initializer="zeros"))
If a tensor of 3D or higher, including the sample dimensions, is entered in tf.keras.layers.Dense
--Flatten other than the last dimension --Multiply the weight matrix from the right --Return to the original shape after calculation
It will be the operation. In the original code, I wrote the shape operation before and after writing the weight by myself, but it is also very easy because it can be thrown to Keras.
I specified dtype
and set the feature type to float32
. It works even if it is not specified, but WARNING occurs.
train_inputs = np.asarray(inputs[np.newaxis, :], dtype=np.float32)
In the sample code (1) above, there was only one training data, but of course, in reality, we want to put multiple data in a mini-batch for learning. Both the input data and the correct label series have different lengths, so you have to handle them well.
ctc_tensorflow_example_tf2_multi.py
# Compatibility imports
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import time
import tensorflow as tf
import scipy.io.wavfile as wav
import numpy as np
from six.moves import xrange as range
try:
from python_speech_features import mfcc
except ImportError:
print("Failed to import python_speech_features.\n Try pip install python_speech_features.")
raise ImportError
from utils import maybe_download as maybe_download
from utils import sparse_tuple_from as sparse_tuple_from
# Constants
SPACE_TOKEN = '<space>'
SPACE_INDEX = 0
FIRST_INDEX = ord('a') - 1 # 0 is reserved to space
FEAT_MASK_VALUE = 1e+10
# Some configs
num_features = 13
num_units = 50 # Number of units in the LSTM cell
# Accounting the 0th indice + space + blank label = 28 characters
num_classes = ord('z') - ord('a') + 1 + 1 + 1
# Hyper-parameters
num_epochs = 400
num_hidden = 50
num_layers = 1
batch_size = 2
initial_learning_rate = 1e-2
momentum = 0.9
# Loading the data
audio_filename = maybe_download('LDC93S1.wav', 93638)
target_filename = maybe_download('LDC93S1.txt', 62)
fs, audio = wav.read(audio_filename)
# create a dataset composed of data with variable lengths
inputs = mfcc(audio, samplerate=fs)
inputs = (inputs - np.mean(inputs))/np.std(inputs)
inputs_short = mfcc(audio[fs*8//10:fs*20//10], samplerate=fs)
inputs_short = (inputs_short - np.mean(inputs_short))/np.std(inputs_short)
# Transform in 3D array
train_inputs = tf.ragged.constant([inputs, inputs_short], dtype=np.float32)
train_seq_len = tf.cast(train_inputs.row_lengths(), tf.int32)
train_inputs = train_inputs.to_sparse()
num_examples = train_inputs.shape[0]
# Reading targets
with open(target_filename, 'r') as f:
#Only the last line is necessary
line = f.readlines()[-1]
# Get only the words between [a-z] and replace period for none
original = ' '.join(line.strip().lower().split(' ')[2:]).replace('.', '')
targets = original.replace(' ', ' ')
targets = targets.split(' ')
# Adding blank label
targets = np.hstack([SPACE_TOKEN if x == '' else list(x) for x in targets])
# Transform char into index
targets = np.asarray([SPACE_INDEX if x == SPACE_TOKEN else ord(x) - FIRST_INDEX
for x in targets])
# Creating sparse representation to feed the placeholder
train_targets = tf.ragged.constant([targets, targets[13:32]], dtype=np.int32)
train_targets_len = tf.cast(train_targets.row_lengths(), tf.int32)
train_targets = train_targets.to_sparse()
# We don't have a validation dataset :(
val_inputs, val_targets, val_seq_len, val_targets_len = train_inputs, train_targets, \
train_seq_len, train_targets_len
# THE MAIN CODE!
# Defining the cell
# Can be:
# tf.nn.rnn_cell.RNNCell
# tf.nn.rnn_cell.GRUCell
cells = []
for _ in range(num_layers):
cell = tf.keras.layers.LSTMCell(num_units) # Or LSTMCell(num_units)
cells.append(cell)
stack = tf.keras.layers.StackedRNNCells(cells)
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Masking(FEAT_MASK_VALUE, input_shape=(None, num_features)))
model.add(tf.keras.layers.RNN(stack, return_sequences=True))
# Truncated normal with mean 0 and stdev=0.1
# Zero initialization
model.add(tf.keras.layers.Dense(num_classes,
kernel_initializer=tf.keras.initializers.TruncatedNormal(0.0, 0.1),
bias_initializer="zeros"))
optimizer = tf.keras.optimizers.SGD(initial_learning_rate, momentum)
@tf.function
def step(inputs, targets, seq_len, targets_len, flag_training):
inputs = tf.sparse.to_dense(inputs, default_value=FEAT_MASK_VALUE)
if flag_training:
with tf.GradientTape() as tape:
logits = model(inputs, training=True)
# Time major
logits = tf.transpose(logits, (1, 0, 2))
cost = tf.reduce_mean(tf.nn.ctc_loss(targets, logits, targets_len, seq_len, blank_index=-1))
gradients = tape.gradient(cost, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
else:
logits = model(inputs)
# Time major
logits = tf.transpose(logits, (1, 0, 2))
cost = tf.reduce_mean(tf.nn.ctc_loss(targets, logits, targets_len, seq_len, blank_index=-1))
# Option 2: tf.nn.ctc_beam_search_decoder
# (it's slower but you'll get better results)
decoded, _ = tf.nn.ctc_greedy_decoder(logits, seq_len)
# Inaccuracy: label error rate
ler = tf.reduce_mean(tf.edit_distance(tf.cast(decoded[0], tf.int32),
targets))
return cost, ler, decoded
ds = tf.data.Dataset.from_tensor_slices((train_inputs, train_targets, train_seq_len, train_targets_len)).batch(batch_size)
for curr_epoch in range(num_epochs):
train_cost = train_ler = 0
start = time.time()
for batch_inputs, batch_targets, batch_seq_len, batch_targets_len in ds:
batch_cost, batch_ler, _ = step(batch_inputs, batch_targets, batch_seq_len, batch_targets_len, True)
train_cost += batch_cost*batch_size
train_ler += batch_ler*batch_size
train_cost /= num_examples
train_ler /= num_examples
val_cost, val_ler, decoded = step(val_inputs, val_targets, val_seq_len, val_targets_len, False)
log = "Epoch {}/{}, train_cost = {:.3f}, train_ler = {:.3f}, val_cost = {:.3f}, val_ler = {:.3f}, time = {:.3f}"
print(log.format(curr_epoch+1, num_epochs, train_cost, train_ler,
val_cost, val_ler, time.time() - start))
# Decoding
print('Original:')
print(original)
print(original[13:32])
print('Decoded:')
d = tf.sparse.to_dense(decoded[0], default_value=-1).numpy()
for i in range(2):
str_decoded = ''.join([chr(x) for x in np.asarray(d[i][d[i] != -1]) + FIRST_INDEX])
# Replacing blank label to none
str_decoded = str_decoded.replace(chr(ord('z') + 1), '')
# Replacing space label to space
str_decoded = str_decoded.replace(chr(ord('a') - 1), ' ')
print(str_decoded)
The execution result is as follows, for example.
Epoch 1/400, train_cost = 527.789, train_ler = 1.122, val_cost = 201.650, val_ler = 1.000, time = 1.702
Epoch 2/400, train_cost = 201.650, train_ler = 1.000, val_cost = 372.285, val_ler = 1.000, time = 0.238
(Omitted)
Epoch 400/400, train_cost = 1.331, train_ler = 0.000, val_cost = 1.320, val_ler = 0.000, time = 0.307
Original:
she had your dark suit in greasy wash water all year
dark suit in greasy
Decoded:
she had your dark suit in greasy wash water all year
dark suit in greasy
# create a dataset composed of data with variable lengths
inputs = mfcc(audio, samplerate=fs)
inputs = (inputs - np.mean(inputs))/np.std(inputs)
inputs_short = mfcc(audio[fs*8//10:fs*20//10], samplerate=fs)
inputs_short = (inputs_short - np.mean(inputs_short))/np.std(inputs_short)
# Transform in 3D array
train_inputs = tf.ragged.constant([inputs, inputs_short], dtype=np.float32)
train_seq_len = tf.cast(train_inputs.row_lengths(), tf.int32)
train_inputs = train_inputs.to_sparse()
num_examples = train_inputs.shape[0]
I prepared a part of the data used in the original code and increased the data to two.
In the end, the data will be SparseTensor
, but if you first create RaggedTensor
using tf.ragged.constant ()
and then convert from there, it will be easier to create. ..
As I mentioned in another article, I use the Masking
layer to represent variable length inputs.
Try a basic RNN (LSTM) in Keras-Qiita
model.add(tf.keras.layers.Masking(FEAT_MASK_VALUE, input_shape=(None, num_features)))
Since the shape of the mini-batch at the time of input is made with the maximum data length, when inputting short data, fill the shortage part with FEAT_MASK_VALUE
.
@tf.function
def step(inputs, targets, seq_len, targets_len, flag_training):
inputs = tf.sparse.to_dense(inputs, default_value=FEAT_MASK_VALUE)
I explained the input features, but the same applies to the label side.
targets [13:32]
is just fetching the label corresponding to the clipped audio section (magic number ...).
# Creating sparse representation to feed the placeholder
train_targets = tf.ragged.constant([targets, targets[13:32]], dtype=np.int32)
train_targets_len = tf.cast(train_targets.row_lengths(), tf.int32)
train_targets = train_targets.to_sparse()
When training, create a Dataset
that summarizes the required data and use batch ()
to create a mini-batch. You can retrieve the mini-batch in sequence in a for
loop.
ds = tf.data.Dataset.from_tensor_slices((train_inputs, train_targets, train_seq_len, train_targets_len)).batch(batch_size)
for curr_epoch in range(num_epochs):
(Omitted)
for batch_inputs, batch_targets, batch_seq_len, batch_targets_len in ds:
(Omitted)
In reality, I think that the training data will be written to a TFRecord format file in advance, and Dataset
will be created from it and used. If you use tf.io.VarLenFeature
to fetch the feature as SparseTensor
at the time of loading, you can use the processing of the contents of the current loop as it is (probably).
[\ TensorFlow 2 ] It is recommended to read features from TFRecord in batch units --Qiita
It's good to say that it works with TensorFlow 2.x based processing, but since the model has been converted to Keras in the end, let's think about whether the learning part can also be executed with the Keras API. I will.
** This was the beginning of the Shura road ...
From the conclusion, it seems that you should not try hard. </ del> **
** (Added on 2020/04/27) I found a way to work well with Keras. Please see another article for details. ** ** [\ TensorFlow 2 / Keras ] How to run learning with CTC Loss in Keras-Qiita
It is based on the sample code (1) that is learned with one piece of data.
ctc_tensorflow_example_tf2_keras.py
(Since it is the same as the TF2 version, the first half is omitted)
# Creating sparse representation to feed the placeholder
train_targets = tf.sparse.to_dense(tf.sparse.SparseTensor(*sparse_tuple_from([targets], dtype=np.int32)))
(Omitted)
def loss(y_true, y_pred):
#print(y_true) # Tensor("dense_target:0", shape=(None, None, None), dtype=float32) ???
targets_len = train_targets_len[0]
seq_len = train_seq_len[0]
targets = tf.cast(tf.reshape(y_true, (-1, targets_len)), tf.int32)
# Time major
logits = tf.transpose(y_pred, (1, 0, 2))
return tf.reduce_mean(tf.nn.ctc_loss(targets, logits,
tf.fill((tf.shape(targets)[0],), targets_len), tf.fill((tf.shape(logits)[1],), seq_len),
blank_index=-1))
def metrics(y_true, y_pred):
targets_len = train_targets_len[0]
seq_len = train_seq_len[0]
targets = tf.sparse.from_dense(tf.cast(tf.reshape(y_true, (-1, targets_len)), tf.int32))
# Time major
logits = tf.transpose(y_pred, (1, 0, 2))
# Option 2: tf.nn.ctc_beam_search_decoder
# (it's slower but you'll get better results)
decoded, _ = tf.nn.ctc_greedy_decoder(logits, train_seq_len)
# Inaccuracy: label error rate
ler = tf.reduce_mean(tf.edit_distance(tf.cast(decoded[0], tf.int32),
targets))
return ler
model.compile(loss=loss, optimizer=optimizer, metrics=[metrics])
for curr_epoch in range(num_epochs):
train_cost = train_ler = 0
start = time.time()
train_cost, train_ler = model.train_on_batch(train_inputs, train_targets)
val_cost, val_ler = model.test_on_batch(train_inputs, train_targets)
log = "Epoch {}/{}, train_cost = {:.3f}, train_ler = {:.3f}, val_cost = {:.3f}, val_ler = {:.3f}, time = {:.3f}"
print(log.format(curr_epoch+1, num_epochs, train_cost, train_ler,
val_cost, val_ler, time.time() - start))
decoded, _ = tf.nn.ctc_greedy_decoder(tf.transpose(model.predict(train_inputs), (1, 0, 2)), train_seq_len)
d = tf.sparse.to_dense(decoded[0])[0].numpy()
str_decoded = ''.join([chr(x) for x in np.asarray(d) + FIRST_INDEX])
# Replacing blank label to none
str_decoded = str_decoded.replace(chr(ord('z') + 1), '')
# Replacing space label to space
str_decoded = str_decoded.replace(chr(ord('a') - 1), ' ')
print('Original:\n%s' % original)
print('Decoded:\n%s' % str_decoded)
It looks like it's written like that. However, the behavior is actually quite suspicious ...
If you create a model normally with Keras, you cannot use Sparse labels with Model.fit ()
or Model.train_on_batch ()
. I couldn't help it, so I converted it to a normal Tensor
.
Since the label must be Sparse when calculating the label error rate
targets = tf.sparse.from_dense(tf.cast(tf.reshape(y_true, (-1, targets_len)), tf.int32))
I'm returning to Sparse again, but this removes the ID: 0 symbol that corresponds to the space (yes, that's right, because it's a sparse matrix that doesn't originally have 0s ...). Therefore, the error rate is calculated with spaces removed from the correct label column, and the error rate does not become 0 forever (insertion errors occur as many as the number of spaces). I will. The most recent solution is to change the ID system so that ID: 0 becomes a blank symbol (≠ space). However, it is better to solve the point that you bother to set Sparse to Dense and return it again ...
When writing in Keras, specify the loss function in Model.compile ()
. You can also specify your own Callable
def loss(y_true, y_pred):
Since we can only take two arguments, this time we will fetch the length information from a global variable. It's still good up to that point.
def loss(y_true, y_pred):
#print(y_true) # Tensor("dense_target:0", shape=(None, None, None), dtype=float32) ???
(Omitted)
targets = tf.sparse.from_dense(tf.cast(tf.reshape(y_true, (-1, targets_len)), tf.int32))
Isn't y_true
the data brought from the correct label (that is, train_targets
and val_targets
)? These dimensions are supposed to be two dimensions of (sample, time)
, but for some reason they are three dimensions of Tensor
... Moreover, the original label should have been made with `ʻint32, but for some reason it is
float32`` ...
That's why I don't know what's coming to y_true
targets = tf.sparse.from_dense(tf.cast(tf.reshape(y_true, (-1, targets_len)), tf.int32))
It has been transformed into two dimensions and type-converted. It's too suspicious. But it seems that learning is done properly?
This may be the Keras specification (design concept?), And the description in the following document is also tf.keras.losses.Loss | TensorFlow Core v2.1.0
y_true: Ground truth values. shape = [batch_size, d0, .. dN] y_pred: The predicted values. shape = [batch_size, d0, .. dN]
It can be read as if it is supposed to have the same shape. There is no problem when using cross entropy loss etc. in a normal classification problem, but in the case of "the length of the correct label and the prediction result are different" like CTC Loss, it gets confused immediately.
… By the way, sparse_categorical_crossentropy
has different shapes of y_true
and y_pred
, right? How is that achieved?
-- y_true
: Category variable (batch_size,)
-- y_pred
: Output score for each category (batch_size, num_classes)
In other words, you should be able to imitate that implementation. If you look at the implementation below, transformation and type conversion are included, so it may be that the current implementation is actually suitable. (But still suspicious) tensorflow/backend.py at v2.1.0 · tensorflow/tensorflow · GitHub
Epoch 1/200, train_cost = 774.764, train_ler = 1.190, val_cost = 387.497, val_ler = 1.000, time = 2.212
Epoch 2/200, train_cost = 387.497, train_ler = 1.000, val_cost = 638.239, val_ler = 1.000, time = 0.459
(Omitted)
Epoch 200/200, train_cost = 3.549, train_ler = 0.238, val_cost = 3.481, val_ler = 0.238, time = 0.461
Original:
she had your dark suit in greasy wash water all year
Decoded:
she had your dark suit in greasy wash water all year
It takes about 3 times longer than before rewriting to Keras version (TensorFlow 2.x version) ... [^ 2]
Moreover, for the reasons mentioned above, the values of train_ler
and val_ler
are not output correctly.
[^ 2]: Since the amount of data is small, the learning itself is not slow, but Keras conversion may just cause overhead that does not depend on the amount of data. Another possible cause is that the labels are moved back and forth between Dense and Sparse.
I tried my best to write the learning part in Keras style, but I ended up with suspicious hacks, and there is nothing good for now. </ strong> It may be solved with the version upgrade of TensorFlow and Keras, but how about it? </ del>
--I explained how to learn parameters using CTC Loss in TensorFlow 2.x. It seems to be working for the time being.
――It looks like writing a TensorFlow style learning loop with some Keras code mixed in, but I don't recommend writing it entirely in Keras because it's rather troublesome. </ del>
-** (Added on 2020/04/27) Please see Separate article for how to learn in Keras style. ** **
Recommended Posts