An example that comes with a number of frameworks, a story about implementing a machine learning model. What they all have in common is that there are no tests. Even machine learning models are part of the production code once they are embedded in the application. Do you want to incorporate an untested implementation into your production environment? I don't think that's usually the case.
(Borrowed from [Studio Ghibli Porco Rosso](https://www.amazon.co.jp/dp/B00005R5J6))It's easy to forget that machine learning models are most accurate at the "moment of release." The reason is that the moment it is released is a model that has been trained using the full-fledged data available at that time, and after that, more and more unknown data will come in. Therefore, it is very important to be able to verify the accuracy and validity of the model at any time. This is equivalent to the reason for testing regular code, which means that just because it's a machine learning model isn't special.
In this article, I will explain the method of testing this machine learning model. Of course, this is the method I am currently practicing, and I think that the know-how of more practical methods will spread as the application of machine learning to applications progresses in the future.
First, it must be well designed to perform the test. There is a material that explained this point earlier, so I would like to quote from that.
Code design to avoid crying with machine learning
Model is a real machine learning model (built with scikit-learn, Chainer, TensorFlow), which is often packed with all the processing. It is a story to divide it as follows.
When a problem such as inaccuracy occurs due to this, is it a problem of the model itself, is it bad to train, is the model okay and there is a problem only when using it from the application side? Is to be able to isolate and verify and test whether there is a mistake in data preprocessing.
However, the output of machine learning is indefinite compared to a normal program where the input / output can be clearly defined. DataProcessor and Resource are almost the same as normal programs, so it is easy to test, but this is a problem for Trainer and Model API including Model itself.
I didn't go into detail on this point in the material above, but I'd like to take a look at these tests from here.
There are four main things to test in a machine learning model:
I would like to take a step-by-step approach to these tests. In the following code introduction, I will quote from the following recently developed repository.
This is based on TensorFlow, but I think the idea can be used with other libraries (I did the same design and testing when using Chainer before). On the contrary, when using TensorFlow, there is a point that I am addicted to during the test, so I will also mention how to deal with that problem.
In the operation test, we check whether the Model works from input to output without causing an error. In the case of a neural network model, it can also be called a Forward check.
Here is the code I actually used.
tensorflow_qrnn/test_tf_qrnn_forward.py
The input can be random, so make sure it passes through to the output. The operation test is frequently used (used) during development for the purpose of "verifying as lightly and quickly as possible" whether it works for the time being when developing or rearranging the model. In that sense, the position is close to compilation.
In TensorFlow, when you run a unit test, multiple tests share Global Graph information and an unintended error occurs. Therefore, please note that you need to separate the Graph for each test case.
class TestQRNNForward(unittest.TestCase):
def test_qrnn_linear_forward(self):
batch_size = 100
sentence_length = 5
word_size = 10
size = 5
data = self.create_test_data(batch_size, sentence_length, word_size)
with tf.Graph().as_default() as q_linear:
qrnn = QRNN(in_size=word_size, size=size, conv_size=1)
...
In particular, this phenomenon becomes chaotic if the variable scope is not turned off. Basically, when using TensorFlow, it is important to cut the variable scope firmly with variable_scope when declaring variables (duplicates cannot be checked with name_scope).
class QRNNLinear():
def __init__(self, in_size, size):
self.in_size = in_size
self.size = size
self._weight_size = self.size * 3 # z, f, o
with tf.variable_scope("QRNN/Variable/Linear"):
initializer = tf.random_normal_initializer()
self.W = tf.get_variable("W", [self.in_size, self._weight_size], initializer=initializer)
self.b = tf.get_variable("b", [self._weight_size], initializer=initializer)
For the scope, please refer to this article, so please refer to it. I will summarize it). In any case, when using TensorFlow, please keep the following in mind.
(Borrowed from [Studio Ghibli Porco Rosso](https://www.amazon.co.jp/dp/B00005R5J6))Once you have a model that passes the operation test, it is a little hasty to learn using the production data immediately. The volume of production data will be considerable, and it will take time to learn. If you're not very confident, you should first check that your model behaves as intended and record better accuracy than the baseline, or check with smaller data. This is a verification test.
Conversely, creating a dataset for validation testing and a baseline model for it will help in the process of improving the machine learning model. A data set for validation tests is a data set that is easy to handle and can be trained in a relatively short time. And the baseline model is a basic model that "if it does not exceed this, it is NG".
Without this, there is a tendency to get caught up in the delusion that "it may be better to increase the data a little more" and "the accuracy may be improved if a little more learning time is taken", and the improvement of the essential algorithm tends to be obscured. There is.
It is quite a trap to say, "We can handle the same data as the actual data immediately," and the tendency may be greatly biased because the actual data is the actual data (for example, 90 is usually used for diagnostic imaging. If% is normal, the accuracy will be 90% even with a model that simply predicts "no abnormality"). It is a basic matter in machine learning that the bias of data causes bias in judgment, but the sense of security that "we are using production data" tends to distract us from such a point.
In order to solve the above problems, it is recommended to prepare "easy-to-use size" and "label balanced" verification test data and its environment.
In the implementation below, we are testing with a dataset of handwritten characters called digit that comes with scikit-learn. scikit-learn comes with a dataset such as handwritten characters, so if you can use it, you can save the trouble of preparing the data.
tensorflow_qrnn/test_tf_qrnn_work.py
If you have production data, it is a good idea to create a well-balanced sampling data set in consideration of the target label, rather than simply extracting it by period. This will reduce the loss and check if the accuracy is correct.
Comparison with baseline is also an important role of validation testing. It's often said that SVM was definitely better than the neural network model I made so hard (* Tune the model used for the baseline properly. The important thing is not to use the neural network, but for your purpose. Because you are looking for a suitable model). Fortunately, scikit-learn comes with a variety of models, which makes it perfect for this verification. I think you can compare and verify with the baseline model without writing too much code.
By establishing a barrier called this verification test, you can save time and money (GPU fee) spent on a bad model.
(Borrowed from [Studio Ghibli Porco Rosso](https://www.amazon.co.jp/dp/B00005R5J6))However, it is also a fact that there are models that cannot be accurate unless they are trained endlessly. In such a case, you can record the loss / accuracy value (velocity-like) for the learning time and check it to replace it.
The integration test checks whether the call from the application is successful. When using a machine learning model, it is not only the accuracy that should be tested, but also the preprocessing and so on.
Therefore, the Data Processor should be tested independently before the cooperation test. Then, test whether the Model API works properly when used from the application. As for the accuracy, it is good to prepare a data set of a size that is easy to verify and test the accuracy when running the Model API, that is, the actual application, as in the evaluation test above. This is because the following things often happen during the cooperation test.
For this reason, it is advisable to measure accuracy as well as simply function.
The dataset used to test the accuracy of this Model API is also useful for constantly monitoring the performance of machine learning models. This makes it possible to determine the timing of re-learning / reconstruction, and in that sense, it is recommended to prepare a collaborative test separately from the evaluation test (the data will be a little more actual than the evaluation test).
We will move to the evaluation test when the verification test exceeds the baseline and the cooperation test confirms that the application can be called.
Here, so-called A / B tests are carried out. In the implementation, if necessary, we will train firmly with the amount of data that exceeds the verification test. Then check to see if it has an advantage over existing models.
The indicators checked at the evaluation test stage and the indicators checked at the verification test are very different. At the verification test stage, indicators that represent the performance of the model such as accuracy are checked, and at the evaluation test, KPIs (Key Performance Indicators) in the service such as "user engagement rate" are checked.
Ultimately, it's not about building a highly accurate model, it's about building a model that contributes to the service, that is, adds value to the user. The evaluation test checks this point.
The above is the test method for implementing a machine learning model. In fact, I am doing it through trial and error, so if you have any opinions like this, I would love to hear from you.