First edition: 2020/3/3
Authors: Soichi Takashige, Masahiro Ito, Hitachi, Ltd.

Introduction

In this post, we will introduce the design know-how of data preprocessing and the performance verification results of numerical data preprocessing when designing a system that incorporates a machine learning model.

In the first installment, we will introduce the data preprocessing of machine learning systems and the outline of their design.

** Post list: **

About data preprocessing of systems that use machine learning (this post)
Performance verification of data preprocessing for machine learning (numerical data) (Part 1)
Performance verification of data preprocessing for machine learning (numerical data) (Part 2)

Outline and flow of AI projects

Data analysis using AI technology such as machine learning is attracting attention, and the number of projects using AI is increasing. In AI projects, we analyze customer data to create machine learning models for gaining some knowledge and automating forecasts. In many cases, AI projects are 1) conducted PoC (Proof of Concept) led by data scientists who are experts in data analysis, and confirmed the usefulness of data analysis and machine learning models, and 2) the results. In addition, the system engineer (hereinafter referred to as SE) will make the production system, which is carried out in two stages. In addition to the report submitted to the customer, the PoC deliverables by the data scientist have source code written in a language such as Python, which SE designs and systematizes.

Challenges in systematization

One of the challenges of systematization based on PoC deliverables is the increase in the amount of data in the production environment. Since PoC is just a verification, the amount of data entrusted to us by customers is often small, and even if there is a large amount of data, we often sample and use the data for quick verification. As you can see, PoC typically uses a small amount of data that can be processed on a single desktop machine. On the other hand, since machine learning systems perform learning based on data, the larger the amount of data, the higher the prediction accuracy. Therefore, when making a production system, a large amount of data is often used to improve prediction accuracy, and machine learning systems are required to have high data processing performance.

In addition, while data scientists handle machine learning models, data preprocessing is generally designed mainly by engineers such as SE. Data preprocessing is a technology closely related to machine learning models, but it also requires knowledge as an SE, such as infrastructure design, sizing, and failure handling in the event of a failure. At that time, it is not efficient to redesign and reimplement the preprocessing from scratch on the SE side at a high cost and time. Therefore, an approach that promotes systemization by utilizing prototypes developed by data scientists in Python etc. is effective. However, there is not much public information currently available on such design know-how. In this post, we will introduce the design procedure and points of data preprocessing based on the knowledge obtained from the results of performance verification for SEs who have completed the PoC stage and are designing a system that incorporates a machine learning model. I will.

Overview of data preprocessing in machine learning systems

Data preprocessing in machine learning systems

In a system that uses machine learning, it is conceivable to perform data preprocessing mainly in three phases: learning, inference, and relearning. Figure 1 gives an overview. 前処理システム概要
Figure 1 Overview of machine learning utilization system

Data preprocessing during learning

Data sets stored in various formats are converted into a data structure suitable for learning a certain model, and data is normalized and aggregated to improve accuracy. In the pre-processing at the time of learning, there are many cases where all the data is processed at once, and the amount of processing tends to be very large compared to the time of inference.

Data preprocessing during inference

Inference uses models to classify and predict trends on data collected from the field of production. Inference may impose latency requirements, such as seconds, in which case preprocessing will also be subject to latency requirements. Data preprocessing at the time of inference is a process that converts so that it has the same features as at the time of learning, but it is performed only on the data to be inferred, and the process may be simplified. On the other hand, because it is necessary to execute processing each time the model is used, it is frequently called continuously in the production system.

Data preprocessing during retraining

If data that can be used for learning is accumulated from the actual data at the operation site, it may be used to update the model. In such a case, the same processing as the data preprocessing at the time of learning will be performed.

Issues and solutions in data preprocessing design

Since PoC by data scientists is often performed with a small amount of data, preprocessing is often implemented in Python during PoC. On the other hand, if you try to implement preprocessing in Python during systemization, the issues shown in Table 1 below will occur. Table 1 also shows a solution to that problem.

Table 1 Data preprocessing system issues and solutions

#	Task	Solution plan
①	Preprocessing takes a very long time due to the huge amount of data to be targeted.	Utilization of big data processing infrastructureDesigned and implemented so that preprocessing written in Python etc. can be executed in parallel and distributed on a processing platform such as Spark when the amount of data is large.
②	If the big data processing platform is used in (1), it will be necessary to reimplement the preprocessing, which will take man-hours.	Pre-processing implementation with a view to systemization from the PoC stageA Python processing implementation method that operates without major changes when converting to Spark is adopted and implemented at PoC.

System design process that utilizes machine learning

Design a system that utilizes machine learning as shown in Figure 2. As mentioned at the beginning of this post, it is common to carry out PoC to confirm the usefulness of machine learning and systematize the logic whose usefulness has been confirmed there.

プロセス Figure 2 Outline of the design process of a system that utilizes machine learning

As shown in Fig. 1, there are two major systems that utilize machine learning: learning systems and inference systems. From now on, we will deal with learning systems. In general, inference systems are compared to learning systems.

The amount of data used in one inference is small
There is a large constraint on the execution time (latency) of one inference.

There is a tendency.

Pretreatment design during learning

Table 2 shows the design items when designing and implementing the data preprocessing of the learning system. Only the parts that are characteristic of machine learning are shown here.

Table 2 List of design items for learning system

#	Design items	Details
1	Examination of system requirements	Throughput requirements Total run time requirements availability/Requirements for recovery processing Estimating the total number of pretreatment types
2	Data design	Data placement design Intermediate data placement design
3	Resource estimation by actual machine	Analyzing PoC Code on Small Data Sets Check the number of data Data size per record Estimating the degree of sparseness of data Total data size estimate Training data size and intermediate data estimate to save
4	Implementation	Selection of execution platform(Python,Spark and others) Estimating the number of nodes Estimating the amount of memory per node Determining data processing method Python、Sparkでの処理共通化を意識したImplementation Spark support for Python code
5	Availability design	Designation of switchback and re-execution method when processing fails
6	Operational design	Design of data preprocessing update accompanying model update Data history management after processing, deletion design

Resource estimate

In the learning system, the most important point is whether the model development using the target data can be completed within the period required by the system requirements. In the PoC phase, the target data is often handled only for a part of the period (devices, forms, etc.), but in the subsequent systemization, the entire period and all types of data are handled. This can result in a huge amount of data. Regarding pre-processing, it is important to estimate the resources (number of CPUs, amount of memory) required for systemization design so that processing can be completed within an acceptable processing time for large data.

When finding the resources required for preprocessing, first check the input / output data size and processing time using a small data set for each process in the preprocessing. Also, check the processes (processes with a large number of repetitions) that can be expected to be effective in optimizing the processing logic, which will be described in Part 2 and later.

Regarding the number of CPUs among the resources, the processing time of the production system is estimated from the input data size of the production system based on the input data size and processing time when using a small data set (at this time, the processing time is data). It shall be proportional to the size). By dividing this time by the processing time specified in the system requirements, you can estimate the approximate number of CPUs you will need.

Regarding the amount of memory, the data size of each process in the production system is estimated based on the input data size of each process when using a small data set, and the total size is estimated.

Selection of data processing infrastructure

As a basis for executing data preprocessing in a production preprocessing system, it is possible to determine whether to build an environment in which PoC code written in Python is executed as it is in Python, or whether it should be executed in a distributed processing platform such as Spark. Is required. Basically, as a result of estimating the required resources, if the memory seems to be insufficient, Spark will be preprocessed, and if the memory amount is not a problem, Python will be preprocessed as it is.

in conclusion

In this post, I introduced the outline of data preprocessing and its design of a system that uses machine learning. Next time, we will introduce the know-how for improving the performance of numerical data preprocessing using Python, and the performance verification results on actual machines.

The second: Performance verification of preprocessing by machine learning of numerical data (1)

About data preprocessing of systems that use machine learning