First edition: 2020/3/10
Authors: Soichi Takashige, Masahiro Ito, Hitachi, Ltd.

Introduction

In this post, we will introduce the design know-how of data preprocessing and the performance verification result of data preprocessing when designing a system incorporating a machine learning model.

In the second installment, we will introduce performance improvement know-how and verification results in data preprocessing using Python.

** Post list: **

About data preprocessing of the system using machine learning
Performance verification of data preprocessing for machine learning (numerical data) (Part 1) (Posted)
Performance verification of data preprocessing for machine learning (numerical data) (Part 2)

About the benchmark (BigBench) referred to in the performance verification

Before introducing the design know-how and performance verification results, we will introduce the benchmarks referred to as references in the verification. This time, I used BigBench, which is one of the benchmark programs for big data analysis. BigBench is a benchmark program that simulates a system that analyzes data such as access users on online e-commerce sites, and targets structured or semi-structured data such as data on RDBs and Web logs. Individual business scenarios and benchmark programs are defined. Figure 1 shows the BigBench data structure.

BigBench overview

Figure 1 Big Bench data structure

This time, of the 30 business scenarios of BigBench, we referred to business scenario # 5, which is the most complicated processing and can be said to be a typical example of data preprocessing, as a verification target, and implemented the initial code in python independently. .. Business scenario # 5 looks like this:

Create a logistic regression model that estimates the category of products that the user is interested in from the Web access history of the user in the online store.

Figure 2 shows an overview of the processing of business scenario # 5. BigBench業務シナリオ#5

Figure 2 Outline of processing of BigBench business scenario # 5

This scenario is divided into a learning phase and an inference phase. In the learning phase, from related information such as access history (web_clickstreams) to online stores on the Web, product information database (item), user (customer) information database (customer, customer_demographics), etc., for each user (customer) Organize online store transition history and user characteristics (educational background, gender, etc.) as statistical data. Based on this statistical data, we create a regression model that estimates whether a user is interested in a particular area (eg, "Books"). In the inference phase, statistical data that aggregates access history etc. for a user is created and applied to the regression model created in the learning phase to estimate how interested the user is in the "Books" field. I will. Here, Fig. 3 shows the flow of data preprocessing implemented for this verification by referring to the inside of business scenario # 5.

BigBench#5 flow

Figure 3 Outline of preprocessing algorithm for BigBench business scenario # 5 (learning phase)

This pre-processing consists of the following four phases.

** 1) [Combining access history and product information] ** Associate the Web access history with the product database so that it can be classified by category ((1) in Fig. 3).

** 2) [Aggregate by user] ** The number of times the product in each category is clicked for each accessing user is tabulated on the table (②③ in Fig. 3).

** 3) [Combining user information] ** By associating the attribute information of the accessing user with the aggregated information for each user, it is possible to classify by the attribute of the user (④⑤ in Fig. 3).

** 4) [Feature quantification] ** Convert text information etc. into a numerical string so that it can be used in machine learning (⑥ and ⑦ in Fig. 3).

BigBench Business Scenario # 5 Coding Example in Python

Figure 4 shows an example of implementing data preprocessing in Python for BigBench business scenario # 5. This time, based on this code, we will consider improving the processing.

`Initial code(For PoC)`


import pandas as pd
import numpy as np

#Data read
web_clickstreams = pd.read_csv("web_clickstreams.csv")
item = pd.read_csv("item.csv");
customer = pd.read_csv("customer.csv")
customer_demographics = pd.read_csv("customer_demographics.csv")

#Process (1): Combining access history and product information
data = web_clickstreams.loc[web_clickstreams['web_clickstreams.wcs_user_sk'].notnull(), :]
data = pd.merge(data, item, how='inner', left_on=['wcs_item_sk'], right_on=['i_item_sk'])

#Process (2): Divide data by user ID
data = data.groupby('wcs_user_sk')

#Process ③: Aggregate by user
i_category_index = "Books"
types = ['wcs_user_sk', 'clicks_in_category']+['clicks_in_%d'%i for i in range(1, 8)]
def summarize_per_user(data_mr):
    wcs_user_sk_index = data_mr.name
    # ③-1,③-2 A specified product category(Books)Aggregate the number of accesses to
    clicks_in_category = len(data_mr[data_mr['i_category'] == i_category_index])
    # ③-3 ‘i_category_id’==Calculate for 1… 7 respectively
    # ③-3-1, ③-3-2 ‘i_category_id’==Aggregate the number of access logs for i
    return pd.Series([wcs_user_sk_index, i_category_index] + \
                     [len(data_mr[data_mr['i_category_id']==i]) for i in range(1, 8)], \
                     index = types)
data = data.apply(summarize_per_user)

#Process ④: Combine with user information
data = pd.merge(data, customer, how='inner', left_on=['wcs_user_sk'], right_on=['c_customer_sk'])

#Process ⑤: Combine with user attribute information
data = pd.merge(data, customer_demographics, how='inner', \
                left_on=['c_current_cdemo_sk'], right_on=['cd_demo_sk'])

#Process ⑥: Feature quantification
data['college_education'] = data['cd_education_status'].apply( \
    lambda x: 1 if x == 'Advanced Degree' or x == 'College' or \
    x == '4 yr Degree' or x == '2 yr Degree' else 0)
data['male'] = data['cd_gender'].apply(lambda x: 1 if x == 'M' else 0)

#Processing ⑦:Extract and substitute the required information
result = pd.DataFrame(data[['clicks_in_category', 'college_education', 'male', \
                            'clicks_in_1', 'clicks_in_2', 'clicks_in_3', \
                            'clicks_in_4', 'clicks_in_5', 'clicks_in_6', 'clicks_in_7']])
#Save results
result.to_csv('result-apply.csv')

Figure 4 Code example of BigBench business scenario # 5

Data preprocessing design know-how in Python: Logic optimization

Let's start with the improvements that can be made within pure python. In Python coding, logic optimization is important as know-how for improving performance. We examined and applied the following two points as logic optimization that can be expected to have a large effect on the code of the target business this time.

Use pandas function of loop

When iterative processing is executed in a for loop etc., the processing is executed only on a single CPU due to Python restrictions. By rewriting this to a pandas function such as apply, map, it may be executed in parallel with multiple threads internally and speeded up.

Simplification of duplicate loops

When processing the same data range in different cases, data filtering such as filter may be executed many times with different conditional expressions. In such code, the data condition comparison is performed many times for the same data range, which results in poor execution efficiency. Such processing can be speeded up by rewriting it so that it can be done in one loop.

Example of logic optimization in BigBench business scenario # 5

Use the pandas function of the loop

This example uses the pandas function as of the code in Figure 4.

Simplification of duplicate loops

In the code before optimization in the upper part of Fig. 5 below, "③-3 map loop (repetition of 0… 7)" in Fig. 3 is written. Looking at this, we refer to the ‘i_category_id’ column of the elements of the array called data_mr, and count the number of elements whose values are 1 to 7, respectively. In this implementation, the same range of data is retrieved 7 times more than once. By rewriting this process using groupby, the number of searches can be reduced to one.

```python:Before improvement
#[Before optimization] Perform all element search 7 times
return pd.Series([wcs_user_sk_index, i_category_index] + \
               [len(data_mr[data_mr['i_category_id']==i]) for i in range(1, 8)],\
index=types)
```
```python:After improvement
#[After optimization] Perform all element search only once
    clicks_in = [0] * 8
    for name, df in data_mr.groupby('i_category_id'):
        if name < len(clicks_in):
            clicks_in[name] = len(df)
    return pd.Series([wcs_user_sk_index, i_category_index] + clicks_in[1:], \
                     index = types);
```

Figure 5 Example of replacing the search process in a loop with a single search

Verification of the effect of logic optimization

Here, let's actually measure the performance of the effect of logic optimization in Fig. 5.

Verification environment

In this performance verification, AWS is used and its specifications are shown in Table 1 below.

Table 1 Hardware specifications of verification environment

	Python data preprocessing verification environment
instance	AWS EC2
OS	CentOS 7 64bit
CPU(Number of cores)	32
Memory(GB)	256
HDD(TB)	5 (1TB HDD x 5)

The software versions used for verification are shown in Table 2 below.

Table 2 Software version of verification environment

software	version
Python	3.7.3
Pandas	0.24.2
Numpy	1.16.4

Processing method to compare performance

This time, we measured the performance using the following two processing methods.

Single-node processing with Python (without logic optimization in Figure 5)

Execute the code in Figure 4 on Python.
Single-node processing by Python (with logic optimization in Fig. 5)

Execute the code in Fig. 4 with the optimization in Fig. 5 performed on Python.

Processing content to be measured

In the measurement, the total time required for the following three processes is measured.

Read data from the data source into memory

To process the data, read all the data in the tables (web_clickstream, item, customer, customer-demographics) needed for processing from disk to data frame. The data to be read is a text file, which will be read from the local disk.

Preprocessing such as data combination and aggregation for the read data
Write the processing result to the data store

Write the data in text format to your local disk.

Data to be measured

Assuming that the data size processed by the production system is about 50GB (estimated size when expanded on memory), measure with some data sizes between the data size of 1/100 of that and the data size assumed for production. And I checked how the processing time changes. For each measurement size, Table 3 shows the size when the input data to be processed is expanded on the memory and the data size when it is originally saved on the HDD in text format. After that, the data size in the measurement result uses the value of the data size in the memory.

Table 3 Measurement data size

Percentage of production data size[%]	1	5	10	25	50	75	100	200	300
Number of rows of data- web_clickstreams	6.7M	39M	83M	226M	481M	749M	1.09G	2.18G	3.39G
Number of rows of data- item	18K	40K	56K	89K	126K	154K	178K	252K	309K
Number of rows of data- customer	99K	221K	313K	495K	700K	857K	990K	1.4M	1,715
Number of rows of data- customer_demographics	1.9M	1.9M	1.9M	1.9M	1.9M	1.9M	1.9M	1.9M	1.9M
Data size in memory(GB)	0.4	1.9	3.9	10.3	21.8	33.7	49.1	97.9	152.1
Data size on HDD(GB)	0.2	1.0	2.2	6.3	13.8	21.7	29.8	63.6	100.6

The numbers K, M, and G in the table are abbreviations for 1,000 lines, one million lines, and one billion lines, respectively.

Performance measurement results

Figure 6 shows the results of measuring the processing time by executing each of the two types of processing (with / without logic optimization in Fig. 5) for each data size for BigBench's business scenario # 5. .. In addition, the data size was up to about 22GB (50% of the production data size), and when trying to process data of a larger size, it could not be processed due to insufficient memory. Here, when the input data size is 0.4GB (the leftmost point in the graph in Fig. 6), the execution time is 412 seconds without logic optimization, and 246 seconds with logic optimization, which is about 40%. It has been shortened. Also, when the input data size is 22GB (the rightmost point in the graph in Fig. 6), the execution time is 5,039 seconds without logic optimization, and 3,892 seconds with logic optimization, which is a reduction of about 23%. I stayed at.

Figure 6 Data preprocessing time measurement results for each input data size

Figure 7 shows the progress of CPU, memory, and disk I / O usage when processing 22GB of data. This time the verification machine has 32 cores, but Python can only use 1 core. Therefore, the CPU usage rate is always about 1/32 = 3%. On the other hand, you can see that about 200GB of memory is consumed for processing input data with a size of about 22GB on the memory. This is probably because the intermediate processing result is retained in memory during data processing, so its size is also consumed. I / O is performed only when the initial data is read, and it can be confirmed that I / O does not occur during data processing and is basically processed on-memory.

resource

Figure 7 Temporal changes in CPU, memory, and I / O usage in the Python + Pandas environment

Effects and precautions of logic optimization

It was confirmed that a performance improvement effect of 40 to 23% can be obtained by logic optimization that avoids searching the same data repeatedly. At this time, the performance improvement effect decreased as the data size increased, but this is because the improvement effect depends on the data characteristics (value range and distribution), and the characteristics change when the data size increases. I think it was because of that. Therefore, over-optimizing the logic for data of a different size (a subset of the smaller size), such as PoC, can have different effects on production-scale data. Therefore, it is recommended that logic optimization be performed in the final tuning.

in conclusion

In this post, we introduced the know-how for improving the performance of numerical data preprocessing using Python, and the performance verification results on actual machines. Next time, we will introduce the performance verification results when Spark, which is a parallel distributed processing platform, is used for numerical data preprocessing.

The third: Performance verification of data preprocessing for machine learning (numerical data) (Part 2)

Performance verification of data preprocessing for machine learning (numerical data) (Part 1)