Performance verification of data preprocessing for machine learning (numerical data) (Part 1)

First edition: 2020/3/10
Authors: Soichi Takashige, Masahiro Ito, Hitachi, Ltd.

Introduction

In this post, we will introduce the design know-how of data preprocessing and the performance verification result of data preprocessing when designing a system incorporating a machine learning model.

In the second installment, we will introduce performance improvement know-how and verification results in data preprocessing using Python.

** Post list: **

  1. About data preprocessing of the system using machine learning
  2. Performance verification of data preprocessing for machine learning (numerical data) (Part 1) (Posted)
  3. Performance verification of data preprocessing for machine learning (numerical data) (Part 2)

About the benchmark (BigBench) referred to in the performance verification

Before introducing the design know-how and performance verification results, we will introduce the benchmarks referred to as references in the verification. This time, I used BigBench, which is one of the benchmark programs for big data analysis. BigBench is a benchmark program that simulates a system that analyzes data such as access users on online e-commerce sites, and targets structured or semi-structured data such as data on RDBs and Web logs. Individual business scenarios and benchmark programs are defined. Figure 1 shows the BigBench data structure.

BigBench overview

Figure 1 Big Bench data structure

This time, of the 30 business scenarios of BigBench, we referred to business scenario # 5, which is the most complicated processing and can be said to be a typical example of data preprocessing, as a verification target, and implemented the initial code in python independently. .. Business scenario # 5 looks like this:

Create a logistic regression model that estimates the category of products that the user is interested in from the Web access history of the user in the online store.

Figure 2 shows an overview of the processing of business scenario # 5. BigBench業務シナリオ#5

Figure 2 Outline of processing of BigBench business scenario # 5

This scenario is divided into a learning phase and an inference phase. In the learning phase, from related information such as access history (web_clickstreams) to online stores on the Web, product information database (item), user (customer) information database (customer, customer_demographics), etc., for each user (customer) Organize online store transition history and user characteristics (educational background, gender, etc.) as statistical data. Based on this statistical data, we create a regression model that estimates whether a user is interested in a particular area (eg, "Books"). In the inference phase, statistical data that aggregates access history etc. for a user is created and applied to the regression model created in the learning phase to estimate how interested the user is in the "Books" field. I will. Here, Fig. 3 shows the flow of data preprocessing implemented for this verification by referring to the inside of business scenario # 5.

BigBench#5 flow

Figure 3 Outline of preprocessing algorithm for BigBench business scenario # 5 (learning phase)

This pre-processing consists of the following four phases.

** 1) [Combining access history and product information] ** Associate the Web access history with the product database so that it can be classified by category ((1) in Fig. 3).

** 2) [Aggregate by user] ** The number of times the product in each category is clicked for each accessing user is tabulated on the table (②③ in Fig. 3).

** 3) [Combining user information] ** By associating the attribute information of the accessing user with the aggregated information for each user, it is possible to classify by the attribute of the user (④⑤ in Fig. 3).

** 4) [Feature quantification] ** Convert text information etc. into a numerical string so that it can be used in machine learning (⑥ and ⑦ in Fig. 3).

BigBench Business Scenario # 5 Coding Example in Python

Figure 4 shows an example of implementing data preprocessing in Python for BigBench business scenario # 5. This time, based on this code, we will consider improving the processing.

Initial code(For PoC)


import pandas as pd
import numpy as np

#Data read
web_clickstreams = pd.read_csv("web_clickstreams.csv")
item = pd.read_csv("item.csv");
customer = pd.read_csv("customer.csv")
customer_demographics = pd.read_csv("customer_demographics.csv")

#Process (1): Combining access history and product information
data = web_clickstreams.loc[web_clickstreams['web_clickstreams.wcs_user_sk'].notnull(), :]
data = pd.merge(data, item, how='inner', left_on=['wcs_item_sk'], right_on=['i_item_sk'])

#Process (2): Divide data by user ID
data = data.groupby('wcs_user_sk')

#Process ③: Aggregate by user
i_category_index = "Books"
types = ['wcs_user_sk', 'clicks_in_category']+['clicks_in_%d'%i for i in range(1, 8)]
def summarize_per_user(data_mr):
    wcs_user_sk_index = data_mr.name
    # ③-1,③-2 A specified product category(Books)Aggregate the number of accesses to
    clicks_in_category = len(data_mr[data_mr['i_category'] == i_category_index])
    # ③-3 ‘i_category_id’==Calculate for 1… 7 respectively
    # ③-3-1, ③-3-2 ‘i_category_id’==Aggregate the number of access logs for i
    return pd.Series([wcs_user_sk_index, i_category_index] + \
                     [len(data_mr[data_mr['i_category_id']==i]) for i in range(1, 8)], \
                     index = types)
data = data.apply(summarize_per_user)

#Process ④: Combine with user information
data = pd.merge(data, customer, how='inner', left_on=['wcs_user_sk'], right_on=['c_customer_sk'])

#Process ⑤: Combine with user attribute information
data = pd.merge(data, customer_demographics, how='inner', \
                left_on=['c_current_cdemo_sk'], right_on=['cd_demo_sk'])

#Process ⑥: Feature quantification
data['college_education'] = data['cd_education_status'].apply( \
    lambda x: 1 if x == 'Advanced Degree' or x == 'College' or \
    x == '4 yr Degree' or x == '2 yr Degree' else 0)
data['male'] = data['cd_gender'].apply(lambda x: 1 if x == 'M' else 0)

#Processing ⑦:Extract and substitute the required information
result = pd.DataFrame(data[['clicks_in_category', 'college_education', 'male', \
                            'clicks_in_1', 'clicks_in_2', 'clicks_in_3', \
                            'clicks_in_4', 'clicks_in_5', 'clicks_in_6', 'clicks_in_7']])
#Save results
result.to_csv('result-apply.csv')

Figure 4 Code example of BigBench business scenario # 5

Data preprocessing design know-how in Python: Logic optimization

Let's start with the improvements that can be made within pure python. In Python coding, logic optimization is important as know-how for improving performance. We examined and applied the following two points as logic optimization that can be expected to have a large effect on the code of the target business this time.

Use pandas function of loop

When iterative processing is executed in a for loop etc., the processing is executed only on a single CPU due to Python restrictions. By rewriting this to a pandas function such as apply, map, it may be executed in parallel with multiple threads internally and speeded up.

Simplification of duplicate loops

When processing the same data range in different cases, data filtering such as filter may be executed many times with different conditional expressions. In such code, the data condition comparison is performed many times for the same data range, which results in poor execution efficiency. Such processing can be speeded up by rewriting it so that it can be done in one loop.

Example of logic optimization in BigBench business scenario # 5

  1. Use the pandas function of the loop

This example uses the pandas function as of the code in Figure 4.

  1. Simplification of duplicate loops

In the code before optimization in the upper part of Fig. 5 below, "③-3 map loop (repetition of 0… 7)" in Fig. 3 is written. Looking at this, we refer to the ‘i_category_id’ column of the elements of the array called data_mr, and count the number of elements whose values are 1 to 7, respectively. In this implementation, the same range of data is retrieved 7 times more than once. By rewriting this process using groupby, the number of searches can be reduced to one.

```python:Before improvement
#[Before optimization] Perform all element search 7 times
return pd.Series([wcs_user_sk_index, i_category_index] + \
               [len(data_mr[data_mr['i_category_id']==i]) for i in range(1, 8)],\
index=types)
```
```python:After improvement
#[After optimization] Perform all element search only once
    clicks_in = [0] * 8
    for name, df in data_mr.groupby('i_category_id'):
        if name < len(clicks_in):
            clicks_in[name] = len(df)
    return pd.Series([wcs_user_sk_index, i_category_index] + clicks_in[1:], \
                     index = types);
```

Figure 5 Example of replacing the search process in a loop with a single search

Verification of the effect of logic optimization

Here, let's actually measure the performance of the effect of logic optimization in Fig. 5.

Verification environment

In this performance verification, AWS is used and its specifications are shown in Table 1 below.

Table 1 Hardware specifications of verification environment

Python data preprocessing verification environment
instance AWS EC2
OS CentOS 7 64bit
CPU(Number of cores) 32
Memory(GB) 256
HDD(TB) 5 (1TB HDD x 5)

The software versions used for verification are shown in Table 2 below.

Table 2 Software version of verification environment

software version
Python 3.7.3
Pandas 0.24.2
Numpy 1.16.4

Processing method to compare performance

This time, we measured the performance using the following two processing methods.

  1. Single-node processing with Python (without logic optimization in Figure 5)

    Execute the code in Figure 4 on Python.

  2. Single-node processing by Python (with logic optimization in Fig. 5)

    Execute the code in Fig. 4 with the optimization in Fig. 5 performed on Python.

Processing content to be measured

In the measurement, the total time required for the following three processes is measured.

  1. Read data from the data source into memory

To process the data, read all the data in the tables (web_clickstream, item, customer, customer-demographics) needed for processing from disk to data frame. The data to be read is a text file, which will be read from the local disk.

  1. Preprocessing such as data combination and aggregation for the read data

  2. Write the processing result to the data store

Write the data in text format to your local disk.

Data to be measured

Assuming that the data size processed by the production system is about 50GB (estimated size when expanded on memory), measure with some data sizes between the data size of 1/100 of that and the data size assumed for production. And I checked how the processing time changes. For each measurement size, Table 3 shows the size when the input data to be processed is expanded on the memory and the data size when it is originally saved on the HDD in text format. After that, the data size in the measurement result uses the value of the data size in the memory.

Table 3 Measurement data size

Percentage of production data size[%] 1 5 10 25 50 75 100 200 300
Number of rows of data- web_clickstreams 6.7M 39M 83M 226M 481M 749M 1.09G 2.18G 3.39G
Number of rows of data- item 18K 40K 56K 89K 126K 154K 178K 252K 309K
Number of rows of data- customer 99K 221K 313K 495K 700K 857K 990K 1.4M 1,715
Number of rows of data- customer_demographics 1.9M 1.9M 1.9M 1.9M 1.9M 1.9M 1.9M 1.9M 1.9M
Data size in memory(GB) 0.4 1.9 3.9 10.3 21.8 33.7 49.1 97.9 152.1
Data size on HDD(GB) 0.2 1.0 2.2 6.3 13.8 21.7 29.8 63.6 100.6

Performance measurement results

Figure 6 shows the results of measuring the processing time by executing each of the two types of processing (with / without logic optimization in Fig. 5) for each data size for BigBench's business scenario # 5. .. In addition, the data size was up to about 22GB (50% of the production data size), and when trying to process data of a larger size, it could not be processed due to insufficient memory. Here, when the input data size is 0.4GB (the leftmost point in the graph in Fig. 6), the execution time is 412 seconds without logic optimization, and 246 seconds with logic optimization, which is about 40%. It has been shortened. Also, when the input data size is 22GB (the rightmost point in the graph in Fig. 6), the execution time is 5,039 seconds without logic optimization, and 3,892 seconds with logic optimization, which is a reduction of about 23%. I stayed at.

性能

Figure 6 Data preprocessing time measurement results for each input data size

Figure 7 shows the progress of CPU, memory, and disk I / O usage when processing 22GB of data. This time the verification machine has 32 cores, but Python can only use 1 core. Therefore, the CPU usage rate is always about 1/32 = 3%. On the other hand, you can see that about 200GB of memory is consumed for processing input data with a size of about 22GB on the memory. This is probably because the intermediate processing result is retained in memory during data processing, so its size is also consumed. I / O is performed only when the initial data is read, and it can be confirmed that I / O does not occur during data processing and is basically processed on-memory.

resource

Figure 7 Temporal changes in CPU, memory, and I / O usage in the Python + Pandas environment

Effects and precautions of logic optimization

It was confirmed that a performance improvement effect of 40 to 23% can be obtained by logic optimization that avoids searching the same data repeatedly. At this time, the performance improvement effect decreased as the data size increased, but this is because the improvement effect depends on the data characteristics (value range and distribution), and the characteristics change when the data size increases. I think it was because of that. Therefore, over-optimizing the logic for data of a different size (a subset of the smaller size), such as PoC, can have different effects on production-scale data. Therefore, it is recommended that logic optimization be performed in the final tuning.

in conclusion

In this post, we introduced the know-how for improving the performance of numerical data preprocessing using Python, and the performance verification results on actual machines. Next time, we will introduce the performance verification results when Spark, which is a parallel distributed processing platform, is used for numerical data preprocessing.

The third: Performance verification of data preprocessing for machine learning (numerical data) (Part 2)

Recommended Posts

Performance verification of data preprocessing for machine learning (numerical data) (Part 2)
Performance verification of data preprocessing for machine learning (numerical data) (Part 1)
Data set for machine learning
About data preprocessing of systems that use machine learning
Performance verification of data preprocessing in natural language processing
Preprocessing in machine learning 2 Data acquisition
Preprocessing in machine learning 4 Data conversion
[Machine learning] Check the performance of the classifier with handwritten character data
Python: Preprocessing in machine learning: Data acquisition
Preprocessing in machine learning 1 Data analysis process
Python: Preprocessing in machine learning: Handling of missing, outlier, and imbalanced data
Summary of mathematical scope and learning resources required for machine learning and data science
Machine learning memo of a fledgling engineer Part 1
Classification of guitar images by machine learning Part 1
[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-
Machine learning Training data division and learning / prediction / verification
Machine learning memo of a fledgling engineer Part 2
Classification of guitar images by machine learning Part 2
I started machine learning with Python Data preprocessing
Predict short-lived works of Weekly Shonen Jump by machine learning (Part 1: Data analysis)
Align the number of samples between classes of data for machine learning with Python
Numerical summary of data
Preprocessing of prefecture data
[Recommended tagging for machine learning # 2] Extension of scraping script
[Recommended tagging for machine learning # 2.5] Modification of scraping script
xgboost: A valid machine learning model for table data
Try to evaluate the performance of machine learning / regression model
Basics of Machine Learning (Notes)
[Recommended tagging for machine learning # 1] Scraping of Hatena blog articles
Made icrawler easier to use for machine learning data collection
Try to evaluate the performance of machine learning / classification model
Reproduce numerical examples of spurious correlation (Machine Learning Professional Series)
Importance of machine learning datasets
Feature Engineering for Machine Learning Beginning with Part 3 Google Colaboratory-Scaling
Create data for series labeling (part of speech tagging) quickly
[Python machine learning] Recommendation of using Spyder for beginners (as of August 2020)
Summary of recommended APIs for artificial intelligence, machine learning, and AI
How to use machine learning for work? 01_ Understand the purpose of machine learning
Significance of machine learning and mini-batch learning
Machine learning ③ Summary of decision tree
<For beginners> python library <For machine learning>
Machine learning in Delemas (data acquisition)
Python: Preprocessing in Machine Learning: Overview
Machine learning meeting information for HRTech
[Recommended tagging for machine learning # 4] Machine learning script ...?
Basic machine learning procedure: ② Prepare data
How to collect machine learning data
Feature engineering for machine learning starting with the 1st Google Colaboratory --Binarization and discretization of count data
Latin learning for the purpose of writing a Latin sentence analysis program (Part 1)
How to use machine learning for work? 02_Overview of AI development project
An introductory reader of machine learning theory for IT engineers tried Kaggle
Specific implementation method to add horse past performance data to machine learning features
One-click data prediction for the field realized by fully automatic machine learning
Python learning memo for machine learning by Chainer Chapter 13 Basics of neural networks
[Summary of books and online courses used for programming and data science learning]
Python learning memo for machine learning by Chainer until the end of Chapter 2
Machine learning algorithm (generalization of linear regression)
Predict power demand with machine learning Part 2
Amplify images for machine learning with python
First Steps for Machine Learning (AI) Beginners
Machine learning imbalanced data sklearn with k-NN