Introduction

We will update the information of Kaggle who participated in the past. Here, we will pick up the data introduction of BOSCH and the prominent discussions in the forum. For the code of the winner of the competition and the useful Kernel, see Kaggle Summary: BOSCH (winner), [Kaggle Summary: BOSCH (kernels)](http: // qiita.com/TomHortons/items/359f8e39b9cd424c2360), which is a summary and discussion summary. (The contents of the forum will be added later.)

This article uses Python 2.7, numpy 1.11, scipy 0.17, scikit-learn 0.18, matplotlib 1.5, seaborn 0.7, pandas 0.17. It has been confirmed to work on jupyter notebook. (Please modify% matplotlib inline appropriately) If you find any errors when you run the sample script, it would be helpful if you could comment.

Overview
Evaluation index
Introduction of data
Forum
Reference

background

BOSCH is a global manufacturer of mechanical parts. We have our own manufacturing plants all over the world, and of course, the finished products rarely contain defective products. The economic loss caused by this defective product is large, and the cause of the defective product is complicated. Therefore, the purpose of this time is to predict whether a defective product or a non-defective product can be produced based on the observation data obtained from BOSCH's manufacturing plant.

The characteristic points of this time are as follows.

Extreme binary classification problem: sample ratio 1000: 1
Consists of 3 files: numerical file, time stamp, and category data
Large number of variables: 1,000 variables
Large number of samples: 1,000,000 samples
Missing values: 96% or more of the numerical data are missing values
All data is anonymized

2. Evaluation index

The evaluation index this time is MCC.

MCC = \frac{(TP * TN) - (FP * FN)}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}},

As for label data, the probability of defective products is about 1 in 1000. For example, if all are predicted to be non-defective, the correct answer rate will exceed 99%. By using F measure or MCC, it is possible to properly evaluate such extremely non-uniform data.

In addition, the format of the submitted file expresses the correspondence between Id and Response in CSV.

Id,Response
1,0
2,1
3,0
etc.

3. Introduction of data

There are three types of this data.

numeric data
categorical data
date data

If a defective product occurs,'Response' = 1, and if it is a non-defective product,'Response' = 0. All data is (very) large and anonymized, and all column names are represented by "production line \ _ station \ _ features". For example, "L3_S36_F3939" is the numerical data for the 3rd line, 36th station, and 3939th feature.

numeric data The data is so large that just reading the numeric data will stop the program on your laptop. So, first, check only the column name and the number of samples. TRAIN_NUMERIC is the path to train_numeric.csv.

`check_numeric.py`


import numpy as np
import pandas as pd

numeric_cols = pd.read_csv(TRAIN_NUMERIC, nrows = 1).columns.values
print numeric_cols
print 'cols.shape: ', numeric_cols.shape

F0 = pd.read_csv(TRAIN_NUMERIC, usecols=(numeric_cols[:2].tolist() + ['Response']))
print 'F0.shape: ', F0.shape

The execution example looks like this.

array(['Id', 'L0_S0_F0', 'L0_S0_F2', 'L0_S0_F4', 'L0_S0_F6', 'L0_S0_F8',
       'L0_S0_F10', 'L0_S0_F12', 'L0_S0_F14', 'L0_S0_F16', 'L0_S0_F18',
       'L0_S0_F20', 'L0_S0_F22', 'L0_S1_F24', 'L0_S1_F28', 'L0_S2_F32',
       'L0_S2_F36', 'L0_S2_F40', 'L0_S2_F44', 'L0_S2_F48', 'L0_S2_F52',
       'L0_S2_F56', 'L0_S2_F60', 'L0_S2_F64', 'L0_S3_F68', 'L0_S3_F72',
       .....
       'L3_S50_F4245', 'L3_S50_F4247', 'L3_S50_F4249', 'L3_S50_F4251',
       'L3_S50_F4253', 'L3_S51_F4256', 'L3_S51_F4258', 'L3_S51_F4260',
       'L3_S51_F4262', 'Response'], dtype=object)
cols.shape:  (970,)
F0.shape:  (1183747, 2)

Id is a parameter associated with the date file and category file. You can see that there are 968 explanatory variables for defective products. You can also see that the number of samples is very large at 1,183,747. Each variable contains the following real numbers and missing values.

              Id  L0_S0_F0  Response
0              4     0.030         0
1              6       NaN         0
2              7     0.088         0
3              9    -0.036         0

categorical data Look at the categorical data in the same way. TRAIN_CAT is the path to train_categorical.csv.

`check_category.py`


cat_cols = pd.read_csv(TRAIN_CAT, nrows = 1).columns.values
print 'cat_cols: ', cat_cols
print 'cat_cols.shape: ', cat_cols.shape

cats = pd.read_csv(TRAIN_CAT, usecols=(cat_cols[:2].tolist()))
print 'cats.shape: ', cats.shape
print cats

This is the execution result.

cat_cols: ['Id' 'L0_S1_F25' 'L0_S1_F27' ..., 'L3_S49_F4237' 'L3_S49_F4239'
 'L3_S49_F4240']

cat_cols.shape:  (2141,)

cats.shape:  (1183747, 2)

              Id L0_S1_F25
0              4       NaN
1              6       NaN
2              7       NaN
3              9       NaN
4             11       NaN
5             13       NaN
6             14       NaN
7             16       NaN
8             18       NaN

The number of samples is the same as numeric_data, and the number of variables is 2141, which is almost double. 'Response' is not included in the category data.

date data Finally, let's look at the date file. TRAIN_DATE is the path to train_date.csv.

`check_date.py`


date_cols = pd.read_csv(TRAIN_DATE, nrows = 1).columns.values
date = pd.read_csv(TRAIN_DATE, usecols=(date_cols[:2].tolist()))

print 'date_cols.shape: ', date_cols.shape
print date_cols
print 'date.shape: ', date.shape
print date

This is the execution result.

date_cols.shape:  (1157,)
['Id' 'L0_S0_D1' 'L0_S0_D3' ..., 'L3_S51_D4259' 'L3_S51_D4261'
 'L3_S51_D4263']
date.shape:  (1183747, 2)
              Id  L0_S0_D1
0              4     82.24
1              6       NaN
2              7   1618.70
3              9   1149.20
4             11    602.64
5             13   1331.66

The number of variables is 1157, which is slightly larger than numeric. The number of samples is the same. Like "L0_S0_D1", the end of the variable name changes from F to D. For example, L0_S0_D1 means the time stamp of L0_S0_F0, and L0_S0_D3 means the time stamp of L0_S0_F2. I have not investigated why the variables are larger than the numeric data.

Forum Here are some of the outstanding interactions I found by browsing the forums. Since the direct solution and the sample program that appeared in the forum are summarized in another article, we will focus on know-how and general discussion here.

4.1. What to do first if you have unknown multivariable data

There is data and there are labels. But I don't know what to do first. There was a helpful post for those who said.

Since it may be a part of raw data, first visualize it with a table. At this time, the missing values and the numerical data above the threshold are colored, and the entire data is roughly visually confirmed.
Create the cumulative distribution of the labels you want to estimate on the same graph for all variables. At this time, check if there is any pattern.
Pick up important variables using some information criterion such as XGBoost feature_importance, gini, entropy. 4.3 Plot all combinations of features selected in 3 using a scatter plot.

2 ~ 4 are introduced with specific code in 4. EDA of importance features of Kaggle Summary: BOSCH (kernels).

I created the sample code of 2.

Output all variables as a file as violinplot. Please set the path as you like.

from scipy import stats
import pandas as pd
import numpy as np
import matplotlib as mpl
mpl.use('Agg')
import matplotlib.pyplot as plt
import seaborn as sns


DATA_DIR = "../input"


TRAIN_NUMERIC = "{0}/train_numeric.csv".format(DATA_DIR)
TEST_NUMERIC = "{0}/test_numeric.csv".format(DATA_DIR)

COL_BATCH = 100
numeric_cols = pd.read_csv(TRAIN_NUMERIC, nrows = 1).columns.values

for n_ in range(len(numeric_cols)/COL_BATCH):
    cols = numeric_cols[(n_*COL_BATCH):(n_*COL_BATCH+COL_BATCH)].tolist()
    train = pd.read_csv(TRAIN_NUMERIC, index_col = 0, usecols=(cols + ['Response']))
    X_neg, X_pos = train[train['Response'] == 0].iloc[:, :-1], train[train['Response']==1].iloc[:, :-1]
    
    BATCH_SIZE = 10
    dummy = []
    source = train.drop('Response', axis=1)

    for n in list(range(0, train.shape[1], BATCH_SIZE)):
        data = source.iloc[:, n:n+BATCH_SIZE]
        data_cols = data.columns.tolist()
        dummy.append(pd.melt(pd.concat([data, train.Response], axis=1), id_vars = 'Response', value_vars = data_cols))
        
    FIGSIZE = (3*(BATCH_SIZE),4*(COL_BATCH/BATCH_SIZE))
    _, axs = plt.subplots(len(dummy), figsize = FIGSIZE)
    for data, ax in zip(dummy, axs):
        v_plots = sns.violinplot(x = 'variable',  y = 'value', hue = 'Response', data = data, ax = ax, split =True)
    v_plots.get_figure().savefig("violin_{0}.jpg ".format(n_))

About scatter plot creation of 4

Data that contains a lot of missing values like this time cannot be displayed as a scatter plot as it is. Therefore, save each variable at the median before plotting. Below is a sample.

import pandas as pd
import numpy as np
import seaborn as sns

features_names = [
    'L0_S11_F298', 'L1_S24_F1672', 'L1_S24_F766', 'L1_S24_F1844',
    'L1_S24_F1632', 'L1_S24_F1723', 'L1_S24_F1846', 'L1_S25_F2761',
    'L1_S25_F2193'
]
features = pd.read_csv(TRAIN_NUMERIC, index_col = 0, usecols=(features_names + ['Response'])).reset_index()
for f in features.columns[:-1]:
    features[f][np.isnan(features[f])] = features[f].median()
    
X_neg, X_pos = features[features['Response'] == 0], features[features['Response']==1]
volumes = len(X_pos) if len(X_pos)<len(X_neg) else len(X_neg)
features = pd.concat([X_pos, X_neg]).reset_index(drop=True)
g = sns.pairplot(features, hue="Response", vars=test.columns.tolist()[:-1], markers='.')

4.2. What to do if the data is too huge

It seems that the data this time can be reduced considerably by devising preprocessing. (It can be executed even on a laptop of about 8G)

Discussion 1

a) Drop duplicate data because the category data contains duplicates. b) As explained in Kernel introduction article, more than 95% of the date files are duplicated in each station. By dropping these, you will be able to use date features. c) use all numeric data

a and b can be understood as a result of analyzing this data. It is clear that not only the general analysis approach but also the approach that matches the individuality of each data is important. I heard that c uses all numeric data, but my PC has stopped. It seems that it cannot be dealt with by using pandas.read as it is.

Discussion part 2

Similar to raddar's comment, it can be executed even with 8GB memory including calculation cost by performing preprocessing (removal of features showing perfect correlation and duplicate features). It was not possible to achieve such sophisticated preprocessing. I hope you can understand it with Winner's Code.

4.3. Utilization of correlation heat map and data generation

Although there are a lot of variables, the raw data does not seem to make a noticeable judgment. Therefore, features are generated from the correlation between the data. In another article, I explained how to visualize the difference in correlation coefficient when a defective product occurs with a heat map. Find variables useful for classification problems from heatmaps of correlation coefficients

In this method, a combination of variables whose correlation is broken when a defective product occurs is searched for, and a new variable is used by PCA (principal component analysis). As explained in 4.1, since it contains a large amount of missing values, it is supplemented with the median first. At this time, it may be possible to generate a new parameter with the complemented part set to 1 and the non-completed part set to 0.

4.4. Solving classification problems using deep learning

Recently, Keras based on Tensorflow is being used very actively in Kaggle. It seems that deep learning overfitting is likely to occur and is not suitable for extremely non-uniform data such as this time, even if the sampling number is adjusted or dropout is used. Still, there were people who wanted to approach with Keras, so if I have time later, I will add an explanation.

Kaggle Summary: BOSCH (intro + forum discussion)