We will update the information of Kaggle who participated in the past. Here, we will pick up the data introduction of BOSCH and the prominent discussions in the forum. For the code of the winner of the competition and the useful Kernel, see Kaggle Summary: BOSCH (winner), [Kaggle Summary: BOSCH (kernels)](http: // qiita.com/TomHortons/items/359f8e39b9cd424c2360), which is a summary and discussion summary. (The contents of the forum will be added later.)
This article uses Python 2.7, numpy 1.11, scipy 0.17, scikit-learn 0.18, matplotlib 1.5, seaborn 0.7, pandas 0.17. It has been confirmed to work on jupyter notebook. (Please modify% matplotlib inline appropriately) If you find any errors when you run the sample script, it would be helpful if you could comment.
BOSCH is a global manufacturer of mechanical parts. We have our own manufacturing plants all over the world, and of course, the finished products rarely contain defective products. The economic loss caused by this defective product is large, and the cause of the defective product is complicated. Therefore, the purpose of this time is to predict whether a defective product or a non-defective product can be produced based on the observation data obtained from BOSCH's manufacturing plant.
The characteristic points of this time are as follows.
The evaluation index this time is MCC.
As for label data, the probability of defective products is about 1 in 1000. For example, if all are predicted to be non-defective, the correct answer rate will exceed 99%. By using F measure or MCC, it is possible to properly evaluate such extremely non-uniform data.
In addition, the format of the submitted file expresses the correspondence between Id and Response in CSV.
Id,Response
1,0
2,1
3,0
etc.
There are three types of this data.
If a defective product occurs,'Response' = 1, and if it is a non-defective product,'Response' = 0. All data is (very) large and anonymized, and all column names are represented by "production line \ _ station \ _ features". For example, "L3_S36_F3939" is the numerical data for the 3rd line, 36th station, and 3939th feature.
numeric data The data is so large that just reading the numeric data will stop the program on your laptop. So, first, check only the column name and the number of samples. TRAIN_NUMERIC is the path to train_numeric.csv.
check_numeric.py
import numpy as np
import pandas as pd
numeric_cols = pd.read_csv(TRAIN_NUMERIC, nrows = 1).columns.values
print numeric_cols
print 'cols.shape: ', numeric_cols.shape
F0 = pd.read_csv(TRAIN_NUMERIC, usecols=(numeric_cols[:2].tolist() + ['Response']))
print 'F0.shape: ', F0.shape
The execution example looks like this.
array(['Id', 'L0_S0_F0', 'L0_S0_F2', 'L0_S0_F4', 'L0_S0_F6', 'L0_S0_F8',
'L0_S0_F10', 'L0_S0_F12', 'L0_S0_F14', 'L0_S0_F16', 'L0_S0_F18',
'L0_S0_F20', 'L0_S0_F22', 'L0_S1_F24', 'L0_S1_F28', 'L0_S2_F32',
'L0_S2_F36', 'L0_S2_F40', 'L0_S2_F44', 'L0_S2_F48', 'L0_S2_F52',
'L0_S2_F56', 'L0_S2_F60', 'L0_S2_F64', 'L0_S3_F68', 'L0_S3_F72',
.....
'L3_S50_F4245', 'L3_S50_F4247', 'L3_S50_F4249', 'L3_S50_F4251',
'L3_S50_F4253', 'L3_S51_F4256', 'L3_S51_F4258', 'L3_S51_F4260',
'L3_S51_F4262', 'Response'], dtype=object)
cols.shape: (970,)
F0.shape: (1183747, 2)
Id is a parameter associated with the date file and category file. You can see that there are 968 explanatory variables for defective products. You can also see that the number of samples is very large at 1,183,747. Each variable contains the following real numbers and missing values.
Id L0_S0_F0 Response
0 4 0.030 0
1 6 NaN 0
2 7 0.088 0
3 9 -0.036 0
categorical data Look at the categorical data in the same way. TRAIN_CAT is the path to train_categorical.csv.
check_category.py
cat_cols = pd.read_csv(TRAIN_CAT, nrows = 1).columns.values
print 'cat_cols: ', cat_cols
print 'cat_cols.shape: ', cat_cols.shape
cats = pd.read_csv(TRAIN_CAT, usecols=(cat_cols[:2].tolist()))
print 'cats.shape: ', cats.shape
print cats
This is the execution result.
cat_cols: ['Id' 'L0_S1_F25' 'L0_S1_F27' ..., 'L3_S49_F4237' 'L3_S49_F4239'
'L3_S49_F4240']
cat_cols.shape: (2141,)
cats.shape: (1183747, 2)
Id L0_S1_F25
0 4 NaN
1 6 NaN
2 7 NaN
3 9 NaN
4 11 NaN
5 13 NaN
6 14 NaN
7 16 NaN
8 18 NaN
The number of samples is the same as numeric_data, and the number of variables is 2141, which is almost double. 'Response' is not included in the category data.
date data Finally, let's look at the date file. TRAIN_DATE is the path to train_date.csv.
check_date.py
date_cols = pd.read_csv(TRAIN_DATE, nrows = 1).columns.values
date = pd.read_csv(TRAIN_DATE, usecols=(date_cols[:2].tolist()))
print 'date_cols.shape: ', date_cols.shape
print date_cols
print 'date.shape: ', date.shape
print date
This is the execution result.
date_cols.shape: (1157,)
['Id' 'L0_S0_D1' 'L0_S0_D3' ..., 'L3_S51_D4259' 'L3_S51_D4261'
'L3_S51_D4263']
date.shape: (1183747, 2)
Id L0_S0_D1
0 4 82.24
1 6 NaN
2 7 1618.70
3 9 1149.20
4 11 602.64
5 13 1331.66
The number of variables is 1157, which is slightly larger than numeric. The number of samples is the same. Like "L0_S0_D1", the end of the variable name changes from F to D. For example, L0_S0_D1 means the time stamp of L0_S0_F0, and L0_S0_D3 means the time stamp of L0_S0_F2. I have not investigated why the variables are larger than the numeric data.
There is data and there are labels. But I don't know what to do first. There was a helpful post for those who said.
2 ~ 4 are introduced with specific code in 4. EDA of importance features of Kaggle Summary: BOSCH (kernels).
Output all variables as a file as violinplot. Please set the path as you like.
from scipy import stats
import pandas as pd
import numpy as np
import matplotlib as mpl
mpl.use('Agg')
import matplotlib.pyplot as plt
import seaborn as sns
DATA_DIR = "../input"
TRAIN_NUMERIC = "{0}/train_numeric.csv".format(DATA_DIR)
TEST_NUMERIC = "{0}/test_numeric.csv".format(DATA_DIR)
COL_BATCH = 100
numeric_cols = pd.read_csv(TRAIN_NUMERIC, nrows = 1).columns.values
for n_ in range(len(numeric_cols)/COL_BATCH):
cols = numeric_cols[(n_*COL_BATCH):(n_*COL_BATCH+COL_BATCH)].tolist()
train = pd.read_csv(TRAIN_NUMERIC, index_col = 0, usecols=(cols + ['Response']))
X_neg, X_pos = train[train['Response'] == 0].iloc[:, :-1], train[train['Response']==1].iloc[:, :-1]
BATCH_SIZE = 10
dummy = []
source = train.drop('Response', axis=1)
for n in list(range(0, train.shape[1], BATCH_SIZE)):
data = source.iloc[:, n:n+BATCH_SIZE]
data_cols = data.columns.tolist()
dummy.append(pd.melt(pd.concat([data, train.Response], axis=1), id_vars = 'Response', value_vars = data_cols))
FIGSIZE = (3*(BATCH_SIZE),4*(COL_BATCH/BATCH_SIZE))
_, axs = plt.subplots(len(dummy), figsize = FIGSIZE)
for data, ax in zip(dummy, axs):
v_plots = sns.violinplot(x = 'variable', y = 'value', hue = 'Response', data = data, ax = ax, split =True)
v_plots.get_figure().savefig("violin_{0}.jpg ".format(n_))
Data that contains a lot of missing values like this time cannot be displayed as a scatter plot as it is. Therefore, save each variable at the median before plotting. Below is a sample.
import pandas as pd
import numpy as np
import seaborn as sns
features_names = [
'L0_S11_F298', 'L1_S24_F1672', 'L1_S24_F766', 'L1_S24_F1844',
'L1_S24_F1632', 'L1_S24_F1723', 'L1_S24_F1846', 'L1_S25_F2761',
'L1_S25_F2193'
]
features = pd.read_csv(TRAIN_NUMERIC, index_col = 0, usecols=(features_names + ['Response'])).reset_index()
for f in features.columns[:-1]:
features[f][np.isnan(features[f])] = features[f].median()
X_neg, X_pos = features[features['Response'] == 0], features[features['Response']==1]
volumes = len(X_pos) if len(X_pos)<len(X_neg) else len(X_neg)
features = pd.concat([X_pos, X_neg]).reset_index(drop=True)
g = sns.pairplot(features, hue="Response", vars=test.columns.tolist()[:-1], markers='.')
It seems that the data this time can be reduced considerably by devising preprocessing. (It can be executed even on a laptop of about 8G)
a) Drop duplicate data because the category data contains duplicates. b) As explained in Kernel introduction article, more than 95% of the date files are duplicated in each station. By dropping these, you will be able to use date features. c) use all numeric data
a and b can be understood as a result of analyzing this data. It is clear that not only the general analysis approach but also the approach that matches the individuality of each data is important. I heard that c uses all numeric data, but my PC has stopped. It seems that it cannot be dealt with by using pandas.read as it is.
Similar to raddar's comment, it can be executed even with 8GB memory including calculation cost by performing preprocessing (removal of features showing perfect correlation and duplicate features). It was not possible to achieve such sophisticated preprocessing. I hope you can understand it with Winner's Code.
Although there are a lot of variables, the raw data does not seem to make a noticeable judgment. Therefore, features are generated from the correlation between the data. In another article, I explained how to visualize the difference in correlation coefficient when a defective product occurs with a heat map. Find variables useful for classification problems from heatmaps of correlation coefficients
In this method, a combination of variables whose correlation is broken when a defective product occurs is searched for, and a new variable is used by PCA (principal component analysis). As explained in 4.1, since it contains a large amount of missing values, it is supplemented with the median first. At this time, it may be possible to generate a new parameter with the complemented part set to 1 and the non-completed part set to 0.
Recently, Keras based on Tensorflow is being used very actively in Kaggle. It seems that deep learning overfitting is likely to occur and is not suitable for extremely non-uniform data such as this time, even if the sampling number is adjusted or dropout is used. Still, there were people who wanted to approach with Keras, so if I have time later, I will add an explanation.
Recommended Posts