Handling of missing values

Mechanism of missing values

In the pre-processing of machine learning, it may be necessary to deal with data that includes some missing values as shown below.

If you do not respond appropriately to the missing values in this data, the following problems will occur.

1, Cannot obtain statistical information such as mean value and standard deviation It is not possible to calculate the average or standard deviation of the evaluation values of all six people with the data that includes the missing values. This makes it difficult to perform various analyzes.

If you simply delete the data that contains the missing values, the data will be biased. If you delete the data that contains even one NaN from the previous data, the number of data will be 2. In this case, some data will be wasted and the remaining data may be biased. As a result, you may not be able to get the information you originally wanted to know from data analysis.

For these reasons, it is necessary to deal with missing values appropriately, but in order to deal with them appropriately, it is necessary to confirm the mechanism by which missing values occur.

There are the following three patterns for the mechanism by which missing values occur.

MCAR Abbreviation for Missing Completely At Random, when the probability of missing a data value is irrelevant to all data.

Explaining with the above data example, for example, when the missing item is caused by a defect in the file that records the questionnaire result, regardless of age, gender, or evaluation, or the handwritten questionnaire result. For example, if the paper accidentally disappears due to friction during transportation.

MAR Abbreviation for Missing At Random, when certain data is observed, the probability that an item will be missing can be estimated only from the observed data items other than the missing item.

This abstract definition is difficult to understand, but if you explain with the data example above, for example, if the gender is female, the probability of answering the age in the questionnaire drops to a certain level, and if it is male, it is different from female. It is a situation where you answer the age with the probability of. To be a MAR, we also need to assume that age itself does not affect age deficiency in this situation. In this case, if there is gender data, there is a certain probability that the age will be randomly lost under that condition.

NMAR Abbreviation for Not Missing At Random, when the probability that an item is missing depends on the item itself, and the missing rate of the missing item cannot be estimated from data items other than that item.

Explaining with the data example above, the older you get, the less you answer about age, and moreover, whether you are male or female does not affect age deficiency, and you cannot predict from the gender value. If there is a loss due to NMAR, it may not be possible to solve it by the substitution method described later, so it is necessary to consider recollection of data.

How to handle missing values

We found that there are three mechanisms by which missing values occur: MCAR, MAR, and NMAR.

Now, I will explain how to deal with the defects caused by each mechanism.

MCAR In the case of MCAR, missing values are randomly generated, so even if you perform listwise deletion (a method to delete all data rows containing missing values) introduced in the data cleansing course, the data There is no bias in. However, it is also possible to supplement missing values using the substitution method described later, such as when the number of data items becomes extremely small by deleting.

MAR In the case of MAR, if you delete the data that contains missing values, the data will be biased. For example, if there is a tendency for a person whose gender is female to not answer the age, and if the data containing the missing value in the age is deleted, the data analysis result will strongly reflect the content of the male answer. I will end up. In this case, consider supplementing the missing value with an assignment method (described later) that predicts the true value that should be included in the missing value from the observed data.

NMAR In the case of NMAR, deleting data containing missing values causes a bias in the data, and the missing data items themselves affect the missing values, so missing values from other observed data items. Cannot be predicted. Therefore, you should basically consider recollecting the data.

Missing value substitution method

There are two main methods for assigning missing values.

Single assignment method

This is a method that uses the mean value substitution method (see data cleansing), stochastic regression substitution method, hot deck method, etc. to complement the missing values and create only one complete dataset.

Multiple assignment method

One set is a state in which missing values are predicted and complemented from the observed data. Create multiple (ie, multiple) sets of this and build an analytical model for each set This is the final method of integrating the individually built models.

Visualization of missing values

When thinking about how to deal with missing values, you first need to find out if there are any missing values in your data.

For this,
pandas isnull()How to check easily using
There is a method to visualize and understand missing values like the missingno package.

When using isnull () with data containing missing values such as those introduced earlier,

For example

import pandas as pd
data = pd.read_csv('./8000_data_preprocessing_data/missing_data_3_1_3.csv')
data.isnull().sum()

If you execute the dataFrame's isnull () method and the sum () method that calculates the total value of each column in combination, you will get the following result, and you can see how many missing values are in which column. I will.

#Output result
rate    1
age     2
sex     2
dtype: int64

With missingno, you can easily visualize the overall status of missing values. If you specify the pandas DataFrame to the missingno matrix function as shown below,

import missingno as msno
%matplotlib inline
data = pd.read_csv('./8000_data_preprocessing_data/missing_data_3_1_3.csv')
msno.matrix(data)

Where the white part of the image has a missing value On the right side of the image, the number of non-missing data items in that row is displayed in a vertical line graph.

Of the numbers displayed on the line graph The number on the left is the row with the most missing The number on the right corresponds to the number of non-missing data items in the row with the fewest losses. In the case of the data displayed in the image, the number of non-missing data items in the row with the most missing is "3". Therefore, you can see that there is a row that is missing at most one of the four data items. You will get the following image.

Single assignment method (hot deck method)

The single assignment method is used properly according to the purpose as follows.

1, When only the average value of missing data items needs to be known

In this case, it is introduced in data cleansing. Use the mean substitution method. But in this case As the data with the average value increases, the variance of the target data item becomes smaller. Not available if you want to take variance or error into account for your analysis.

2, If you want to take into account the variance of missing data items in your data analysis In this case, the error term considering the variance is included in the substitution value by using the stochastic regression substitution method or the like.

3, When there is a lot of qualitative data and it is difficult to calculate the substitution value parametrically due to regression etc. In this case, a non-parametric method called the hot deck method (without any assumptions about the parameters) is used. Use. In the hot deck method, the missing value of the data row (called the recipient) containing the missing value is calculated. Complement with the value of another row of data (called the donor) that has the value of that data item that is missing.

When looking for donors to complement recipients, the proximity of the data is determined by the nearest neighbors. Use the value of a close donor.

here
knnimpute
knnimpute.knn_impute_few_observed(matrix, missing_mask, k)Function
I will explain the single assignment method by the hot deck method used.

The main arguments of the knn_impute_few_observed (matrix, missing_mask, k) function are There are three below. See the docstring on the official page for more information.

matrix: This is numpy np.Specify matrix type data.
missing_mask:This is a boolean type matrix data showing where the missing values are contained in the matrix. Must have the same shape as the argument matrix.
k           :This is KNN(Supervised learning(Classification)reference)Represents a nearby point to take into account.

Flow of multiple assignment method

In most cases, after filling in the missing values in the data with the substitution method, what you want to do in the end is I think it is a parameter estimation of the analytical model using the entire data including the complemented values.

For example, in the case of a regression model, the regression coefficient is predicted. Even with other machine learning models, you want to estimate some parameters.

When using the single assignment method, missing values are literally embedded with a single value. You can then use the complemented data to create an analytical model. From there, you will be able to estimate the parameter values of the analytical model.

However, the values complemented by the single assignment method are predicted values, and there should be uncertainty (that is, errors in the predicted values). If you predict the parameters of the analytical model without taking that into account The result does not reflect the original uncertainty of the parameter Insufficient analysis or misleading conclusions can occur.

In the multiple assignment method, data completion is performed multiple times to deal with this problem. Create multiple datasets and estimate the parameters of the analytical model separately for the multiple datasets Finally, combine the results into one.

The flow of analysis of the multiple assignment method when three data sets are created by completing three times is shown below.

Multiple assignment method (MICE)

Let's actually perform the multiple assignment method with an example of a linear multiple regression model.

As the data including missing values, use the following data visualized in "Visualization of missing values".

This data is below the price of the house and the items related to the price (distance from the station, age, size (㎡)) There are four data.

To use the multiple assignment method

Of stats models
MICE(Multiple imputation by Chained Equations)Is convenient to use.

In this exercise, we will predict the price by linear multiple regression from the data of distance from the station, age, and size.

When writing this linear multiple regression model in statsmodels, use the column name of DataFrame Write as follows in the character string.

'price ~ distance + age + m2'

To briefly explain how to write this model, here the objective variable is price. There are three explanatory variables: distance, age, and m2.

The explanatory variables are models that are linearly combined with +. You do not need to include the coefficients when writing the model. A ~ is written between the objective variable and the explanatory variable, and the price is predicted by the three variables on the right side of ~. For details on how to write a model in statsmodels, refer to the statsmodels official website.

In statsmodels MICE, the functions are executed in the following order to execute the multiple assignment method.

In the case of a multiple regression model such as'y ~ x1 + x2 + x3 + x4', the sample code that executes the above order is as follows.

imp_data = mice.MICEData(data)
formula = 'y ~ x1 + x2 + x3 + x4'
model = mice.MICE(formula, sm.OLS, imp_data)
results = model.fit(10, 10)
print(results.summary())

The details are like this.

imp_data = mice.MICEData(data)
# 1,Mice a MICEData object to handle complementary data.MICEData(data)Created with a function

formula = 'y ~ x1 + x2 + x3 + x4'
# 2,Create a linear model formula(The price I explained earlier~ distance + age +Expression like m2)

model = mice.MICE(formula, sm.OLS, imp_data)
# 3,mice.MICE(formula, optimizer, imp_data)Analytical model with function(MICE object)Create
# (Here, since it is a multiple regression, sm for optimizer.Use OLS)

results = model.fit(10, 10) #Complementary processing 10 times,Number of datasets 10
# 4,MICE object fit(n_burnin, n_imputation)Using the method,
#Optimized analytical model for data and results(MICEResults object)Get
# (The first argument of the fit method is how many times the process is repeated for one completion, and the second argument is the number of datasets to be completed.)

print(results.summary())
# 5,Summary of the optimized result content of the MICEResult object()Confirm with method

The resulting sample looks like this:

                   Results: MICE
=================================================================
Method:                    MICE       Sample size:           1000
Model:                     OLS        Scale                  1.00
Dependent variable:        y          Num. imputations       10
-----------------------------------------------------------------
           Coef.  Std.Err.    t     P>|t|   [0.025  0.975]  FMI
-----------------------------------------------------------------
Intercept -0.0234   0.0318  -0.7345 0.4626 -0.0858  0.0390 0.0128
x1         1.0305   0.0578  17.8342 0.0000  0.9172  1.1437 0.0309
x2        -0.0134   0.0162  -0.8282 0.4076 -0.0451  0.0183 0.0236
x3        -1.0260   0.0328 -31.2706 0.0000 -1.0903 -0.9617 0.0169
x4        -0.0253   0.0336  -0.7520 0.4521 -0.0911  0.0406 0.0269
=================================================================

The final result is a parameter estimated by integrating multiple complementary datasets. This sample gives the results of what weights x1, x2, x3, and x4 correlate with y.

Outlier detection

Problems caused by outliers

Outliers are data that deviate significantly from other data. If outliers are mixed, the following problems will occur.

The analysis result is different from the original result
The learning process of the machine learning model is affected by outliers, which makes learning difficult.

This section describes how to detect and exclude outliers.

Visualization of outliers

It is easy to see if there are outliers by first making a simple visualization of the data.

For this visualization

You can use seaborn boxplot.

boxplot is a function that draws a so-called boxplot, as shown in the following figure. Outliers are displayed with diamond marks.

The main arguments of the boxplot function are:

In the case of the previous figure, it is specified as follows.

import pandas as pd
import seaborn as sns

data = pd.read_csv('outlier_322.csv')
sns.boxplot(y=data['height'])

If the data is two-dimensional

Using joint plot makes it easy to see if there are outliers.

Since jointplot does not have a function to display outliers with a diamond mark. Visually check for any outliers. The main arguments of the jointplot function are:

Here's a script that displays a joint plot on a well-known iris dataset:

import pandas as pd
import seaborn as sns

#Read iris data
iris = sns.load_dataset("iris")
sns.jointplot('sepal_width', 'petal_length', data=iris)

Detection of outliers by LOF

Which data are outliers to exclude outliers It is necessary to detect by a certain standard. There are various methods of this detection.

First, we will introduce LOF (Local Outlier Factor), which detects outliers based on the density of data.

LOF has the following features.

Consider that there are few data points nearby as outliers
Estimate the data density using k neighborhood points
A point where the above density is relatively low with the surroundings is judged as an outlier.

Judgment of outliers by LOF is scikit-You can easily do it by using the learn function.

In scikit-learn LOF, first use the LocalOutlierFactor function Create a classification model with parameters.

The parameter mainly specifies how many neighborhood points to use (n_neighbors). (Check the official documentation for detailed parameters). Describe as follows.

clf = LocalOutlierFactor(n_neighbors=20)

#Of the model
fit_predict(data)
#The method trains the data to detect outliers.

You can pass a pandas DataFrame directly to the argument data.

predictions = clf.fit_predict(data)

array([ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1, -1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1, -1,
        1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1, -1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1, -1,  1,  1,  1,
        1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1, -1,
       -1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1])

The value is -1 for data rows that are considered outliers, and 1 for data rows that are considered normal values.

Using this result, if you specify the data as follows You can get the rows that were considered outliers in the original data.

data[predictions == -1]

The following figure plots the iris data predicted to be outliers at k = 20. As the return value of the fit_predict () method, you will get the following array.

Click here for usage examples

import numpy as np
import pandas as pd
from sklearn.neighbors import LocalOutlierFactor

np.random.seed(0)
data = pd.read_csv('./8000_data_preprocessing_data/outlier_ex.csv')

clf = LocalOutlierFactor(n_neighbors=20)
predictions = clf.fit_predict(data)
data[predictions == -1]

Outlier detection by isolationForest

LOF was a method that used the density of data points for judgment. Here, as another method

Isolation Forest ()I'd like to introduce_______

Isolation Forest has the following features.

No distance or density independent, so no cost to calculate those indicators
Calculation is not complicated and memory is saved
Easy to scale calculations even for large data

An example of iris outlier detection, with a brief algorithm, is as follows:

Randomly divide the data by a specific data item (here near 2.1 on the y-axis)

Randomly divide the data (here, around 2.7 on the x-axis) with a data item different from 1.

Since the data that cannot be divided any more (red dot at the bottom left of the figure) could be created by dividing it twice, record the depth as 2.
Repeat 1-3 to calculate the average depth of each point

Then, the outliers have a smaller average depth (there is a high possibility that they can be easily separated from other points). Data points with a small depth can be judged as outliers.

Isolation Forest can also use scikit-learn to predict outliers, similar to LOF.

First, create a classification model with the IsolationForest () function.

clf = IsolationForest()

Next, train the data with the model's fit () method.

clf.fit(data)

After that, the predict () method is used to determine and predict outliers.

predictions = clf.predict(data)

As the return value of the predict () method, the following array is obtained.

array([ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1, -1,  1,
        1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1, -1,
        1,  1,  1,  1,  1,  1,  1, -1,  1,  1, -1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1, -1, -1,
       -1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1])

The value is -1 for data rows that are considered outliers, and 1 for data rows that are considered normal values.

Using this result, you can specify the data as follows to get the rows that were considered outliers in the original data.

data[predictions == -1]

iris data predicted to be outliers by Isolation Forest The plot looks like the following figure.

import numpy as np
import pandas as pd
from sklearn.ensemble import IsolationForest

np.random.seed(0)
data = pd.read_csv('./8000_data_preprocessing_data/outlier_ex.csv')

#Example of use.
clf = IsolationForest()
clf.fit(data)
predictions = clf.predict(data)
data[predictions == -1]

Adjustment of imbalanced data

Problems caused by imbalanced data

Unbalanced data is when there are categorical or binary data items instead of numeric data. The specific value of the item is too high or too low, and the frequency of the data values is imbalanced.

Specifically, when a specific data item has values of 0 and 1, 990 out of 1000 data are 1 When 10 cases are like 0.

In such a case, if you predict the value of the data item, if you always predict it as 1, there is a 99% chance that the prediction will be correct. However, if predicting a zero case was an important requirement, this prediction model makes no sense at all.

Therefore, before training the imbalanced data in the machine learning model, To prevent false predictions caused by imbalanced data We may adjust the data.

The first thing to do before adjusting the imbalanced data is to check for the existence of the imbalanced data.

For this

pandas value_counts()It's easy to use the method.

If you want to check the frequency of survival data of the famous Titanic, write as follows.

titanic['survived'].value_counts()

Then you can know the frequency of each value as follows.

0    549
1    342
Name: survived, dtype: int64

There are three adjustment methods as follows.

Oversampling

A method of replicating and increasing data containing infrequent values among the cases you want to predict. In the previous example, the data containing 0, which is only 1%, is duplicated and increased.

Undersampling

How to reduce data that contains frequent values of the cases you want to predict. In the previous example, we will reduce the data containing 1 which is 99%.

A method that combines oversampling and undersampling

In the previous example, we will reduce the data containing 1s and increase the data containing 0s.

Undersampling of imbalanced data

In the case where you want to predict what kind of data will be 1 when you purchase a car and 0 when you have not purchased it. It will be as follows.

Positive example:Data row with purchase 1 in the data for learning
Negative example:Data rows that have not been purchased and are 0

In the training data, these positive and negative examples are imbalanced. When there are few positive cases and overwhelmingly many negative cases (when there is a lot of unpurchased data) By randomly deleting and reducing this negative example, it is possible to alleviate the data imbalance.

Undersampling.

For undersampling unbalanced data

imbalanced-It's easy to use learn.

There are several methods, but here

Consider using RandomUnderSampler to randomly delete data.

If you want to reduce the ratio of RandomUnderSampler Specify majority (frequently) in the argument ratio as shown in the following sample code.

For ratio, you can also specify other fine ratios in dictionary format. Please refer to the official document for the values that can be specified for ratio.

rus = RandomUnderSampler(ratio = 'majority')

After creating RandomUnderSampler, Pass the data divided into the objective variable and the explanatory variable in advance as arguments as follows. Obtain the data after undersampling.

X_resampled, y_resampled = rus.fit_sample(X, y)

import numpy as np
import pandas as pd
from imblearn.under_sampling import RandomUnderSampler

np.random.seed(0)

data = pd.read_csv('./8000_data_preprocessing_data/imbalanced_ex.csv')
y = data['purchased']
X = data.loc[:, ['income', 'age', 'num_of_children']]

#Click here for usage examples
rus = RandomUnderSampler(ratio = 'majority')
X_resampled, y_resampled = rus.fit_sample(X, y)
(X_resampled, y_resampled)

Oversampling of balanced data

Undersampling eliminated the imbalance by reducing the large number of negative cases. Conversely, oversampling is the elimination of imbalances by increasing the number of positive examples.

Undersampling only deleted existing data Oversampling requires some steps to increase the data.

There are several ways to increase this, but the simplest way to increase it There is a way to randomly inflate existing data.

To inflate data randomly

imbalanced-Learn uses RandomOverSampler.

RandomOverSampler It is almost the same as how to use Random UnderSampler used for downsampling. Use it as follows.

The only difference is that it increases the number of infrequent examples.

to ratio'minority'Is to specify.

Similar to RandomUnderSampler, you can specify a fine ratio for ratio. Please refer to the official document for the detailed argument specification pattern.

ros = RandomOverSampler(ratio = 'minority')
X_resampled, y_resampled = ros.fit_sample(X, y)

import numpy as np
import pandas as pd
from imblearn.over_sampling import RandomOverSampler

np.random.seed(0)

data = pd.read_csv('./8000_data_preprocessing_data/imbalanced_ex.csv')
y = data['purchased']
X = data.loc[:, ['income', 'age', 'num_of_children']]

#Click here for usage examples
ros = RandomOverSampler(ratio = 'minority')
X_resampled, y_resampled = ros.fit_sample(X, y)
(X_resampled, y_resampled)

Adjustment of imbalanced data using SMOTE-ENN

Oversampling and undersampling when adjusting imbalanced data There is also a method to use both, not just one.

imbalanced-In learn

For oversampling
SMOTE(Synthetic minority over-sampling technique)

For undersampling
ENN(Edited Nearest Neighbours)use
SMOTE-You can use ENN.

SMOTE has the following features.

Oversampling the positive example by guessing the value of the data to be increased from the neighborhood points using the nearest neighbor method (kNN) instead of padding as it is.

ENN has the following features.

When deleting negative examples, consider points in the vicinity of each data point and undersample so that there is less data in the vicinity.

How to use SMOTE-ENN Similar to RandomUnderSampler and RandomOverSampler

The difference is that you specify the values of k_neighbors used by SMOTE and n_neighbors used by ENN. Check the official documentation for a detailed description of other optional parameters. The sample code is as follows.

sm_enn = SMOTEENN(smote=SMOTE(k_neighbors=3), enn=EditedNearestNeighbours(n_neighbors=3))
X_resampled, y_resampled = sm_enn.fit_sample(X, y)

import numpy as np
import pandas as pd
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import EditedNearestNeighbours
from imblearn.combine import SMOTEENN

np.random.seed(0)

data = pd.read_csv('./8000_data_preprocessing_data/imbalanced_ex.csv')
y = data['purchased']
X = data.loc[:, ['income', 'age', 'num_of_children']]

#Basic use is like this
sm_enn = SMOTEENN(smote=SMOTE(k_neighbors=3), enn=EditedNearestNeighbours(n_neighbors=3))
X_resampled, y_resampled = sm_enn.fit_sample(X, y)
(X_resampled, y_resampled)

Python: Preprocessing in machine learning: Handling of missing, outlier, and imbalanced data