In the pre-processing of machine learning, it may be necessary to deal with data that includes some missing values as shown below.
If you do not respond appropriately to the missing values in this data, the following problems will occur.
1, Cannot obtain statistical information such as mean value and standard deviation It is not possible to calculate the average or standard deviation of the evaluation values of all six people with the data that includes the missing values. This makes it difficult to perform various analyzes.
For these reasons, it is necessary to deal with missing values appropriately, but in order to deal with them appropriately, it is necessary to confirm the mechanism by which missing values occur.
There are the following three patterns for the mechanism by which missing values occur.
MCAR Abbreviation for Missing Completely At Random, when the probability of missing a data value is irrelevant to all data.
Explaining with the above data example, for example, when the missing item is caused by a defect in the file that records the questionnaire result, regardless of age, gender, or evaluation, or the handwritten questionnaire result. For example, if the paper accidentally disappears due to friction during transportation.
MAR Abbreviation for Missing At Random, when certain data is observed, the probability that an item will be missing can be estimated only from the observed data items other than the missing item.
This abstract definition is difficult to understand, but if you explain with the data example above, for example, if the gender is female, the probability of answering the age in the questionnaire drops to a certain level, and if it is male, it is different from female. It is a situation where you answer the age with the probability of. To be a MAR, we also need to assume that age itself does not affect age deficiency in this situation. In this case, if there is gender data, there is a certain probability that the age will be randomly lost under that condition.
NMAR Abbreviation for Not Missing At Random, when the probability that an item is missing depends on the item itself, and the missing rate of the missing item cannot be estimated from data items other than that item.
Explaining with the data example above, the older you get, the less you answer about age, and moreover, whether you are male or female does not affect age deficiency, and you cannot predict from the gender value. If there is a loss due to NMAR, it may not be possible to solve it by the substitution method described later, so it is necessary to consider recollection of data.
We found that there are three mechanisms by which missing values occur: MCAR, MAR, and NMAR.
Now, I will explain how to deal with the defects caused by each mechanism.
MCAR In the case of MCAR, missing values are randomly generated, so even if you perform listwise deletion (a method to delete all data rows containing missing values) introduced in the data cleansing course, the data There is no bias in. However, it is also possible to supplement missing values using the substitution method described later, such as when the number of data items becomes extremely small by deleting.
MAR In the case of MAR, if you delete the data that contains missing values, the data will be biased. For example, if there is a tendency for a person whose gender is female to not answer the age, and if the data containing the missing value in the age is deleted, the data analysis result will strongly reflect the content of the male answer. I will end up. In this case, consider supplementing the missing value with an assignment method (described later) that predicts the true value that should be included in the missing value from the observed data.
NMAR In the case of NMAR, deleting data containing missing values causes a bias in the data, and the missing data items themselves affect the missing values, so missing values from other observed data items. Cannot be predicted. Therefore, you should basically consider recollecting the data.
There are two main methods for assigning missing values.
This is a method that uses the mean value substitution method (see data cleansing), stochastic regression substitution method, hot deck method, etc. to complement the missing values and create only one complete dataset.
One set is a state in which missing values are predicted and complemented from the observed data. Create multiple (ie, multiple) sets of this and build an analytical model for each set This is the final method of integrating the individually built models.
When thinking about how to deal with missing values, you first need to find out if there are any missing values in your data.
For this,
pandas isnull()How to check easily using
There is a method to visualize and understand missing values like the missingno package.
When using isnull () with data containing missing values such as those introduced earlier,
For example
import pandas as pd
data = pd.read_csv('./8000_data_preprocessing_data/missing_data_3_1_3.csv')
data.isnull().sum()
If you execute the dataFrame's isnull () method and the sum () method that calculates the total value of each column in combination, you will get the following result, and you can see how many missing values are in which column. I will.
#Output result
rate 1
age 2
sex 2
dtype: int64
With missingno, you can easily visualize the overall status of missing values. If you specify the pandas DataFrame to the missingno matrix function as shown below,
import missingno as msno
%matplotlib inline
data = pd.read_csv('./8000_data_preprocessing_data/missing_data_3_1_3.csv')
msno.matrix(data)
Where the white part of the image has a missing value On the right side of the image, the number of non-missing data items in that row is displayed in a vertical line graph.
Of the numbers displayed on the line graph The number on the left is the row with the most missing The number on the right corresponds to the number of non-missing data items in the row with the fewest losses. In the case of the data displayed in the image, the number of non-missing data items in the row with the most missing is "3". Therefore, you can see that there is a row that is missing at most one of the four data items. You will get the following image.
The single assignment method is used properly according to the purpose as follows.
1, When only the average value of missing data items needs to be known
In this case, it is introduced in data cleansing. Use the mean substitution method. But in this case As the data with the average value increases, the variance of the target data item becomes smaller. Not available if you want to take variance or error into account for your analysis.
2, If you want to take into account the variance of missing data items in your data analysis In this case, the error term considering the variance is included in the substitution value by using the stochastic regression substitution method or the like.
3, When there is a lot of qualitative data and it is difficult to calculate the substitution value parametrically due to regression etc. In this case, a non-parametric method called the hot deck method (without any assumptions about the parameters) is used. Use. In the hot deck method, the missing value of the data row (called the recipient) containing the missing value is calculated. Complement with the value of another row of data (called the donor) that has the value of that data item that is missing.
When looking for donors to complement recipients, the proximity of the data is determined by the nearest neighbors. Use the value of a close donor.
here
knnimpute
knnimpute.knn_impute_few_observed(matrix, missing_mask, k)Function
I will explain the single assignment method by the hot deck method used.
The main arguments of the knn_impute_few_observed (matrix, missing_mask, k) function are There are three below. See the docstring on the official page for more information.
matrix: This is numpy np.Specify matrix type data.
missing_mask:This is a boolean type matrix data showing where the missing values are contained in the matrix. Must have the same shape as the argument matrix.
k :This is KNN(Supervised learning(Classification)reference)Represents a nearby point to take into account.
In most cases, after filling in the missing values in the data with the substitution method, what you want to do in the end is I think it is a parameter estimation of the analytical model using the entire data including the complemented values.
For example, in the case of a regression model, the regression coefficient is predicted. Even with other machine learning models, you want to estimate some parameters.
When using the single assignment method, missing values are literally embedded with a single value. You can then use the complemented data to create an analytical model. From there, you will be able to estimate the parameter values of the analytical model.
However, the values complemented by the single assignment method are predicted values, and there should be uncertainty (that is, errors in the predicted values). If you predict the parameters of the analytical model without taking that into account The result does not reflect the original uncertainty of the parameter Insufficient analysis or misleading conclusions can occur.
In the multiple assignment method, data completion is performed multiple times to deal with this problem. Create multiple datasets and estimate the parameters of the analytical model separately for the multiple datasets Finally, combine the results into one.
The flow of analysis of the multiple assignment method when three data sets are created by completing three times is shown below.
Let's actually perform the multiple assignment method with an example of a linear multiple regression model.
As the data including missing values, use the following data visualized in "Visualization of missing values".
This data is below the price of the house and the items related to the price (distance from the station, age, size (㎡)) There are four data.
To use the multiple assignment method
Of stats models
MICE(Multiple imputation by Chained Equations)Is convenient to use.
In this exercise, we will predict the price by linear multiple regression from the data of distance from the station, age, and size.
When writing this linear multiple regression model in statsmodels, use the column name of DataFrame Write as follows in the character string.
'price ~ distance + age + m2'
To briefly explain how to write this model, here the objective variable is price. There are three explanatory variables: distance, age, and m2.
The explanatory variables are models that are linearly combined with +. You do not need to include the coefficients when writing the model. A ~ is written between the objective variable and the explanatory variable, and the price is predicted by the three variables on the right side of ~. For details on how to write a model in statsmodels, refer to the statsmodels official website.
In statsmodels MICE, the functions are executed in the following order to execute the multiple assignment method.
In the case of a multiple regression model such as'y ~ x1 + x2 + x3 + x4', the sample code that executes the above order is as follows.
imp_data = mice.MICEData(data)
formula = 'y ~ x1 + x2 + x3 + x4'
model = mice.MICE(formula, sm.OLS, imp_data)
results = model.fit(10, 10)
print(results.summary())
The details are like this.
imp_data = mice.MICEData(data)
# 1,Mice a MICEData object to handle complementary data.MICEData(data)Created with a function
formula = 'y ~ x1 + x2 + x3 + x4'
# 2,Create a linear model formula(The price I explained earlier~ distance + age +Expression like m2)
model = mice.MICE(formula, sm.OLS, imp_data)
# 3,mice.MICE(formula, optimizer, imp_data)Analytical model with function(MICE object)Create
# (Here, since it is a multiple regression, sm for optimizer.Use OLS)
results = model.fit(10, 10) #Complementary processing 10 times,Number of datasets 10
# 4,MICE object fit(n_burnin, n_imputation)Using the method,
#Optimized analytical model for data and results(MICEResults object)Get
# (The first argument of the fit method is how many times the process is repeated for one completion, and the second argument is the number of datasets to be completed.)
print(results.summary())
# 5,Summary of the optimized result content of the MICEResult object()Confirm with method
The resulting sample looks like this:
Results: MICE
=================================================================
Method: MICE Sample size: 1000
Model: OLS Scale 1.00
Dependent variable: y Num. imputations 10
-----------------------------------------------------------------
Coef. Std.Err. t P>|t| [0.025 0.975] FMI
-----------------------------------------------------------------
Intercept -0.0234 0.0318 -0.7345 0.4626 -0.0858 0.0390 0.0128
x1 1.0305 0.0578 17.8342 0.0000 0.9172 1.1437 0.0309
x2 -0.0134 0.0162 -0.8282 0.4076 -0.0451 0.0183 0.0236
x3 -1.0260 0.0328 -31.2706 0.0000 -1.0903 -0.9617 0.0169
x4 -0.0253 0.0336 -0.7520 0.4521 -0.0911 0.0406 0.0269
=================================================================
The final result is a parameter estimated by integrating multiple complementary datasets. This sample gives the results of what weights x1, x2, x3, and x4 correlate with y.
Outliers are data that deviate significantly from other data. If outliers are mixed, the following problems will occur.
This section describes how to detect and exclude outliers.
It is easy to see if there are outliers by first making a simple visualization of the data.
For this visualization
You can use seaborn boxplot.
boxplot is a function that draws a so-called boxplot, as shown in the following figure. Outliers are displayed with diamond marks.
The main arguments of the boxplot function are:
In the case of the previous figure, it is specified as follows.
import pandas as pd
import seaborn as sns
data = pd.read_csv('outlier_322.csv')
sns.boxplot(y=data['height'])
If the data is two-dimensional
Using joint plot makes it easy to see if there are outliers.
Since jointplot does not have a function to display outliers with a diamond mark. Visually check for any outliers. The main arguments of the jointplot function are:
Here's a script that displays a joint plot on a well-known iris dataset:
import pandas as pd
import seaborn as sns
#Read iris data
iris = sns.load_dataset("iris")
sns.jointplot('sepal_width', 'petal_length', data=iris)
Which data are outliers to exclude outliers It is necessary to detect by a certain standard. There are various methods of this detection.
First, we will introduce LOF (Local Outlier Factor), which detects outliers based on the density of data.
LOF has the following features.
Judgment of outliers by LOF is scikit-You can easily do it by using the learn function.
In scikit-learn LOF, first use the LocalOutlierFactor function Create a classification model with parameters.
The parameter mainly specifies how many neighborhood points to use (n_neighbors). (Check the official documentation for detailed parameters). Describe as follows.
clf = LocalOutlierFactor(n_neighbors=20)
next
#Of the model
fit_predict(data)
#The method trains the data to detect outliers.
You can pass a pandas DataFrame directly to the argument data.
predictions = clf.fit_predict(data)
array([ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, -1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, -1,
1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, -1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, -1, 1, 1, 1,
1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, -1,
-1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
The value is -1 for data rows that are considered outliers, and 1 for data rows that are considered normal values.
Using this result, if you specify the data as follows You can get the rows that were considered outliers in the original data.
data[predictions == -1]
The following figure plots the iris data predicted to be outliers at k = 20. As the return value of the fit_predict () method, you will get the following array.
Click here for usage examples
import numpy as np
import pandas as pd
from sklearn.neighbors import LocalOutlierFactor
np.random.seed(0)
data = pd.read_csv('./8000_data_preprocessing_data/outlier_ex.csv')
clf = LocalOutlierFactor(n_neighbors=20)
predictions = clf.fit_predict(data)
data[predictions == -1]
LOF was a method that used the density of data points for judgment. Here, as another method
Isolation Forest ()I'd like to introduce_______
Isolation Forest has the following features.
An example of iris outlier detection, with a brief algorithm, is as follows:
Since the data that cannot be divided any more (red dot at the bottom left of the figure) could be created by dividing it twice, record the depth as 2.
Repeat 1-3 to calculate the average depth of each point
Then, the outliers have a smaller average depth (there is a high possibility that they can be easily separated from other points). Data points with a small depth can be judged as outliers.
Isolation Forest can also use scikit-learn to predict outliers, similar to LOF.
First, create a classification model with the IsolationForest () function.
clf = IsolationForest()
Next, train the data with the model's fit () method.
clf.fit(data)
After that, the predict () method is used to determine and predict outliers.
predictions = clf.predict(data)
As the return value of the predict () method, the following array is obtained.
array([ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, -1, 1,
1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, -1,
1, 1, 1, 1, 1, 1, 1, -1, 1, 1, -1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, -1, -1,
-1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
The value is -1 for data rows that are considered outliers, and 1 for data rows that are considered normal values.
Using this result, you can specify the data as follows to get the rows that were considered outliers in the original data.
data[predictions == -1]
iris data predicted to be outliers by Isolation Forest The plot looks like the following figure.
import numpy as np
import pandas as pd
from sklearn.ensemble import IsolationForest
np.random.seed(0)
data = pd.read_csv('./8000_data_preprocessing_data/outlier_ex.csv')
#Example of use.
clf = IsolationForest()
clf.fit(data)
predictions = clf.predict(data)
data[predictions == -1]
Unbalanced data is when there are categorical or binary data items instead of numeric data. The specific value of the item is too high or too low, and the frequency of the data values is imbalanced.
Specifically, when a specific data item has values of 0 and 1, 990 out of 1000 data are 1 When 10 cases are like 0.
In such a case, if you predict the value of the data item, if you always predict it as 1, there is a 99% chance that the prediction will be correct. However, if predicting a zero case was an important requirement, this prediction model makes no sense at all.
Therefore, before training the imbalanced data in the machine learning model, To prevent false predictions caused by imbalanced data We may adjust the data.
The first thing to do before adjusting the imbalanced data is to check for the existence of the imbalanced data.
For this
pandas value_counts()It's easy to use the method.
If you want to check the frequency of survival data of the famous Titanic, write as follows.
titanic['survived'].value_counts()
Then you can know the frequency of each value as follows.
0 549
1 342
Name: survived, dtype: int64
There are three adjustment methods as follows.
A method of replicating and increasing data containing infrequent values among the cases you want to predict. In the previous example, the data containing 0, which is only 1%, is duplicated and increased.
How to reduce data that contains frequent values of the cases you want to predict. In the previous example, we will reduce the data containing 1 which is 99%.
In the previous example, we will reduce the data containing 1s and increase the data containing 0s.
In the case where you want to predict what kind of data will be 1 when you purchase a car and 0 when you have not purchased it. It will be as follows.
Positive example:Data row with purchase 1 in the data for learning
Negative example:Data rows that have not been purchased and are 0
In the training data, these positive and negative examples are imbalanced. When there are few positive cases and overwhelmingly many negative cases (when there is a lot of unpurchased data) By randomly deleting and reducing this negative example, it is possible to alleviate the data imbalance.
Undersampling.
For undersampling unbalanced data
imbalanced-It's easy to use learn.
There are several methods, but here
Consider using RandomUnderSampler to randomly delete data.
If you want to reduce the ratio of RandomUnderSampler Specify majority (frequently) in the argument ratio as shown in the following sample code.
For ratio, you can also specify other fine ratios in dictionary format. Please refer to the official document for the values that can be specified for ratio.
rus = RandomUnderSampler(ratio = 'majority')
After creating RandomUnderSampler, Pass the data divided into the objective variable and the explanatory variable in advance as arguments as follows. Obtain the data after undersampling.
X_resampled, y_resampled = rus.fit_sample(X, y)
import numpy as np
import pandas as pd
from imblearn.under_sampling import RandomUnderSampler
np.random.seed(0)
data = pd.read_csv('./8000_data_preprocessing_data/imbalanced_ex.csv')
y = data['purchased']
X = data.loc[:, ['income', 'age', 'num_of_children']]
#Click here for usage examples
rus = RandomUnderSampler(ratio = 'majority')
X_resampled, y_resampled = rus.fit_sample(X, y)
(X_resampled, y_resampled)
Undersampling eliminated the imbalance by reducing the large number of negative cases. Conversely, oversampling is the elimination of imbalances by increasing the number of positive examples.
Undersampling only deleted existing data Oversampling requires some steps to increase the data.
There are several ways to increase this, but the simplest way to increase it There is a way to randomly inflate existing data.
To inflate data randomly
imbalanced-Learn uses RandomOverSampler.
RandomOverSampler It is almost the same as how to use Random UnderSampler used for downsampling. Use it as follows.
The only difference is that it increases the number of infrequent examples.
to ratio'minority'Is to specify.
Similar to RandomUnderSampler, you can specify a fine ratio for ratio. Please refer to the official document for the detailed argument specification pattern.
ros = RandomOverSampler(ratio = 'minority')
X_resampled, y_resampled = ros.fit_sample(X, y)
import numpy as np
import pandas as pd
from imblearn.over_sampling import RandomOverSampler
np.random.seed(0)
data = pd.read_csv('./8000_data_preprocessing_data/imbalanced_ex.csv')
y = data['purchased']
X = data.loc[:, ['income', 'age', 'num_of_children']]
#Click here for usage examples
ros = RandomOverSampler(ratio = 'minority')
X_resampled, y_resampled = ros.fit_sample(X, y)
(X_resampled, y_resampled)
Oversampling and undersampling when adjusting imbalanced data There is also a method to use both, not just one.
imbalanced-In learn
For oversampling
SMOTE(Synthetic minority over-sampling technique)
For undersampling
ENN(Edited Nearest Neighbours)use
SMOTE-You can use ENN.
SMOTE has the following features.
ENN has the following features.
How to use SMOTE-ENN Similar to RandomUnderSampler and RandomOverSampler
The difference is that you specify the values of k_neighbors used by SMOTE and n_neighbors used by ENN. Check the official documentation for a detailed description of other optional parameters. The sample code is as follows.
sm_enn = SMOTEENN(smote=SMOTE(k_neighbors=3), enn=EditedNearestNeighbours(n_neighbors=3))
X_resampled, y_resampled = sm_enn.fit_sample(X, y)
import numpy as np
import pandas as pd
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import EditedNearestNeighbours
from imblearn.combine import SMOTEENN
np.random.seed(0)
data = pd.read_csv('./8000_data_preprocessing_data/imbalanced_ex.csv')
y = data['purchased']
X = data.loc[:, ['income', 'age', 'num_of_children']]
#Basic use is like this
sm_enn = SMOTEENN(smote=SMOTE(k_neighbors=3), enn=EditedNearestNeighbours(n_neighbors=3))
X_resampled, y_resampled = sm_enn.fit_sample(X, y)
(X_resampled, y_resampled)
Recommended Posts