Scikit-learn impute is used to fill in missing data as a pre-process for machine learning. I tried to check the behavior using simple data.
data = {
'A': [a for a in range(10)],
'B': [a * 2 for a in range(10)],
'C': [a * 3 for a in range(10)],
'D': [a * 4 for a in range(10)],
}
import pandas as pd
data = pd.DataFrame(data)
data
A | B | C | D | |
---|---|---|---|---|
0 | 0 | 0 | 0 | 0 |
1 | 1 | 2 | 3 | 4 |
2 | 2 | 4 | 6 | 8 |
3 | 3 | 6 | 9 | 12 |
4 | 4 | 8 | 12 | 16 |
5 | 5 | 10 | 15 | 20 |
6 | 6 | 12 | 18 | 24 |
7 | 7 | 14 | 21 | 28 |
8 | 8 | 16 | 24 | 32 |
9 | 9 | 18 | 27 | 36 |
import numpy as nan
data2 = pd.DataFrame(data)
#data2['B'][3] = np.nan
data2.loc.__setitem__(((2), ("B")), np.nan)
data2.loc.__setitem__(((3), ("C")), np.nan)
data2.loc.__setitem__(((5), ("C")), np.nan)
data2.loc.__setitem__(((6), ("D")), np.nan)
data2.loc.__setitem__(((7), ("D")), np.nan)
data2
A | B | C | D | |
---|---|---|---|---|
0 | 0 | 0.0 | 0.0 | 0.0 |
1 | 1 | 2.0 | 3.0 | 4.0 |
2 | 2 | NaN | 6.0 | 8.0 |
3 | 3 | 6.0 | NaN | 12.0 |
4 | 4 | 8.0 | 12.0 | 16.0 |
5 | 5 | 10.0 | NaN | 20.0 |
6 | 6 | 12.0 | 18.0 | NaN |
7 | 7 | 14.0 | 21.0 | NaN |
8 | 8 | 16.0 | 24.0 | 32.0 |
9 | 9 | 18.0 | 27.0 | 36.0 |
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(data2)
As described above, we created column B containing one missing value, column C containing two values, and column D containing consecutive missing values.
SimpleImputer
The SimpleImputer class provides a basic calculation method for entering missing values. Missing values can be calculated using the specified constant value or by using the statistic (mean, median, or most frequently occurring value) of each column in which the missing value exists.
default(mean)
The default is filled with the average value.
from sklearn.impute import SimpleImputer
imp = SimpleImputer() #missing_values=np.nan, strategy='mean')
data3 = pd.DataFrame(imp.fit_transform(data2))
data3
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 0.0 | 0.000000 | 0.000 | 0.0 |
1 | 1.0 | 2.000000 | 3.000 | 4.0 |
2 | 2.0 | 9.555556 | 6.000 | 8.0 |
3 | 3.0 | 6.000000 | 13.875 | 12.0 |
4 | 4.0 | 8.000000 | 12.000 | 16.0 |
5 | 5.0 | 10.000000 | 13.875 | 20.0 |
6 | 6.0 | 12.000000 | 18.000 | 16.0 |
7 | 7.0 | 14.000000 | 21.000 | 16.0 |
8 | 8.0 | 16.000000 | 24.000 | 32.0 |
9 | 9.0 | 18.000000 | 27.000 | 36.0 |
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(data3)
As mentioned above, depending on the type of data, it may be unnatural to fill in with the average value.
median
You can also fill in the missing values with the median.
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='median')
data4 = pd.DataFrame(imp.fit_transform(data2))
data4
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 1.0 | 2.0 | 3.0 | 4.0 |
2 | 2.0 | 10.0 | 6.0 | 8.0 |
3 | 3.0 | 6.0 | 15.0 | 12.0 |
4 | 4.0 | 8.0 | 12.0 | 16.0 |
5 | 5.0 | 10.0 | 15.0 | 20.0 |
6 | 6.0 | 12.0 | 18.0 | 14.0 |
7 | 7.0 | 14.0 | 21.0 | 14.0 |
8 | 8.0 | 16.0 | 24.0 | 32.0 |
9 | 9.0 | 18.0 | 27.0 | 36.0 |
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(data4)
Like the average value, when filling with the median, it may become an unnatural filling type depending on the content of the data.
most_frequent
You can also fill it with the mode.
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
data5 = pd.DataFrame(imp.fit_transform(data2))
data5
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 1.0 | 2.0 | 3.0 | 4.0 |
2 | 2.0 | 0.0 | 6.0 | 8.0 |
3 | 3.0 | 6.0 | 0.0 | 12.0 |
4 | 4.0 | 8.0 | 12.0 | 16.0 |
5 | 5.0 | 10.0 | 0.0 | 20.0 |
6 | 6.0 | 12.0 | 18.0 | 0.0 |
7 | 7.0 | 14.0 | 21.0 | 0.0 |
8 | 8.0 | 16.0 | 24.0 | 32.0 |
9 | 9.0 | 18.0 | 27.0 | 36.0 |
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(data5)
If there is no mode, it seems to be filled with the first value.
constant
You can also set a predetermined number and fill it with it.
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=99)
data6 = pd.DataFrame(imp.fit_transform(data2))
data6
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 1.0 | 2.0 | 3.0 | 4.0 |
2 | 2.0 | 99.0 | 6.0 | 8.0 |
3 | 3.0 | 6.0 | 99.0 | 12.0 |
4 | 4.0 | 8.0 | 12.0 | 16.0 |
5 | 5.0 | 10.0 | 99.0 | 20.0 |
6 | 6.0 | 12.0 | 18.0 | 99.0 |
7 | 7.0 | 14.0 | 21.0 | 99.0 |
8 | 8.0 | 16.0 | 24.0 | 32.0 |
9 | 9.0 | 18.0 | 27.0 | 36.0 |
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(data6)
Well, how unnatural it is!
KNNImputer
The KNNImputer class fills in missing values using the k-Nearest Neighbors approach. By default, the Euclidean distance metric nan_euclidean_distances, which supports missing values, is used to find the nearest neighbors. Neighbor characteristics are either uniformly averaged or weighted by the distance to each neighbor. If a sample is missing one or more features, its neighbors may differ depending on the particular features entered. If the number of neighbors available is less than n_neighbors and there is no defined distance to the training set, the training set average for that feature will be used during input. If there is at least one neighbor with the defined distance, the weighted or unweighted average of the remaining neighbors is used at entry.
n_neighbors=2
Let's explicitly set the number of neighbors to consider to n_neighbors = 2.
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=2)
data7 = pd.DataFrame(imputer.fit_transform(data2))
data7
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 1.0 | 2.0 | 3.0 | 4.0 |
2 | 2.0 | 4.0 | 6.0 | 8.0 |
3 | 3.0 | 6.0 | 9.0 | 12.0 |
4 | 4.0 | 8.0 | 12.0 | 16.0 |
5 | 5.0 | 10.0 | 15.0 | 20.0 |
6 | 6.0 | 12.0 | 18.0 | 18.0 |
7 | 7.0 | 14.0 | 21.0 | 26.0 |
8 | 8.0 | 16.0 | 24.0 | 32.0 |
9 | 9.0 | 18.0 | 27.0 | 36.0 |
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(data7)
It seems that it cannot be filled well if it is missing for two consecutive times.
default(n_neighbors=5)
By default, it seems to consider up to 5 neighbors.
from sklearn.impute import KNNImputer
imputer = KNNImputer()
data8 = pd.DataFrame(imputer.fit_transform(data2))
data8
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 1.0 | 2.0 | 3.0 | 4.0 |
2 | 2.0 | 5.2 | 6.0 | 8.0 |
3 | 3.0 | 6.0 | 12.0 | 12.0 |
4 | 4.0 | 8.0 | 12.0 | 16.0 |
5 | 5.0 | 10.0 | 16.2 | 20.0 |
6 | 6.0 | 12.0 | 18.0 | 23.2 |
7 | 7.0 | 14.0 | 21.0 | 23.2 |
8 | 8.0 | 16.0 | 24.0 | 32.0 |
9 | 9.0 | 18.0 | 27.0 | 36.0 |
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(data8)
The way to fill column D is relatively better, but instead it has had a slight negative effect on how to fill columns B and C.
There is probably no perfect way to fill in missing values, so consider the characteristics of your data and choose the suboptimal method!
Recommended Posts