Scikit-learn impute is used to fill in missing data as a pre-process for machine learning. I tried to check the behavior using simple data.

Test data creation

data = {
    'A': [a for a in range(10)],
    'B': [a * 2 for a in range(10)],
    'C': [a * 3 for a in range(10)],
    'D': [a * 4 for a in range(10)],
        }

import pandas as pd

data = pd.DataFrame(data)
data

	A	B	C	D
0	0	0	0	0
1	1	2	3	4
2	2	4	6	8
3	3	6	9	12
4	4	8	12	16
5	5	10	15	20
6	6	12	18	24
7	7	14	21	28
8	8	16	24	32
9	9	18	27	36

import numpy as nan
data2 = pd.DataFrame(data)
#data2['B'][3] = np.nan
data2.loc.__setitem__(((2), ("B")), np.nan)
data2.loc.__setitem__(((3), ("C")), np.nan)
data2.loc.__setitem__(((5), ("C")), np.nan)
data2.loc.__setitem__(((6), ("D")), np.nan)
data2.loc.__setitem__(((7), ("D")), np.nan)
data2

	A	B	C	D
0	0	0.0	0.0	0.0
1	1	2.0	3.0	4.0
2	2	NaN	6.0	8.0
3	3	6.0	NaN	12.0
4	4	8.0	12.0	16.0
5	5	10.0	NaN	20.0
6	6	12.0	18.0	NaN
7	7	14.0	21.0	NaN
8	8	16.0	24.0	32.0
9	9	18.0	27.0	36.0

%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(data2)

As described above, we created column B containing one missing value, column C containing two values, and column D containing consecutive missing values.

SimpleImputer

The SimpleImputer class provides a basic calculation method for entering missing values. Missing values can be calculated using the specified constant value or by using the statistic (mean, median, or most frequently occurring value) of each column in which the missing value exists.

default(mean)

The default is filled with the average value.

from sklearn.impute import SimpleImputer

imp = SimpleImputer() #missing_values=np.nan, strategy='mean')
data3 = pd.DataFrame(imp.fit_transform(data2))
data3

	0	1	2	3
0	0.0	0.000000	0.000	0.0
1	1.0	2.000000	3.000	4.0
2	2.0	9.555556	6.000	8.0
3	3.0	6.000000	13.875	12.0
4	4.0	8.000000	12.000	16.0
5	5.0	10.000000	13.875	20.0
6	6.0	12.000000	18.000	16.0
7	7.0	14.000000	21.000	16.0
8	8.0	16.000000	24.000	32.0
9	9.0	18.000000	27.000	36.0

%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(data3)

As mentioned above, depending on the type of data, it may be unnatural to fill in with the average value.

median

You can also fill in the missing values with the median.

from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=np.nan, strategy='median')
data4 = pd.DataFrame(imp.fit_transform(data2))
data4

	0	1	2	3
0	0.0	0.0	0.0	0.0
1	1.0	2.0	3.0	4.0
2	2.0	10.0	6.0	8.0
3	3.0	6.0	15.0	12.0
4	4.0	8.0	12.0	16.0
5	5.0	10.0	15.0	20.0
6	6.0	12.0	18.0	14.0
7	7.0	14.0	21.0	14.0
8	8.0	16.0	24.0	32.0
9	9.0	18.0	27.0	36.0

%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(data4)

Like the average value, when filling with the median, it may become an unnatural filling type depending on the content of the data.

most_frequent

You can also fill it with the mode.

from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
data5 = pd.DataFrame(imp.fit_transform(data2))
data5

	0	1	2	3
0	0.0	0.0	0.0	0.0
1	1.0	2.0	3.0	4.0
2	2.0	0.0	6.0	8.0
3	3.0	6.0	0.0	12.0
4	4.0	8.0	12.0	16.0
5	5.0	10.0	0.0	20.0
6	6.0	12.0	18.0	0.0
7	7.0	14.0	21.0	0.0
8	8.0	16.0	24.0	32.0
9	9.0	18.0	27.0	36.0

%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(data5)

If there is no mode, it seems to be filled with the first value.

constant

You can also set a predetermined number and fill it with it.

from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=99)
data6 = pd.DataFrame(imp.fit_transform(data2))
data6

	0	1	2	3
0	0.0	0.0	0.0	0.0
1	1.0	2.0	3.0	4.0
2	2.0	99.0	6.0	8.0
3	3.0	6.0	99.0	12.0
4	4.0	8.0	12.0	16.0
5	5.0	10.0	99.0	20.0
6	6.0	12.0	18.0	99.0
7	7.0	14.0	21.0	99.0
8	8.0	16.0	24.0	32.0
9	9.0	18.0	27.0	36.0

%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(data6)

Well, how unnatural it is!

KNNImputer

The KNNImputer class fills in missing values using the k-Nearest Neighbors approach. By default, the Euclidean distance metric nan_euclidean_distances, which supports missing values, is used to find the nearest neighbors. Neighbor characteristics are either uniformly averaged or weighted by the distance to each neighbor. If a sample is missing one or more features, its neighbors may differ depending on the particular features entered. If the number of neighbors available is less than n_neighbors and there is no defined distance to the training set, the training set average for that feature will be used during input. If there is at least one neighbor with the defined distance, the weighted or unweighted average of the remaining neighbors is used at entry.

n_neighbors=2

Let's explicitly set the number of neighbors to consider to n_neighbors = 2.

from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=2)
data7 = pd.DataFrame(imputer.fit_transform(data2))
data7

	0	1	2	3
0	0.0	0.0	0.0	0.0
1	1.0	2.0	3.0	4.0
2	2.0	4.0	6.0	8.0
3	3.0	6.0	9.0	12.0
4	4.0	8.0	12.0	16.0
5	5.0	10.0	15.0	20.0
6	6.0	12.0	18.0	18.0
7	7.0	14.0	21.0	26.0
8	8.0	16.0	24.0	32.0
9	9.0	18.0	27.0	36.0

%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(data7)

It seems that it cannot be filled well if it is missing for two consecutive times.

default(n_neighbors=5)

By default, it seems to consider up to 5 neighbors.

from sklearn.impute import KNNImputer
imputer = KNNImputer()
data8 = pd.DataFrame(imputer.fit_transform(data2))
data8

	0	1	2	3
0	0.0	0.0	0.0	0.0
1	1.0	2.0	3.0	4.0
2	2.0	5.2	6.0	8.0
3	3.0	6.0	12.0	12.0
4	4.0	8.0	12.0	16.0
5	5.0	10.0	16.2	20.0
6	6.0	12.0	18.0	23.2
7	7.0	14.0	21.0	23.2
8	8.0	16.0	24.0	32.0
9	9.0	18.0	27.0	36.0

%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(data8)

The way to fill column D is relatively better, but instead it has had a slight negative effect on how to fill columns B and C.

Summary

There is probably no perfect way to fill in missing values, so consider the characteristics of your data and choose the suboptimal method!

	A	B	C	D
0	0	0	0	0
1	1	2	3	4
2	2	4	6	8
3	3	6	9	12
4	4	8	12	16
5	5	10	15	20
6	6	12	18	24
7	7	14	21	28
8	8	16	24	32
9	9	18	27	36

	A	B	C	D
0	0	0	0	0
1	1	2	3	4
2	2	4	6	8
3	3	6	9	12
4	4	8	12	16
5	5	10	15	20
6	6	12	18	24
7	7	14	21	28
8	8	16	24	32
9	9	18	27	36

Fill in missing values with Scikit-learn impute

Test data creation

Summary

	A	B	C	D
0	0	0	0	0
1	1	2	3	4
2	2	4	6	8
3	3	6	9	12
4	4	8	12	16
5	5	10	15	20
6	6	12	18	24
7	7	14	21	28
8	8	16	24	32
9	9	18	27	36