Preprocessing of training data includes a process called normalization that changes the real range of data. For the normalization implementation, scikit-learn (hereinafter sklearn) provides a function called fit_transform. This time, we will share the implementation that normalizes the training data and validation data with the sample code.

sklearn normalization function

There are three main types of normalization functions provided by sklearn, and there are two-step processes.

Parameter calculation
Conversion using parameters fit() Calculate standard deviation and maximum / minimum values from input data and save parameters

transform() Convert data using parameters calculated from fit function

fit_transform() Execute the above process continuously

Why are there three types of functions?

If you want to normalize a certain data, you should use the fit_transorm function to calculate the parameters and convert the data at the same time. .. However, when converting data as preprocessing during training, it is necessary to use the same parameters (results of the fit function) for the training data and the verification data. * A simple example is displayed with the sample code. Therefore, fit () that calculates the parameter for a certain data and the transform function that transforms using the calculated parameter are prepared.

Normalization type

It seems that there are 27 types as a result of checking with the reference of sklearn. I have only used a few types, but if you are interested, please refer to it. API Reference sklearn.preprocessing scikit-learn 0.19.2 documentation

Commonly used conversion methods -MinMaxScaler () #Define the maximum and minimum values of data ・ StandardScaler () # standardization

Sample code

Below is a sample of normalization using sklearn. The content of the process is commented on each line. As a procedure of the sample code,

Normalization method, test data definition
Transform with fit_transform
Save parameters-> Load
Test data definition
Test data transform with save parameters
Data transformation for test data (fit_transform)

`scaler_sample.py`


#If you haven't imported, please pip install each time.
from sklearn import preprocessing
import numpy as np
import pickle

#Normalization method definition MinMaxScaler(0<=data<=1)
mmscaler = preprocessing.MinMaxScaler()

#Raw data definition for learning
train_raw = np.array(list(range(11)))
print (train_raw)   # [ 0  1  2  3  4  5  6  7  8  9 10]

#Fit with training data_transform
train_transed = mmscaler.fit_transform(train_raw.reshape(-1,1))
#Conversion result display
#0 to 10 data is converted from 0 to 1
print (train_transed.flatten()) # [0.  0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1. ]

#Save fit parameters in binary format
#Usually, the learning code and the verification code are implemented separately, so it is introduced as a method to save the parameters.
pickle.dump(mmscaler, open('./scaler.sav', 'wb'))

#Assuming that the above is implemented in another function, load the fit parameter saved in the training data(Binary file)
save_scaler = pickle.load(open('./scaler.sav', 'rb'))
#Check parameter details
print(save_scaler,type(save_scaler))    # MinMaxScaler() <class 'sklearn.preprocessing._data.MinMaxScaler'>

#Test data definition
test_raw = np.array(list(range(100)))
print (test_raw)
'''print (test_raw)
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
 96 97 98 99]
 '''

 #Convert using saved parameters(tranform)
save_scaler_transed = save_scaler.transform(test_raw.reshape(-1,1))
print (save_scaler_transed.flatten())
#Since the weight of the training data is used, the data range is 0 to 9..Become 9
'''print (save_scaler_transed.flatten())
[0.  0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.  1.1 1.2 1.3 1.4 1.5 1.6 1.7
 1.8 1.9 2.  2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.  3.1 3.2 3.3 3.4 3.5
 3.6 3.7 3.8 3.9 4.  4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.  5.1 5.2 5.3
 5.4 5.5 5.6 5.7 5.8 5.9 6.  6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7.  7.1
 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8.  8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9
 9.  9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9]
'''

#Parameter calculation using test data+conversion(fit_tranform)
test_fit_transed = mmscaler.fit_transform(test_raw.reshape(-1,1))

#Since the parameters are calculated from the test data, the data range changes from 0 to 1.
print (test_fit_transed.flatten())
'''print (test_fit_transed.flatten())
[0.         0.01010101 0.02020202 0.03030303 0.04040404 0.05050505
 0.06060606 0.07070707 0.08080808 0.09090909 0.1010101  0.11111111
 0.12121212 0.13131313 0.14141414 0.15151515 0.16161616 0.17171717
 0.18181818 0.19191919 0.2020202  0.21212121 0.22222222 0.23232323
 0.24242424 0.25252525 0.26262626 0.27272727 0.28282828 0.29292929
 0.3030303  0.31313131 0.32323232 0.33333333 0.34343434 0.35353535
 0.36363636 0.37373737 0.38383838 0.39393939 0.4040404  0.41414141
 0.42424242 0.43434343 0.44444444 0.45454545 0.46464646 0.47474747
 0.48484848 0.49494949 0.50505051 0.51515152 0.52525253 0.53535354
 0.54545455 0.55555556 0.56565657 0.57575758 0.58585859 0.5959596
 0.60606061 0.61616162 0.62626263 0.63636364 0.64646465 0.65656566
 0.66666667 0.67676768 0.68686869 0.6969697  0.70707071 0.71717172
 0.72727273 0.73737374 0.74747475 0.75757576 0.76767677 0.77777778
 0.78787879 0.7979798  0.80808081 0.81818182 0.82828283 0.83838384
 0.84848485 0.85858586 0.86868687 0.87878788 0.88888889 0.8989899
 0.90909091 0.91919192 0.92929293 0.93939394 0.94949495 0.95959596
 0.96969697 0.97979798 0.98989899 1.        ]
 '''

in conclusion

I made this article as a memorandum of what I learned about normalization. Before I looked it up, I didn't notice any difference between fit_transform () and transform (). .. This is an important conversion of preprocessing and also affects the accuracy of the data to be verified. We hope that there will be no cases where parameters are inadvertently reused. We borrowed the wisdom of our predecessors in writing this article. I will write it at a later date. Thank you for reading. If you like LGTM, please!

Implement normalization of Python training data preprocessing with scikit-learn [fit_transform]