Preprocessing of training data includes a process called normalization that changes the real range of data. For the normalization implementation, scikit-learn (hereinafter sklearn) provides a function called fit_transform. This time, we will share the implementation that normalizes the training data and validation data with the sample code.
There are three main types of normalization functions provided by sklearn, and there are two-step processes.
transform() Convert data using parameters calculated from fit function
fit_transform() Execute the above process continuously
If you want to normalize a certain data, you should use the fit_transorm function to calculate the parameters and convert the data at the same time. .. However, when converting data as preprocessing during training, it is necessary to use the same parameters (results of the fit function) for the training data and the verification data. * A simple example is displayed with the sample code. Therefore, fit () that calculates the parameter for a certain data and the transform function that transforms using the calculated parameter are prepared.
It seems that there are 27 types as a result of checking with the reference of sklearn. I have only used a few types, but if you are interested, please refer to it. API Reference sklearn.preprocessing scikit-learn 0.19.2 documentation
Commonly used conversion methods -MinMaxScaler () #Define the maximum and minimum values of data ・ StandardScaler () # standardization
Below is a sample of normalization using sklearn. The content of the process is commented on each line. As a procedure of the sample code,
scaler_sample.py
#If you haven't imported, please pip install each time.
from sklearn import preprocessing
import numpy as np
import pickle
#Normalization method definition MinMaxScaler(0<=data<=1)
mmscaler = preprocessing.MinMaxScaler()
#Raw data definition for learning
train_raw = np.array(list(range(11)))
print (train_raw) # [ 0 1 2 3 4 5 6 7 8 9 10]
#Fit with training data_transform
train_transed = mmscaler.fit_transform(train_raw.reshape(-1,1))
#Conversion result display
#0 to 10 data is converted from 0 to 1
print (train_transed.flatten()) # [0. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1. ]
#Save fit parameters in binary format
#Usually, the learning code and the verification code are implemented separately, so it is introduced as a method to save the parameters.
pickle.dump(mmscaler, open('./scaler.sav', 'wb'))
#Assuming that the above is implemented in another function, load the fit parameter saved in the training data(Binary file)
save_scaler = pickle.load(open('./scaler.sav', 'rb'))
#Check parameter details
print(save_scaler,type(save_scaler)) # MinMaxScaler() <class 'sklearn.preprocessing._data.MinMaxScaler'>
#Test data definition
test_raw = np.array(list(range(100)))
print (test_raw)
'''print (test_raw)
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
96 97 98 99]
'''
#Convert using saved parameters(tranform)
save_scaler_transed = save_scaler.transform(test_raw.reshape(-1,1))
print (save_scaler_transed.flatten())
#Since the weight of the training data is used, the data range is 0 to 9..Become 9
'''print (save_scaler_transed.flatten())
[0. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1. 1.1 1.2 1.3 1.4 1.5 1.6 1.7
1.8 1.9 2. 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3. 3.1 3.2 3.3 3.4 3.5
3.6 3.7 3.8 3.9 4. 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5. 5.1 5.2 5.3
5.4 5.5 5.6 5.7 5.8 5.9 6. 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7. 7.1
7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8. 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9
9. 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9]
'''
#Parameter calculation using test data+conversion(fit_tranform)
test_fit_transed = mmscaler.fit_transform(test_raw.reshape(-1,1))
#Since the parameters are calculated from the test data, the data range changes from 0 to 1.
print (test_fit_transed.flatten())
'''print (test_fit_transed.flatten())
[0. 0.01010101 0.02020202 0.03030303 0.04040404 0.05050505
0.06060606 0.07070707 0.08080808 0.09090909 0.1010101 0.11111111
0.12121212 0.13131313 0.14141414 0.15151515 0.16161616 0.17171717
0.18181818 0.19191919 0.2020202 0.21212121 0.22222222 0.23232323
0.24242424 0.25252525 0.26262626 0.27272727 0.28282828 0.29292929
0.3030303 0.31313131 0.32323232 0.33333333 0.34343434 0.35353535
0.36363636 0.37373737 0.38383838 0.39393939 0.4040404 0.41414141
0.42424242 0.43434343 0.44444444 0.45454545 0.46464646 0.47474747
0.48484848 0.49494949 0.50505051 0.51515152 0.52525253 0.53535354
0.54545455 0.55555556 0.56565657 0.57575758 0.58585859 0.5959596
0.60606061 0.61616162 0.62626263 0.63636364 0.64646465 0.65656566
0.66666667 0.67676768 0.68686869 0.6969697 0.70707071 0.71717172
0.72727273 0.73737374 0.74747475 0.75757576 0.76767677 0.77777778
0.78787879 0.7979798 0.80808081 0.81818182 0.82828283 0.83838384
0.84848485 0.85858586 0.86868687 0.87878788 0.88888889 0.8989899
0.90909091 0.91919192 0.92929293 0.93939394 0.94949495 0.95959596
0.96969697 0.97979798 0.98989899 1. ]
'''
I made this article as a memorandum of what I learned about normalization. Before I looked it up, I didn't notice any difference between fit_transform () and transform (). .. This is an important conversion of preprocessing and also affects the accuracy of the data to be verified. We hope that there will be no cases where parameters are inadvertently reused. We borrowed the wisdom of our predecessors in writing this article. I will write it at a later date. Thank you for reading. If you like LGTM, please!
Recommended Posts