Implement normalization of Python training data preprocessing with scikit-learn [fit_transform]

Preprocessing of training data includes a process called normalization that changes the real range of data. For the normalization implementation, scikit-learn (hereinafter sklearn) provides a function called fit_transform. This time, we will share the implementation that normalizes the training data and validation data with the sample code.

sklearn normalization function

There are three main types of normalization functions provided by sklearn, and there are two-step processes.

  1. Parameter calculation
  2. Conversion using parameters fit() Calculate standard deviation and maximum / minimum values from input data and save parameters

transform() Convert data using parameters calculated from fit function

fit_transform() Execute the above process continuously

Why are there three types of functions?

If you want to normalize a certain data, you should use the fit_transorm function to calculate the parameters and convert the data at the same time. .. However, when converting data as preprocessing during training, it is necessary to use the same parameters (results of the fit function) for the training data and the verification data. * A simple example is displayed with the sample code. Therefore, fit () that calculates the parameter for a certain data and the transform function that transforms using the calculated parameter are prepared.

Normalization type

It seems that there are 27 types as a result of checking with the reference of sklearn. I have only used a few types, but if you are interested, please refer to it. API Reference sklearn.preprocessing  scikit-learn 0.19.2 documentation

Commonly used conversion methods -MinMaxScaler () #Define the maximum and minimum values of data ・ StandardScaler () # standardization

Sample code

Below is a sample of normalization using sklearn. The content of the process is commented on each line. As a procedure of the sample code,

  1. Normalization method, test data definition
  2. Transform with fit_transform
  3. Save parameters-> Load
  4. Test data definition
  5. Test data transform with save parameters
  6. Data transformation for test data (fit_transform)

scaler_sample.py


#If you haven't imported, please pip install each time.
from sklearn import preprocessing
import numpy as np
import pickle

#Normalization method definition MinMaxScaler(0<=data<=1)
mmscaler = preprocessing.MinMaxScaler()

#Raw data definition for learning
train_raw = np.array(list(range(11)))
print (train_raw)   # [ 0  1  2  3  4  5  6  7  8  9 10]

#Fit with training data_transform
train_transed = mmscaler.fit_transform(train_raw.reshape(-1,1))
#Conversion result display
#0 to 10 data is converted from 0 to 1
print (train_transed.flatten()) # [0.  0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1. ]

#Save fit parameters in binary format
#Usually, the learning code and the verification code are implemented separately, so it is introduced as a method to save the parameters.
pickle.dump(mmscaler, open('./scaler.sav', 'wb'))

#Assuming that the above is implemented in another function, load the fit parameter saved in the training data(Binary file)
save_scaler = pickle.load(open('./scaler.sav', 'rb'))
#Check parameter details
print(save_scaler,type(save_scaler))    # MinMaxScaler() <class 'sklearn.preprocessing._data.MinMaxScaler'>

#Test data definition
test_raw = np.array(list(range(100)))
print (test_raw)
'''print (test_raw)
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
 96 97 98 99]
 '''

 #Convert using saved parameters(tranform)
save_scaler_transed = save_scaler.transform(test_raw.reshape(-1,1))
print (save_scaler_transed.flatten())
#Since the weight of the training data is used, the data range is 0 to 9..Become 9
'''print (save_scaler_transed.flatten())
[0.  0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.  1.1 1.2 1.3 1.4 1.5 1.6 1.7
 1.8 1.9 2.  2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.  3.1 3.2 3.3 3.4 3.5
 3.6 3.7 3.8 3.9 4.  4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.  5.1 5.2 5.3
 5.4 5.5 5.6 5.7 5.8 5.9 6.  6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7.  7.1
 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8.  8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9
 9.  9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9]
'''

#Parameter calculation using test data+conversion(fit_tranform)
test_fit_transed = mmscaler.fit_transform(test_raw.reshape(-1,1))

#Since the parameters are calculated from the test data, the data range changes from 0 to 1.
print (test_fit_transed.flatten())
'''print (test_fit_transed.flatten())
[0.         0.01010101 0.02020202 0.03030303 0.04040404 0.05050505
 0.06060606 0.07070707 0.08080808 0.09090909 0.1010101  0.11111111
 0.12121212 0.13131313 0.14141414 0.15151515 0.16161616 0.17171717
 0.18181818 0.19191919 0.2020202  0.21212121 0.22222222 0.23232323
 0.24242424 0.25252525 0.26262626 0.27272727 0.28282828 0.29292929
 0.3030303  0.31313131 0.32323232 0.33333333 0.34343434 0.35353535
 0.36363636 0.37373737 0.38383838 0.39393939 0.4040404  0.41414141
 0.42424242 0.43434343 0.44444444 0.45454545 0.46464646 0.47474747
 0.48484848 0.49494949 0.50505051 0.51515152 0.52525253 0.53535354
 0.54545455 0.55555556 0.56565657 0.57575758 0.58585859 0.5959596
 0.60606061 0.61616162 0.62626263 0.63636364 0.64646465 0.65656566
 0.66666667 0.67676768 0.68686869 0.6969697  0.70707071 0.71717172
 0.72727273 0.73737374 0.74747475 0.75757576 0.76767677 0.77777778
 0.78787879 0.7979798  0.80808081 0.81818182 0.82828283 0.83838384
 0.84848485 0.85858586 0.86868687 0.87878788 0.88888889 0.8989899
 0.90909091 0.91919192 0.92929293 0.93939394 0.94949495 0.95959596
 0.96969697 0.97979798 0.98989899 1.        ]
 '''

in conclusion

I made this article as a memorandum of what I learned about normalization. Before I looked it up, I didn't notice any difference between fit_transform () and transform (). .. This is an important conversion of preprocessing and also affects the accuracy of the data to be verified. We hope that there will be no cases where parameters are inadvertently reused. We borrowed the wisdom of our predecessors in writing this article. I will write it at a later date. Thank you for reading. If you like LGTM, please!

Recommended Posts

Implement normalization of Python training data preprocessing with scikit-learn [fit_transform]
[Python] Use string data with scikit-learn SVM
Recommendation of Altair! Data visualization with Python
I started machine learning with Python Data preprocessing
Data analysis with python 2
Preprocessing of prefecture data
Data analysis with Python
Challenge principal component analysis of text data with Python
[Basics of data science] Collecting data from RSS with python
Extract the band information of raster data with python
Sample data created with python
Try scraping the data of COVID-19 in Tokyo with Python
Visualize the results of decision trees performed with Python scikit-learn
Neural network with Python (scikit-learn)
Get Youtube data with python
Notes on handling large amounts of data with python + pandas
Python application: Data cleansing # 3: Use of OpenCV and preprocessing of image data
Get rid of dirty data with Python and regular expressions
Parallel processing with Parallel of scikit-learn
python: Basics of using scikit-learn ①
[Homology] Count the number of holes in data with Python
[Python] Linear regression with scikit-learn
Basic summary of data manipulation with Python Pandas-First half: Data creation & manipulation
Read json data with python
Practical exercise of data analysis with Python ~ 2016 New Coder Survey Edition ~
Grid search of hyperparameters with Scikit-learn
Python: Time Series Analysis: Preprocessing Time Series Data
Easily implement subcommands with python click
[Python] Get economic data with DataReader
Getting Started with Python Basics of Python
Python Pandas Data Preprocessing Personal Notes
Life game with Python! (Conway's Game of Life)
10 functions of "language with battery" python
Preprocessing template for data analysis (Python)
Easy data visualization with Python seaborn.
Implementation of Dijkstra's algorithm with python
Time series analysis 3 Preprocessing of time series data
Process Pubmed .xml data with python
Data analysis starting with python (data visualization 1)
[Translation] scikit-learn 0.18 User Guide 4.3. Data preprocessing
Coexistence of Python2 and 3 with CircleCI (1.0)
Data analysis starting with python (data visualization 2)
Implement "Data Visualization Design # 2" with matplotlib
Python application: Data cleansing # 2: Data cleansing with DataFrame
Basic study of OpenCV with Python
CSV output of pulse data with Raspberry Pi (confirm analog input with python)
Try to image the elevation data of the Geographical Survey Institute with Python
I have 0 years of programming experience and challenge data processing with python
Plot CSV of time series data with unixtime value in Python (matplotlib)
Python: Preprocessing in machine learning: Handling of missing, outlier, and imbalanced data
Get additional data in LDAP with python
Basics of binarized image processing with Python
Data pipeline construction with Python and Luigi
[Examples of improving Python] Learning Python with Codecademy
Receive textual data from mysql with python
[Note] Get data from PostgreSQL with Python
Process Pubmed .xml data with python [Part 2]
Add a Python data source with Redash
Execute Python script with cron of TS-220
Try working with binary data in Python
Generate Japanese test data with Python faker