I briefly researched sklearn.preprocessing.StandardScaler, so I will leave it as a memo.
StandardScaler
provides the standardization function of the dataset.
By standardizing, the ratio of features can be made uniform.
For example, taking the deviation value as an example, assuming that there is a test with a maximum of 100 points and a test with a maximum of 50 points. Even if the ratio and unit of points are different, you can evaluate the points without being affected by standardization.
Standardization can be found by subtracting the mean from each piece of data in the set and dividing by the standard deviation.
z_i = \large{\frac{x_i- \mu}{\sigma}}
μ
is the mean, σ
is the standard deviation, and ʻi` is any natural number.
The average is calculated by dividing the sum of the sets by the number.
\mu = \frac{1}{n}\sum ^n_{i}x_i
The standard deviation is calculated by dividing the variance by the square root.
\sigma = \sqrt{s^2}
The variance is calculated by subtracting the mean from each data in the set, summing the squares, and dividing by the number.
s^2 = \dfrac 1n\sum ^n_{i}(x_i - \mu)^2
First of all, I would like to implement it myself without using a machine learning library. Use the ʻiris` dataset as the target for standardization.
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
print(X[:1]) # array([[5.1, 3.5, 1.4, 0.2]])
import math
def standardize(x):
"""
Parameters
----------
x:Expect an array to standardize, a one-dimensional vector.
Returns
-------
mean:average
var:Distributed
std:standard deviation
z:Standardized array
"""
mean = None
var = None
std = None
z = None
#Calculate the average, the sum of the sets/Number of features
mean = sum(x) / len(x)
#Calculate the variance, mean the set-Divide the sum of squares of feature differences by the number
var = sum([(mean - x_i) ** 2 for x_i in x]) / len(x)
#Divide the variance by the square root
std = math.sqrt(var)
#Subtract the mean from each feature and divide it by the standard deviation
z = [x_i / std for x_i in [x_i - mean for x_i in x]]
return [mean, var, std, z]
#7 data from the beginning of the 1D of the dataset
sample_data = X[:, :1][:7]
print(sample_data.tolist()) # [[5.1], [4.9], [4.7], [4.6], [5.0], [5.4], [4.6]]
mean, var, std, z = standardize(sample_data)
print(mean) # [4.9]
print(var) # [0.07428571]
print(std) # 0.2725540575476989
print(*z) # [0.73379939] [3.25872389e-15] [-0.73379939] [-1.10069908] [0.36689969] [1.83449846] [-1.10069908]
You can see that the sample_data
variable has been processed and converted to floating point.
Next, try using sklearn.preprocessing.StandardScaler
.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(sample_data)
print(scaler.mean_) # [4.9]
print(scaler.var_) # [0.07428571]
print(math.sqrt(scaler.var_)) # 0.2725540575476989
print(*scaler.transform(sample_data)) # [0.73379939] [3.25872389e-15] [-0.73379939] [-1.10069908] [0.36689969] [1.83449846] [-1.10069908]
#The values are the same
print(scaler.mean_ == mean) # [ True]
print(scaler.var_ == var) # [ True]
print(math.sqrt(scaler.var_) == std) # True
print(*(scaler.transform(sample_data) == z)) # [ True] [ True] [ True] [ True] [ True] [ True] [ True]
By using Standard Scaler
in this way, standardization can be achieved with a small amount.
Also, since the values calculated at the time of standardization such as mean_
and var_
are retained, if fit
is set when learning the machine learning model, it will be used as data [^ 1] when inferring the model. On the other hand, processed_query = (np.array (query) --scaler.mean_) --np.sqrt (scaler.var_)
can be set, so I think it is very convenient.
[^ 1]: Assuming that the vector given during model inference belongs to the set of datasets used during training.