Min-max normalization and Z-score Normalization (Standardization) are often used for normalization / standardization. This time, I tried robust Z-score and compared it with the above normalization.
min-max normalization min-max normalization is a method to make the data have a minimum value of 0 and a maximum value of 1, and normalizes with the following formula.
x' = \frac{x-min(x)}{max(x)-min(x)}
In python, you can calculate with minmax_scale
or MinMaxScaler
in sklearn.preprocessing
.
This normalization assumes that the distribution of the data is ** uniform **.
Z-score Normalization(Standardization) Z-score Normalization is a method to make the data average 0 and variance 1, and normalize with the following formula. This value is called ** Z-score **. * μ * represents the mean and * σ * represents the standard deviation.
x' = \frac{x-\mu}{\sigma}
In python, you can calculate with scale
or StandardScaler
in sklearn.preprocessing
.
This normalization assumes that the distribution of the data is ** normal **.
In the actual data, it was often neither uniform nor normal distribution, so when I was investigating what to do, I found the robust Z-score in the following article.
Robust z-score: median and quartile, non-normal distribution, standardization including outliers (Memo) Exclusion of outliers using robust z-score
Below, I tried it in Python.
For more information on robust Z-score, please read the above article. The following is a brief description and implementation.
Z-score assumes a normal distribution, but to apply this to a non-normal distribution, first replace the mean * μ * with the median and the standard deviation * σ * with the interquartile range (IQR).
x' = \frac{x-median(x)}{IQR}
This formula can be calculated with robust_scale
or RobustScaler
in sklearn.preprocessing
.
It also makes it compatible with standard normal distributions. The corresponding IQR to the standard normal distribution is called the normalized interquartile range (NIQR), which is the IQR divided by F (0.75) --F (0.25) = 1.3489. (F (x) is the inverse of the cumulative distribution function)
NIQR = \frac{IQR}{1.3489}
Robust Z-score is the denominator of the above formula replaced from IQR to NIQR.
robust Z score = \frac{x-median(x)}{NIQR}
If implemented based on the above, the function will be as follows.
def robust_z(x):
from sklearn.preprocessing import robust_scale
from scipy.stats import norm
coefficient = norm.ppf(0.75)-norm.ppf(0.25)
robust_z_score = robust_scale(x)*coefficient
return robust_z_score
I would like to compare the three normalizations that have come up so far. First, prepare the data. I want data that is neither uniform nor normal, so I prepared data that combines uniform and normal distribution.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy.stats import chisquare, shapiro, norm
from sklearn.preprocessing import minmax_scale, scale, robust_scale
np.random.seed(2020)
#Data that combines a uniform distribution and a normal distribution.
data = np.concatenate((np.random.uniform(low=5.0, high=10.0, size=100),
np.random.normal(loc=5.0, scale=1.0, size=100)))
#Draw a histogram.
fig, axes = plt.subplots()
axes.hist(data)
axes.set_title("Histogram of data")
fig.show()
The test confirms that this data is not uniformly and normally distributed. Homogeneity was confirmed by the chi-square test, and normality was confirmed by the (Shapiro-Wilk test).
#Calculate the frequency distribution.
hist_data, _ = np.histogram(data, bins="auto")
#Uniformity test (chi-square test)
_, chisquare_p = chisquare(hist_data)
print("Uniformity test (chi-square test) p-value: {}".format(chisquare_p))
#Normality test (Shapiro-Wilk test)
_, shapiro_p = shapiro(data)
print("P-value of normality test (Shapiro-Wilk test): {}".format(shapiro_p))
The results are as follows. Both have a P-value smaller than 0.05, so it can be said that they are neither uniform nor normal.
Uniformity test (chi-square test) p-value: 3.8086163670115985e-09
P-value of normality test (Shapiro-Wilk test): 8.850588528730441e-06
Use this data to calculate min-max normalization, Z-score, and robust Z-score and compare them.
#Normalize each method and put it in the data frame.
score_df = pd.DataFrame(data=np.array([minmax_scale(data), scale(data), robust_z(data)]).T,
columns=["min-max", "Z-score", "robust Z-score"])
#Create a graph
fig, axs = plt.subplots(ncols=3, constrained_layout=True)
#x-axis width setting
xrange = {"min-max":(0,1),
"Z-score":(-2.5,2.5),
"robust Z-score":(-2.5,2.5)}
#Drawing of each histogram
for i, score_name in enumerate(score_df.columns):
axs[i].hist(score_df[score_name])
axs[i].set_title(score_name)
axs[i].set_xlim(xrange[score_name])
fig.show()
The result is shown below. There is not much difference. It may make a difference depending on the distribution of the data.
In the first place, "robust" in robust Z-score means that it is robust against ** outliers **. The robust Z-score is also used for outlier detection. Therefore, I would like to put outliers in the data and compare them. For ease of comparison, try entering a large number of extreme outliers.
#Combine outliers (uniform distribution) into the data.
outier = np.concatenate((data,
np.random.uniform(low=19.0, high=20.0, size=15)))
#Normalize each method and put it in the data frame.
outlier_df = pd.DataFrame(data=np.array([minmax_scale(outier), scale(outier), robust_z(outier)]).T,
columns=["min-max", "Z-score", "robust Z-score"])
#Combine data frames with no outliers and with outliers.
concat_df = pd.concat([score_df, outlier_df],
axis=1,
keys=['without outlier', 'with outlier'])
#Create a graph
fig, axs = plt.subplots(nrows=2, ncols=3, constrained_layout=True)
#x-axis width setting
xrange = {"min-max":(0, 1),
"Z-score":(-6.5, 6.5),
"robust Z-score":(-6.5, 6.5)}
#Histogram drawing
for i, (data_name, score_name) in enumerate(concat_df.columns):
row, col = divmod(i, 3)
axs[row, col].hist(concat_df[(data_name, score_name)])
axs[row, col].set_xlim(xrange[score_name])
title = "\n".join([data_name, score_name])
axs[row, col].set_title(title)
plt.show()
The result is shown below. The top is when there are no outliers and the bottom is when there are outliers. min-max normalization is highly sensitive to outliers. Z-score is also affected by outliers and is very different from the case without outliers. Robust Z-score is least affected by outliers and is relatively similar to no outliers.
Robust Z-score gives the same result as Z-score when the data is normally distributed, so if you get lost, I'm thinking of using robust Z-score. In particular, I felt that robust Z-score is effective when I want to use outliers as well.