Feature Engineering for Machine Learning Beginning with Part 3 Google Colaboratory-Scaling

Introduction

This article discusses scaling and normalization. This article is mainly based on "Features Engineering for Machine Learning". Please check it out if you become.

All the programs explained in this article can be found at here.

What is scaling?

The range of numerical data that can be taken may or may not be fixed. Basically, the range of data that can be taken from count data is not fixed, and in the case of a model that is sensitive to the scale of features such as linear regression, learning may not be successful due to outliers or differences in scale between features. .. Unifying the scale in such a case is called scaling. Scaling includes Min-Max scaling, standardization, L2 normalization, etc., so let's introduce them in order. If you want to know more about scaling, please see the article here.

Min-Max scaling

For Min-Max scaling, set the minimum value to 0 and the maximum value to 1. If outliers are included, the range of normal values that can be taken may become too narrow due to the influence of the outliers, so standardization is basically used.

\tilde{x} = \frac{x - min(x)}{max(x) - min(x)}

import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

##Random number fixed
np.random.seed(100)
data_array = []
for i in range(1, 100):
  s = np.random.randint(0, i * 10, 10)
  data_array.extend(s)
data_array.extend(np.zeros(100))
data = pd.DataFrame({'Listen Count': data_array})

print(data.max()) # 977.0
print(data.min()) # 0

scaler = MinMaxScaler()
data_n = scaler.fit_transform(data)
data_n = pd.DataFrame(data_n)

print(data_n.max()) ## 1.0
print(data_n.min()) ## 0

Standardization

In standardization, the mean is 0 and the variance is 1. If the original feature has a normal distribution The standardized feature has a standard normal distribution.

\tilde{x} = \frac{x - mean(x)}{sqrt(var(x))}

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

##Random number fixed
np.random.seed(100)
data_array = []
for i in range(1, 100):
  s = np.random.randint(0, i * 10, 10)
  data_array.extend(s)
data_array.extend(np.zeros(100))
data = pd.DataFrame({'Listen Count': data_array})

scaler = StandardScaler()
data_n = scaler.fit_transform(data)
data_n = pd.DataFrame({'Listen Count': data_n.ravel()})

print(data_n.var()) ##1.000918
print(data_n.mean()) ##6.518741e-17

L2 normalization

L2 normalization normalizes features by dividing by the L2 norm.

\tilde{x} = \frac{x}{||x||_2} \\
||x||_2 = \sqrt{x_1^2 + x_2^2+ ...+x_m^2 }

import numpy as np
import pandas as pd
from sklearn.preprocessing import normalize

##Random number fixed
np.random.seed(100)
data_array = []
for i in range(1, 100):
  s = np.random.randint(0, i * 10, 10)
  data_array.extend(s)
data_array.extend(np.zeros(100))
data = pd.DataFrame({'Listen Count': data_array})

##L2 normalization
data_l2_normalized = normalize([data['Listen Count']],norm='l2')
data_l2 = pd.DataFrame({'Listen Count': data_l2_normalized.ravel()})

print(np.linalg.norm(data_l2_normalized,ord=2)) ## 0.999999999

Finally

I'm thinking of posting a video about IT on YouTube. Please like, subscribe to the channel, and give us a high rating, as it will motivate youtube and Qiita updates. YouTube: https://www.youtube.com/channel/UCywlrxt0nEdJGYtDBPW-peg Twitter: https://twitter.com/tatelabo