Feature Engineering for Machine Learning Beginning with Part 2 Google Colaboratory-Logarithmic Transformation and Box-Cox Transformation

Introduction

In this article Explains logarithmic transformation and Box-Cox transformation. This article is mainly based on "Features Engineering for Machine Learning". Please check it out if you become.

Also, the content of this article is explained in more detail on YouTube, so if you are interested, please check it out.

All the programs explained in this article can be found at here.

What is logarithmic transformation?

Logarithmic transformation is mainly used for the following purposes.

--Make it follow a normal distribution --Reduce variance

The logarithmic function is a function as shown in the figure below, and the value of x is small because the range of [1,10] is copied to [0,1] and the range of [10,100] is copied to [1,2]. It can be seen that the case is copied in a wide range, and the case is copied in a narrow range when the value of x is large.

By using this logarithmic transformation, it is possible to compress the upper side of the heavy-tailed distribution as shown below and expand the lower side to bring it closer to the normal distribution. Many machine learning methods do not make any assumptions about the population ** nonparametric models **, so it is not necessary to approach the normal distribution, but the distribution of the statistical population is assumed ** parametric model * If you use *, the data distribution must be normal.

Further, the variance can be reduced by using the logarithmic conversion for the following data having a large variance.

** Before applying logarithmic transformation (variance: 5.0e + 06) **

** After applying logarithmic transformation (variance: 0.332007) **

** Sample code for applying logarithmic transformation **

`log.py`


import numpy as np
import pandas as pd


##Random number fixed
np.random.seed(100)

data_array = []
for i in range(1, 10000):
  max_num = i if i > 3000 else 1000 
  s = np.random.randint(0, max_num, 10)
  data_array.extend(s)

data = pd.DataFrame({'Listen Count': data_array})

data_log = pd.DataFrame()
##Add 1 to prevent it from becoming 0
data_log['Listen Count'] = np.log10(data['Listen Count'] + 1)

What is Box-Cox conversion?

Conversion that can be defined by the following formula

y=\begin{eqnarray}
\left\{
\begin{array}{l}
\frac{x^\lambda - 1}{\lambda}~~~~~(if ~~ \lambda\neq0) \\
\log(x)~~~~~(if ~~ \lambda=0)
\end{array}
\right.
\end{eqnarray}

By using the Box-Cox transformation to make the data follow a normal distribution to some extent, the data can be made to follow a normal distribution. (* However, it can be used only when the data is positive.)

The graph below shows this conversion. You need to determine the lambda value before using the Box-Cox transform. By using the maximum likelihood method here, the lambda is determined so that the converted data is closest to the normal distribution.

If you use the Box-Cox transformation on the data that is actually distributed as shown in the figure below, you can see that it can be converted to a distribution that seems to be a normal distribution.

** Before applying Box-Cox conversion **

** After applying Box-Cox conversion **

** Sample code for applying Box-Cox conversion **

from scipy import stats
import numpy as np
import pandas as pd

##Random number fixed
np.random.seed(100)

##Data generation
data_array = []
for i in range(1, 1000):
  s = np.random.randint(1, i * 100, 10)
  data_array.extend(s)
data = pd.DataFrame({'Listen Count': data_array})

##Box-Cox conversion
rc_bc, bc_params = stats.boxcox(data['Listen Count'])
print(bc_params) ##0.3419237117680786

Q-Q plot

The Q-Q plot is a plot of measured and ideal values. In other words, if it is a straight line, it can be said that the measured value is normally distributed. Below is a plot of the original data, the data after logarithmic conversion, and the data after Box-Cox conversion.

raw data

** After logarithmic conversion **

** After Box-Cox conversion **

From these results, we can see that the Box-Cox transformation was able to follow the most normal distribution.

Finally

I'm planning to post reviews and explanation videos of technical books on YouTube, focusing on machine learning. We also introduce companies that you should know if you go to IT. Please like, subscribe to the channel, and give us a high rating, as it will motivate youtube and Qiita updates. YouTube: https://www.youtube.com/channel/UCywlrxt0nEdJGYtDBPW-peg Twitter: https://twitter.com/tatelabo

reference

https://yolo-kiyoshi.com/2018/12/26/post-1037/ https://toukei-lab.com/box-cox%E5%A4%89%E6%8F%9B%E3%82%92%E7%94%A8%E3%81%84%E3%81%A6%E6%AD%A3%E8%A6%8F%E5%88%86%E5%B8%83%E3%81%AB%E5%BE%93%E3%82%8F%E3%81%AA%E3%81%84%E3%83%87%E3%83%BC%E3%82%BF%E3%82%92%E8%A7%A3%E6%9E%90 https://toukeier.hatenablog.com/entry/2019/09/08/224346 https://sigma-eye.com/2018/09/23/qq-plot/