1. Purpose

If you want to try machine learning, anyone can use scikit-learn etc. to implement it relatively easily. However, in order to achieve results at work or to improve your level ** "I don't know the background, but I got this result" ** You can see that it is clearly weak.

This time, we will take up ** "standardization" ** that appears in preprocessing.

From the previous explanation, "I've heard about standardization, why do you do it?" And "How do you use it with scikit-learn?", "Standardization is only a process that makes the average 0 and standard deviation 1". However, the purpose is to make an article that can answer the question "I want to understand what kind of calculation is actually done from mathematical formulas."

First, Chapter 2 outlines the standardization, and Chapter 3 actually uses scikit-learn to standardize. Finally, in Chapter 4, I would like to touch on the standardization formula (whether standardization really results in an average of 0 and a standard deviation of 1).

I have posted several articles as a series of "Understanding from Mathematics", so I hope you can read them together. [Machine learning] Understanding decision trees from both scikit-learn and mathematics [Machine learning] Understanding linear simple regression from both scikit-learn and mathematics [[Machine learning] Understanding linear multiple regression from both scikit-learn and mathematics] (https://qiita.com/Hawaii/items/b84a0d669bcf5267e750) [[Machine learning] Understanding logistic regression from both scikit-learn and mathematics] (https://qiita.com/Hawaii/items/ee2a0687ca451fe213be) [[Machine learning] Understanding SVM from both scikit-learn and mathematics] (https://qiita.com/Hawaii/items/4688a50cffb2140f297d) [[Machine learning] Understand from mathematics why the correlation coefficient ranges from -1 to 1] (https://qiita.com/Hawaii/items/3f4e91cf9b86676c202f)

2. What is standardization?

(1) What is standardization?

From the conclusion, it is ** the process of processing each variable so that it fits within "mean 0, standard deviation 1" **.

It is often used in preprocessing in machine learning.

Roughly speaking, it is an image that ** aligns the units of each data **. For example, if you have data on sales and temperature, the scales will be completely different if the sales are 100 million units and the temperature is about 40 (℃) at the maximum, so convert these two data so that they each have an average of 0 and a standard deviation of 1. I'll do it. (Specifically, after standardization, sales are 0.4 and temperature is 0.1. * Values are appropriate)

(2) Why standardize?

There seem to be many reasons, but one of them is ** "Many machine learning algorithms assume that each variable has the same scale" **.

For example, in the example of sales (assuming 400 million yen) and temperature (assuming 35 ° C) mentioned earlier, we humans know in advance that each data is "sales" and "temperature", so there are two. Even if the scale of the variable (scale of the unit) is different, you can understand it without any discomfort.

However, computers are not aware of this, so they may think of it as just the numbers "400,000,000" and "35" and give them the wrong meaning in machine learning.

In order to eliminate such adverse effects, standardization is performed by adjusting the scale of each variable in advance.

(3) Is there any other way than standardization?

In addition to standardization, there is a method called ** normalization ** to match the scale of each variable.

I won't go into detail this time, but it seems that standardization is often done because ** "it is not easily affected by outliers" and "it becomes a normal distribution when standardized" **.

3. Standardized with scikit-learn

This time, as a concrete example, I will use kaggle's kickstarter-projects, which I always use, as an example. https://www.kaggle.com/kemical/kickstarter-projects

This chapter is long, but ** the essential standardization is only (v) **, so it's a good idea to take a look there first.

(I) Import

#numpy,Import pandas
import numpy as np
import pandas as pd

#Import to perform some processing on date data
import datetime

#Import for training and test data split
from sklearn.model_selection import train_test_split

#Import for standardization
from sklearn.preprocessing import StandardScaler

#Import for accuracy verification
from sklearn.model_selection import cross_val_score

#Import for logistic regression
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import log_loss, accuracy_score, confusion_matrix

(Ii) Reading data


df = pd.read_csv("ks-projects-201801.csv")

(Iii) Confirmation of the number of data

From the following, you can see that it is the dataset of (378661, 15).

df.shape

(Iv) Data molding

◆ Number of recruitment days

I will omit the details, but since the recruitment start time and end time of crowdfunding are in the data, I will convert this to "recruitment days".

df['deadline'] = pd.to_datetime(df["deadline"])
df["launched"] = pd.to_datetime(df["launched"])
df["days"] = (df["deadline"] - df["launched"]).dt.days

◆ About the objective variable

I will omit the details here as well, but there are categories other than success ("successful") and failure ("failed") for the objective variable "state", but this time I will only use data for success and failure.

df = df[(df["state"] == "successful") | (df["state"] == "failed")]

Then replace success with 1 and failure with 0.

df["state"] = df["state"].replace("failed",0)
df["state"] = df["state"].replace("successful",1)

◆ Delete unnecessary lines

Before building the model, delete the id and name that you think you don't need (this should be kept, but this time it's gone), and the variables that you only need to crowdfunding ..

df = df.drop(["ID","name","deadline","launched","backers","pledged","usd pledged","usd_pledged_real","usd_goal_real"], axis=1)

◆ Category variable processing

Perform categorical variable processing with pd.get_dummies.

df = pd.get_dummies(df,drop_first = True)

(V) Standardization

It's finally standardization. Before that, let's take a quick look at the current data.

As mentioned in the example at the beginning, you can see that the unit scales of "goal (target amount)" and "days (recruitment period)" are quite different.

This time, let's standardize these two variables.

stdsc = StandardScaler()
df["goal"] = stdsc.fit_transform(df[["goal"]].values)
df["days"] = stdsc.fit_transform(df[["days"]].values)

After doing this, let's display the data again.

Goal and days data have been standardized!

(V) Logistic regression implementation

The standardization process itself ends with (V), but for the purpose of grasping the series of flows, We even build a logistic regression model.

#Divided into training data and test data
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

#Logistic regression model construction
clf = SGDClassifier(loss = "log", penalty = "none",random_state=1234)
clf.fit(X_train,y_train)

#Check accuracy with test data
clf.score(X_test, y_test)

Now the accuracy is 0.665.

Supplement Without this (V) standardization, the accuracy would be 0.529, indicating that the accuracy of the entire model has improved.

4. Understanding standardization from mathematics

So far, we have used scikit-learn for standardization so that we can implement it for the time being.

In this chapter, I would like to understand the processing that standardization is doing from a mathematical point of view.

(1) Standardization formula

Let's give a concrete example here. An example of temperature and sales, as mentioned at the beginning.

	temperature	Sales (yen)
1	10	400,000,000
2	15	390,000,000
3	30	410,000,000
4	20	405,000,000
5	40	395,000,000

Let's start with the conclusion. Standardization is calculated by the following formula.

z =  \frac{x -u}{σ}

I'm not sure. $ x $ is the value of each variable, $ u $ is the mean of each variable, and $ σ $ is the standard deviation of each variable.

To put it in a very rough way, the average is subtracted from each value and divided by the standard deviation of each variable, that is, the degree of dispersion, so the image is that the data scattered in various ranges are gathered together.

・・ I don't understand better. Let's actually find the numerical value.

■ Average $ u $ Since $ u $ is the average, the average of all five temperature data is $ (10 + 15 + 30 + 20 + 40) / 5 = 23 $.

Similarly, if you calculate the average sales, $ (400,000,000 + 390,000,000 + 410,000,000 + 405,000,000 + 395,000,000) / 5 = 400,000,000 $ It will be.

■ Standard deviation $ σ $ I will omit the calculation process, The standard deviation of the temperature is about 12.

The standard deviation of sales is 7,905,694.

■ When standardized ... The calculation result is as follows, and the standardized temperature column values and the standardized sales column values are used in machine learning.

	temperature	Sales (yen)	標準化したtemperature	Standardized sales
1	10	400,000,000	(10-23)/12	(400,000,000-400,000,000)/7905694
2	15	390,000,000	(15-23)/12	(390,000,000-400,000,000)/7905694
3	30	410,000,000	(30-23)/12	(410,000,000-400,000,000)/7905694
4	20	405,000,000	(20-23)/12	(405,000,000-400,000,000)/7905694
5	40	395,000,000	(40-23)/12	(395,000,000-400,000,000)/7905694

(2) Why is the average 0 and standard deviation 1?

The standardization formula is described as $ z = \ frac {x -u} {σ} $, but let's prove that the average is 0 and the standard deviation is 1 when this calculation is performed.

Here, we will consider the formula $ y = ax + b $.

Unstandardized mean and standard deviation of the original equation

First, let's simply find the pre-standardized mean and standard deviation. As you can see, neither the mean nor the standard deviation is 0 or 1. ■ Average As shown below, the average of ** $ y $ can be expressed as $ aµ + b $ **. キャプチャ3.PNG

$ E $ represents the expected value, but if you do not know the expected value very much, you can roughly regard it as "average" (strictly different).

■ Standard deviation First we are looking for the variance and the result is ** $ a ^ 2σ ^ 2 $. (Standard deviation is $ aσ $) ** キャプチャ4.PNG

Mean and standard deviation of standardized equations

Next, let's find the mean and standard deviation after standardization. The concept of formula transformation is basically the same as the formula transformation before standardization.

If you transform the formula as shown below, you can see that the average is 0 and the standard deviation is 1 after standardization!

■ Average キャプチャ5.PNG

■ Standard deviation As before, we are looking for dispersion first. Since the variance is 1, the standard deviation is also 1. キャプチャ6.PNG

With the above, we can prove from the formula that the mean of each variable after standardization is 0 and the standard deviation is 1!

5. Conclusion

How was it? My thought is, "I can't interpret the extremely complicated code from the beginning, so I don't care about the accuracy, so I'll try to implement a basic series of steps with scikit-learn etc." I think it's very important.

However, once I get used to it, I feel that it is very important to understand from a mathematical background how they work behind the scenes.

I think there are many contents that are difficult to understand, but I hope it helps to deepen my understanding.

[Machine learning] Understand from mathematics that standardization results in an average of 0 and a standard deviation of 1.