[Machine learning] Understand from mathematics that standardization results in an average of 0 and a standard deviation of 1.

1. Purpose

If you want to try machine learning, anyone can use scikit-learn etc. to implement it relatively easily. However, in order to achieve results at work or to improve your level ** "I don't know the background, but I got this result" ** You can see that it is clearly weak.

This time, we will take up ** "standardization" ** that appears in preprocessing.

From the previous explanation, "I've heard about standardization, why do you do it?" And "How do you use it with scikit-learn?", "Standardization is only a process that makes the average 0 and standard deviation 1". However, the purpose is to make an article that can answer the question "I want to understand what kind of calculation is actually done from mathematical formulas."

First, Chapter 2 outlines the standardization, and Chapter 3 actually uses scikit-learn to standardize. Finally, in Chapter 4, I would like to touch on the standardization formula (whether standardization really results in an average of 0 and a standard deviation of 1).

2. What is standardization?

(1) What is standardization?

From the conclusion, it is ** the process of processing each variable so that it fits within "mean 0, standard deviation 1" **.

It is often used in preprocessing in machine learning.

Roughly speaking, it is an image that ** aligns the units of each data **. For example, if you have data on sales and temperature, the scales will be completely different if the sales are 100 million units and the temperature is about 40 (℃) at the maximum, so convert these two data so that they each have an average of 0 and a standard deviation of 1. I'll do it. (Specifically, after standardization, sales are 0.4 and temperature is 0.1. * Values are appropriate)

(2) Why standardize?

There seem to be many reasons, but one of them is ** "Many machine learning algorithms assume that each variable has the same scale" **.

For example, in the example of sales (assuming 400 million yen) and temperature (assuming 35 ° C) mentioned earlier, we humans know in advance that each data is "sales" and "temperature", so there are two. Even if the scale of the variable (scale of the unit) is different, you can understand it without any discomfort.

However, computers are not aware of this, so they may think of it as just the numbers "400,000,000" and "35" and give them the wrong meaning in machine learning.

In order to eliminate such adverse effects, standardization is performed by adjusting the scale of each variable in advance.

(3) Is there any other way than standardization?

In addition to standardization, there is a method called ** normalization ** to match the scale of each variable.

I won't go into detail this time, but it seems that standardization is often done because ** "it is not easily affected by outliers" and "it becomes a normal distribution when standardized" **.

3. Standardized with scikit-learn

This time, as a concrete example, I will use kaggle's kickstarter-projects, which I always use, as an example. https://www.kaggle.com/kemical/kickstarter-projects

This chapter is long, but ** the essential standardization is only (v) **, so it's a good idea to take a look there first.

(I) Import

#numpy,Import pandas
import numpy as np
import pandas as pd

#Import to perform some processing on date data
import datetime

#Import for training and test data split
from sklearn.model_selection import train_test_split

#Import for standardization
from sklearn.preprocessing import StandardScaler

#Import for accuracy verification
from sklearn.model_selection import cross_val_score

#Import for logistic regression
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import log_loss, accuracy_score, confusion_matrix

(Ii) Reading data


df = pd.read_csv("ks-projects-201801.csv")

(Iii) Confirmation of the number of data

From the following, you can see that it is the dataset of (378661, 15).

df.shape

(Iv) Data molding

◆ Number of recruitment days

I will omit the details, but since the recruitment start time and end time of crowdfunding are in the data, I will convert this to "recruitment days".

df['deadline'] = pd.to_datetime(df["deadline"])
df["launched"] = pd.to_datetime(df["launched"])
df["days"] = (df["deadline"] - df["launched"]).dt.days

◆ About the objective variable

I will omit the details here as well, but there are categories other than success ("successful") and failure ("failed") for the objective variable "state", but this time I will only use data for success and failure.

df = df[(df["state"] == "successful") | (df["state"] == "failed")]

Then replace success with 1 and failure with 0.

df["state"] = df["state"].replace("failed",0)
df["state"] = df["state"].replace("successful",1)

◆ Delete unnecessary lines

Before building the model, delete the id and name that you think you don't need (this should be kept, but this time it's gone), and the variables that you only need to crowdfunding ..

df = df.drop(["ID","name","deadline","launched","backers","pledged","usd pledged","usd_pledged_real","usd_goal_real"], axis=1)

◆ Category variable processing

Perform categorical variable processing with pd.get_dummies.

df = pd.get_dummies(df,drop_first = True)

(V) Standardization

It's finally standardization. Before that, let's take a quick look at the current data.

キャプチャ1.PNG

As mentioned in the example at the beginning, you can see that the unit scales of "goal (target amount)" and "days (recruitment period)" are quite different.

This time, let's standardize these two variables.

stdsc = StandardScaler()
df["goal"] = stdsc.fit_transform(df[["goal"]].values)
df["days"] = stdsc.fit_transform(df[["days"]].values)

After doing this, let's display the data again.

キャプチャ2.PNG

Goal and days data have been standardized!

(V) Logistic regression implementation

The standardization process itself ends with (V), but for the purpose of grasping the series of flows, We even build a logistic regression model.

#Divided into training data and test data
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

#Logistic regression model construction
clf = SGDClassifier(loss = "log", penalty = "none",random_state=1234)
clf.fit(X_train,y_train)

#Check accuracy with test data
clf.score(X_test, y_test)

Now the accuracy is 0.665.

4. Understanding standardization from mathematics

So far, we have used scikit-learn for standardization so that we can implement it for the time being.

In this chapter, I would like to understand the processing that standardization is doing from a mathematical point of view.

(1) Standardization formula

Let's give a concrete example here. An example of temperature and sales, as mentioned at the beginning.

temperature Sales (yen)
1 10 400,000,000
2 15 390,000,000
3 30 410,000,000
4 20 405,000,000
5 40 395,000,000

Let's start with the conclusion. Standardization is calculated by the following formula.

z =  \frac{x -u}{σ}

I'm not sure. $ x $ is the value of each variable, $ u $ is the mean of each variable, and $ σ $ is the standard deviation of each variable.

To put it in a very rough way, the average is subtracted from each value and divided by the standard deviation of each variable, that is, the degree of dispersion, so the image is that the data scattered in various ranges are gathered together.

・ ・ I don't understand better. Let's actually find the numerical value.

■ Average $ u $ Since $ u $ is the average, the average of all five temperature data is $ (10 + 15 + 30 + 20 + 40) / 5 = 23 $.

Similarly, if you calculate the average sales, $ (400,000,000 + 390,000,000 + 410,000,000 + 405,000,000 + 395,000,000) / 5 = 400,000,000 $ It will be.

■ Standard deviation $ σ $ I will omit the calculation process, The standard deviation of the temperature is about 12.

The standard deviation of sales is 7,905,694.

■ When standardized ... The calculation result is as follows, and the standardized temperature column values and the standardized sales column values are used in machine learning.

temperature Sales (yen) 標準化したtemperature Standardized sales
1 10 400,000,000 (10-23)/12 (400,000,000-400,000,000)/7905694
2 15 390,000,000 (15-23)/12 (390,000,000-400,000,000)/7905694
3 30 410,000,000 (30-23)/12 (410,000,000-400,000,000)/7905694
4 20 405,000,000 (20-23)/12 (405,000,000-400,000,000)/7905694
5 40 395,000,000 (40-23)/12 (395,000,000-400,000,000)/7905694

(2) Why is the average 0 and standard deviation 1?

The standardization formula is described as $ z = \ frac {x -u} {σ} $, but let's prove that the average is 0 and the standard deviation is 1 when this calculation is performed.

Here, we will consider the formula $ y = ax + b $.


Unstandardized mean and standard deviation of the original equation

First, let's simply find the pre-standardized mean and standard deviation. As you can see, neither the mean nor the standard deviation is 0 or 1. ■ Average As shown below, the average of ** $ y $ can be expressed as $ aµ + b $ **. キャプチャ3.PNG

■ Standard deviation First we are looking for the variance and the result is ** $ a ^ 2σ ^ 2 $. (Standard deviation is $ aσ $) ** キャプチャ4.PNG


Mean and standard deviation of standardized equations

Next, let's find the mean and standard deviation after standardization. The concept of formula transformation is basically the same as the formula transformation before standardization.

If you transform the formula as shown below, you can see that the average is 0 and the standard deviation is 1 after standardization!

■ Average キャプチャ5.PNG

■ Standard deviation As before, we are looking for dispersion first. Since the variance is 1, the standard deviation is also 1. キャプチャ6.PNG

With the above, we can prove from the formula that the mean of each variable after standardization is 0 and the standard deviation is 1!

5. Conclusion

How was it? My thought is, "I can't interpret the extremely complicated code from the beginning, so I don't care about the accuracy, so I'll try to implement a basic series of steps with scikit-learn etc." I think it's very important.

However, once I get used to it, I feel that it is very important to understand from a mathematical background how they work behind the scenes.

I think there are many contents that are difficult to understand, but I hope it helps to deepen my understanding.

Recommended Posts

[Machine learning] Understand from mathematics that standardization results in an average of 0 and a standard deviation of 1.
Simple code that gives a score of 0.81339 in Kaggle's Titanic: Machine Learning from Disaster
An example of a mechanism that returns a prediction by HTTP from the result of machine learning
Note that I understand the algorithm of the machine learning naive Bayes classifier. And I wrote it in Python.
Get a glimpse of machine learning in Python
Calculation of standard deviation and correlation coefficient in Python
[Machine learning] Understanding SVM from both scikit-learn and mathematics
Installation of TensorFlow, a machine learning library from Google
An introduction to machine learning from a simple perceptron
MALSS, a tool that supports machine learning in Python
Machine learning and statistical prediction, a paradigm of modern statistics that you should know before that
[Machine learning] Understanding decision trees from both scikit-learn and mathematics
Find the average / standard deviation of the brightness values in the image
[Machine learning] Understanding logistic regression from both scikit-learn and mathematics
Started Python: Swap an array of values obtained from SQL results to a list type and use it in IN of another query
I made a Line bot that guesses the gender and age of a person from an image
I tried to understand supervised learning of machine learning in an easy-to-understand manner even for server engineers 1
I tried to understand supervised learning of machine learning in an easy-to-understand manner even for server engineers 2
Create an instance of a predefined class from a string in Python
[Machine learning] Understanding linear simple regression from both scikit-learn and mathematics
[Machine learning] Understand from mathematics why the correlation coefficient ranges from -1 to 1.
[Machine learning] Understanding linear multiple regression from both scikit-learn and mathematics
A story about a magic trick in LT that performs live coding of natural language processing and dependency analysis in an instant from nothing.
Significance of machine learning and mini-batch learning
[Machine learning] Understanding uncorrelatedness from mathematics
Become an AI engineer soon! Comprehensive learning of Python / AI / machine learning / deep learning / statistical analysis in a few days!
Classification and regression in machine learning
I wrote a book that allows you to learn machine learning implementations and algorithms in a well-balanced manner.
Python: Preprocessing in machine learning: Handling of missing, outlier, and imbalanced data
A memorandum of filter commands that you might forget in an instant
[Python] Saving learning results (models) in machine learning
Free version of DataRobot! ?? Introduction to "PyCaret", a library that automates machine learning
List of main probability distributions used in machine learning and statistics and code in python
Implementation of a model that predicts the exchange rate (dollar-yen rate) by machine learning
A memorandum of method often used in machine learning using scikit-learn (for beginners)
A memo to generate a dynamic variable of class from dictionary data (dict) that has only standard type data in Python3