If you want to try machine learning, anyone can use scikit-learn etc. to implement it relatively easily. However, in order to achieve results at work or to improve your level ** "I don't know the background, but I got this result" ** You can see that it is clearly weak.
This time, we will take up ** "standardization" ** that appears in preprocessing.
From the previous explanation, "I've heard about standardization, why do you do it?" And "How do you use it with scikit-learn?", "Standardization is only a process that makes the average 0 and standard deviation 1". However, the purpose is to make an article that can answer the question "I want to understand what kind of calculation is actually done from mathematical formulas."
First, Chapter 2 outlines the standardization, and Chapter 3 actually uses scikit-learn to standardize. Finally, in Chapter 4, I would like to touch on the standardization formula (whether standardization really results in an average of 0 and a standard deviation of 1).
From the conclusion, it is ** the process of processing each variable so that it fits within "mean 0, standard deviation 1" **.
It is often used in preprocessing in machine learning.
Roughly speaking, it is an image that ** aligns the units of each data **. For example, if you have data on sales and temperature, the scales will be completely different if the sales are 100 million units and the temperature is about 40 (℃) at the maximum, so convert these two data so that they each have an average of 0 and a standard deviation of 1. I'll do it. (Specifically, after standardization, sales are 0.4 and temperature is 0.1. * Values are appropriate)
There seem to be many reasons, but one of them is ** "Many machine learning algorithms assume that each variable has the same scale" **.
For example, in the example of sales (assuming 400 million yen) and temperature (assuming 35 ° C) mentioned earlier, we humans know in advance that each data is "sales" and "temperature", so there are two. Even if the scale of the variable (scale of the unit) is different, you can understand it without any discomfort.
However, computers are not aware of this, so they may think of it as just the numbers "400,000,000" and "35" and give them the wrong meaning in machine learning.
In order to eliminate such adverse effects, standardization is performed by adjusting the scale of each variable in advance.
In addition to standardization, there is a method called ** normalization ** to match the scale of each variable.
I won't go into detail this time, but it seems that standardization is often done because ** "it is not easily affected by outliers" and "it becomes a normal distribution when standardized" **.
This time, as a concrete example, I will use kaggle's kickstarter-projects, which I always use, as an example. https://www.kaggle.com/kemical/kickstarter-projects
This chapter is long, but ** the essential standardization is only (v) **, so it's a good idea to take a look there first.
#numpy,Import pandas
import numpy as np
import pandas as pd
#Import to perform some processing on date data
import datetime
#Import for training and test data split
from sklearn.model_selection import train_test_split
#Import for standardization
from sklearn.preprocessing import StandardScaler
#Import for accuracy verification
from sklearn.model_selection import cross_val_score
#Import for logistic regression
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import log_loss, accuracy_score, confusion_matrix
df = pd.read_csv("ks-projects-201801.csv")
From the following, you can see that it is the dataset of (378661, 15).
df.shape
I will omit the details, but since the recruitment start time and end time of crowdfunding are in the data, I will convert this to "recruitment days".
df['deadline'] = pd.to_datetime(df["deadline"])
df["launched"] = pd.to_datetime(df["launched"])
df["days"] = (df["deadline"] - df["launched"]).dt.days
I will omit the details here as well, but there are categories other than success ("successful") and failure ("failed") for the objective variable "state", but this time I will only use data for success and failure.
df = df[(df["state"] == "successful") | (df["state"] == "failed")]
Then replace success with 1 and failure with 0.
df["state"] = df["state"].replace("failed",0)
df["state"] = df["state"].replace("successful",1)
Before building the model, delete the id and name that you think you don't need (this should be kept, but this time it's gone), and the variables that you only need to crowdfunding ..
df = df.drop(["ID","name","deadline","launched","backers","pledged","usd pledged","usd_pledged_real","usd_goal_real"], axis=1)
Perform categorical variable processing with pd.get_dummies.
df = pd.get_dummies(df,drop_first = True)
It's finally standardization. Before that, let's take a quick look at the current data.
As mentioned in the example at the beginning, you can see that the unit scales of "goal (target amount)" and "days (recruitment period)" are quite different.
This time, let's standardize these two variables.
stdsc = StandardScaler()
df["goal"] = stdsc.fit_transform(df[["goal"]].values)
df["days"] = stdsc.fit_transform(df[["days"]].values)
After doing this, let's display the data again.
Goal and days data have been standardized!
The standardization process itself ends with (V), but for the purpose of grasping the series of flows, We even build a logistic regression model.
#Divided into training data and test data
y = df["state"].values
X = df.drop("state", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)
#Logistic regression model construction
clf = SGDClassifier(loss = "log", penalty = "none",random_state=1234)
clf.fit(X_train,y_train)
#Check accuracy with test data
clf.score(X_test, y_test)
Now the accuracy is 0.665.
So far, we have used scikit-learn for standardization so that we can implement it for the time being.
In this chapter, I would like to understand the processing that standardization is doing from a mathematical point of view.
Let's give a concrete example here. An example of temperature and sales, as mentioned at the beginning.
temperature | Sales (yen) | |
---|---|---|
1 | 10 | 400,000,000 |
2 | 15 | 390,000,000 |
3 | 30 | 410,000,000 |
4 | 20 | 405,000,000 |
5 | 40 | 395,000,000 |
Let's start with the conclusion. Standardization is calculated by the following formula.
z = \frac{x -u}{σ}
I'm not sure. $ x $ is the value of each variable, $ u $ is the mean of each variable, and $ σ $ is the standard deviation of each variable.
To put it in a very rough way, the average is subtracted from each value and divided by the standard deviation of each variable, that is, the degree of dispersion, so the image is that the data scattered in various ranges are gathered together.
・ ・ I don't understand better. Let's actually find the numerical value.
■ Average $ u $ Since $ u $ is the average, the average of all five temperature data is $ (10 + 15 + 30 + 20 + 40) / 5 = 23 $.
Similarly, if you calculate the average sales, $ (400,000,000 + 390,000,000 + 410,000,000 + 405,000,000 + 395,000,000) / 5 = 400,000,000 $ It will be.
■ Standard deviation $ σ $ I will omit the calculation process, The standard deviation of the temperature is about 12.
The standard deviation of sales is 7,905,694.
■ When standardized ... The calculation result is as follows, and the standardized temperature column values and the standardized sales column values are used in machine learning.
temperature | Sales (yen) | 標準化したtemperature | Standardized sales | |
---|---|---|---|---|
1 | 10 | 400,000,000 | ||
2 | 15 | 390,000,000 | ||
3 | 30 | 410,000,000 | ||
4 | 20 | 405,000,000 | ||
5 | 40 | 395,000,000 |
The standardization formula is described as $ z = \ frac {x -u} {σ} $, but let's prove that the average is 0 and the standard deviation is 1 when this calculation is performed.
Here, we will consider the formula $ y = ax + b $.
First, let's simply find the pre-standardized mean and standard deviation. As you can see, neither the mean nor the standard deviation is 0 or 1. ■ Average As shown below, the average of ** $ y $ can be expressed as $ aµ + b $ **.
■ Standard deviation First we are looking for the variance and the result is ** $ a ^ 2σ ^ 2 $. (Standard deviation is $ aσ $) **
Next, let's find the mean and standard deviation after standardization. The concept of formula transformation is basically the same as the formula transformation before standardization.
If you transform the formula as shown below, you can see that the average is 0 and the standard deviation is 1 after standardization!
■ Average
■ Standard deviation As before, we are looking for dispersion first. Since the variance is 1, the standard deviation is also 1.
With the above, we can prove from the formula that the mean of each variable after standardization is 0 and the standard deviation is 1!
How was it? My thought is, "I can't interpret the extremely complicated code from the beginning, so I don't care about the accuracy, so I'll try to implement a basic series of steps with scikit-learn etc." I think it's very important.
However, once I get used to it, I feel that it is very important to understand from a mathematical background how they work behind the scenes.
I think there are many contents that are difficult to understand, but I hope it helps to deepen my understanding.
Recommended Posts