I recently learned about machine learning, so I've summarized the steps to take when implementing it in Python.

Data preprocessing

In machine learning, it is first necessary to read data and know what kind of distribution it has. I will write about the procedure

Data reading

In order to actually read the data, read the csv file using the method called read_csv of pandas that was read earlier.

#Loading the library
import pandas as pd
import numpy as np

#Hoge directly below.load csv
df = pd.read_csv("./hoge.csv") 
#Extract only 5 lines from the top
df.head()

Confirmation of read data

When it comes to machine learning, there is an image that if you put in the data, it will do something about it, but in reality you need to look closely at the data. For example, are there any missing values, are there too much variation, and are there correlations?

Basic statistics

By entering the following code, you can see the number of data, mean value, standard deviation, minimum value, maximum value, etc. at once.

#Calculation of statistics
df.describe()

You can see the basic statistics in a list like this 基本統計量.PNG

Confirmation of distribution

However, it is difficult to understand just by looking at the numbers such as standard deviation and mean value, so it is easier for humans to understand it by graphing it. Therefore, it shows the distribution.

%matplotlib inline
#Loading a library that displays a graph called seaborn
import seaborn as sns

#Confirmation of distribution
sns.distplot(df["x1"]) #Check the data first (here, check the data in column x1)

This is nice data because the data seems to follow a normal distribution x6の分布.PNG

Confirmation of correlation coefficient

If the data has no correlation at all, there is no point in training it, so check the correlation coefficient. By the way, the correlation coefficient is between -1 and +1 and the higher it is, the more correlated it is.

#Calculation of correlation coefficient
df.corr()
#Check the correlation coefficient with a graph
sns.pairplot(df)

Separation of input variables and output variables

Actually, we will create something like $ y = w0x0 + w1x1 + w2x2 + ... + $, so we need to divide the contents of the data into the output variable y and the input variable x. At that time, I use a method called iloc of pandas.

#df.iloc[line,Column]とすることでそのlineとColumnのデータを取り出せる
Example example= df.iloc[1,3]
Result 100
#Last column-Fetch all rows up to 1(Input variable X)
 X = df.iloc[:,:-1]
#You can write it like this, but the versatility is low
 X = df.iloc[:,:Last column number]

#Take out y
 y = df.iloc[:,-1]

Machine learning using preprocessed data

By doing the above work, we are ready to actually perform machine learning. We will actually learn from the next. Here, we use scikit-learn, a machine learning library.

Divide into training data and verification data

The purpose of machine learning was to train data and make predictions when unknown data was entered. In other words, the data used for learning is not used. Obviously, if you put in the data used for learning and make a prediction, you will get an accurate answer because you learned with that data, right? Will be

Therefore, it is necessary to separate training data (train) and verification data (test) before training.

from sklearn.model_selection import train_test_split

#Separation of training data and verification data
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.4,random_state=1)

testsize specifies the ratio of training data and verification data, and here it is set as learning: training = 6: 4. Also, random_state is fixed to maintain reproducibility.

Model construction / verification

With scikit-learn, you can build and verify the model with just the following code The model used for learning this time is multiple regression analysis

#Library import
from sklearn.linear_model import  LinearRegression

#Model declaration (Linear Regression means multiple regression analysis)
model = LinearRegression()

#Model training (adjusting parameters)
model.fit(X,y)

#Check parameters
model.coef_

#Coefficient of determination(Prediction accuracy) 0~Higher between 1 is better
model.score(X,y)

#Predicted value calculation
x = X.iloc[0,:] #Take out the first line of X
y_pred = model.predict([x])

Save / load model

You can save the model with the following code

#import
from sklearn.externals import joblib

#Save model (hoge.Save as pkl
joblib.dump(model,"hoge.pkl")

Load the model with the following code

#hoge.Loading pkl
model_new = joblib.load("hoge.pkl") 

#Display the predicted value of the loaded model
model_new.predict([x])[0]

The above is the basic flow of machine learning. This time, the model was performed by multiple regression analysis, but the basic flow is the same when, for example, you want to perform logistic regression or SVM.

Summary of the basic flow of machine learning with Python