[Python] What is Pipeline ...

This page is intended for people who want to understand Pipeline somehow.

Hello.

Suddenly, I was interested in machine learning and deep learning, so I recently participated in a kaggle competition. Kaggle has a Notebook feature, so I was enthusiastic to understand the code!

"I don't know what this means at all"

I had no programming knowledge at all, so when I looked at the code in kaggle's notebook, it looked like a cipher (laughs). Therefore, I thought I would slowly understand each one, so I would like to write it here as if it were a diary.

This time, it is about "Pipeline".

The data included in this article is iris data.

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

iris_data = datasets.load_iris()
input_data = iris_data.data
correct = iris_data.target

For the time being, I accessed the following site. sklearn.pipeline.Pipeline — scikit-learn 0.23.2 documentation

According to this, the basic shape is

from sklearn.pipeline import Pipeline pipe = Pipeline ([(pretreatment method), (learning method)]) pipe.fit (explanatory variable, objective variable)

It seems that the code can be simplified.

Based on this, I tried to train iris data in a random forest.

from sklearn.ensemble import RandomForestClassifier as RFC 

X_train, X_test, y_train, y_test = train_test_split(input_data, correct)
pipe = Pipeline([('scaler', StandardScaler()), 
                 ('RandomForestClassifier', RFC())])
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

# 0.9473684210526315

From the above, we have standardized the explanatory variables and are training in a random forest. By putting them together in Pipeline in this way, the code becomes "concise".

Below is the code for confirmation.

X_train, X_test, y_train, y_test = train_test_split(input_data, correct)
tr_x, te_x, tr_y, te_y = X_train.copy(), X_test.copy(), y_train.copy(), y_test.copy() #Copy for check

pipe = Pipeline([('scaler', StandardScaler()), 
                 ('Classifier', RFC())])
pipe.fit(X_train, y_train)
print("pipe score = " + str(pipe.score(X_test, y_test)))


from sklearn.preprocessing import StandardScaler

stdsc = StandardScaler()
tr_x = stdsc.fit(tr_x).transform(tr_x)
te_x = stdsc.fit(te_x).transform(te_x)

clf = RFC()
clf.fit(tr_x, tr_y)
print("RFC score = ", clf.score(te_x, te_y))

# pipe score = 0.9473684210526315
# RFC score =  0.9473684210526315

I was able to match the calculation, so I knew that Pipeline's preprocessing worked correctly.

I see, I somehow learned about Pipeline. But even if there are many pre-processes, can only one be executed?

Apparently, it seems that multiple processes can be combined.

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier as RFC 
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer

preprocessing = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),  #Missing value removal process
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])                    #One hot encoding


rf = Pipeline([
    ('preprocess', preprocessing),
    ('classifier', RFC())])

rf.fit(X_train, y_train)

in this way, Basic form pipe = Pipeline ([(pretreatment method), (learning method)]) As for the (pre-processing method), it seems that one method is to overlap Pipelines as an image like BNF notation (it is just an image story).