Hello.
Suddenly, I was interested in machine learning and deep learning, so I recently participated in a kaggle competition. Kaggle has a Notebook feature, so I was enthusiastic to understand the code!
"I don't know what this means at all"
I had no programming knowledge at all, so when I looked at the code in kaggle's notebook, it looked like a cipher (laughs). Therefore, I thought I would slowly understand each one, so I would like to write it here as if it were a diary.
This time, it is about "Pipeline".
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
iris_data = datasets.load_iris()
input_data = iris_data.data
correct = iris_data.target
For the time being, I accessed the following site. sklearn.pipeline.Pipeline — scikit-learn 0.23.2 documentation
According to this, the basic shape is
from sklearn.pipeline import Pipeline pipe = Pipeline ([(pretreatment method), (learning method)]) pipe.fit (explanatory variable, objective variable)
It seems that the code can be simplified.
Based on this, I tried to train iris data in a random forest.
from sklearn.ensemble import RandomForestClassifier as RFC
X_train, X_test, y_train, y_test = train_test_split(input_data, correct)
pipe = Pipeline([('scaler', StandardScaler()),
('RandomForestClassifier', RFC())])
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)
# 0.9473684210526315
From the above, we have standardized the explanatory variables and are training in a random forest. By putting them together in Pipeline in this way, the code becomes "concise".
Below is the code for confirmation.
X_train, X_test, y_train, y_test = train_test_split(input_data, correct)
tr_x, te_x, tr_y, te_y = X_train.copy(), X_test.copy(), y_train.copy(), y_test.copy() #Copy for check
pipe = Pipeline([('scaler', StandardScaler()),
('Classifier', RFC())])
pipe.fit(X_train, y_train)
print("pipe score = " + str(pipe.score(X_test, y_test)))
from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
tr_x = stdsc.fit(tr_x).transform(tr_x)
te_x = stdsc.fit(te_x).transform(te_x)
clf = RFC()
clf.fit(tr_x, tr_y)
print("RFC score = ", clf.score(te_x, te_y))
# pipe score = 0.9473684210526315
# RFC score = 0.9473684210526315
I was able to match the calculation, so I knew that Pipeline's preprocessing worked correctly.
I see, I somehow learned about Pipeline. But even if there are many pre-processes, can only one be executed?
Apparently, it seems that multiple processes can be combined.
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
preprocessing = Pipeline([
('imputer', SimpleImputer(strategy='constant', fill_value='missing')), #Missing value removal process
('onehot', OneHotEncoder(handle_unknown='ignore'))]) #One hot encoding
rf = Pipeline([
('preprocess', preprocessing),
('classifier', RFC())])
rf.fit(X_train, y_train)
in this way, Basic form pipe = Pipeline ([(pretreatment method), (learning method)]) As for the (pre-processing method), it seems that one method is to overlap Pipelines as an image like BNF notation (it is just an image story).
Recommended Posts