What is this

I used to use sklearn's pipeline from time to time, but when I set pipeline.fit_transform (X, y), I was curious about what kind of processing was done in the pipeline, so the official document [^ 1] and I read the source code [^ 2] and decided to organize it.

In addition, the problem awareness that I had is described in the comment of the code below. Some people may think, "It's natural!", But I was really curious, so I looked it up.

#Problem awareness 1:Fit in the converter_transform,Fit is called in the estimator??
#Problem awareness 2:What should I do if I want to pass parameters to the converter or estimator at this timing? ??
#Problem awareness 3:What are the requirements to be met if you want to install your own estimator / converter???
pipe.fit(X, y)

#Problem awareness 4:Fit in the converter_transform,The estimator calls predict??
pipe.predict(X)

1. What is pipeline

When using an estimator that performs classification and regression in a machine learning project, a transformer is often used together. Pipeline is provided as a function that can integrate the processing from data conversion to learning / estimation as one estimator.

1.1. Example of using pipeline

A pipeline consists of a list whose elements are tuples of (key, value). Pass the name of the estimator / converter in key and the object of estimator / converter in value as steps to pipeline. An example of use is shown below.

from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn import datasets

#Preparation of sample data
iris = datasets.load_iris()
X, y = iris.data, iris.target

#Creating a pipeline
estimators = [('reduce_dim', PCA()), ('clf', SVC())]
pipe = Pipeline(steps=estimators)

#Learning
pipe.fit(X, y)

#Forecast
pipe.predict(X)

2. Estimator / converter requirements

You may want to put your own estimator / converter in the pipeline. Describe the requirements that must be met at that time. The requirements change at the end of the pipeline steps (final_estimator) or at the other end (not_final_estimator).

In the document, not_final_estimator is called transform, but since it is covered with transform method, the name is changed in this article.

--final_estimator: Have a fit method --not_final_estimator: Have fit and transform methods, or have fit_transform methods

Depending on the method called by pipeline, the requirements will increase, but the minimum requirements to be met are above.

3. Processing in pipeline

As shown in the code in 1.1., I checked the processing in the pipeline when calling pipeline.fit and pipeline.predict [^ 3]. The methods that will be used frequently in the pipeline are summarized below. From the left, the pipeline method, the parameters passed to it, the method called not_final_estimator, and the method called final_estimator.

pipeline	Parameters	not_final_estimator	final_estimator
fit	X, y=None, **fit_params	fit_transform	fit
fit_transform	X, y=None, **fit_params	fit_transform	fit_transform
predict	X, **predict_params	transform	predict
fit_predict	X, y=None, **fit_params	fit_transform	fit_predict
score	X, y=None, sample_weight=None	transform	score

The points to be noted are listed below.

--If fit_transform method is not defined, fit method and transform method are executed in order. --Unlike the fit_transform method, an error will occur if the fit_predict method is not defined. -\ * \ * fit_params can be passed with target step name (tuple key part) __ parameter name. --Example: pipeline.fit (X, y, key1__param1 = True) -Unlike \ * \ * fit_params, \ * \ * predict_params can only pass parameters to the predict method called final_estimator. As for the description method, just specify the parameter name in the predict method as it is. --Example: pipeline.predict (X, param1 = True)

As an aside, the sklearn-compliant model should not be designed to accept parameters when the fit method is executed. Therefore, it is better to avoid passing parameters using \ * \ * fit_params as much as possible. The sklearn compliant model is described in detail in here.

[^ 1]: User Guide [^ 2]: Source code [^ 3]: pipeline documentation

Understand the contents of sklearn's pipeline

What is this

1. What is pipeline

1.1. Example of using pipeline

2. Estimator / converter requirements

3. Processing in pipeline