I used to use sklearn's pipeline from time to time, but when I set pipeline.fit_transform (X, y)
, I was curious about what kind of processing was done in the pipeline, so the official document [^ 1] and I read the source code [^ 2] and decided to organize it.
In addition, the problem awareness that I had is described in the comment of the code below. Some people may think, "It's natural!", But I was really curious, so I looked it up.
#Problem awareness 1:Fit in the converter_transform,Fit is called in the estimator??
#Problem awareness 2:What should I do if I want to pass parameters to the converter or estimator at this timing? ??
#Problem awareness 3:What are the requirements to be met if you want to install your own estimator / converter???
pipe.fit(X, y)
#Problem awareness 4:Fit in the converter_transform,The estimator calls predict??
pipe.predict(X)
When using an estimator that performs classification and regression in a machine learning project, a transformer is often used together. Pipeline is provided as a function that can integrate the processing from data conversion to learning / estimation as one estimator.
A pipeline consists of a list whose elements are tuples of (key, value). Pass the name of the estimator / converter in key and the object of estimator / converter in value as steps to pipeline. An example of use is shown below.
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn import datasets
#Preparation of sample data
iris = datasets.load_iris()
X, y = iris.data, iris.target
#Creating a pipeline
estimators = [('reduce_dim', PCA()), ('clf', SVC())]
pipe = Pipeline(steps=estimators)
#Learning
pipe.fit(X, y)
#Forecast
pipe.predict(X)
You may want to put your own estimator / converter in the pipeline. Describe the requirements that must be met at that time. The requirements change at the end of the pipeline steps (final_estimator) or at the other end (not_final_estimator).
--final_estimator: Have a fit method --not_final_estimator: Have fit and transform methods, or have fit_transform methods
Depending on the method called by pipeline, the requirements will increase, but the minimum requirements to be met are above.
As shown in the code in 1.1., I checked the processing in the pipeline when calling pipeline.fit and pipeline.predict [^ 3]. The methods that will be used frequently in the pipeline are summarized below. From the left, the pipeline method, the parameters passed to it, the method called not_final_estimator, and the method called final_estimator.
pipeline | Parameters | not_final_estimator | final_estimator |
---|---|---|---|
fit | X, y=None, **fit_params | fit_transform | fit |
fit_transform | X, y=None, **fit_params | fit_transform | fit_transform |
predict | X, **predict_params | transform | predict |
fit_predict | X, y=None, **fit_params | fit_transform | fit_predict |
score | X, y=None, sample_weight=None | transform | score |
The points to be noted are listed below.
--If fit_transform method is not defined, fit method and transform method are executed in order.
--Unlike the fit_transform method, an error will occur if the fit_predict method is not defined.
-\ * \ * fit_params can be passed with target step name (tuple key part) __ parameter name
.
--Example: pipeline.fit (X, y, key1__param1 = True)
-Unlike \ * \ * fit_params, \ * \ * predict_params can only pass parameters to the predict method called final_estimator. As for the description method, just specify the parameter name in the predict method as it is.
--Example: pipeline.predict (X, param1 = True)
As an aside, the sklearn-compliant model should not be designed to accept parameters when the fit method is executed. Therefore, it is better to avoid passing parameters using \ * \ * fit_params as much as possible. The sklearn compliant model is described in detail in here.
[^ 1]: User Guide [^ 2]: Source code [^ 3]: pipeline documentation
Recommended Posts