Google translated http://scikit-learn.org/0.18/modules/pipeline.html [scikit-learn 0.18 User Guide 4. Dataset Conversion](http://qiita.com/nazoking@github/items/267f2371757516f8c168#4-%E3%83%87%E3%83%BC%E3%82%BF From% E3% 82% BB% E3% 83% 83% E3% 83% 88% E5% A4% 89% E6% 8F% 9B)
Pipeline (http://scikit-learn.org/0.18/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline) allows you to combine multiple estimators into one. I can do it. This is useful because the steps that process data, such as feature selection, normalization, and classification, are often fixed. The pipeline serves two purposes here.
-** Convenience : To fit the entire sequence of estimators, just call fit
and predict
the data once.
- Joint parameter selection **: Search the grid for all estimator parameters in the pipeline at once (http://scikit-learn.org/0.18/modules/grid_search.html#grid-search )I can do it.
All estimators in the pipeline except the last pipeline must be transducers (requires the transform
method). The final estimator may be of any type (transducer, classifier, etc.).
The pipeline is created using a list of (key, value)
pairs. key
is the name string for this step and value
is the estimator instance.
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.svm import SVC
>>> from sklearn.decomposition import PCA
>>> estimators = [('reduce_dim', PCA()), ('clf', SVC())]
>>> pipe = Pipeline(estimators)
>>> pipe
Pipeline(steps=[('reduce_dim', PCA(copy=True, iterated_power='auto',
n_components=None, random_state=None, svd_solver='auto', tol=0.0,
whiten=False)), ('clf', SVC(C=1.0, cache_size=200, class_weight=None,
coef0=0.0, decision_function_shape=None, degree=3, gamma='auto',
kernel='rbf', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False))])
The utility function make_pipeline is an abbreviation for building a pipeline. Returns a pipeline with multiple estimators as arguments. The name is decided automatically:
>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.naive_bayes import MultinomialNB
>>> from sklearn.preprocessing import Binarizer
>>> make_pipeline(Binarizer(), MultinomialNB())
Pipeline(steps=[('binarizer', Binarizer(copy=True, threshold=0.0)),
('multinomialnb', MultinomialNB(alpha=1.0,
class_prior=None,
fit_prior=True))])
The estimators in the pipeline are stored as a list in the steps
attribute.
>>> pipe.steps[0]
('reduce_dim', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False))
Also as dict
to named_steps
:
>>> pipe.named_steps['reduce_dim']
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)
Estimated parameters in the pipeline can be accessed using the <estimator> __ <parameter>
syntax.
>>> pipe.set_params(clf__C=10)
Pipeline(steps=[('reduce_dim', PCA(copy=True, iterated_power='auto',
n_components=None, random_state=None, svd_solver='auto', tol=0.0,
whiten=False)), ('clf', SVC(C=10, cache_size=200, class_weight=None,
coef0=0.0, decision_function_shape=None, degree=3, gamma='auto',
kernel='rbf', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False))])
This is especially important when doing grid searches.
>>> from sklearn.model_selection import GridSearchCV
>>> params = dict(reduce_dim__n_components=[2, 5, 10],
... clf__C=[0.1, 10, 100])
>>> grid_search = GridSearchCV(pipe, param_grid=params)
Individual steps are also replaced as parameters, non-final steps are ignored and set to None
.
>>>
>>> from sklearn.linear_model import LogisticRegression
>>> params = dict(reduce_dim=[None, PCA(5), PCA(10)],
... clf=[SVC(), LogisticRegression()],
... clf__C=[0.1, 10, 100])
>>> grid_search = GridSearchCV(pipe, param_grid=params)
--Example: -Pipeline ANOVA SVM -[Sample Pipeline for Text Feature Extraction and Evaluation](http://scikit-learn.org/0.18/auto_examples/model_selection/grid_search_text_feature_extraction.html#sphx-glr-auto-examples-model-selection-grid- search-text-feature-extraction-py) -Pipeline: Chain of PCA and Logistic Regression -Explicit RBF Kernel Feature Map Approximation -[SVM-Anova: SVM with univariate feature selection](http://scikit-learn.org/0.18/auto_examples/svm/plot_svm_anova.html#sphx-glr-auto-examples-svm-plot-svm-anova -py) --Reference: -Tuning of hyperparameters of estimator
Calling fit
on the pipeline is the same as calling fit
to each estimator in turn, transforming the input and passing it to the next step. The pipeline has all the methods that the last estimator in the pipeline has. That is, if the last estimator is a classifier, the pipeline can be used as a classifier. If the last estimator is a transducer, so is the pipeline.
FeatureUnion combines several transducer objects with their output. Combine with a new converter. FeatureUnion takes a list of transducer objects. During fitting, each of these fits the data individually. To transform the data, the transformations are applied in parallel and the sample vectors they output are end-to-end concatenated into large vectors. FeatureUnion serves the same purpose as Pipeline-convenience and joint parameter estimation and validation. You can combine Feature Union and pipelines to create complex models. (FeatureUnion has no way to check if two transformations produce the same feature, only if the feature sets are disjointed, it will only generate a union and make sure it is the caller's responsibility).
FeatureUnions are built using a list of (key, value)
pairs. Where key
is the name you give to the conversion (it only works as an arbitrary string or identifier). value
is an estimator object.
>>> from sklearn.pipeline import FeatureUnion
>>> from sklearn.decomposition import PCA
>>> from sklearn.decomposition import KernelPCA
>>> estimators = [('linear_pca', PCA()), ('kernel_pca', KernelPCA())]
>>> combined = FeatureUnion(estimators)
>>> combined
FeatureUnion(n_jobs=1, transformer_list=[('linear_pca', PCA(copy=True,
iterated_power='auto', n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)), ('kernel_pca',
KernelPCA(alpha=1.0, coef0=1, copy_X=True, degree=3,
eigen_solver='auto', fit_inverse_transform=False, gamma=None,
kernel='linear', kernel_params=None, max_iter=None, n_components=None,
n_jobs=1, random_state=None, remove_zero_eig=False, tol=0))],
transformer_weights=None)
Like pipelines, feature unions do not require explicit naming of components, [make_union](http://scikit-learn.org/0.18/modules/generated/sklearn.pipeline.make_union.html#sklearn. There is a concise constructor called .pipeline.make_union).
>>> combined.set_params(kernel_pca=None)
FeatureUnion(n_jobs=1, transformer_list=[('linear_pca', PCA(copy=True,
iterated_power='auto', n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)), ('kernel_pca', None)],
transformer_weights=None)
Similar to the pipeline, individual steps are replaced using set_params
and are ignored when set to None
.
--Example: -Concatenate multiple feature extraction methods -Functional integration with heterogeneous data sources
[scikit-learn 0.18 User Guide 4. Dataset Conversion](http://qiita.com/nazoking@github/items/267f2371757516f8c168#4-%E3%83%87%E3%83%BC%E3%82%BF From% E3% 82% BB% E3% 83% 83% E3% 83% 88% E5% A4% 89% E6% 8F% 9B)
© 2010 --2016, scikit-learn developers (BSD license).