[Translation] scikit-learn 0.18 User Guide 4.1. Pipeline and Feature Union: Combination of estimators

Google translated http://scikit-learn.org/0.18/modules/pipeline.html [scikit-learn 0.18 User Guide 4. Dataset Conversion](http://qiita.com/nazoking@github/items/267f2371757516f8c168#4-%E3%83%87%E3%83%BC%E3%82%BF From% E3% 82% BB% E3% 83% 83% E3% 83% 88% E5% A4% 89% E6% 8F% 9B)


4.1. Pipeline and Feature Union: Estimator combination

4.1.1. Pipeline: Chain estimation

Pipeline (http://scikit-learn.org/0.18/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline) allows you to combine multiple estimators into one. I can do it. This is useful because the steps that process data, such as feature selection, normalization, and classification, are often fixed. The pipeline serves two purposes here.

-** Convenience : To fit the entire sequence of estimators, just call fit and predict the data once. - Joint parameter selection **: Search the grid for all estimator parameters in the pipeline at once (http://scikit-learn.org/0.18/modules/grid_search.html#grid-search )I can do it.

All estimators in the pipeline except the last pipeline must be transducers (requires the transform method). The final estimator may be of any type (transducer, classifier, etc.).

4.1.1.1. Usage

The pipeline is created using a list of (key, value) pairs. key is the name string for this step and value is the estimator instance.

>>> from sklearn.pipeline import Pipeline
>>> from sklearn.svm import SVC
>>> from sklearn.decomposition import PCA
>>> estimators = [('reduce_dim', PCA()), ('clf', SVC())]
>>> pipe = Pipeline(estimators)
>>> pipe 
Pipeline(steps=[('reduce_dim', PCA(copy=True, iterated_power='auto',
n_components=None, random_state=None, svd_solver='auto', tol=0.0,
whiten=False)), ('clf', SVC(C=1.0, cache_size=200, class_weight=None,
coef0=0.0, decision_function_shape=None, degree=3, gamma='auto',
kernel='rbf', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False))])

The utility function make_pipeline is an abbreviation for building a pipeline. Returns a pipeline with multiple estimators as arguments. The name is decided automatically:

>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.naive_bayes import MultinomialNB
>>> from sklearn.preprocessing import Binarizer
>>> make_pipeline(Binarizer(), MultinomialNB()) 
Pipeline(steps=[('binarizer', Binarizer(copy=True, threshold=0.0)),
                ('multinomialnb', MultinomialNB(alpha=1.0,
                                                class_prior=None,
                                                fit_prior=True))])

The estimators in the pipeline are stored as a list in the steps attribute.

>>> pipe.steps[0]
('reduce_dim', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False))

Also as dict to named_steps:

>>> pipe.named_steps['reduce_dim']
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

Estimated parameters in the pipeline can be accessed using the <estimator> __ <parameter> syntax.

>>> pipe.set_params(clf__C=10) 
Pipeline(steps=[('reduce_dim', PCA(copy=True, iterated_power='auto',
    n_components=None, random_state=None, svd_solver='auto', tol=0.0,
    whiten=False)), ('clf', SVC(C=10, cache_size=200, class_weight=None,
    coef0=0.0, decision_function_shape=None, degree=3, gamma='auto',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False))])

This is especially important when doing grid searches.

>>> from sklearn.model_selection import GridSearchCV
>>> params = dict(reduce_dim__n_components=[2, 5, 10],
...               clf__C=[0.1, 10, 100])
>>> grid_search = GridSearchCV(pipe, param_grid=params)

Individual steps are also replaced as parameters, non-final steps are ignored and set to None.

>>>
>>> from sklearn.linear_model import LogisticRegression
>>> params = dict(reduce_dim=[None, PCA(5), PCA(10)],
...               clf=[SVC(), LogisticRegression()],
...               clf__C=[0.1, 10, 100])
>>> grid_search = GridSearchCV(pipe, param_grid=params)

--Example: -Pipeline ANOVA SVM -[Sample Pipeline for Text Feature Extraction and Evaluation](http://scikit-learn.org/0.18/auto_examples/model_selection/grid_search_text_feature_extraction.html#sphx-glr-auto-examples-model-selection-grid- search-text-feature-extraction-py) -Pipeline: Chain of PCA and Logistic Regression -Explicit RBF Kernel Feature Map Approximation -[SVM-Anova: SVM with univariate feature selection](http://scikit-learn.org/0.18/auto_examples/svm/plot_svm_anova.html#sphx-glr-auto-examples-svm-plot-svm-anova -py) --Reference: -Tuning of hyperparameters of estimator

4.1.1.2. Notes

Calling fit on the pipeline is the same as calling fit to each estimator in turn, transforming the input and passing it to the next step. The pipeline has all the methods that the last estimator in the pipeline has. That is, if the last estimator is a classifier, the pipeline can be used as a classifier. If the last estimator is a transducer, so is the pipeline.

4.1.2. FeatureUnion: Composite feature space

FeatureUnion combines several transducer objects with their output. Combine with a new converter. FeatureUnion takes a list of transducer objects. During fitting, each of these fits the data individually. To transform the data, the transformations are applied in parallel and the sample vectors they output are end-to-end concatenated into large vectors. FeatureUnion serves the same purpose as Pipeline-convenience and joint parameter estimation and validation. You can combine Feature Union and pipelines to create complex models. (FeatureUnion has no way to check if two transformations produce the same feature, only if the feature sets are disjointed, it will only generate a union and make sure it is the caller's responsibility).

4.1.2.1. Usage

FeatureUnions are built using a list of (key, value) pairs. Where key is the name you give to the conversion (it only works as an arbitrary string or identifier). value is an estimator object.

>>> from sklearn.pipeline import FeatureUnion
>>> from sklearn.decomposition import PCA
>>> from sklearn.decomposition import KernelPCA
>>> estimators = [('linear_pca', PCA()), ('kernel_pca', KernelPCA())]
>>> combined = FeatureUnion(estimators)
>>> combined 
FeatureUnion(n_jobs=1, transformer_list=[('linear_pca', PCA(copy=True,
    iterated_power='auto', n_components=None, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)), ('kernel_pca',
    KernelPCA(alpha=1.0, coef0=1, copy_X=True, degree=3,
    eigen_solver='auto', fit_inverse_transform=False, gamma=None,
    kernel='linear', kernel_params=None, max_iter=None, n_components=None,
    n_jobs=1, random_state=None, remove_zero_eig=False, tol=0))],
    transformer_weights=None)

Like pipelines, feature unions do not require explicit naming of components, [make_union](http://scikit-learn.org/0.18/modules/generated/sklearn.pipeline.make_union.html#sklearn. There is a concise constructor called .pipeline.make_union).

>>> combined.set_params(kernel_pca=None) 
FeatureUnion(n_jobs=1, transformer_list=[('linear_pca', PCA(copy=True,
      iterated_power='auto', n_components=None, random_state=None,
      svd_solver='auto', tol=0.0, whiten=False)), ('kernel_pca', None)],
    transformer_weights=None)

Similar to the pipeline, individual steps are replaced using set_params and are ignored when set to None.

--Example: -Concatenate multiple feature extraction methods -Functional integration with heterogeneous data sources


[scikit-learn 0.18 User Guide 4. Dataset Conversion](http://qiita.com/nazoking@github/items/267f2371757516f8c168#4-%E3%83%87%E3%83%BC%E3%82%BF From% E3% 82% BB% E3% 83% 83% E3% 83% 88% E5% A4% 89% E6% 8F% 9B)

© 2010 --2016, scikit-learn developers (BSD license).

Recommended Posts

[Translation] scikit-learn 0.18 User Guide 4.1. Pipeline and Feature Union: Combination of estimators
[Translation] scikit-learn 0.18 User Guide 2.7. Detection of novelty and outliers
[Translation] scikit-learn 0.18 User Guide 4.2 Feature extraction
[Translation] scikit-learn 0.18 User Guide 1.13 Feature selection
[Translation] scikit-learn 0.18 User Guide Table of Contents
[Translation] scikit-learn 0.18 User Guide 3.2. Tuning the hyperparameters of the estimator
[Translation] scikit-learn 0.18 User Guide 3.1. Cross-validation: Evaluate the performance of the estimator
[Translation] scikit-learn 0.18 User Guide 4.5. Random projection
[Translation] scikit-learn 0.18 User Guide 1.11. Ensemble method
[Translation] scikit-learn 0.18 User Guide 1.16. Probability calibration
[Translation] scikit-learn 0.18 User Guide 2.8. Density estimation
[Translation] scikit-learn 0.18 User Guide 4.3. Data preprocessing
[Translation] scikit-learn 0.18 User Guide 3.3. Model evaluation: Quantify the quality of prediction
[Translation] scikit-learn 0.18 User Guide 4.4. Unsupervised dimensionality reduction
[Translation] scikit-learn 0.18 User Guide 1.4. Support Vector Machine
Pandas User Guide "merge, join and concatenate" (Japanese translation of official documentation)
Combination of recursion and generator
Combination of anyenv and direnv
[Translation] scikit-learn 0.18 Tutorial Table of Contents
[Translation] scikit-learn 0.18 User Guide 3.5. Verification curve: Plot the score to evaluate the model
[Translation] scikit-learn 0.18 User Guide 2.5. Decompose the signal in the component (matrix factorization problem)