[Translation] scikit-learn 0.18 User Guide 1.13 Feature selection

Google translated http://scikit-learn.org/0.18/modules/feature_selection.html [scikit-learn 0.18 User Guide 1. Supervised Learning](http://qiita.com/nazoking@github/items/267f2371757516f8c168#1-%E6%95%99%E5%B8%AB%E4%BB%98 From% E3% 81% 8D% E5% AD% A6% E7% BF% 92)


1.13. Selection of features

sklearn.feature_selection Using module classes can improve the accuracy score of the estimator or is very high. It can be used for feature selection / dimension reduction in sample sets to improve performance on dimensional datasets.

1.13.1. Removal of features with low variance

VarianceThreshold is a simple baseline approach to feature selection. Deletes all features whose variance does not meet a certain threshold. By default, all zero-distributed features, that is, features with the same value in all samples, are removed. For example, suppose you have a dataset with Boolean features and you want to remove all 1 or zero (on or off) features in 80% or more of the sample. Boolean functions are Bernoulli random variables, and the variance of such variables is

\mathrm {Var} [X] = p(1-p)

Therefore, you can select using the threshold .8 * (1 --.8).

>>>
>>> sklearn.feature_Import from selection VarianceThreshold
>>> X = [[0,0,1]、[0,1,0]、[1,0,0]、[0,1,1]、[0,1,0]、[0,1 、1]]
>>> sel =VarianceThreshold=(.8 *(1~8)))
>>> sel.fit_transform(X)
Array ([[0、1]、
       [1、0]、
       [0、0]、
       [1,1]
       [1、0]、
       [1,1]))

As expected, the VarianceThreshold has removed the first column. This column has a probability of including zero $ p = 5/6> .8 $.

1.13.2. Selection of univariate feature function

Univariate feature selection works by choosing the best features based on univariate statistical tests. This can be seen as a pre-processing step for the estimator. Scikit-learn exposes feature selection routines as objects that implement transformation methods.

-SelectKBest does everything except the $ k $ highest score feature. Delete -SelectPercentile is the feature with the highest percentage of scores specified by the user. Delete --Use a common univariate statistical test for each feature. False positive rate SelectFpr, false detection rate SelectFdr -learn.org/stable/modules/generated/sklearn.feature_selection.SelectFdr.html#sklearn.feature_selection.SelectFdr) or Familywise Error SelectFwe sklearn.feature_selection.SelectFwe.html # sklearn.feature_selection.SelectFwe). -GenericUnivariateSelect allows you to select univariate features with a configurable strategy. Can be executed. This makes it possible to select the best univariate selection strategy using the hyperparameter search estimator.

For example, you can run a $ \ chi ^ 2 $ test on a sample to get only the two best features:

>>>
>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectKBest
>>> from sklearn.feature_selection import chi2
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
>>> X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
>>> X_new.shape
(150, 2)

These objects take as input a scoring function that returns a univariate score and a p-value (or only SelectKBest and SelectPercentile scores).

--For regression: f_regression, [mutual_info_regression](http://scikit- learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_regression.html#sklearn.feature_selection.mutual_info_regression) --For classification: chi2, [f_classif](http://scikit- learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html#sklearn.feature_selection.f_classif), mutual_info_classif .html # sklearn.feature_selection.mutual_info_classif)

A method (f_) based on the F-test estimates the linear dependence between two random variables. Mutual information (mutual_info_) methods, on the other hand, can capture all kinds of statistical dependencies, but they are nonparametric and require more samples for accurate estimation.

--Feature selection based on sparse data --When using sparse data (that is, data represented as a sparse matrix), chi2, mutual_info_regression, and mutual_info_classif process the data without making it dense. --Warning: Be careful not to use the regression score function in classification problems. The result is useless. --Example: -Selection of univariate function -[Comparison of F-test and mutual information](http://scikit-learn.org/stable/auto_examples/feature_selection/plot_f_test_vs_mi.html#sphx-glr-auto-examples-feature-selection-plot-f-test-vs -mi-py)

1.13.3. Recursive feature removal

Recursive feature removal (RFE given an external estimator that assigns weights to features (eg, coefficients of a linear model). .RFE.html # sklearn.feature_selection.RFE)) is to recursively consider smaller feature sets to select features. First, the estimator is trained on the initial feature set and weights are assigned to each feature point. The feature with the lowest absolute weight is then trimmed from the current feature set. The procedure is recursively repeated for the pruned set until the final number of features to select is reached. RFECV performs RFE in a mutual validation loop and features the optimal number of features. find.

--Example: -Recursive feature removal: Recursive An example of feature selection shows the relevance of pixels in a digit classification task. -[Cross-validation removal of recursive features](http://scikit-learn.org/stable/auto_examples/feature_selection/plot_rfe_with_cross_validation.html#sphx-glr-auto-examples-feature-selection-plot-rfe-with -cross-validation-py): An example of recursive feature removal by automatic tuning of the number of features selected by cross-validation.

1.13.4. Feature selection using SelectFromModel

SelectFromModel is an estimator with the coef_ or feature_importances_ attribute after fitting. A meta converter that can be used with. If the corresponding value of coef_ or feature_importances_ is less than the specified threshold parameter, the feature is considered insignificant and will be removed. In addition to specifying thresholds numerically, heuristics are built in to find thresholds using string arguments. The available heuristics are "Average", "Median", and "0.1 * Average" of these floating point numbers. See the section below for usage examples.

--Example: -[Feature selection using SelectFromModel and LassoCV](http://scikit-learn.org/stable/auto_examples/feature_selection/plot_select_from_model_boston.html#sphx-glr-auto-examples-feature-selection-plot-select-from- model-boston-py): Select the two most important features from the Boston dataset without knowing the threshold in advance.

1.13.4.1. L1-based feature selection

The L1 norm penalized Linear Model (http://scikit-learn.org/stable/modules/linear_model.html#linear-model) has a sparse solution. Many of the estimation factors are zero. If your goal is to reduce the dimensions of the data used by other classifiers, then [feature_selection.SelectFromModel](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html#sklearn] You can use it with .feature_selection.SelectFromModel) to select non-zero coefficients. In particular, a sparse determiner useful for this purpose is [linear_model.Lasso] for regression (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model. Lasso) and linear_model.LogisticRegression and [svm.LinearSVC] for classification. (http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC).

>>>
>>> from sklearn.svm import LinearSVC
>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectFromModel
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
>>> lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)
>>> model = SelectFromModel(lsvc, prefit=True)
>>> X_new = model.transform(X)
>>> X_new.shape
(150, 3)

In SVM and logistic regression, parameter C controls sparsity. In Lasso, the higher the alpha parameter, the fewer features are selected.

--Example: -Classification of text documents using sparse features: Comparison of different algorithms for document classification, including L1-based feature selection.

L1 recovery and compressed sensing

For a good choice of alpha, Lasso is not accurate with only a few observations, provided certain conditions are met. You can completely recover a set of zero variables. In particular, the sample size should be "large enough" or the L1 model should be run randomly. "Large enough" depends on the number of non-zero coefficients, the logarithm of the number of features, the amount of noise, the minimum absolute value of the non-zero coefficients, and the structure of the matrix X. In addition, the design matrix needs to show certain specific properties, such as being too uncorrelated. In addition, the design matrix must display certain characteristics, such as not being too correlated. There are no general rules for choosing alpha parameters for nonzero coefficient recovery. This can be set by cross-validation (LassoCV or LassoLarsCV), but this can lead to an underrated model: containing a small number of irrelevant variables is not detrimental to the predicted score. There is none. On the contrary, BIC (LassoLarsIC) tends to set a high value of α.

--References --Richard G. Baraniuk "Compressed Sensing", IEEE Signal Processing Magazine [120] July 2007 http://dsp.rice.edu/sites/dsp.rice.edu/files/cs/baraniukCSlecture07.pdf

1.13.4.2. Randomized sparse model

There are some well-known limitations of the L1 penalties model for regression and classification regarding feature selection. For example, Lasso is known to tend to select individual variables from a group of highly correlated features. Moreover, the conditions under which the L1 penalties consistently select "good" features can be generally restrictive, even if the correlation between features is not too high. To alleviate this problem, it is possible to use randomization techniques such as those shown in [B2009] and [M2010]. The latter technique is called stability selection and is implemented in the sklearn.linear_model module. In the stability selection method, the subsamples of the data fit into an L1 penalty model with a scaled penalty for a random subset of coefficients. Specifically, given a subsample of the data $ (x_i, y_i), i \ in I $, where $ I \ subset \ {1, 2, \ ldots, n } $ is the size $ n_I A random subset of the $ data, with the following modified Lasso fit:

\hat{w_I} = \mathrm{arg}\min_{w} \frac{1}{2n_I} \sum_{i \in I} (y_i - x_i^T w)^2 + \alpha \sum_{j=1}^p \frac{ \vert w_j \vert}{s_j},

$ s_j \ in \ {s, 1 } $ is an independent trial of a fair Bernoulli random variable, and $ 0 <s <1 $ is a scaling factor. By repeating this procedure over different random subsamples and Bernoulli trials, the randomized procedure can count the number of times each feature is selected and use these percentages as the feature selection score. RandomizedLasso implemented this strategy in regression settings using Lasso, RandomizedLogisticRegression uses logistic regression and is suitable for classification tasks. Use lasso_stability_path to get the full path of the stability score ..

../_images/sphx_glr_plot_sparse_recovery_0031.png http://scikit-learn.org/stable/auto_examples/linear_model/plot_sparse_recovery.html

In order for a randomized sparse model to be more powerful than standard F statistics in detecting non-zero features, the grand-truth model must be sparse, in other words, only a small portion of the non-zero features. Should not exist.

--Example: -[Sparse Recovery: Feature Selection for Sparse Linear Models](http://scikit-learn.org/stable/auto_examples/linear_model/plot_sparse_recovery.html#sphx-glr-auto-examples-linear-model-plot-sparse -recovery-py): An example of comparing different feature selection approaches and discussing under what circumstances each approach is preferred. --Reference: -[B2009] F. Bach, "Model-consistent sparse estimation by bootstrap" https://hal.inria.fr/hal-00354771/ -[M2010] N. Meinshausen, P. Buhlmann, "Choice of Stability", Royal Statistical Society, 72 (2010) http://arxiv.org/pdf/0809.2932.pdf

1.13.4.3. Tree-based feature selection

You can use a tree-based estimator (see Forest of Trees in the sklearn.tree and sklearn.ensemble modules) to calculate feature loads and discard unrelated features (sklearn.feature_selection.SelectFromModel). When combined with a meta converter)

>>>
>>> from sklearn.ensemble import ExtraTreesClassifier
>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectFromModel
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
>>> clf = ExtraTreesClassifier()
>>> clf = clf.fit(X, y)
>>> clf.feature_importances_ 
array([ 0.04..., 0.05..., 0.4..., 0.4...])
>>> model = SelectFromModel(clf, prefit=True)
>>> X_new = model.transform(X)
>>> X_new.shape 
(150, 2)

--Example: -Importance of tree function in forest: Example of synthetic data showing the recovery of actually meaningful function -[Importance of pixels with parallel forests of trees](http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances_faces.html#sphx-glr-auto-examples-ensemble-plot-forest-importances -faces-py): Example of face recognition data

1.13.5. Choosing features as part of the pipeline

Feature selection is typically used as a pre-processing step before the actual learning. We recommend using sklearn.pipeline.Pipeline to do this with scikit-learn.

clf = Pipeline([
  ('feature_selection', SelectFromModel(LinearSVC(penalty="l1"))),
  ('classification', RandomForestClassifier())
])
clf.fit(X, y)

This snippet combines sklearn.svm.LinearSVC with sklearn.feature_selection.SelectFromModel to assess the importance of features and select the most relevant features. The sklearn.ensemble.RandomForestClassifier is then trained on the transformed output, i.e. using only the relevant features. Similar operations can be performed with other feature selection methods and, of course, with classifiers that provide a way to assess the importance of features. See the sklearn.pipeline.Pipeline example for more information.


[scikit-learn 0.18 User Guide 1. Supervised Learning](http://qiita.com/nazoking@github/items/267f2371757516f8c168#1-%E6%95%99%E5%B8%AB%E4%BB%98 From% E3% 81% 8D% E5% AD% A6% E7% BF% 92)

© 2010 --2016, scikit-learn developers (BSD license)

Recommended Posts

[Translation] scikit-learn 0.18 User Guide 1.13 Feature selection
[Translation] scikit-learn 0.18 User Guide 4.2 Feature extraction
[Translation] scikit-learn 0.18 User Guide 4.5. Random projection
[Translation] scikit-learn 0.18 User Guide 1.11. Ensemble method
[Translation] scikit-learn 0.18 User Guide 1.15. Isotonic regression
[Translation] scikit-learn 0.18 User Guide 1.16. Probability calibration
[Translation] scikit-learn 0.18 User Guide 3.4. Model persistence
[Translation] scikit-learn 0.18 User Guide 2.8. Density estimation
[Translation] scikit-learn 0.18 User Guide 4.3. Data preprocessing
[Translation] scikit-learn 0.18 User Guide 4.4. Unsupervised dimensionality reduction
[Translation] scikit-learn 0.18 User Guide Table of Contents
[Translation] scikit-learn 0.18 User Guide 1.4. Support Vector Machine
[Translation] scikit-learn 0.18 User Guide 4.1. Pipeline and Feature Union: Combination of estimators
[Translation] scikit-learn 0.18 User Guide 1.12. Multi-class algorithm and multi-label algorithm
[Translation] scikit-learn 0.18 User Guide 3.2. Tuning the hyperparameters of the estimator
[Translation] scikit-learn 0.18 User Guide 4.8. Convert the prediction target (y)
[Translation] scikit-learn 0.18 User Guide 2.7. Detection of novelty and outliers
[Translation] scikit-learn 0.18 User Guide 3.1. Cross-validation: Evaluate the performance of the estimator
[Translation] scikit-learn 0.18 User Guide 3.3. Model evaluation: Quantify the quality of prediction
Feature Selection Datasets
[Translation] scikit-learn 0.18 User Guide 3.5. Verification curve: Plot the score to evaluate the model
[Translation] scikit-learn 0.18 User Guide 2.5. Decompose the signal in the component (matrix factorization problem)
Feature selection by sklearn.feature_selection
Pandas User Guide "Multi-Index / Advanced Index" (Official document Japanese translation)
Pandas User Guide "Manipulating Missing Data" (Official Document Japanese Translation)
Feature selection by genetic algorithm
Feature selection by Null importances
Organized feature selection using sklearn
Pandas User Guide "Table Formatting and PivotTables" (Official Document Japanese Translation)