[Translation] scikit-learn 0.18 User Guide 1.16. Probability calibration

Google translated http://scikit-learn.org/0.18/modules/calibration.html [scikit-learn 0.18 User Guide 1. Supervised Learning](http://qiita.com/nazoking@github/items/267f2371757516f8c168#1-%E6%95%99%E5%B8%AB%E4%BB%98 From% E3% 81% 8D% E5% AD% A6% E7% BF% 92)


1.16. Probability calibration

When performing a classification, we often not only predict the class labels, but also get the probabilities of each label. This probability gives some confidence in the prediction. Some models have poor estimates of class probabilities, while others do not support probability prediction. Calibration modules allow you to better adjust the probabilities of a particular model or add support for probabilistic prediction. A well-calibrated classifier is a stochastic classifier that can directly interpret the output of the predict_proba method as a confidence level. For example, about 80% of samples given a predict_proba value of 0.8 by a well-calibrated (binary) classifier actually belong to the positive class. The following plot compares how well the stochastic predictions of the various classifiers are calibrated.

較正プロット(信頼性曲線)

LogisticRegression () returns a properly calibrated prediction by default for direct optimization of log loss. In contrast, other methods return a biased probability. Each method has a different bias:

Two approaches to perform stochastic prediction calibration: a parametric approach based on Pratt's sigmoid model and an isotonic regression (sklearn.isotonic). ) Is provided as a non-parametric approach. Probabilistic calibration should be performed on new data that will not be used for model fitting. CalibratedClassifierCV The class is a model on a training sample using a cross-validation generator. Estimate each split of the parameter and calibration of the test sample. Then the expected probabilities for the fold are averaged. Already matched classifiers can be calibrated by CalibratedClassifierCV via the parameter cv =" prefit ". In this case, the user must manually note that the data for fitting and calibration of the model is discontinuous.

The following image shows the benefits of stochastic calibration. The first image shows two classes and three chunks of the dataset. The central chunk contains a random sample of each class. The probability of a sample of this mass should be 0.5.

The following image shows data above estimated probabilities using an uncalibrated Gaussian naive Bayes classifier, sigmoid calibration, and nonparametric isotonic calibration. It can be observed that the nonparametric model provides the most accurate probability estimate for the central sample, 0.5.

The next experiment is performed on an artificial dataset for binary classification with 100.000 samples with 20 features (1.000 samples are used for model fitting). Of the 20 features, only 2 are useful and 10 are redundant. This figure shows the estimated probabilities obtained with a logistic regression, a linear support vector classifier (SVC), and a linear SVC with both isotonic and sigmoid calibrations. Calibration performance is assessed by the Brier score brier_score_loss and reported in the legend ( Smaller is better).

Here we can see that logistic regression is calibrated because its curves are nearly diagonal. The linear SVC calibration curve has a sigmoid curve, which is unique to "confident" classifiers. For LinearSVC, this is caused by the margin property of hinge loss. This allows the model to focus on hard samples (support vectors) near the decision boundaries. Both types of calibration solve this problem and give almost the same results. The following figure shows a Gaussian Naive Bayes calibration curve on the same data, with and without both types of calibration.

Gaussian naive Bayes gives very bad results, but it turns out to be done in ways other than linear SVC. The linear SVC shows a sigmoid calibration curve, whereas the Gaussian naive Bayes calibration curve has a transposed sigmoid shape. This is common with classifiers that are too optimistic. In this case, classifier overconfidence is caused by redundant features that violate the feature-independent naive Bayesian assumption. Gaussian naive Bayes isotonic regression probability calibration can correct this problem, as can be seen from the nearly diagonal calibration curve. Sigmoid calibration is also not as powerful as nonparametric isotonic calibration, but it improves brier scores slightly. This is an essential limitation of sigmoid calibration, and its parametric form assumes sigmoids rather than transposed sigmoid curves. However, nonparametric isotonic calibration models do not make such strong assumptions and can handle any shape with sufficient calibration data. In general, sigmoid calibration is preferred when the calibration curve is sigmoid and the calibration data is limited, but isotonic calibration is preferred in situations where a large amount of data is available for non-sigmoid calibration curves and calibration. CalibratedClassifierCV is more than one if the base estimation can. It can also handle classification tasks that include classes. In this case, the classifier is individually calibrated in a different one-to-one way for each class. When predicting the probabilities of invisible data, the calibrated probabilities for each class are predicted separately. Since their probabilities do not always match 1, post-processing is done to normalize them. The following image shows how sigmoid calibration changes the predictive probability of a three-class classification problem. An example is a standard 2 simplex with 3 corners corresponding to 3 classes. The arrows point from the random variables predicted by the uncalibrated classifier to the random variables predicted by the same classifier after the sigmoid calibration of the holdout validation set. The color indicates the true class of the instance (red: class 1, green: class 2, blue: class 3).

The base classifier is a random forest classifier with 25 base estimators (trees). If this classifier is trained on all 800 training data points, the predictions are overly confident and result in significant log losses. Calibrating the same classifier trained at 600 data points with method ='sigmoid' on the remaining 200 data points reduces the reliability of the prediction, ie, the edge of the simplex. Move the random variable from to the center.

This calibration results in lower log loss. Note that the alternative was to increase the number of base estimators that would result in a similar reduction in log loss.


[scikit-learn 0.18 User Guide 1. Supervised Learning](http://qiita.com/nazoking@github/items/267f2371757516f8c168#1-%E6%95%99%E5%B8%AB%E4%BB%98 From% E3% 81% 8D% E5% AD% A6% E7% BF% 92)

© 2010 --2016, scikit-learn developers (BSD license).

Recommended Posts

[Translation] scikit-learn 0.18 User Guide 1.16. Probability calibration
[Translation] scikit-learn 0.18 User Guide 4.5. Random projection
[Translation] scikit-learn 0.18 User Guide 1.15. Isotonic regression
[Translation] scikit-learn 0.18 User Guide 4.2 Feature extraction
[Translation] scikit-learn 0.18 User Guide 1.13 Feature selection
[Translation] scikit-learn 0.18 User Guide 3.4. Model persistence
[Translation] scikit-learn 0.18 User Guide 2.8. Density estimation
[Translation] scikit-learn 0.18 User Guide 4.3. Data preprocessing
[Translation] scikit-learn 0.18 User Guide Table of Contents
[Translation] scikit-learn 0.18 User Guide 1.4. Support Vector Machine
[Translation] scikit-learn 0.18 User Guide 1.12. Multi-class algorithm and multi-label algorithm
[Translation] scikit-learn 0.18 User Guide 3.2. Tuning the hyperparameters of the estimator
[Translation] scikit-learn 0.18 User Guide 4.8. Convert the prediction target (y)
[Translation] scikit-learn 0.18 User Guide 2.7. Detection of novelty and outliers
[Translation] scikit-learn 0.18 User Guide 3.3. Model evaluation: Quantify the quality of prediction
[Translation] scikit-learn 0.18 User Guide 4.1. Pipeline and Feature Union: Combination of estimators
[Translation] scikit-learn 0.18 User Guide 3.5. Verification curve: Plot the score to evaluate the model
[Translation] scikit-learn 0.18 User Guide 2.5. Decompose the signal in the component (matrix factorization problem)
Pandas User Guide "Multi-Index / Advanced Index" (Official document Japanese translation)
Pandas User Guide "Manipulating Missing Data" (Official Document Japanese Translation)
Pandas User Guide "Table Formatting and PivotTables" (Official Document Japanese Translation)