Upload
buddy-lester
View
221
Download
3
Embed Size (px)
Citation preview
Predicting Good Probabilities With Supervised Learning
Alexandru Niculescu-Mizil
Rich Caruana
Cornell University
What are good probabilities?
Ideally, if the model predicts 0.75 for an example then the conditional probability, given the available attributes, of that example to be positive is 0.75.
In practice: Good calibration: out of all the cases the model
predicts 0.75 for, 75% are positive. Low Brier score (squared error). Low cross-entropy (log-loss).
Why good probabilities?
Intelligibility
If the classifier is part of a larger system Speech recognition Handwritten recognition
If the classifier is used for decision making Cost sensitive decisions Medical applications Meteorology Risk analysis
What did we do?
We analyzed the predictions made by ten supervised learning algorithms.
For the analysis we used eight binary classification problems.
Limitations: Only binary problems. No multiclass. No high dimensional problems (only under 200). Only moderately sized training sets.
Questions addressed in this talk
Which models are well calibrated and which are not?
Can we fix the models that are not well calibrated?
Which learning algorithm makes the best probabilistic predictions?
Reliability diagrams
Put the cases with predicted values between 0 and 0.1 in the first bin, between 0.1 and 0.2 in the second, etc.
For each bin, plot the mean predicted value against the true fraction of positives.
Which models are well calibrated?ANN BAG-DTLOGREG
SVM BST-DT BST-STMP
RFDT KNN NB
Questions addressed in this talk
Which models are well calibrated and which are not?
Can we fix the models that are not well calibrated?
Which learning algorithm makes the best probabilistic predictions?
Can we fix the models that are not well calibrated?
Platt Scaling Method used by Platt to obtain calibrated
probabilities from SVMs. [Platt `99] Converts the outputs by passing them through a
sigmoid. The sigmoid is fitted using an independent
calibration set.
Can we fix the models that are not well calibrated?
Isotonic Regression [Robertson et al. `88] More general calibration method used by
Zadrozny and Elkan. [Zadrozny & Elkan `01, `02] Converts the outputs by passing them through a
general isotonic (monotonically increasing) function.
The isotonic function is fitted using an independent calibration set.
Max-margin methods
BS
T-D
TB
ST
-ST
MP
SV
M
Predictions are pushed away from 0 and 1.
HIST
Reliability plots have sigmoidal shape.
PLATT ISO
Calibration undoes the shift in predictions: more cases have predicted values closer to 0 and 1.
PLATTHIST
Boosted decision trees
P1COVT
P2ADULT
P3LET1
P4LET2
P5MEDIS
P6SLAC
P7HS
P8MG
HIS
TP
LA
TT
ISO
Naive BayesPLATTHISTPLATT
This generates reliability plots that have an inverted sigmoid shape.
Even if Platt Calibration is helping to improve the calibration, it is clear that a sigmoid is not the the right function for Naive Bayes models.
NB
HIST
Naive Bayes pushes predictions toward 0 and 1 because of the unrealistic independence assumptions.
ISO
Isotonic Regression provides a better fit.
Platt Scaling vs.Isotonic Regression
ANN
ISO
PLATT
UNCAL
.38
.36
.34
.32
.30
.28
BR
IER
SC
OR
E
10 100 1000 10000
ISO
PLATT
UNCAL
.38
.36
.34
.32
.30
.28
BR
IER
SC
OR
E
BST-DT
10 100 1000 10000
ISO
PLATT
UNCAL
.38
.36
.34
.32
.30
.28
BR
IER
SC
OR
E
RF
10 100 1000 10000
ISO
PLATT
UNCAL
.38
.36
.34
.32
.30
.28
BR
IER
SC
OR
E
NB
10 100 1000 10000
Questions addressed in this talk
Which models are well calibrated and which are not?
Can we fix the models that are not well calibrated?
Which learning algorithm makes the best probabilistic predictions?
Empirical ComparisonB
RIE
R S
CO
RE
BST-DT SVM RF ANN BAG KNN STMP DT LR NB
Summary and Conclusions We examined the quality of the probabilities
predicted by ten supervised learning algorithms.
Neural nets, bagged trees and logistic regression have well calibrated predictions.
Max-margin methods such as boosting and SVMs push the predicted values away from 0 and 1. This yields a sidmoid-shaped reliability diagram.
Learning algorithms such as Naive Bayes distort the probabilities in the opposite way, pushing them closer to 0 and 1.
Summary and Conclusions We examined two methods to calibrate the predictions.
Max-margin methods and Naive Baies benefit a lot from calibration, while well-calibrated methods do not.
Platt Scaling is more effective when the calibration set is small, but Isotonic Regression is more powerful when there is enough data to prevent overfitting.
The methods that predict the best probabilities are calibrated boosted trees, calibrated random forests, calibrated SVMs, uncalibrated bagged trees and uncalibrated neural nets.
Thank you!
Questions?