Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
Machine Learning: Basic Techniques
Alvaro J. Riascos VillegasUniversidad de los Andes and Quantil
July 6 2018
Machine Learning: Basic Techniques A. Riascos
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
Contenido
1 Optimal Learning Algorithms
2 Linear Regression and k-NNFeature selectionBest subset, forward, backward and stagewisePrincipal components
3 Regularization
4 Classification and Regression Trees
5 Random Forests
6 Model validationROC CurveCalibration Curve
Machine Learning: Basic Techniques A. Riascos
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
Bayes regression and classification algorithms
For the regression problem it is possible to prove that the bestlearning function, when the loss is cuadratic, is:
f (x) = EP [Y | X ]
For the classification problem it is possible to prove that thebest learning function, when the loss is zero - one, is:f (x) = 1 if P(Y | X ) ≥ 0,5 and zero otherwise. For multiplecategories it is easily generalized.
Machine Learning: Basic Techniques A. Riascos
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
Bayes regression and classification algorithms
For the regression problem it is possible to prove that the bestlearning function, when the loss is cuadratic, is:
f (x) = EP [Y | X ]
For the classification problem it is possible to prove that thebest learning function, when the loss is zero - one, is:f (x) = 1 if P(Y | X ) ≥ 0,5 and zero otherwise. For multiplecategories it is easily generalized.
Machine Learning: Basic Techniques A. Riascos
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
Feature selectionBest subset, forward, backward and stagewisePrincipal components
Contenido
1 Optimal Learning Algorithms
2 Linear Regression and k-NNFeature selectionBest subset, forward, backward and stagewisePrincipal components
3 Regularization
4 Classification and Regression Trees
5 Random Forests
6 Model validationROC CurveCalibration Curve
Machine Learning: Basic Techniques A. Riascos
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
Feature selectionBest subset, forward, backward and stagewisePrincipal components
Linear Regression
The Linear Regression model assumes:
f (x) ≈ XTβ
If we minimize the risk subject to the restriction thatfunctions must be linear, we obtain:
β = E(XXT
)−1E (XY )
The model assumes that f (x) is globally linear.
Machine Learning: Basic Techniques A. Riascos
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
Feature selectionBest subset, forward, backward and stagewisePrincipal components
Linear Regression
The Linear Regression model assumes:
f (x) ≈ XTβ
If we minimize the risk subject to the restriction thatfunctions must be linear, we obtain:
β = E(XXT
)−1E (XY )
The model assumes that f (x) is globally linear.
Machine Learning: Basic Techniques A. Riascos
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
Feature selectionBest subset, forward, backward and stagewisePrincipal components
Linear Regression
The Linear Regression model assumes:
f (x) ≈ XTβ
If we minimize the risk subject to the restriction thatfunctions must be linear, we obtain:
β = E(XXT
)−1E (XY )
The model assumes that f (x) is globally linear.
Machine Learning: Basic Techniques A. Riascos
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
Feature selectionBest subset, forward, backward and stagewisePrincipal components
k-NN: k-Nearest Neighbors
k-NN estimates the conditional expected value locally as aconstant function.
f (x) ≈ Mean (y |x ∈ Nk(x))
Machine Learning: Basic Techniques A. Riascos
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
Feature selectionBest subset, forward, backward and stagewisePrincipal components
Comparison between Linear Regression and k-NN
Both methods approximate to E (Y |X = x) using averages butthey use different assumptions over the true learning function:
Linear Regression assumes that f (x) is globally linear.k-NN assumes that f (x) is locally constant.
Machine Learning: Basic Techniques A. Riascos
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
Feature selectionBest subset, forward, backward and stagewisePrincipal components
Comparison between Linear Regression and k-NN
Both methods approximate to E (Y |X = x) using averages butthey use different assumptions over the true learning function:
Linear Regression assumes that f (x) is globally linear.k-NN assumes that f (x) is locally constant.
Machine Learning: Basic Techniques A. Riascos
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
Feature selectionBest subset, forward, backward and stagewisePrincipal components
Feature selection
Two common problems:1 Test error: it’s possible to diminish the test error by reducing
the number of variables (this reduces the complexity andvariance) although it increases bias.
2 Interpretation: a smaller number of features allows an easierand better interpretation.
We are going to discuss different ways of reducing the numberof features.
Machine Learning: Basic Techniques A. Riascos
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
Feature selectionBest subset, forward, backward and stagewisePrincipal components
Feature selection
Two common problems:1 Test error: it’s possible to diminish the test error by reducing
the number of variables (this reduces the complexity andvariance) although it increases bias.
2 Interpretation: a smaller number of features allows an easierand better interpretation.
We are going to discuss different ways of reducing the numberof features.
Machine Learning: Basic Techniques A. Riascos
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
Feature selectionBest subset, forward, backward and stagewisePrincipal components
Feature selection
Two common problems:1 Test error: it’s possible to diminish the test error by reducing
the number of variables (this reduces the complexity andvariance) although it increases bias.
2 Interpretation: a smaller number of features allows an easierand better interpretation.
We are going to discuss different ways of reducing the numberof features.
Machine Learning: Basic Techniques A. Riascos
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
Feature selectionBest subset, forward, backward and stagewisePrincipal components
Feature selection: Best subset of features
Best subset of features.
The subset of features which produces the smallest test error ischosen.Computationally expensive. Is computationally viable formodels with less than 40 features.
Machine Learning: Basic Techniques A. Riascos
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
Feature selectionBest subset, forward, backward and stagewisePrincipal components
Feature selection: Best subset of features
Best subset of features.
The subset of features which produces the smallest test error ischosen.Computationally expensive. Is computationally viable formodels with less than 40 features.
Machine Learning: Basic Techniques A. Riascos
Feature selection: Best subset of features58 3. Linear Methods for Regression
Subset Size k
Res
idua
l Sum
−of
−S
quar
es
020
4060
8010
0
0 1 2 3 4 5 6 7 8
•
•
•••••••
••••••••••••••••••••••••••
•••••••••••••••••••••••••••••••••••••••••••••••••
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
•••••••••••••••••••••••••••••••••••••••••••••
••••••••••••••••••••••••••
•••••••
•
•
•
•• • • • • • •
FIGURE 3.5. All possible subset models for the prostate cancer example. Ateach subset size is shown the residual sum-of-squares for each model of that size.
cross-validation to estimate prediction error and select k; the AIC criterionis a popular alternative. We defer more detailed discussion of these andother approaches to Chapter 7.
3.3.2 Forward- and Backward-Stepwise Selection
Rather than search through all possible subsets (which becomes infeasiblefor pmuch larger than 40), we can seek a good path through them. Forward-stepwise selection starts with the intercept, and then sequentially adds intothe model the predictor that most improves the fit. With many candidatepredictors, this might seem like a lot of computation; however, clever up-dating algorithms can exploit the QR decomposition for the current fit torapidly establish the next candidate (Exercise 3.9). Like best-subset re-gression, forward stepwise produces a sequence of models indexed by k, thesubset size, which must be determined.
Forward-stepwise selection is a greedy algorithm, producing a nested se-quence of models. In this sense it might seem sub-optimal compared tobest-subset selection. However, there are several reasons why it might bepreferred:
Feature selection: Forward, Backward and Stagewiseselection
Forward: Start with a model only with a constant andsequentially add the feature which reduces prediction error themost. On each stage the model is reestimated.
Backward: Start with a model with all the features, proceed toeliminate the feature that less contributes to the finalprediction (it can be done using the Z-score). On each stagethe model is reestimated.
Stagewise: Start with a model only with a constant andsequentially add the feature that has the biggest correlationwith the residual errors of the last model. Don’t reestimate.
Feature selection: Forward, Backward and Stagewiseselection
Forward: Start with a model only with a constant andsequentially add the feature which reduces prediction error themost. On each stage the model is reestimated.
Backward: Start with a model with all the features, proceed toeliminate the feature that less contributes to the finalprediction (it can be done using the Z-score). On each stagethe model is reestimated.
Stagewise: Start with a model only with a constant andsequentially add the feature that has the biggest correlationwith the residual errors of the last model. Don’t reestimate.
Feature selection: Forward, Backward and Stagewiseselection
Forward: Start with a model only with a constant andsequentially add the feature which reduces prediction error themost. On each stage the model is reestimated.
Backward: Start with a model with all the features, proceed toeliminate the feature that less contributes to the finalprediction (it can be done using the Z-score). On each stagethe model is reestimated.
Stagewise: Start with a model only with a constant andsequentially add the feature that has the biggest correlationwith the residual errors of the last model. Don’t reestimate.
Feature selection: Growth Regressions
Hal R. Varian 17
model averaging, a technique related to, but not identical with, spike-and-slab. model averaging, a technique related to, but not identical with, spike-and-slab. Hendry and Krolzig (2004) examined an iterative signifi cance test selection method.Hendry and Krolzig (2004) examined an iterative signifi cance test selection method.
Table 4 shows ten predictors that were chosen by Sala-i-Martín (1997) using Table 4 shows ten predictors that were chosen by Sala-i-Martín (1997) using his two million regressions, Ley and Steel (2009) using Bayesian model averaging, his two million regressions, Ley and Steel (2009) using Bayesian model averaging, LASSO, and spike-and-slab. The table is based on that in Ley and Steel (2009) but LASSO, and spike-and-slab. The table is based on that in Ley and Steel (2009) but metrics used are not strictly comparable across the various models. The “Bayesian metrics used are not strictly comparable across the various models. The “Bayesian model averaging” and “spike-slab” columns show posterior probabilities of inclu-model averaging” and “spike-slab” columns show posterior probabilities of inclu-sion; the “LASSO” column just shows the ordinal importance of the variable or sion; the “LASSO” column just shows the ordinal importance of the variable or a dash indicating that it was not included in the chosen model; and the CDF(0) a dash indicating that it was not included in the chosen model; and the CDF(0) measure is defi ned in Sala-i-Martín (199 7).measure is defi ned in Sala-i-Martín (199 7).
The LASSO and the Bayesian techniques are very computationally effi cient The LASSO and the Bayesian techniques are very computationally effi cient and would likely be preferred to exhaustive search. All four of these variable selec-and would likely be preferred to exhaustive search. All four of these variable selec-tion methods give similar results for the fi rst four or fi ve variables, after which they tion methods give similar results for the fi rst four or fi ve variables, after which they diverge. In this particular case, the dataset appears to be too small to resolve the diverge. In this particular case, the dataset appears to be too small to resolve the question of what is “important” for economic growt h.question of what is “important” for economic growt h.
Variable Selection in Time Series ApplicationsThe machine learning techniques described up until now are generally The machine learning techniques described up until now are generally
applied to cross-sectional data where independently distributed data is a plausible applied to cross-sectional data where independently distributed data is a plausible assumption. However, there are also techniques that work with time series. Here we assumption. However, there are also techniques that work with time series. Here we
Table 4Comparing Variable Selection Algorithms: Which Variables Appeared as Important Predictors of Economic Growth?
Predictor Bayesian model averaging CDF(0) LASSO Spike-and-Slab
GDP level 1960 1.000 1.000 - 0.9992Fraction Confucian 0.995 1.000 2 0.9730Life expectancy 0.946 0.942 - 0.9610Equipment investment 0.757 0.997 1 0.9532Sub-Saharan dummy 0.656 1.000 7 0.5834Fraction Muslim 0.656 1.000 8 0.6590Rule of law 0.516 1.000 - 0.4532Open economy 0.502 1.000 6 0.5736Degree of capitalism 0.471 0.987 9 0.4230Fraction Protestant 0.461 0.966 5 0.3798
Source: The table is based on that in Ley and Steel (2009); the data analyzed is from Sala-i-Martín (1997).Notes: We illustrate different methods of variable selection. This exercise involved examining a dataset of 72 counties and 42 variables in order to see which variables appeared to be important predictors of economic growth. The table shows ten predictors that were chosen by Sala-i-Martín (1997) using a CDF(0) measure defi ned in the 1997 paper; Ley and Steel (2009) using Bayesian model averaging, LASSO, and spike-and-slab regressions. Metrics used are not strictly comparable across the various models. The “Bayesian model averaging” and “Spike-and-Slab” columns are posterior probabilities of inclusion; the “LASSO” column just shows the ordinal importance of the variable or a dash indicating that it was not included in the chosen model; and the CDF(0) measure is defi ned in Sala-i-Martín (1997).
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
Feature selectionBest subset, forward, backward and stagewisePrincipal components
Principal components
Sometimes many features are highly correlated.
The information they give to the model is redundant.
Can we construct a few new features that explain most of thevariation that the original features contain?
What if we wanted to build a single feature to replace all theothers, how is this possible?
Machine Learning: Basic Techniques A. Riascos
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
Feature selectionBest subset, forward, backward and stagewisePrincipal components
Principal components
Sometimes many features are highly correlated.
The information they give to the model is redundant.
Can we construct a few new features that explain most of thevariation that the original features contain?
What if we wanted to build a single feature to replace all theothers, how is this possible?
Machine Learning: Basic Techniques A. Riascos
Principal components
Principal components
The principal components are new features.
They are linear combinations of the orignal features.
The first principal component is the linear combination thatmaximizes the variance.
The second principal component is the linear combinationthat maximizes the variance subject to being orthogonal tothe first principal component.
Successively, p principal components can be build.
Principal components
The principal components are new features.
They are linear combinations of the orignal features.
The first principal component is the linear combination thatmaximizes the variance.
The second principal component is the linear combinationthat maximizes the variance subject to being orthogonal tothe first principal component.
Successively, p principal components can be build.
Principal components
The principal components are new features.
They are linear combinations of the orignal features.
The first principal component is the linear combination thatmaximizes the variance.
The second principal component is the linear combinationthat maximizes the variance subject to being orthogonal tothe first principal component.
Successively, p principal components can be build.
Principal components
The principal components are new features.
They are linear combinations of the orignal features.
The first principal component is the linear combination thatmaximizes the variance.
The second principal component is the linear combinationthat maximizes the variance subject to being orthogonal tothe first principal component.
Successively, p principal components can be build.
Principal components
The principal components are new features.
They are linear combinations of the orignal features.
The first principal component is the linear combination thatmaximizes the variance.
The second principal component is the linear combinationthat maximizes the variance subject to being orthogonal tothe first principal component.
Successively, p principal components can be build.
Principal Components
Aplication: handwritten numbers538 14. Unsupervised Learning
First Principal Component
Sec
ond
Prin
cipa
l Com
pone
nt
-6 -4 -2 0 2 4 6 8
-50
5
••
•
•
•
•
•
•
• •
•
•
•
•
•
••
•
••
•
• •
•
•
•
•
••
••
•
•
•
••
•
••
•
•
••
•
•
•
•
•• •
••
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
••
•
•
•
•
•
• •
•
•
•
•
•
•
•
•
•
•
•
••
•
••
•
•
•
••
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
••
•
•••
••
•
•
•
•
••
•
•
•
•
••
•
•
•
•
••
•
•
•
•
• •
•
•
••
•
•
•
•
•
•
•
•
•
••
•
•••
•
•
•
•
•
••
•
•
•
•
•
• •
•
•
•
••
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•• •
•
•••
•
••
••
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
• •
•
•
•
•
• •
• •
•
•
•
•
•
••
••
•
• •
••
•
•
••
••
•
•
•
•
•
•
••
• •
•
••
••
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
••
•
•
•
•
•
••
••
•
•
••
•
•••
•
•
•
•
•
•
••
•
•
•
•
•
••
•
••
•
•
•
••
••
•••
•
••
••
•
•
•
•
••
•
•
•
•
••
•
•
•
••
•
•
•
•
•
• •
•
•
• •
•
•
•
•
••
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
••
•
•
•• •
•
•
••
•
•
•
••
•
• ••
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•••
•
••
•
•
•
•
•
•
•
•••
••
•
••
•
•••
• • •
••
••
•
•
••
•
•
• ••
••
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
•
•
•
•
•
•
•
•
• •
•
••
•
•
••
•
•
•
••
•
••
•
•
•
•
•
• •
•
•
•
•
•
•
•
•
••
••
• •
••
•
•
•
•
•
•
•
•
•
•
•
•
•
••
••
•
•
• •
••
•
•
•
•
•
•
•
•
••
••
•• •
•
•
•
•
•
•
••
•
O O O O
O
O O OO
O
OO
O O O
O O O OO
O O O O O
FIGURE 14.23. (Left panel:) the first two principal components of the hand-written threes. The circled points are the closest projected images to the verticesof a grid, defined by the marginal quantiles of the principal components. (Rightpanel:) The images corresponding to the circled points. These show the nature ofthe first two principal components.
Dimension
Sin
gula
r V
alue
s
0 50 100 150 200 250
020
4060
80
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
•
•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
• Real Trace• Randomized Trace
FIGURE 14.24. The 256 singular values for the digitized threes, compared tothose for a randomized version of the data (each column of X was scrambled).
PCA in practice
1 Apply PCA.
2 Choose the first PCs that explain the x % of the data’svariance.
3 Make a classification model using the PCs.
This can make a 200 feature model into a 5 feature model.
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
Contenido
1 Optimal Learning Algorithms
2 Linear Regression and k-NNFeature selectionBest subset, forward, backward and stagewisePrincipal components
3 Regularization
4 Classification and Regression Trees
5 Random Forests
6 Model validationROC CurveCalibration Curve
Machine Learning: Basic Techniques A. Riascos
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
Regularization
The best subset can have the smallest test error. But it’s adiscrete method. Features are either included or excluded fully.
This makes the method to have a high variance.
Regularization methods are more continuous and have smallervariance.
Lets consider Ridge Regression and Lasso.
Regularization is key concept in machine learning. Amust for economists.
It allows to control for complexity, trading bias for variance.
Machine Learning: Basic Techniques A. Riascos
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
Regularization
The best subset can have the smallest test error. But it’s adiscrete method. Features are either included or excluded fully.
This makes the method to have a high variance.
Regularization methods are more continuous and have smallervariance.
Lets consider Ridge Regression and Lasso.
Regularization is key concept in machine learning. Amust for economists.
It allows to control for complexity, trading bias for variance.
Machine Learning: Basic Techniques A. Riascos
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
Regularization
The best subset can have the smallest test error. But it’s adiscrete method. Features are either included or excluded fully.
This makes the method to have a high variance.
Regularization methods are more continuous and have smallervariance.
Lets consider Ridge Regression and Lasso.
Regularization is key concept in machine learning. Amust for economists.
It allows to control for complexity, trading bias for variance.
Machine Learning: Basic Techniques A. Riascos
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
Regularization
The best subset can have the smallest test error. But it’s adiscrete method. Features are either included or excluded fully.
This makes the method to have a high variance.
Regularization methods are more continuous and have smallervariance.
Lets consider Ridge Regression and Lasso.
Regularization is key concept in machine learning. Amust for economists.
It allows to control for complexity, trading bias for variance.
Machine Learning: Basic Techniques A. Riascos
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
Regularization
The best subset can have the smallest test error. But it’s adiscrete method. Features are either included or excluded fully.
This makes the method to have a high variance.
Regularization methods are more continuous and have smallervariance.
Lets consider Ridge Regression and Lasso.
Regularization is key concept in machine learning. Amust for economists.
It allows to control for complexity, trading bias for variance.
Machine Learning: Basic Techniques A. Riascos
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
Regularization
The best subset can have the smallest test error. But it’s adiscrete method. Features are either included or excluded fully.
This makes the method to have a high variance.
Regularization methods are more continuous and have smallervariance.
Lets consider Ridge Regression and Lasso.
Regularization is key concept in machine learning. Amust for economists.
It allows to control for complexity, trading bias for variance.
Machine Learning: Basic Techniques A. Riascos
Regularization: Ridge
Solve:
mınβ0,β{
N∑
i=1
(yi − β0 − xTi β)2 + λ‖(β)‖2} (1)
Regularization: Ridge
3.4 Shrinkage Methods 63
TABLE 3.3. Estimated coefficients and test error results, for different subsetand shrinkage methods applied to the prostate data. The blank entries correspondto variables omitted.
Term LS Best Subset Ridge Lasso PCR PLS
Intercept 2.465 2.477 2.452 2.468 2.497 2.452lcavol 0.680 0.740 0.420 0.533 0.543 0.419lweight 0.263 0.316 0.238 0.169 0.289 0.344
age −0.141 −0.046 −0.152 −0.026lbph 0.210 0.162 0.002 0.214 0.220svi 0.305 0.227 0.094 0.315 0.243lcp −0.288 0.000 −0.051 0.079
gleason −0.021 0.040 0.232 0.011pgg45 0.267 0.133 −0.056 0.084
Test Error 0.521 0.492 0.492 0.479 0.449 0.528Std Error 0.179 0.143 0.165 0.164 0.105 0.152
squares,
βridge = argminβ
{ N∑
i=1
(yi − β0 −
p∑
j=1
xijβj
)2+ λ
p∑
j=1
β2j
}. (3.41)
Here λ ≥ 0 is a complexity parameter that controls the amount of shrink-age: the larger the value of λ, the greater the amount of shrinkage. Thecoefficients are shrunk toward zero (and each other). The idea of penaliz-ing by the sum-of-squares of the parameters is also used in neural networks,where it is known as weight decay (Chapter 11).
An equivalent way to write the ridge problem is
βridge = argminβ
N∑
i=1
(yi − β0 −
p∑
j=1
xijβj
)2
,
subject to
p∑
j=1
β2j ≤ t,
(3.42)
which makes explicit the size constraint on the parameters. There is a one-to-one correspondence between the parameters λ in (3.41) and t in (3.42).When there are many correlated variables in a linear regression model,their coefficients can become poorly determined and exhibit high variance.A wildly large positive coefficient on one variable can be canceled by asimilarly large negative coefficient on its correlated cousin. By imposing asize constraint on the coefficients, as in (3.42), this problem is alleviated.
The ridge solutions are not equivariant under scaling of the inputs, andso one normally standardizes the inputs before solving (3.41). In addition,
Regularization: Ridge3.4 Shrinkage Methods 65
Coe
ffici
ents
0 2 4 6 8
−0.
20.
00.
20.
40.
6
•
••••
••
••
••
••
••
••
••
••
•••
•
lcavol
••••••••••••••••••••••••
•
lweight
••••••••••••••••••••••••
•
age
•••••••••••••••••••••••••
lbph
••••••••••••••••••••••••
•
svi
•
•••
••
••
••
••
••••••••••••
•
lcp
••••••••••••••••••••••••
•gleason
•
•••••••••••••••••••••••
•
pgg45
df(λ)
FIGURE 3.8. Profiles of ridge coefficients for the prostate cancer example, asthe tuning parameter λ is varied. Coefficients are plotted versus df(λ), the effectivedegrees of freedom. A vertical line is drawn at df = 5.0, the value chosen bycross-validation.
Summing up
There are several techniques for controlling models complexity.
In the k-NN model this is controlled with k.
In the linear regression model this is controlled with: variablesselection techniques (subset selection, backward, forward,PCA, etc.) and regularization.
Controlling complexity trades bias for variance, lookingforward to reduce test error.
Summing up
There are several techniques for controlling models complexity.
In the k-NN model this is controlled with k.
In the linear regression model this is controlled with: variablesselection techniques (subset selection, backward, forward,PCA, etc.) and regularization.
Controlling complexity trades bias for variance, lookingforward to reduce test error.
Summing up
There are several techniques for controlling models complexity.
In the k-NN model this is controlled with k.
In the linear regression model this is controlled with: variablesselection techniques (subset selection, backward, forward,PCA, etc.) and regularization.
Controlling complexity trades bias for variance, lookingforward to reduce test error.
Summing up
There are several techniques for controlling models complexity.
In the k-NN model this is controlled with k.
In the linear regression model this is controlled with: variablesselection techniques (subset selection, backward, forward,PCA, etc.) and regularization.
Controlling complexity trades bias for variance, lookingforward to reduce test error.
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
Contenido
1 Optimal Learning Algorithms
2 Linear Regression and k-NNFeature selectionBest subset, forward, backward and stagewisePrincipal components
3 Regularization
4 Classification and Regression Trees
5 Random Forests
6 Model validationROC CurveCalibration Curve
Machine Learning: Basic Techniques A. Riascos
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
CART
The best classifier can be expressed by:
f (x) =M∑
m=1
cmI {x ∈ Rm} (2)
where Rm are different regions in where the function isapproximated by a constant cm.
Some regions are difficult to describe (left top panel, nextfigure).
An alternative is to find regions by making sequential binarypartitions.
Machine Learning: Basic Techniques A. Riascos
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
CART
The best classifier can be expressed by:
f (x) =M∑
m=1
cmI {x ∈ Rm} (2)
where Rm are different regions in where the function isapproximated by a constant cm.
Some regions are difficult to describe (left top panel, nextfigure).
An alternative is to find regions by making sequential binarypartitions.
Machine Learning: Basic Techniques A. Riascos
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
CART
The best classifier can be expressed by:
f (x) =M∑
m=1
cmI {x ∈ Rm} (2)
where Rm are different regions in where the function isapproximated by a constant cm.
Some regions are difficult to describe (left top panel, nextfigure).
An alternative is to find regions by making sequential binarypartitions.
Machine Learning: Basic Techniques A. Riascos
Classification and Regression Trees (CART)
Panels 2, 3, 4 represent regions obtain by making sequentialbinary partitions.
306 9. Additive Models, Trees, and Related Methods
|
t1
t2
t3
t4
R1
R1
R2
R2
R3
R3
R4
R4
R5
R5
X1
X1X1
X2
X2
X2
X1 ≤ t1
X2 ≤ t2 X1 ≤ t3
X2 ≤ t4
FIGURE 9.2. Partitions and CART. Top right panel shows a partition of atwo-dimensional feature space by recursive binary splitting, as used in CART,applied to some fake data. Top left panel shows a general partition that cannotbe obtained from recursive binary splitting. Bottom left panel shows the tree cor-responding to the partition in the top right panel, and a perspective plot of theprediction surface appears in the bottom right panel.
Example: Home Mortgage Disclosure Act
Hal R. Varian 13
race was included as one of the predictors. The coeffi cient on race showed a statis-race was included as one of the predictors. The coeffi cient on race showed a statis-tically signifi cant negative impact on probability of getting a mortgage for black tically signifi cant negative impact on probability of getting a mortgage for black applicants. This fi nding prompted considerable subsequent debate and discussion; applicants. This fi nding prompted considerable subsequent debate and discussion; see Ladd (1998) for an overview.see Ladd (1998) for an overview.
Here I examine this question using the tree-based estimators described in Here I examine this question using the tree-based estimators described in the previous section. The data consists of 2,380 observations of 12 predictors, the previous section. The data consists of 2,380 observations of 12 predictors, one of which was race. Figure 5 shows a conditional tree estimated using the one of which was race. Figure 5 shows a conditional tree estimated using the R package package party..
The tree fi ts pretty well, misclassifying 228 of the 2,380 observations for The tree fi ts pretty well, misclassifying 228 of the 2,380 observations for an error rate of 9.6 percent. By comparison, a simple logistic regression does an error rate of 9.6 percent. By comparison, a simple logistic regression does slightly better, misclassifying 225 of the 2,380 observations, leading to an error slightly better, misclassifying 225 of the 2,380 observations, leading to an error rate of 9.5 percent. As you can see in Figure 5, the most important variable is rate of 9.5 percent. As you can see in Figure 5, the most important variable is “dmi” “dmi” == “denied mortgage insurance.” This variable alone explains much of the “denied mortgage insurance.” This variable alone explains much of the variation in the data. The race variable (“black”) shows up far down the tree and variation in the data. The race variable (“black”) shows up far down the tree and seems to be relatively unimportant.seems to be relatively unimportant.
One way to gauge whether a variable is important is to exclude it from the One way to gauge whether a variable is important is to exclude it from the prediction and see what happens. When this is done, it turns out that the accu-prediction and see what happens. When this is done, it turns out that the accu-racy of the tree-based model doesn’t change at all: exactly the same cases are racy of the tree-based model doesn’t change at all: exactly the same cases are misclassifi ed. Of course, it is perfectly possible that there was racial discrimination misclassifi ed. Of course, it is perfectly possible that there was racial discrimination
Figure 5Home Mortgage Disclosure Act (HMDA) Data Tree
Notes: Figure 5 shows a conditional tree estimated using the R package party. The black bars indicate the fraction of each group who were denied mortgages. The most important determinant of this is the variable “dmi,” or “denied mortgage insurance.” Other variables are: “dir,” debt payments to total income ratio; “hir,” housing expenses to income ratio; “lvr,” ratio of size of loan to assessed value of property; “ccs,” consumer credit score; “mcs,” mortgage credit score; “pbcr,” public bad credit record; “dmi,” denied mortgage insurance; “self,” self-employed; “single,” applicant is single; “uria,” 1989 Massachusetts unemployment rate applicant’s industry; “condominium,” unit is condominium; “black,” race of applicant black; and “deny,” mortgage application denied.
dmip < 0.001
1
no yes
ccsp < 0.001
2
≤3 > 3
dirp < 0.001
3
≤ 0.431 > 0.431
ccsp < 0.001
4
≤1 >1
yes
no
00.20.40.60.81
pbcrp < 0.001
6
yes no
Node 7 (n = 37)
yes
no
00.20.40.60.81
Node 8 (n = 479)
yes
no
00.20.40.60.81
mcsp = 0.011
9
Node 10 (n = 48)
yes
no
00.20.40.60.81
Node 11 (n = 50)
yes
no
00.20.40.60.81
pbcrp < 0.001
12
no yes
lvrp = 0.001
13
≤ 0.953 > 0.953
dirp < 0.001
14
≤ 0.415 > 0.415
blackp = 0.021
15
no yes
Node 16 (n = 246)
yes
no
00.20.40.60.81
Node 17 (n = 71)
yes
no
00.20.40.60.81
Node 18 (n = 36)
yes
no
00.20.40.60.81
Node 19 (n = 10)
yes
no
00.20.40.60.81
Node 20 (n = 83)
yes
no
00.20.40.60.81
Node 21 (n = 48)
yes
no
00.20.40.60.81
dmip < 0.001
1
no yes
ccsp < 0.001
2
≤3 > 3
dirp < 0.001
3
≤ 0.431 > 0.431
ccsp < 0.001
4
≤1 >1
Node 5 (n = 1,272)
yes
no
00.20.40.60.81
pbcrp < 0.001
6
yes no
Node 7 (n = 37)
yes
no
00.20.40.60.81
Node 8 (n = 479)
yes
no
00.20.40.60.81
mcsp = 0.011
9
≤ 1 > 1
Node 10 (n = 48)
yes
no
00.20.40.60.81
Node 11 (n = 50)
yes
no
00.20.40.60.81
pbcrp < 0.001
12
no yes
lvrp = 0.001
13
≤ 0.953 > 0.953
dirp < 0.001
14
≤ 0.415 > 0.415
blackp = 0.021
15
no yes
Node 16 (n = 246)
yes
no
00.20.40.60.81
Node 17 (n = 71)
yes
no
00.20.40.60.81
Node 18 (n = 36)
yes
no
00.20.40.60.81
Node 19 (n = 10)
yes
no
00.20.40.60.81
Node 20 (n = 83)
yes
no
00.20.40.60.81
Node 21 (n = 48)
yes
no
00.20.40.60.81
Classification Trees 9.2 Tree-Based Methods 315
600/1536
280/1177
180/1065
80/861
80/652
77/423
20/238
19/236 1/2
57/185
48/113
37/101 1/12
9/72
3/229
0/209
100/204
36/123
16/94
14/89 3/5
9/29
16/81
9/112
6/109 0/3
48/359
26/337
19/110
18/109 0/1
7/227
0/22
spam
spam
spam
spam
spam
spam
spam
spam
spam
spam
spam
spam
ch$<0.0555
remove<0.06
ch!<0.191
george<0.005
hp<0.03
CAPMAX<10.5
receive<0.125 edu<0.045
our<1.2
CAPAVE<2.7505
free<0.065
business<0.145
george<0.15
hp<0.405
CAPAVE<2.907
1999<0.58
ch$>0.0555
remove>0.06
ch!>0.191
george>0.005
hp>0.03
CAPMAX>10.5
receive>0.125 edu>0.045
our>1.2
CAPAVE>2.7505
free>0.065
business>0.145
george>0.15
hp>0.405
CAPAVE>2.907
1999>0.58
FIGURE 9.5. The pruned tree for the spam example. The split variables areshown in blue on the branches, and the classification is shown in every node.Thenumbers under the terminal nodes indicate misclassification rates on the test data.
Classification TreesElements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 9
0 10 20 30 40
0.0
0.1
0.2
0.3
0.4
Tree Size
Mis
clas
sific
atio
n R
ate
176 21 7 5 3 2 0
α
FIGURE 9.4. Results for spam example. The bluecurve is the 10-fold cross-validation estimate of mis-classification rate as a function of tree size, with stan-dard error bars. The minimum occurs at a tree sizewith about 17 terminal nodes (using the “one-standard--error” rule). The orange curve is the test error, whichtracks the CV error quite closely. The cross-validationis indexed by values of α, shown above. The tree sizesshown below refer to |Tα|, the size of the original treeindexed by α.
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
Contenido
1 Optimal Learning Algorithms
2 Linear Regression and k-NNFeature selectionBest subset, forward, backward and stagewisePrincipal components
3 Regularization
4 Classification and Regression Trees
5 Random Forests
6 Model validationROC CurveCalibration Curve
Machine Learning: Basic Techniques A. Riascos
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
Random Forests
It is a technique that consists of the construction of manyuncorrelated trees and then averaging them. By doing that,the variance is reduced.
Machine Learning: Basic Techniques A. Riascos
Random Forests: Algorithm
For b = 1, ...,B
Create samples of Z ∗ of the same size of the sample.
Grow a tree Tb using the next steps until we reach a numberof observations less than nmin in each node.
1 Randomly, select m variables from the p variables.2 Choose the best partition of the m variables in the node.3 Repeat 1 and 2 until we reach the minimum in each leaf.
In the regression task, average the trees. In the classificationtask, use majority vote (each tree contributes with one vote).
Random Forests: Properties
When each tree grows sufficiently large it is possible todecrease the bias (increasing the variance). With the average,the variance is reduced.
The average bias is similar to the bias of each tree becauseeach tree is built using the same process. Hence, the potentialbenefits are found in the decrease of the variance.
The variance reduction is obtained efficiently by averaginguncorrelated trees.
The random choice in each partition guarantees that fewvariables don’t dominate the regression.
Random Forests: Properties
When each tree grows sufficiently large it is possible todecrease the bias (increasing the variance). With the average,the variance is reduced.
The average bias is similar to the bias of each tree becauseeach tree is built using the same process. Hence, the potentialbenefits are found in the decrease of the variance.
The variance reduction is obtained efficiently by averaginguncorrelated trees.
The random choice in each partition guarantees that fewvariables don’t dominate the regression.
Random Forests: Properties
When each tree grows sufficiently large it is possible todecrease the bias (increasing the variance). With the average,the variance is reduced.
The average bias is similar to the bias of each tree becauseeach tree is built using the same process. Hence, the potentialbenefits are found in the decrease of the variance.
The variance reduction is obtained efficiently by averaginguncorrelated trees.
The random choice in each partition guarantees that fewvariables don’t dominate the regression.
Random Forests: Properties
When each tree grows sufficiently large it is possible todecrease the bias (increasing the variance). With the average,the variance is reduced.
The average bias is similar to the bias of each tree becauseeach tree is built using the same process. Hence, the potentialbenefits are found in the decrease of the variance.
The variance reduction is obtained efficiently by averaginguncorrelated trees.
The random choice in each partition guarantees that fewvariables don’t dominate the regression.
Random Forests: PerformanceElements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 15
0 500 1000 1500 2000 2500
0.04
00.
045
0.05
00.
055
0.06
00.
065
0.07
0
Spam Data
Number of Trees
Tes
t Err
or
BaggingRandom ForestGradient Boosting (5 Node)
FIGURE 15.1. Bagging, random forest, and gradi-ent boosting, applied to the spam data. For boosting,5-node trees were used, and the number of trees werechosen by 10-fold cross-validation (2500 trees). Each“step” in the figure corresponds to a change in a singlemisclassification (in a test set of 1536).
Random Forests: PerformanceElements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 15
0 200 400 600 800 1000
0.32
0.34
0.36
0.38
0.40
0.42
0.44
California Housing Data
Number of Trees
Tes
t Ave
rage
Abs
olut
e E
rror
RF m=2RF m=6GBM depth=4GBM depth=6
FIGURE 15.3. Random forests compared to gradientboosting on the California housing data. The curvesrepresent mean absolute error on the test data as afunction of the number of trees in the models. Two ran-dom forests are shown, with m = 2 and m = 6. Thetwo gradient boosted models use a shrinkage parameterν = 0.05 in (10.41), and have interaction depths of 4and 6. The boosted models outperform random forests.
Random Forests: Recommendations
For classification m =√p y nmin = 1.
For regression m = p3 y nmin = 5.
In practice, the parameters must be calibrated. For instance,in the California example, the parameters work better withother values.
Example: Relative importance
To measure the importance of each variable we first define therelative importance of each variable in a single tree.
For each variable its importance can be measured as: sum thesquared reduction in error at each node that the variable isused to split.
Example: Relative importanceElements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 15
!$
removefree
CAPAVEyour
CAPMAXhp
CAPTOTmoney
ouryou
george000eduhpl
business1999
internet(
willall
emailre
receiveovermail
;650
meetinglabs
orderaddress
pmpeople
make#
creditfontdata
technology85
[lab
telnetreport
originalproject
conferencedirect
415857
addresses3dcs
partstable
Gini
0 20 40 60 80 100
Variable Importance
!remove
$CAPAVE
hpfree
CAPMAXedu
georgeCAPTOT
yourour
1999re
youhpl
business000
meetingmoney
(will
internet650pm
receiveover
email;
fontmail
technologyorder
alllabs
[85
addressoriginal
labtelnet
peopleproject
datacredit
conference857
#415
makecs
reportdirect
addresses3d
partstable
Randomization
0 20 40 60 80 100
Variable Importance
FIGURE 15.5. Variable importance plots for a classi-fication random forest grown on the spam data. The leftplot bases the importance on the Gini splitting index, asin gradient boosting. The rankings compare well withthe rankings produced by gradient boosting (Figure 10.6on page 316). The right plot uses oob randomizationto compute variable importances, and tends to spreadthe importances more uniformly.
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
ROC CurveCalibration Curve
Contenido
1 Optimal Learning Algorithms
2 Linear Regression and k-NNFeature selectionBest subset, forward, backward and stagewisePrincipal components
3 Regularization
4 Classification and Regression Trees
5 Random Forests
6 Model validationROC CurveCalibration Curve
Machine Learning: Basic Techniques A. Riascos
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
ROC CurveCalibration Curve
Classification Models
Regression models: AIC, R2, MAPE, etc.
Classification models: ROC curve y calibration curve, etc.
Machine Learning: Basic Techniques A. Riascos
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
ROC CurveCalibration Curve
ROC Curve
ROC Curve and the area under the curve are importantmethods of validation for classification problems.
Machine Learning: Basic Techniques A. Riascos
ROC Curve
Receiver operating characteristicWe also briefly explain the concept of an ROC curve. The construction ofan ROC curve is illustrated in figure 2, showing possible distributions ofrating scores for defaulting and non-defaulting debtors. For a perfect rat-ing model, the left distribution and the right distribution in figure 2 wouldbe separate. For real rating systems, perfect discrimination in general is notpossible. Both distributions will overlap, as illustrated in figure 2.
Assume someone has to find out from the rating scores which debtorswill survive during the next period and which debtors will default. Onepossibility for the decision-maker would be to introduce a cutoff value Cas in figure 2, and to classify each debtor with a rating score lower than Cas a potential defaulter and each debtor with a rating score higher than Cas a non-defaulter. Then four decision results would be possible. If the rat-ing score is below the cutoff value C and the debtor defaults subsequent-ly, the decision was correct. Otherwise the decision-maker wronglyclassified a non-defaulter as a defaulter. If the rating score is above the cut-off value and the debtor does not default, the classification was correct.Otherwise, a defaulter was incorrectly assigned to the non-defaulters group.
Using the notation of Sobehart & Keenan (2001), we define the hit rateHR(C) as:
where H(C) (equal to the light area in figure 2) is the number of default-ers predicted correctly with the cutoff value C, and ND is the total numberof defaulters in the sample. The false alarm rate FAR(C) (equal to the darkarea in figure 2) is defined as:
where F(C) is the number of false alarms, that is, the number of non-de-faulters that were classified incorrectly as defaulters by using the cutoffvalue C. The total number of non-defaulters in the sample is denoted byNND. The ROC curve is constructed as follows. For all cutoff values C thatare contained in the range of the rating scores the quantities HR(C) andFAR(C) are calculated. The ROC curve is a plot of HR(C) versus FAR(C).This is shown in figure 3.
A rating model’s performance is better the steeper the ROC curve is atthe left end and the closer the ROC curve’s position is to the point (0, 1).Similarly, the larger the area below the ROC curve, the better the model.We denote this area by A. It can be calculated as:
The area A is 0.5 for a random model without discriminative power and it
A HR FAR d FAR= ( ) ( )∫0
1
FAR CF C
NND
( ) =( )
HR CH C
ND
( ) =( )
is 1.0 for a perfect model. It is between 0.5 and 1.0 for any reasonable rat-ing model in practice.
Connection between ROC curves and CAP curvesWe prove a relation between the accuracy ratio and the area under theROC curve (A) in order to demonstrate that both measures are equivalent.By a simple calculation, we get for the area aP between the CAP of theperfect rating model and the CAP of the random model:
We introduce some additional notation. If we randomly draw a debtor fromthe total sample of debtors, the resulting score is described by a randomvariable ST. If the debtor is drawn randomly from the sample of defaultersonly, the corresponding random variable is denoted by SD, and if the debtoris drawn from the sample of non-defaulters only, the random variable isdenoted by SND. Note that HR(C) = P(SD < C) and FAR(C) = P(SND < C).
To calculate the area aR between the CAP of the rating model beingvalidated and the CAP of the random model, we need the cumulative dis-tribution function P(ST < C), where ST is the distribution of the rating scoresin the total population of all debtors. In terms of SD and SND, the cumula-tive distribution function P(ST < C) can be expressed as:
Since we assumed that the distributions of SD and SND are continuous, wehave P(SD = C) = P(SND = C) = 0 for all attainable scores C.
Using this, we find for the area aR:
With these expressions for aP and aR, the accuracy ratio can be calculated as:
This means that the accuracy ratio can be calculated directly from the areabelow the ROC curve and vice versa.2 Hence, both summary statistics con-tain the same information.
ARa
a
N A
NAR
P
ND
ND
= =−( )
= −( )0 5
0 52 0 5
.
..
a P S C dP S C
N P S C dP S C N P S C dP S
R D T
D D D ND D N
= <( ) <( ) −
=<( ) <( ) + <( )
∫0
1
0 5.
DD
D ND
D ND
D ND
ND
D
C
N N
N N A
N N
N A
N N
<( )+
−
=+
+− =
−( )+
∫∫0
1
0
1
0 5
0 50 5
0 5
.
..
.
NND
P S CN P S C N P S C
N NTD D ND ND
D ND
<( ) =<( ) + <( )
+
aN
N NPND
D ND
=+
0 5.
XX RISK JANUARY 2003 ● WWW.RISK.NET
Cutting edge l Strap?
Fre
quen
cy
Defaulters
C
Rating score
Non-defaulters
2. Distribution of rating scores for defaultingand non-defaulting debtors
Randommodel
Perfect modelRating model
False alarm rate
00 1
1
A
Hit
rate
3. Receiver operating characteristic curves
Binary classification models can be extended to multi-classclassification models.
ROC Curve
Consider a good and bad scores cumulative distribution graph.The score that represents the maximum distance betweenthese two distributions is the Kolmogorov-Smirnov distance.
If we draw these two graphics in the same plot, we obtain theROC curve: on the y-axis we present the bad scoredistribution function and on the x-axis the good scoredistribution function: sensitivity vs (1-specificity(x)).
The KS distance represents the score where the horizontaldistance between the ROC curve and the diagonal line ismaximum (slope=1).
Gini coefficient is two times the area between the diagonal lineand the ROC Curve.
In the ROC curve, KS is the point where the curve has a slope= 1 or the greater distance to the diagonal.
ROC Curve
Consider a good and bad scores cumulative distribution graph.The score that represents the maximum distance betweenthese two distributions is the Kolmogorov-Smirnov distance.
If we draw these two graphics in the same plot, we obtain theROC curve: on the y-axis we present the bad scoredistribution function and on the x-axis the good scoredistribution function: sensitivity vs (1-specificity(x)).
The KS distance represents the score where the horizontaldistance between the ROC curve and the diagonal line ismaximum (slope=1).
Gini coefficient is two times the area between the diagonal lineand the ROC Curve.
In the ROC curve, KS is the point where the curve has a slope= 1 or the greater distance to the diagonal.
ROC Curve
Consider a good and bad scores cumulative distribution graph.The score that represents the maximum distance betweenthese two distributions is the Kolmogorov-Smirnov distance.
If we draw these two graphics in the same plot, we obtain theROC curve: on the y-axis we present the bad scoredistribution function and on the x-axis the good scoredistribution function: sensitivity vs (1-specificity(x)).
The KS distance represents the score where the horizontaldistance between the ROC curve and the diagonal line ismaximum (slope=1).
Gini coefficient is two times the area between the diagonal lineand the ROC Curve.
In the ROC curve, KS is the point where the curve has a slope= 1 or the greater distance to the diagonal.
ROC Curve
Consider a good and bad scores cumulative distribution graph.The score that represents the maximum distance betweenthese two distributions is the Kolmogorov-Smirnov distance.
If we draw these two graphics in the same plot, we obtain theROC curve: on the y-axis we present the bad scoredistribution function and on the x-axis the good scoredistribution function: sensitivity vs (1-specificity(x)).
The KS distance represents the score where the horizontaldistance between the ROC curve and the diagonal line ismaximum (slope=1).
Gini coefficient is two times the area between the diagonal lineand the ROC Curve.
In the ROC curve, KS is the point where the curve has a slope= 1 or the greater distance to the diagonal.
ROC Curve
Consider a good and bad scores cumulative distribution graph.The score that represents the maximum distance betweenthese two distributions is the Kolmogorov-Smirnov distance.
If we draw these two graphics in the same plot, we obtain theROC curve: on the y-axis we present the bad scoredistribution function and on the x-axis the good scoredistribution function: sensitivity vs (1-specificity(x)).
The KS distance represents the score where the horizontaldistance between the ROC curve and the diagonal line ismaximum (slope=1).
Gini coefficient is two times the area between the diagonal lineand the ROC Curve.
In the ROC curve, KS is the point where the curve has a slope= 1 or the greater distance to the diagonal.
ROC Curve
0.00 0.25 0.50 0.75 1.000
1
2
3
4
5
Distribución PI OtorgamientoCumplenIncumplen
0.00 0.25 0.50 0.75 1.000
1
2
3
4
5
6Distribución PI 3 meses
CumplenIncumplen
0.00 0.25 0.50 0.75 1.000
2
4
6
8
10
12
Distribución PI 12 mesesCumplenIncumplen
ROC Curve
0.00 0.25 0.50 0.75 1.000.0
0.2
0.4
0.6
0.8
1.0
Cumulativa PI Otorgamiento
CumplenIncumplen
0.00 0.25 0.50 0.75 1.000.0
0.2
0.4
0.6
0.8
1.0
Cumulativa PI 3 meses
CumplenIncumplen
0.00 0.25 0.50 0.75 1.000.0
0.2
0.4
0.6
0.8
1.0
Cumulativa PI 12 meses
CumplenIncumplen
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
ROC CurveCalibration Curve
0.0 0.2 0.4 0.6 0.8 1.0Tasa de falsos positivos
0.0
0.2
0.4
0.6
0.8
1.0
Tas
a de
ver
dade
ros
posi
tivos
Curvas ROC para el modelo
Curva ROC con 0 observaciones pasadasCurva ROC con 3 observaciones pasadasCurva ROC con 6 observaciones pasadasCurva ROC con 9 observaciones pasadasCurva ROC con 12 observaciones pasadas
Machine Learning: Basic Techniques A. Riascos
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
ROC CurveCalibration Curve
Calibration Curve
Measures the error between the predicted frequencies of anevent and the observed frequencies.
In most machine learning applications, we use χ2 test todetermine the statistical significance of the error.
Machine Learning: Basic Techniques A. Riascos
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
ROC CurveCalibration Curve
Calibration Curve
Measures the error between the predicted frequencies of anevent and the observed frequencies.
In most machine learning applications, we use χ2 test todetermine the statistical significance of the error.
Machine Learning: Basic Techniques A. Riascos
Optimal Learning AlgorithmsLinear Regression and k-NN
RegularizationClassification and Regression Trees
Random ForestsModel validation
ROC CurveCalibration Curve
Calibration Curve
Measures the error between the predicted frequencies of anevent and the observed frequencies.
In most machine learning applications, we use χ2 test todetermine the statistical significance of the error.
Machine Learning: Basic Techniques A. Riascos