Machine Learning: Basic Techniques

Optimal Learning AlgorithmsLinear Regression and k-NN

RegularizationClassification and Regression Trees

Random ForestsModel validation

Machine Learning: Basic Techniques

Alvaro J. Riascos VillegasUniversidad de los Andes and Quantil

July 6 2018

Machine Learning: Basic Techniques A. Riascos




Contenido

1 Optimal Learning Algorithms

2 Linear Regression and k-NNFeature selectionBest subset, forward, backward and stagewisePrincipal components

3 Regularization

4 Classification and Regression Trees

5 Random Forests

6 Model validationROC CurveCalibration Curve





Bayes regression and classification algorithms

For the regression problem it is possible to prove that the bestlearning function, when the loss is cuadratic, is:

f (x) = EP [Y | X ]

For the classification problem it is possible to prove that thebest learning function, when the loss is zero - one, is:f (x) = 1 if P(Y | X ) ≥ 0,5 and zero otherwise. For multiplecategories it is easily generalized.





Bayes regression and classification algorithms

For the regression problem it is possible to prove that the bestlearning function, when the loss is cuadratic, is:

f (x) = EP [Y | X ]

For the classification problem it is possible to prove that thebest learning function, when the loss is zero - one, is:f (x) = 1 if P(Y | X ) ≥ 0,5 and zero otherwise. For multiplecategories it is easily generalized.





Feature selectionBest subset, forward, backward and stagewisePrincipal components

Contenido



3 Regularization


5 Random Forests







Linear Regression

The Linear Regression model assumes:

f (x) ≈ XTβ

If we minimize the risk subject to the restriction thatfunctions must be linear, we obtain:

β = E(XXT

)−1E (XY )

The model assumes that f (x) is globally linear.






Linear Regression


f (x) ≈ XTβ


β = E(XXT

)−1E (XY )







Linear Regression


f (x) ≈ XTβ


β = E(XXT

)−1E (XY )







k-NN: k-Nearest Neighbors

k-NN estimates the conditional expected value locally as aconstant function.

f (x) ≈ Mean (y |x ∈ Nk(x))






Comparison between Linear Regression and k-NN

Both methods approximate to E (Y |X = x) using averages butthey use different assumptions over the true learning function:

Linear Regression assumes that f (x) is globally linear.k-NN assumes that f (x) is locally constant.






Comparison between Linear Regression and k-NN

Both methods approximate to E (Y |X = x) using averages butthey use different assumptions over the true learning function:

Linear Regression assumes that f (x) is globally linear.k-NN assumes that f (x) is locally constant.






Feature selection

Two common problems:1 Test error: it’s possible to diminish the test error by reducing

the number of variables (this reduces the complexity andvariance) although it increases bias.

2 Interpretation: a smaller number of features allows an easierand better interpretation.

We are going to discuss different ways of reducing the numberof features.






Feature selection










Feature selection










Feature selection: Best subset of features

Best subset of features.

The subset of features which produces the smallest test error ischosen.Computationally expensive. Is computationally viable formodels with less than 40 features.






Feature selection: Best subset of features

Best subset of features.

The subset of features which produces the smallest test error ischosen.Computationally expensive. Is computationally viable formodels with less than 40 features.


Feature selection: Best subset of features58 3. Linear Methods for Regression

Subset Size k

Res

idua

l Sum

−of

−S

quar

es

020

4060

8010

0

0 1 2 3 4 5 6 7 8

•

•

•••••••

••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••

•••••••

•

•

•

•• • • • • • •

FIGURE 3.5. All possible subset models for the prostate cancer example. Ateach subset size is shown the residual sum-of-squares for each model of that size.

cross-validation to estimate prediction error and select k; the AIC criterionis a popular alternative. We defer more detailed discussion of these andother approaches to Chapter 7.

3.3.2 Forward- and Backward-Stepwise Selection

Rather than search through all possible subsets (which becomes infeasiblefor pmuch larger than 40), we can seek a good path through them. Forward-stepwise selection starts with the intercept, and then sequentially adds intothe model the predictor that most improves the fit. With many candidatepredictors, this might seem like a lot of computation; however, clever up-dating algorithms can exploit the QR decomposition for the current fit torapidly establish the next candidate (Exercise 3.9). Like best-subset re-gression, forward stepwise produces a sequence of models indexed by k, thesubset size, which must be determined.

Forward-stepwise selection is a greedy algorithm, producing a nested se-quence of models. In this sense it might seem sub-optimal compared tobest-subset selection. However, there are several reasons why it might bepreferred:

Feature selection: Forward, Backward and Stagewiseselection

Forward: Start with a model only with a constant andsequentially add the feature which reduces prediction error themost. On each stage the model is reestimated.

Backward: Start with a model with all the features, proceed toeliminate the feature that less contributes to the finalprediction (it can be done using the Z-score). On each stagethe model is reestimated.

Stagewise: Start with a model only with a constant andsequentially add the feature that has the biggest correlationwith the residual errors of the last model. Don’t reestimate.









Feature selection: Growth Regressions

Hal R. Varian 17

model averaging, a technique related to, but not identical with, spike-and-slab. model averaging, a technique related to, but not identical with, spike-and-slab. Hendry and Krolzig (2004) examined an iterative signifi cance test selection method.Hendry and Krolzig (2004) examined an iterative signifi cance test selection method.

Table 4 shows ten predictors that were chosen by Sala-i-Martín (1997) using Table 4 shows ten predictors that were chosen by Sala-i-Martín (1997) using his two million regressions, Ley and Steel (2009) using Bayesian model averaging, his two million regressions, Ley and Steel (2009) using Bayesian model averaging, LASSO, and spike-and-slab. The table is based on that in Ley and Steel (2009) but LASSO, and spike-and-slab. The table is based on that in Ley and Steel (2009) but metrics used are not strictly comparable across the various models. The “Bayesian metrics used are not strictly comparable across the various models. The “Bayesian model averaging” and “spike-slab” columns show posterior probabilities of inclu-model averaging” and “spike-slab” columns show posterior probabilities of inclu-sion; the “LASSO” column just shows the ordinal importance of the variable or sion; the “LASSO” column just shows the ordinal importance of the variable or a dash indicating that it was not included in the chosen model; and the CDF(0) a dash indicating that it was not included in the chosen model; and the CDF(0) measure is defi ned in Sala-i-Martín (199 7).measure is defi ned in Sala-i-Martín (199 7).

The LASSO and the Bayesian techniques are very computationally effi cient The LASSO and the Bayesian techniques are very computationally effi cient and would likely be preferred to exhaustive search. All four of these variable selec-and would likely be preferred to exhaustive search. All four of these variable selec-tion methods give similar results for the fi rst four or fi ve variables, after which they tion methods give similar results for the fi rst four or fi ve variables, after which they diverge. In this particular case, the dataset appears to be too small to resolve the diverge. In this particular case, the dataset appears to be too small to resolve the question of what is “important” for economic growt h.question of what is “important” for economic growt h.

Variable Selection in Time Series ApplicationsThe machine learning techniques described up until now are generally The machine learning techniques described up until now are generally

applied to cross-sectional data where independently distributed data is a plausible applied to cross-sectional data where independently distributed data is a plausible assumption. However, there are also techniques that work with time series. Here we assumption. However, there are also techniques that work with time series. Here we

Table 4Comparing Variable Selection Algorithms: Which Variables Appeared as Important Predictors of Economic Growth?

Predictor Bayesian model averaging CDF(0) LASSO Spike-and-Slab

GDP level 1960 1.000 1.000 - 0.9992Fraction Confucian 0.995 1.000 2 0.9730Life expectancy 0.946 0.942 - 0.9610Equipment investment 0.757 0.997 1 0.9532Sub-Saharan dummy 0.656 1.000 7 0.5834Fraction Muslim 0.656 1.000 8 0.6590Rule of law 0.516 1.000 - 0.4532Open economy 0.502 1.000 6 0.5736Degree of capitalism 0.471 0.987 9 0.4230Fraction Protestant 0.461 0.966 5 0.3798

Source: The table is based on that in Ley and Steel (2009); the data analyzed is from Sala-i-Martín (1997).Notes: We illustrate different methods of variable selection. This exercise involved examining a dataset of 72 counties and 42 variables in order to see which variables appeared to be important predictors of economic growth. The table shows ten predictors that were chosen by Sala-i-Martín (1997) using a CDF(0) measure defi ned in the 1997 paper; Ley and Steel (2009) using Bayesian model averaging, LASSO, and spike-and-slab regressions. Metrics used are not strictly comparable across the various models. The “Bayesian model averaging” and “Spike-and-Slab” columns are posterior probabilities of inclusion; the “LASSO” column just shows the ordinal importance of the variable or a dash indicating that it was not included in the chosen model; and the CDF(0) measure is defi ned in Sala-i-Martín (1997).





Principal components

Sometimes many features are highly correlated.

The information they give to the model is redundant.

Can we construct a few new features that explain most of thevariation that the original features contain?

What if we wanted to build a single feature to replace all theothers, how is this possible?







Sometimes many features are highly correlated.

The information they give to the model is redundant.

Can we construct a few new features that explain most of thevariation that the original features contain?

What if we wanted to build a single feature to replace all theothers, how is this possible?




The principal components are new features.

They are linear combinations of the orignal features.

The first principal component is the linear combination thatmaximizes the variance.

The second principal component is the linear combinationthat maximizes the variance subject to being orthogonal tothe first principal component.

Successively, p principal components can be build.

























Principal Components

Aplication: handwritten numbers538 14. Unsupervised Learning

First Principal Component

Sec

ond

Prin

cipa

l Com

pone

nt

-6 -4 -2 0 2 4 6 8

-50

5

••

•

•

•

•

•

•

• •

•

•

•

•

•

••

•

••

•

• •

•

•

•

•

••

••

•

•

•

••

•

••

•

•

••

•

•

•

•

•• •

••

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

••

•

•

•

•

•

• •

•

•

•

•

•

•

•

•

•

•

•

••

•

••

•

•

•

••

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

••

•

•••

••

•

•

•

•

••

•

•

•

•

••

•

•

•

•

••

•

•

•

•

• •

•

•

••

•

•

•

•

•

•

•

•

•

••

•

•••

•

•

•

•

•

••

•

•

•

•

•

• •

•

•

•

••

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•• •

•

•••

•

••

••

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

• •

•

•

•

•

• •

• •

•

•

•

•

•

••

••

•

• •

••

•

•

••

••

•

•

•

•

•

•

••

• •

•

••

••

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

• •

•

•

••

•

•

•

•

•

••

••

•

•

••

•

•••

•

•

•

•

•

•

••

•

•

•

•

•

••

•

••

•

•

•

••

••

•••

•

••

••

•

•

•

•

••

•

•

•

•

••

•

•

•

••

•

•

•

•

•

• •

•

•

• •

•

•

•

•

••

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

••

•

•

•• •

•

•

••

•

•

•

••

•

• ••

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

• •

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•••

•

••

•

•

•

•

•

•

•

•••

••

•

••

•

•••

• • •

••

••

•

•

••

•

•

• ••

••

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

• •

•

•

•

•

•

•

•

•

•

•

•

• •

•

••

•

•

••

•

•

•

••

•

••

•

•

•

•

•

• •

•

•

•

•

•

•

•

•

••

••

• •

••

•

•

•

•

•

•

•

•

•

•

•

•

•

••

••

•

•

• •

••

•

•

•

•

•

•

•

•

••

••

•• •

•

•

•

•

•

•

••

•

O O O O

O

O O OO

O

OO

O O O

O O O OO

O O O O O

FIGURE 14.23. (Left panel:) the first two principal components of the hand-written threes. The circled points are the closest projected images to the verticesof a grid, defined by the marginal quantiles of the principal components. (Rightpanel:) The images corresponding to the circled points. These show the nature ofthe first two principal components.

Dimension

Sin

gula

r V

alue

s

0 50 100 150 200 250

020

4060

80

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

• Real Trace• Randomized Trace

FIGURE 14.24. The 256 singular values for the digitized threes, compared tothose for a randomized version of the data (each column of X was scrambled).

PCA in practice

1 Apply PCA.

2 Choose the first PCs that explain the x % of the data’svariance.

3 Make a classification model using the PCs.

This can make a 200 feature model into a 5 feature model.




Contenido



3 Regularization


5 Random Forests






Regularization

The best subset can have the smallest test error. But it’s adiscrete method. Features are either included or excluded fully.

This makes the method to have a high variance.

Regularization methods are more continuous and have smallervariance.

Lets consider Ridge Regression and Lasso.

Regularization is key concept in machine learning. Amust for economists.

It allows to control for complexity, trading bias for variance.





Regularization











Regularization











Regularization











Regularization











Regularization








Regularization: Ridge

Solve:

mınβ0,β{

N∑

i=1

(yi − β0 − xTi β)2 + λ‖(β)‖2} (1)

Regularization: Ridge

3.4 Shrinkage Methods 63

TABLE 3.3. Estimated coefficients and test error results, for different subsetand shrinkage methods applied to the prostate data. The blank entries correspondto variables omitted.

Term LS Best Subset Ridge Lasso PCR PLS

Intercept 2.465 2.477 2.452 2.468 2.497 2.452lcavol 0.680 0.740 0.420 0.533 0.543 0.419lweight 0.263 0.316 0.238 0.169 0.289 0.344

age −0.141 −0.046 −0.152 −0.026lbph 0.210 0.162 0.002 0.214 0.220svi 0.305 0.227 0.094 0.315 0.243lcp −0.288 0.000 −0.051 0.079

gleason −0.021 0.040 0.232 0.011pgg45 0.267 0.133 −0.056 0.084

Test Error 0.521 0.492 0.492 0.479 0.449 0.528Std Error 0.179 0.143 0.165 0.164 0.105 0.152

squares,

βridge = argminβ

{ N∑

i=1

(yi − β0 −

p∑

j=1

xijβj

)2+ λ

p∑

j=1

β2j

}. (3.41)

Here λ ≥ 0 is a complexity parameter that controls the amount of shrink-age: the larger the value of λ, the greater the amount of shrinkage. Thecoefficients are shrunk toward zero (and each other). The idea of penaliz-ing by the sum-of-squares of the parameters is also used in neural networks,where it is known as weight decay (Chapter 11).

An equivalent way to write the ridge problem is

βridge = argminβ

N∑

i=1

(yi − β0 −

p∑

j=1

xijβj

)2

,

subject to

p∑

j=1

β2j ≤ t,

(3.42)

which makes explicit the size constraint on the parameters. There is a one-to-one correspondence between the parameters λ in (3.41) and t in (3.42).When there are many correlated variables in a linear regression model,their coefficients can become poorly determined and exhibit high variance.A wildly large positive coefficient on one variable can be canceled by asimilarly large negative coefficient on its correlated cousin. By imposing asize constraint on the coefficients, as in (3.42), this problem is alleviated.

The ridge solutions are not equivariant under scaling of the inputs, andso one normally standardizes the inputs before solving (3.41). In addition,

Regularization: Ridge3.4 Shrinkage Methods 65

Coe

ffici

ents

0 2 4 6 8

−0.

20.

00.

20.

40.

6

•

••••

••

••

••

••

••

••

••

••

•••

•

lcavol

••••••••••••••••••••••••

•

lweight

••••••••••••••••••••••••

•

age

•••••••••••••••••••••••••

lbph

••••••••••••••••••••••••

•

svi

•

•••

••

••

••

••

••••••••••••

•

lcp

••••••••••••••••••••••••

•gleason

•

•••••••••••••••••••••••

•

pgg45

df(λ)

FIGURE 3.8. Profiles of ridge coefficients for the prostate cancer example, asthe tuning parameter λ is varied. Coefficients are plotted versus df(λ), the effectivedegrees of freedom. A vertical line is drawn at df = 5.0, the value chosen bycross-validation.

Summing up

There are several techniques for controlling models complexity.

In the k-NN model this is controlled with k.

In the linear regression model this is controlled with: variablesselection techniques (subset selection, backward, forward,PCA, etc.) and regularization.

Controlling complexity trades bias for variance, lookingforward to reduce test error.

Summing up





Summing up





Summing up








Contenido



3 Regularization


5 Random Forests






CART

The best classifier can be expressed by:

f (x) =M∑

m=1

cmI {x ∈ Rm} (2)

where Rm are different regions in where the function isapproximated by a constant cm.

Some regions are difficult to describe (left top panel, nextfigure).

An alternative is to find regions by making sequential binarypartitions.





CART


f (x) =M∑

m=1

cmI {x ∈ Rm} (2)








CART


f (x) =M∑

m=1

cmI {x ∈ Rm} (2)





Classification and Regression Trees (CART)

Panels 2, 3, 4 represent regions obtain by making sequentialbinary partitions.

306 9. Additive Models, Trees, and Related Methods

|

t1

t2

t3

t4

R1

R1

R2

R2

R3

R3

R4

R4

R5

R5

X1

X1X1

X2

X2

X2

X1 ≤ t1

X2 ≤ t2 X1 ≤ t3

X2 ≤ t4

FIGURE 9.2. Partitions and CART. Top right panel shows a partition of atwo-dimensional feature space by recursive binary splitting, as used in CART,applied to some fake data. Top left panel shows a general partition that cannotbe obtained from recursive binary splitting. Bottom left panel shows the tree cor-responding to the partition in the top right panel, and a perspective plot of theprediction surface appears in the bottom right panel.

Example: Home Mortgage Disclosure Act

Hal R. Varian 13

race was included as one of the predictors. The coeffi cient on race showed a statis-race was included as one of the predictors. The coeffi cient on race showed a statis-tically signifi cant negative impact on probability of getting a mortgage for black tically signifi cant negative impact on probability of getting a mortgage for black applicants. This fi nding prompted considerable subsequent debate and discussion; applicants. This fi nding prompted considerable subsequent debate and discussion; see Ladd (1998) for an overview.see Ladd (1998) for an overview.

Here I examine this question using the tree-based estimators described in Here I examine this question using the tree-based estimators described in the previous section. The data consists of 2,380 observations of 12 predictors, the previous section. The data consists of 2,380 observations of 12 predictors, one of which was race. Figure 5 shows a conditional tree estimated using the one of which was race. Figure 5 shows a conditional tree estimated using the R package package party..

The tree fi ts pretty well, misclassifying 228 of the 2,380 observations for The tree fi ts pretty well, misclassifying 228 of the 2,380 observations for an error rate of 9.6 percent. By comparison, a simple logistic regression does an error rate of 9.6 percent. By comparison, a simple logistic regression does slightly better, misclassifying 225 of the 2,380 observations, leading to an error slightly better, misclassifying 225 of the 2,380 observations, leading to an error rate of 9.5 percent. As you can see in Figure 5, the most important variable is rate of 9.5 percent. As you can see in Figure 5, the most important variable is “dmi” “dmi” == “denied mortgage insurance.” This variable alone explains much of the “denied mortgage insurance.” This variable alone explains much of the variation in the data. The race variable (“black”) shows up far down the tree and variation in the data. The race variable (“black”) shows up far down the tree and seems to be relatively unimportant.seems to be relatively unimportant.

One way to gauge whether a variable is important is to exclude it from the One way to gauge whether a variable is important is to exclude it from the prediction and see what happens. When this is done, it turns out that the accu-prediction and see what happens. When this is done, it turns out that the accu-racy of the tree-based model doesn’t change at all: exactly the same cases are racy of the tree-based model doesn’t change at all: exactly the same cases are misclassifi ed. Of course, it is perfectly possible that there was racial discrimination misclassifi ed. Of course, it is perfectly possible that there was racial discrimination

Figure 5Home Mortgage Disclosure Act (HMDA) Data Tree

Notes: Figure 5 shows a conditional tree estimated using the R package party. The black bars indicate the fraction of each group who were denied mortgages. The most important determinant of this is the variable “dmi,” or “denied mortgage insurance.” Other variables are: “dir,” debt payments to total income ratio; “hir,” housing expenses to income ratio; “lvr,” ratio of size of loan to assessed value of property; “ccs,” consumer credit score; “mcs,” mortgage credit score; “pbcr,” public bad credit record; “dmi,” denied mortgage insurance; “self,” self-employed; “single,” applicant is single; “uria,” 1989 Massachusetts unemployment rate applicant’s industry; “condominium,” unit is condominium; “black,” race of applicant black; and “deny,” mortgage application denied.

dmip < 0.001

1

no yes

ccsp < 0.001

2

≤3 > 3

dirp < 0.001

3

≤ 0.431 > 0.431

ccsp < 0.001

4

≤1 >1

yes

no

00.20.40.60.81

pbcrp < 0.001

6

yes no

Node 7 (n = 37)

yes

no

00.20.40.60.81

Node 8 (n = 479)

yes

no

00.20.40.60.81

mcsp = 0.011

9

Node 10 (n = 48)

yes

no

00.20.40.60.81

Node 11 (n = 50)

yes

no

00.20.40.60.81

pbcrp < 0.001

12

no yes

lvrp = 0.001

13

≤ 0.953 > 0.953

dirp < 0.001

14

≤ 0.415 > 0.415

blackp = 0.021

15

no yes

Node 16 (n = 246)

yes

no

00.20.40.60.81

Node 17 (n = 71)

yes

no

00.20.40.60.81

Node 18 (n = 36)

yes

no

00.20.40.60.81

Node 19 (n = 10)

yes

no

00.20.40.60.81

Node 20 (n = 83)

yes

no

00.20.40.60.81

Node 21 (n = 48)

yes

no

00.20.40.60.81

dmip < 0.001

1

no yes

ccsp < 0.001

2

≤3 > 3

dirp < 0.001

3

≤ 0.431 > 0.431

ccsp < 0.001

4

≤1 >1

Node 5 (n = 1,272)

yes

no

00.20.40.60.81

pbcrp < 0.001

6

yes no

Node 7 (n = 37)

yes

no

00.20.40.60.81

Node 8 (n = 479)

yes

no

00.20.40.60.81

mcsp = 0.011

9

≤ 1 > 1

Node 10 (n = 48)

yes

no

00.20.40.60.81

Node 11 (n = 50)

yes

no

00.20.40.60.81

pbcrp < 0.001

12

no yes

lvrp = 0.001

13

≤ 0.953 > 0.953

dirp < 0.001

14

≤ 0.415 > 0.415

blackp = 0.021

15

no yes

Node 16 (n = 246)

yes

no

00.20.40.60.81

Node 17 (n = 71)

yes

no

00.20.40.60.81

Node 18 (n = 36)

yes

no

00.20.40.60.81

Node 19 (n = 10)

yes

no

00.20.40.60.81

Node 20 (n = 83)

yes

no

00.20.40.60.81

Node 21 (n = 48)

yes

no

00.20.40.60.81

Classification Trees 9.2 Tree-Based Methods 315

600/1536

280/1177

180/1065

80/861

80/652

77/423

20/238

19/236 1/2

57/185

48/113

37/101 1/12

9/72

3/229

0/209

100/204

36/123

16/94

14/89 3/5

9/29

16/81

9/112

6/109 0/3

48/359

26/337

19/110

18/109 0/1

7/227

0/22

spam

spam

spam

spam

spam

spam

spam

spam

spam

spam

spam

spam

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

ch$<0.0555

remove<0.06

ch!<0.191

george<0.005

hp<0.03

CAPMAX<10.5

receive<0.125 edu<0.045

our<1.2

CAPAVE<2.7505

free<0.065

business<0.145

george<0.15

hp<0.405

CAPAVE<2.907

1999<0.58

ch$>0.0555

remove>0.06

ch!>0.191

george>0.005

hp>0.03

CAPMAX>10.5

receive>0.125 edu>0.045

our>1.2

CAPAVE>2.7505

free>0.065

business>0.145

george>0.15

hp>0.405

CAPAVE>2.907

1999>0.58

FIGURE 9.5. The pruned tree for the spam example. The split variables areshown in blue on the branches, and the classification is shown in every node.Thenumbers under the terminal nodes indicate misclassification rates on the test data.

Classification TreesElements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 9

0 10 20 30 40

0.0

0.1

0.2

0.3

0.4

Tree Size

Mis

clas

sific

atio

n R

ate

176 21 7 5 3 2 0

α

FIGURE 9.4. Results for spam example. The bluecurve is the 10-fold cross-validation estimate of mis-classification rate as a function of tree size, with stan-dard error bars. The minimum occurs at a tree sizewith about 17 terminal nodes (using the “one-standard--error” rule). The orange curve is the test error, whichtracks the CV error quite closely. The cross-validationis indexed by values of α, shown above. The tree sizesshown below refer to |Tα|, the size of the original treeindexed by α.




Contenido



3 Regularization


5 Random Forests






Random Forests

It is a technique that consists of the construction of manyuncorrelated trees and then averaging them. By doing that,the variance is reduced.


Random Forests: Algorithm

For b = 1, ...,B

Create samples of Z ∗ of the same size of the sample.

Grow a tree Tb using the next steps until we reach a numberof observations less than nmin in each node.

1 Randomly, select m variables from the p variables.2 Choose the best partition of the m variables in the node.3 Repeat 1 and 2 until we reach the minimum in each leaf.

In the regression task, average the trees. In the classificationtask, use majority vote (each tree contributes with one vote).

Random Forests: Properties

When each tree grows sufficiently large it is possible todecrease the bias (increasing the variance). With the average,the variance is reduced.

The average bias is similar to the bias of each tree becauseeach tree is built using the same process. Hence, the potentialbenefits are found in the decrease of the variance.

The variance reduction is obtained efficiently by averaginguncorrelated trees.

The random choice in each partition guarantees that fewvariables don’t dominate the regression.
















Random Forests: PerformanceElements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 15

0 500 1000 1500 2000 2500

0.04

00.

045

0.05

00.

055

0.06

00.

065

0.07

0

Spam Data

Number of Trees

Tes

t Err

or

BaggingRandom ForestGradient Boosting (5 Node)

FIGURE 15.1. Bagging, random forest, and gradi-ent boosting, applied to the spam data. For boosting,5-node trees were used, and the number of trees werechosen by 10-fold cross-validation (2500 trees). Each“step” in the figure corresponds to a change in a singlemisclassification (in a test set of 1536).

Random Forests: PerformanceElements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 15

0 200 400 600 800 1000

0.32

0.34

0.36

0.38

0.40

0.42

0.44

California Housing Data

Number of Trees

Tes

t Ave

rage

Abs

olut

e E

rror

RF m=2RF m=6GBM depth=4GBM depth=6

FIGURE 15.3. Random forests compared to gradientboosting on the California housing data. The curvesrepresent mean absolute error on the test data as afunction of the number of trees in the models. Two ran-dom forests are shown, with m = 2 and m = 6. Thetwo gradient boosted models use a shrinkage parameterν = 0.05 in (10.41), and have interaction depths of 4and 6. The boosted models outperform random forests.

Random Forests: Recommendations

For classification m =√p y nmin = 1.

For regression m = p3 y nmin = 5.

In practice, the parameters must be calibrated. For instance,in the California example, the parameters work better withother values.

Example: Relative importance

To measure the importance of each variable we first define therelative importance of each variable in a single tree.

For each variable its importance can be measured as: sum thesquared reduction in error at each node that the variable isused to split.

Example: Relative importanceElements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 15

!$

removefree

CAPAVEyour

CAPMAXhp

CAPTOTmoney

ouryou

george000eduhpl

business1999

internet(

willall

emailre

receiveovermail

;650

meetinglabs

orderaddress

pmpeople

make#

creditfontdata

technology85

[lab

telnetreport

originalproject

conferencedirect

415857

addresses3dcs

partstable

Gini

0 20 40 60 80 100

Variable Importance

!remove

$CAPAVE

hpfree

CAPMAXedu

georgeCAPTOT

yourour

1999re

youhpl

business000

meetingmoney

(will

internet650pm

receiveover

email;

fontmail

technologyorder

alllabs

[85

addressoriginal

labtelnet

peopleproject

datacredit

conference857

#415

makecs

reportdirect

addresses3d

partstable

Randomization

0 20 40 60 80 100

Variable Importance

FIGURE 15.5. Variable importance plots for a classi-fication random forest grown on the spam data. The leftplot bases the importance on the Gini splitting index, asin gradient boosting. The rankings compare well withthe rankings produced by gradient boosting (Figure 10.6on page 316). The right plot uses oob randomizationto compute variable importances, and tends to spreadthe importances more uniformly.




ROC CurveCalibration Curve

Contenido



3 Regularization


5 Random Forests







Classification Models

Regression models: AIC, R2, MAPE, etc.

Classification models: ROC curve y calibration curve, etc.






ROC Curve

ROC Curve and the area under the curve are importantmethods of validation for classification problems.


ROC Curve

Receiver operating characteristicWe also briefly explain the concept of an ROC curve. The construction ofan ROC curve is illustrated in figure 2, showing possible distributions ofrating scores for defaulting and non-defaulting debtors. For a perfect rat-ing model, the left distribution and the right distribution in figure 2 wouldbe separate. For real rating systems, perfect discrimination in general is notpossible. Both distributions will overlap, as illustrated in figure 2.

Assume someone has to find out from the rating scores which debtorswill survive during the next period and which debtors will default. Onepossibility for the decision-maker would be to introduce a cutoff value Cas in figure 2, and to classify each debtor with a rating score lower than Cas a potential defaulter and each debtor with a rating score higher than Cas a non-defaulter. Then four decision results would be possible. If the rat-ing score is below the cutoff value C and the debtor defaults subsequent-ly, the decision was correct. Otherwise the decision-maker wronglyclassified a non-defaulter as a defaulter. If the rating score is above the cut-off value and the debtor does not default, the classification was correct.Otherwise, a defaulter was incorrectly assigned to the non-defaulters group.

Using the notation of Sobehart & Keenan (2001), we define the hit rateHR(C) as:

where H(C) (equal to the light area in figure 2) is the number of default-ers predicted correctly with the cutoff value C, and ND is the total numberof defaulters in the sample. The false alarm rate FAR(C) (equal to the darkarea in figure 2) is defined as:

where F(C) is the number of false alarms, that is, the number of non-de-faulters that were classified incorrectly as defaulters by using the cutoffvalue C. The total number of non-defaulters in the sample is denoted byNND. The ROC curve is constructed as follows. For all cutoff values C thatare contained in the range of the rating scores the quantities HR(C) andFAR(C) are calculated. The ROC curve is a plot of HR(C) versus FAR(C).This is shown in figure 3.

A rating model’s performance is better the steeper the ROC curve is atthe left end and the closer the ROC curve’s position is to the point (0, 1).Similarly, the larger the area below the ROC curve, the better the model.We denote this area by A. It can be calculated as:

The area A is 0.5 for a random model without discriminative power and it

A HR FAR d FAR= ( ) ( )∫0

1

FAR CF C

NND

( ) =( )

HR CH C

ND

( ) =( )

is 1.0 for a perfect model. It is between 0.5 and 1.0 for any reasonable rat-ing model in practice.

Connection between ROC curves and CAP curvesWe prove a relation between the accuracy ratio and the area under theROC curve (A) in order to demonstrate that both measures are equivalent.By a simple calculation, we get for the area aP between the CAP of theperfect rating model and the CAP of the random model:

We introduce some additional notation. If we randomly draw a debtor fromthe total sample of debtors, the resulting score is described by a randomvariable ST. If the debtor is drawn randomly from the sample of defaultersonly, the corresponding random variable is denoted by SD, and if the debtoris drawn from the sample of non-defaulters only, the random variable isdenoted by SND. Note that HR(C) = P(SD < C) and FAR(C) = P(SND < C).

To calculate the area aR between the CAP of the rating model beingvalidated and the CAP of the random model, we need the cumulative dis-tribution function P(ST < C), where ST is the distribution of the rating scoresin the total population of all debtors. In terms of SD and SND, the cumula-tive distribution function P(ST < C) can be expressed as:

Since we assumed that the distributions of SD and SND are continuous, wehave P(SD = C) = P(SND = C) = 0 for all attainable scores C.

Using this, we find for the area aR:

With these expressions for aP and aR, the accuracy ratio can be calculated as:

This means that the accuracy ratio can be calculated directly from the areabelow the ROC curve and vice versa.2 Hence, both summary statistics con-tain the same information.

ARa

a

N A

NAR

P

ND

ND

= =−( )

= −( )0 5

0 52 0 5

.

..

a P S C dP S C

N P S C dP S C N P S C dP S

R D T

D D D ND D N

= <( ) <( ) −

=<( ) <( ) + <( )

∫0

1

0 5.

DD

D ND

D ND

D ND

ND

D

C

N N

N N A

N N

N A

N N

<( )+

−

=+

+− =

−( )+

∫∫0

1

0

1

0 5

0 50 5

0 5

.

..

.

NND

P S CN P S C N P S C

N NTD D ND ND

D ND

<( ) =<( ) + <( )

+

aN

N NPND

D ND

=+

0 5.

XX RISK JANUARY 2003 ● WWW.RISK.NET

Cutting edge l Strap?

Fre

quen

cy

Defaulters

C

Rating score

Non-defaulters

2. Distribution of rating scores for defaultingand non-defaulting debtors

Randommodel

Perfect modelRating model

False alarm rate

00 1

1

A

Hit

rate

3. Receiver operating characteristic curves

Binary classification models can be extended to multi-classclassification models.

ROC Curve

Consider a good and bad scores cumulative distribution graph.The score that represents the maximum distance betweenthese two distributions is the Kolmogorov-Smirnov distance.

If we draw these two graphics in the same plot, we obtain theROC curve: on the y-axis we present the bad scoredistribution function and on the x-axis the good scoredistribution function: sensitivity vs (1-specificity(x)).

The KS distance represents the score where the horizontaldistance between the ROC curve and the diagonal line ismaximum (slope=1).

Gini coefficient is two times the area between the diagonal lineand the ROC Curve.

In the ROC curve, KS is the point where the curve has a slope= 1 or the greater distance to the diagonal.

ROC Curve






ROC Curve






ROC Curve






ROC Curve






ROC Curve

0.00 0.25 0.50 0.75 1.000

1

2

3

4

5

Distribución PI OtorgamientoCumplenIncumplen

0.00 0.25 0.50 0.75 1.000

1

2

3

4

5

6Distribución PI 3 meses

CumplenIncumplen

0.00 0.25 0.50 0.75 1.000

2

4

6

8

10

12

Distribución PI 12 mesesCumplenIncumplen

ROC Curve

0.00 0.25 0.50 0.75 1.000.0

0.2

0.4

0.6

0.8

1.0

Cumulativa PI Otorgamiento

CumplenIncumplen

0.00 0.25 0.50 0.75 1.000.0

0.2

0.4

0.6

0.8

1.0

Cumulativa PI 3 meses

CumplenIncumplen

0.00 0.25 0.50 0.75 1.000.0

0.2

0.4

0.6

0.8

1.0

Cumulativa PI 12 meses

CumplenIncumplen





0.0 0.2 0.4 0.6 0.8 1.0Tasa de falsos positivos

0.0

0.2

0.4

0.6

0.8

1.0

Tas

a de

ver

dade

ros

posi

tivos

Curvas ROC para el modelo

Curva ROC con 0 observaciones pasadasCurva ROC con 3 observaciones pasadasCurva ROC con 6 observaciones pasadasCurva ROC con 9 observaciones pasadasCurva ROC con 12 observaciones pasadas






Calibration Curve

Measures the error between the predicted frequencies of anevent and the observed frequencies.

In most machine learning applications, we use χ2 test todetermine the statistical significance of the error.






Calibration Curve








Calibration Curve




Documents

Machine Learning: Basic Techniques