29
Bayesian decision theory Andrea Passerini [email protected] Machine Learning Bayesian decision theory

Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification

Bayesian decision theory

Andrea [email protected]

Machine Learning

Bayesian decision theory

Page 2: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification

Bayesian decision theory: introduction

OverviewBayesian decision theory allows to take optimal decisionsin a fully probabilistic settingIt assumes all relevant probabilities are knownIt allows to provide upper bounds on acheivable errors andevaluate classifiers accordinglyBayesian reasoning can be generalized to cases when theprobabilistic structure is not entirely known

Bayesian decision theory

Page 3: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification

Bayesian decision theory: introduction

Binary classification

Assume examples (x , y) ∈ X × {−1,1} are drawn from aknown distribution p(x , y).The task is predicting the class y of examples given theinput x .Bayes rule allows us to write it in probabilistic terms as:

P(y |x) =p(x |y)P(y)

p(x)

Bayesian decision theory

Page 4: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification

Bayesian decision theory: introductionBayes ruleBayes rule allows to compute the posterior probability givenlikelihood, prior and evidence:

posterior =likelihood × prior

evidenceposterior P(y |x) is the probability that class is y given that x

was observedlikelihood p(x |y) is the probability of observing x given that

its class is yprior P(y) is the prior probability of the class, without

any evidenceevidence p(x) is the probability of the observation, and by

the law of total probability can be computed as:

p(x) =2∑

i=1

p(x |y)P(y)

Bayesian decision theory

Page 5: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification

Bayesian decision theory: introduction

Probability of errorProbability of error given x :

P(error |x) =

{P(y2|x) if we decide y1P(y1|x) if we decide y2

Average probability of error:

P(error) =

∫ ∞−∞

P(error |x)p(x)dx

Bayesian decision theory

Page 6: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification

Bayesian decision theory: introduction

Bayes decision rule

yB = argmaxyi∈{−1,1}P(yi |x) = argmaxyi∈{−1,1}p(x |yi)P(yi)

The probability of error given x is:

P(error |x) = min[P(y1|x),P(y2|x)]

The Bayes decision rule minimizes the probability of error

Bayesian decision theory

Page 7: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification

Bayesian decision theory: Continuous features

Setting

Inputs are vectors in a d-dimensional Euclidean space IRd

called feature spaceThe output is one of c possible categories or classesY = {y1, . . . , yc} (binary/multiclass classification)A decision rule is a function f : X → Y telling what class topredict for a given observationA loss function ` : Y × Y → IR is a function measuring thecost for predicting yi when the true class is yj

The conditional risk of predicting y given x is:

R(y |x) =c∑

i=1

`(y , yi)P(yi |x)

Bayesian decision theory

Page 8: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification

Bayesian decision theory: Continuous features

Bayes decision rule

y∗ = argminy∈YR(y |x)

The overall risk of a decision rule f is given by

R[f ] =

∫R(f (x)|x)p(x)dx =

∫ c∑i=1

`(f (x), yi)p(yi ,x)dx

The Bayes decision rule minimizes the overall risk.The Bayes risk R∗ is the overall risk of a Bayes decisionrule, and it’s the best performance achievable.

Bayesian decision theory

Page 9: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification

Minimum-error-rate classification

Zero-one loss

`(y , yi) =

{0 if y = yi1 if y 6= yi

It assigns a unit cost to a misclassified example, and azero cost to a correctly classified oneIt assumes that all errors have same costIts risk corresponds to the average probability of error:

R(yi |x) =c∑

j=1

`(yi , yj)P(yj |x) =∑j 6=i

P(yj |x) = 1− P(yi |x)

Bayesian decision theory

Page 10: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification

Minimum-error-rate classification: binary case

x

θa

p(x|ω1)p(x|ω2)

θb

R 1R 2 R 1R 2

Bayes decision rule

yB = argmaxyi∈{−1,1}P(yi |x) = argmaxyi∈{−1,1}p(x |yi)P(yi)

Likelihood ratio:

p(x |y1)P(y1) > p(x |y2)P(y2)

p(x |y1)

p(x |y2)>

P(y2)

P(y1)= θa

Bayesian decision theory

Page 11: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification

Representing classifiers

Discriminant functionsA classifier can be represented as a set of discriminantfunctions gi(x), i ∈ 1, . . . , c, giving:

y = argmaxi∈1,...,c gi(x)

A discriminant function is not unique⇒ the mostconvenient one for computational or explanatory reasonscan be used:

gi(x) = P(yi |x) =p(x|yi)P(yi)

p(x)

gi(x) = p(x|yi)P(yi)

gi(x) = lnp(x|yi) + lnP(yi)

Bayesian decision theory

Page 12: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification

Representing classifiers

0

0.1

0.2

0.3

decisionboundary

p(x|ω2)P(ω2)

R 1

R 2

p(x|ω1)P(ω1)

R 2

0

5

0

5

Decision regionsThe feature space is divided into decision regionsR1, . . . ,Rc such that:

x ∈ Ri if gi(x) > gj(x) ∀ j 6= i

Decision regions are separated by decision boundaries,regions in which ties occur among the largest discriminantfunctions

Bayesian decision theory

Page 13: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification

Normal density

Multivariate normal density

1(2π)d/2|Σ|1/2 exp−1

2(x− µ)t Σ−1(x− µ)

The covariance matrix Σ is always symmetric and positivesemi-definiteThe covariance matrix is strictly positive definite if thedimension of the feature space is d (otherwise |Σ| = 0)

Bayesian decision theory

Page 14: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification

Normal densityx2

x1

µ

HyperellipsoidsThe loci of points of constant density are hyperellipsoids ofconstant Mahalanobis distance from x to µ.The principal axes of such hyperellipsoids are theeigenvectors of Σ, their lengths are given by thecorresponding eigenvalues

Bayesian decision theory

Page 15: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification

Discriminant functions for normal density

Discriminant functions

gi(x) = ln p(x|yi) + ln P(yi)

= −12

(x− µi)t Σ−1

i (x− µi)−d2

ln 2π − 12

ln |Σi |+ ln P(yi)

Discarding terms which are independent of i we obtain:

gi(x) = −12

(x− µi)t Σ−1

i (x− µi)−12

ln |Σi |+ ln P(yi)

Bayesian decision theory

Page 16: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification

Discriminant functions for normal density

case Σi = σ2IFeatures are statistically independentAll features have same variance σ2

Covariance determinant |Σi | = σ2d can be ignored beingindependent of iCovariance inverse is given by Σ−1

i = (1/σ2)IThe discriminant functions become:

gi(x) = −||x− µi ||2

2σ2 + ln P(yi)

Bayesian decision theory

Page 17: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification

Discriminant functions for normal density

-2 2 4

0.1

0.2

0.3

0.4

P(ω1)=.5 P(ω2)=.5

x

p(x|ωi) ω1 ω2

0

R 1 R 2

02

4

0

0.05

0.1

0.15

-2

P(ω1)=.5P(ω2)=.5

ω1ω2

R 1

R 2

-20

2

4

-2-1

01

2

0

1

2

-2

-1

0

1

2

P(ω1)=.5

P(ω2)=.5

ω1

ω2

R 1

R 2

case Σi = σ2IExpansion of the quadratic form leads to:

gi(x) = − 12σ2 [xtx− 2µt

i x + µtiµi ] + ln P(yi)

Discarding terms which are independent of i we obtainlinear discriminant functions:

gi(x) =1σ2µ

ti︸ ︷︷ ︸

w ti

x− 12σ2µ

tiµi + ln P(yi)︸ ︷︷ ︸

wi0

Bayesian decision theory

Page 18: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification

case Σi = σ2I

Separating hyperplane

Setting gi(x) = gj(x) we note that the decision boundariesare pieces of hyperplanes:

(µi − µj)t︸ ︷︷ ︸

w t

(x−

12

(µi + µj)−σ2

||µi − µj ||2ln

P(yi)

P(yj)(µi − µj)︸ ︷︷ ︸

x0

)

The hyperplane is orthogonal to vector w ⇒ orthogonal tothe line linking the meansThe hyperplane passes through x0:

if the prior probabilities of classes are equal, x0 is halfwaybetween the meansotherwise, x0 shifts towards the more likely mean

Bayesian decision theory

Page 19: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification

case Σi = σ2I

Separating hyperplane: derivation (1)

gi(x)− gj(x) = 01σ2µ

ti x−

12σ2µ

tiµi + ln P(yi)−

1σ2µ

tj x−

12σ2µ

tjµj + ln P(yj) = 0

(µi − µj)tx− 1/2(µt

iµi − µtjµj) + σ2ln

P(yi)

P(yj)= 0

w t (x− x0) = 0w = (µi − µj)

(µi − µj)tx0 = 1/2(µt

iµi − µtjµj)− σ2ln

P(yi)

P(yj)

Bayesian decision theory

Page 20: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification

case Σi = σ2I

Separating hyperplane: derivation (2)

(µi − µj)tx0 = 1/2(µt

iµi − µtjµj)− σ2ln

P(yi)

P(yj)

(µtiµi − µt

jµj) = (µi − µj)t (µi + µj)

lnP(yi)

P(yj)=

(µi − µj)t (µi − µj)

(µi − µj)t (µi − µj)

lnP(yi)

P(yj)=

= (µi − µj)t (µi − µj)

||µi − µj ||2ln

P(yi)

P(yj)

x0 = 1/2(µi + µj)− σ2 (µi − µj)

||µi − µj ||2ln

P(yi)

P(yj)

Bayesian decision theory

Page 21: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification

case Σi = σ2I

P(ω1)=.7 P(ω2)=.3

ω1 ω2

R 1 R 2

p(x|ωi)

x-2 2 4

0.1

0.2

0.3

0.4

0

P(ω1)=.9 P(ω2)=.1

ω1 ω2

R 1 R 2

p(x|ωi)

x-2 2 4

0.1

0.2

0.3

0.4

0

-2

02

4

-20

24

0

0.05

0.1

0.15

P(ω1)=.8

P(ω2)=.2

ω1 ω2

R 1

R 2

-2

02

4

-20

24

0

0.05

0.1

0.15

P(ω1)=.99

P(ω2)=.01

ω1ω2

R 1

R 2

-10

1

2

0

1

2

3

-2

-1

0

1

2

-2

P(ω1)=.8

P(ω2)=.2

ω1

ω2R 1

R 2

0

2

4

-10

1

2

-2

-1

0

1

2

-2

P(ω1)=.99

P(ω2)=.01ω1

ω2

R 1

R 2

Bayesian decision theory

Page 22: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification

Discriminant functions for normal densitycase Σi = Σ

All classes have same covariance matrixThe discriminant functions become:

gi(x) = −12

(x− µi)t Σ−1(x− µi) + ln P(yi)

Expanding the quadratic form and discarding termsindependent of i we again obtain linear discriminantfunctions:

gi(x) = µti Σ−1︸ ︷︷ ︸

w ti

x−12µt

i Σ−1µi + ln P(yi)︸ ︷︷ ︸

wi0

The separating hyperplanes are not necessarily orthogonalto the line linking the means:

(µi − µj)t Σ−1︸ ︷︷ ︸

w t

(x−12

(µi + µj)−ln P(yi)/P(yj)

(µi − µj)t Σ−1(µi − µj)

(µi − µj)︸ ︷︷ ︸x0

)

Bayesian decision theory

Page 23: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification

case Σi = Σ

-5

0

5 -5

0

5

0

-0.1

0.2

P(ω1)=.5

P(ω2)=.5

ω1ω2

R 1R 2

-5

0

5-5

0

5

0

-0.1

0.2

P(ω1)=.1

P(ω2)=.9

ω1ω2

R 1R 2

-20

2-2

02

4

-2.5

0

2.5

5

7.5

P(ω1)=.5

P(ω2)=.5

ω1

ω2

R 1

R 2

-20

2-2

02

4

0

-2.5

5

7.5

10

P(ω1)=.1

P(ω2)=.9

ω1

ω2

R 1

R 2

Bayesian decision theory

Page 24: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification

Discriminant functions for normal density

case Σi = arbitraryThe discriminant functions are inherently quadratic:

gi(x) = xt (−12

Σ−1i )︸ ︷︷ ︸

Wi

x+µti Σ−1i︸ ︷︷ ︸

w ti

x−12µt

i Σ−1i µi −

12

ln |Σi |+ ln P(yi)︸ ︷︷ ︸wio

In two category case, decision surfaces arehyperquadratics: hyperplanes, pairs of hyperplanes,hyperspheres, hyperellipsoids, etc.

Bayesian decision theory

Page 25: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification

case Σi = arbitrary

Bayesian decision theory

Page 26: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification

Reducible error

ω2ω1

x

p(x|ωi)P(ωi)

reducibleerror

∫p(x|ω1)P(ω1) dx∫p(x|ω2)P(ω2) dxR 2R 1

R 1 R 2

xB x*

Probability of error in binary classification

P(error) = P(x ∈ R1, y2) + P(x ∈ R2, y1)

=

∫R1

p(x|y2)P(y2)dx +

∫R2

p(x|y1)P(y1)dx

The reducible error is the error resulting from a suboptimalchoice of the decision boundary

Bayesian decision theory

Page 27: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification

Bayesian decision theory: arbitrary inputs and outputs

Setting

Examples are input-output pairs (x , y) ∈ X × Y generatedwith probability p(x , y).The conditional risk of predicting y∗ given x is:

R(y∗|x) =

∫Y`(y∗, y)P(y |x)dy

The overall risk of a decision rule f is given by

R[f ] =

∫R(f (x)|x)p(x)dx =

∫X

∫Y`(f (x), y)p(y , x)dxdy

Bayes decision rule

yB = argminy∈YR(y |x)

Bayesian decision theory

Page 28: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification

Handling missing features

Marginalize over missing variablesAssume input x consists of an observed part xo andmissing part xm.Posterior probability of yi given the observation can beobtained from probabilities over entire inputs bymarginalizing over the missing part:

P(yi |xo) =p(yi ,xo)

p(xo)=

∫p(yi ,xo,xm)dxm

p(xo)

=

∫P(yi |xo,xm)p(xo,xm)dxm∫

p(xo,xm)dxm

=

∫P(yi |x)p(x)dxm∫

p(x)dxm

Bayesian decision theory

Page 29: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification

Handling noisy features

Marginalize over true variablesAssume x consists of a clean part xc and noisy part xn.Assume we have a noise model for the probability of thenoisy feature given its true version p(xn|xt ).Posterior probability of yi given the observation can beobtained from probabilities over clean inputs bymarginalizing over true variables via the noise model:

P(yi |xc ,xn) =p(yi ,xc ,xn)

p(xc ,xn)=

∫p(yi ,xc ,xn,xt )dxt∫

p(xc ,xn,xt )dxt

=

∫p(yi |xc ,xn,xt )p(xc ,xn,xt )dxt∫

p(xc ,xn,xt )dxt

=

∫p(yi |xc ,xt )p(xn|xc ,xt )p(xc ,xt )dxt∫

p(xn|xc ,xt )p(xc ,xt )dxt

=

∫p(yi |x)p(xn|xt )p(x)dxt∫

p(xn|xt )p(x)dxt

Bayesian decision theory