22
Outline Bayesian Decision Theory Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Hong Chang (ICT, CAS) Bayesian Decision Theory

Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

  • Upload
    lynhu

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

Outline

Bayesian Decision Theory

Hong Chang

Institute of Computing Technology,Chinese Academy of Sciences

Machine Learning Methods (Fall 2012)

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 2: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

Outline

Outline I

1 Introduction

2 Losses and Risks

3 Discriminant Functions

4 Bayesian Networks

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 3: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Coin Tossing Example

Outcome of tossing a coin ∈ {head,tail}Random variable X :

X =

{1 if outcome is head0 if outcome is tail

X is Bernoulli-distributed:

P(X ) = pX0 (1− p0)

1−X

where the parameter p0 is the probability that the outcome is head,i.e., p0 = P(X = 1).

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 4: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Estimation and Prediction

Estimation of parameter p0 from sample X = {x (i)}Ni=1:

p̂0 =]heads]tosses

=

∑Ni=1 x (i)

N

Prediction of outcome of next toss:

Predicted outcome =

{head if p0 > 1/2tail otherwise

by choosing the more probable outcome, which minimizes theprobability of error (=1-probability of our choice for the predictedoutcome).

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 5: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Classification as Bayesian Decision

Credit scoring example:Inputs: income and savings, or x = (x1, x2)

T

Output: risk ∈ {low,high}, or C ∈ {0, 1}Prediction:

Choose =

{C = 1 if P(C = 1|x) > 0.5C = 0 otherwise

or equivalently

Choose =

{C = 1 if P(C = 1|x) > P(C = 0|x)C = 0 otherwise

Probability of error:

1−max(P(C = 1|x),P(C = 0|x))

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 6: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Bayes’ Rule

Bayes’ rule:

Posterior P(C|x) = likelihood× priorevidence

=p(x|C)P(C)

p(x)

Some useful properties:P(C = 1) + P(C = 0) = 1p(x) = p(x|C = 1)P(C = 1) + p(x|C = 0)P(C = 0)P(C = 0|x) + P(C = 1|x) = 1

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 7: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Bayes’ Rule for K > 2 Classes

Bayes’ rule for general case (K mutually exclusive and exhaustiveclasses):

P(Ci |x) =p(x|Ci)P(Ci)

p(x)

=p(x|Ci)P(Ci)∑K

k=1 p(x|Ck )P(Ck )

Optimal decision rule for Bayes’ classifier:

Choose Ci if P(Ci |x) = maxk

P(Ck |x)

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 8: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Losses and Risks

Different decisions or actions may not be equally good or costly.Action αi : decision to assign the input x to class Ci

Loss λik : loss incurred for taking action αi when the actual state if Ck

Expected risk for taking action αi :

R(αi |x) =K∑

k=1

λik P(Ck |x)

Optimal decision rule with minimum expected risk:

Choose αi if R(αi |x) = mink

R(αk |x)

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 9: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

0-1 Loss

All correct decisions have no loss and all errors have unit cost:

λik =

{C = 1 if i = kC = 0 if i 6= k

Expected risk:

R(αi |x) =K∑

k=1

λik P(Ck |x)

=∑k 6=i

P(Ck |x) = 1− P(Ci |x)

Optimal decision rule with minimum expected expected risk (or,equivalently, highest posterior probability):

Choose αi if P(Ci |x) = maxk

P(Ck |x)

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 10: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Reject Option

If the certainty of a decision is low but misclassiifcation has very highcost, the action of reject (αK+1) is preferred.Loss function:

λik =

0 if i = kλ if i = k + 11 otherwise

where 0 < λ < 1 is the loss incurred for choosing the action of reject.Expected risk:

R(λi |x) ={ ∑K

k=1 λP(Ck |x) = λ if i = k + 1∑k 6=i P(Ck |x) = 1− P(Ci |x) if i ∈ {1, . . . ,K}

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 11: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Reject Option (2)

Optimal decision rule:{Choose Ci if R(αi |x) = min1≤k≤K R(αk |x) < R(αK+1|x)Reject otherwise

Equivalent form of optimal decision rule:{Choose Ci if P(Ci |x) = min1≤k≤K P(Ck |x) > 1− λReject otherwise

This approach is meaningful only if 0 < λ < 1:If λ = 0, we always reject (a reject is as good as a correct classification).If λ ≥ 1, we never reject (a reject is as costly as or more costly than amisclassification).

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 12: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Discriminant Functions

One way of performing classification is through a set of discriminantfunctions.Classification rule:

Choose Ci if gi(x) = maxk

gk (x)

Different ways of defining the discriminant functions:gi(x) = −R(αi |x)gi(x) = P(Ci |x)gi(x) = p(x|Ci)P(Ci)

For the two-class case, we may define a single discriminant function:

g(x) = g1(x)− g2(x)

with the following classification rule:

Choose{

C1 if g(x) > 0C2 otherwise

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 13: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Decision Regions

The feature space is divided into K decision regions R1, . . . ,RK ,where

Ri = {x|gi(x) = maxk

gk (x)}

The decision regions are separated by decision boundaries where tiesoccur among the largest discriminant functions.

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 14: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Bayesian Networks

A.k.a. belief networks, probabilistic networks, or more generallygraphical models.Node (or vertex): random variableEdge (or arc or link): direct influence between variablesStructure: directed acyclic graph (DAG) formed by nodes and edgesParameters: probabilities and conditional probabilities

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 15: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Causal Graph and Diagnostic Inference

Causal graph: rain is the cause of wet grass.Diagnostic inference: knowing that the grass is wet, what is theprobability that rain is the cause?Bayes’ rule:

P(R|W ) =P(W |R)P(R)

P(W )

=0.9× 0.4

0.9× 0.4 + 0.2× 0.6= 0.75 > P(R) = 0.4

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 16: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Two Causes: Causal Inference

Causal or predictive inference: if the sprinkler is on, what is theprobability that the grass is wet?

P(W |S) = P(W |R,S)P(R|S) + P(W | ∼ R,S)P(∼ R|S)

= P(W |R,S)P(R) + P(W | ∼ R,S)P(∼ R)

= 0.95× 0.4 + 0.9× 0.6= 0.92

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 17: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Two Causes: Diagnostic Inference

Diagnostic inference: if the grass is wet, what is the probability that thesprinkler is on?

P(S|W ) =P(W |S)P(S)

P(W )

where

P(W ) = P(W |R,S)P(R,S) + P(W | ∼ R,S)P(∼ R,S) +

P(W |R,∼ S)P(R,∼ S) + P(W | ∼ R,∼ S)P(∼ R,∼ S)

= P(W |R,S)P(R)P(S) + P(W | ∼ R,S)P(∼ R)P(S) +

P(W |R,∼ S)P(R)P(∼ S) + P(W | ∼ R,∼ S)P(∼ R)P(∼ S)

= 0.52

soP(S|W ) =

0.92× 0.20.52

= 0.35 > P(S) = 0.2

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 18: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Two Causes: Explaining Away

Bayes’ rule:

P(S|R,W ) =P(W |R,S)P(S|R)

P(W |R)=

P(W |R,S)P(S)

P(W |R)= 0.21

Explaining away:

0.21 = P(S|R,W ) < P(S|W ) = 0.35

Knowing that it has rained decreases that probability that the sprinkleris on.Knowing that the grass is wet, rain and sprinkler become dependent:

P(S|R,W ) 6= P(S|W )

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 19: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Dependent Causes

Causal inference:

P(W |C) = P(W |R,S,C)P(R,S|C) + P(W | ∼ R,S,C)P(∼ R,S|C) +

P(W |R,∼ S,C)P(R,∼ S|C) + P(W | ∼ R,∼ S,C)P(∼ R,∼ S|C)

= P(W |R,S)P(R|C)P(S|C) + P(W | ∼ R,S)P(∼ R|C)P(S|C) +

P(W |R,∼ S)P(R|C)P(∼ S|C) + P(W | ∼ R,∼ S)P(∼ R|C)P(∼ S|C)

Independence: W and C are independent given R and S; R and Sare independent given C.

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 20: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Local Structures

The network represents conditional independence statements.The joint distribution can be broken down into local structures:

P(C,S,R,W ,F ) = P(C)P(S|C)P(R|C)P(W |S,R)P(F |R)

In general,

P(X1, . . . ,Xd ) =d∏

i=1

P(Xi |parents(Xi))

where Xi is either continuous or discrete with ≥ 2 possible values.

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 21: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Bayesian Network for Classification

Bayes’ rule inverts the edge:

P(C|x) = P(x|C)P(C)

P(x)

Classification as diagnostic inference.

Hong Chang (ICT, CAS) Bayesian Decision Theory

Page 22: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):

IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Naive Bayes’ Classifier

Given C, the input variables xj are independent:

p(x|C) =d∏

j=1

p(xj |C)

The Naive Bayes’ classifier ignores possible dependencies among theinput variables and reduces a multivariate problem to a group ofunivariate problems.

Hong Chang (ICT, CAS) Bayesian Decision Theory