Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Bayesian decision theory: introduction
OverviewBayesian decision theory allows to take optimal decisionsin a fully probabilistic settingIt assumes all relevant probabilities are knownIt allows to provide upper bounds on acheivable errors andevaluate classifiers accordinglyBayesian reasoning can be generalized to cases when theprobabilistic structure is not entirely known
Bayesian decision theory
Bayesian decision theory: introduction
Binary classification
Assume examples (x , y) ∈ X × {−1,1} are drawn from aknown distribution p(x , y).The task is predicting the class y of examples given theinput x .Bayes rule allows us to write it in probabilistic terms as:
P(y |x) =p(x |y)P(y)
p(x)
Bayesian decision theory
Bayesian decision theory: introductionBayes ruleBayes rule allows to compute the posterior probability givenlikelihood, prior and evidence:
posterior =likelihood × prior
evidenceposterior P(y |x) is the probability that class is y given that x
was observedlikelihood p(x |y) is the probability of observing x given that
its class is yprior P(y) is the prior probability of the class, without
any evidenceevidence p(x) is the probability of the observation, and by
the law of total probability can be computed as:
p(x) =2∑
i=1
p(x |y)P(y)
Bayesian decision theory
Bayesian decision theory: introduction
Probability of errorProbability of error given x :
P(error |x) =
{P(y2|x) if we decide y1P(y1|x) if we decide y2
Average probability of error:
P(error) =
∫ ∞−∞
P(error |x)p(x)dx
Bayesian decision theory
Bayesian decision theory: introduction
Bayes decision rule
yB = argmaxyi∈{−1,1}P(yi |x) = argmaxyi∈{−1,1}p(x |yi)P(yi)
The probability of error given x is:
P(error |x) = min[P(y1|x),P(y2|x)]
The Bayes decision rule minimizes the probability of error
Bayesian decision theory
Bayesian decision theory: Continuous features
Setting
Inputs are vectors in a d-dimensional Euclidean space IRd
called feature spaceThe output is one of c possible categories or classesY = {y1, . . . , yc} (binary/multiclass classification)A decision rule is a function f : X → Y telling what class topredict for a given observationA loss function ` : Y × Y → IR is a function measuring thecost for predicting yi when the true class is yj
The conditional risk of predicting y given x is:
R(y |x) =c∑
i=1
`(y , yi)P(yi |x)
Bayesian decision theory
Bayesian decision theory: Continuous features
Bayes decision rule
y∗ = argminy∈YR(y |x)
The overall risk of a decision rule f is given by
R[f ] =
∫R(f (x)|x)p(x)dx =
∫ c∑i=1
`(f (x), yi)p(yi ,x)dx
The Bayes decision rule minimizes the overall risk.The Bayes risk R∗ is the overall risk of a Bayes decisionrule, and it’s the best performance achievable.
Bayesian decision theory
Minimum-error-rate classification
Zero-one loss
`(y , yi) =
{0 if y = yi1 if y 6= yi
It assigns a unit cost to a misclassified example, and azero cost to a correctly classified oneIt assumes that all errors have same costIts risk corresponds to the average probability of error:
R(yi |x) =c∑
j=1
`(yi , yj)P(yj |x) =∑j 6=i
P(yj |x) = 1− P(yi |x)
Bayesian decision theory
Minimum-error-rate classification: binary case
x
θa
p(x|ω1)p(x|ω2)
θb
R 1R 2 R 1R 2
Bayes decision rule
yB = argmaxyi∈{−1,1}P(yi |x) = argmaxyi∈{−1,1}p(x |yi)P(yi)
Likelihood ratio:
p(x |y1)P(y1) > p(x |y2)P(y2)
p(x |y1)
p(x |y2)>
P(y2)
P(y1)= θa
Bayesian decision theory
Representing classifiers
Discriminant functionsA classifier can be represented as a set of discriminantfunctions gi(x), i ∈ 1, . . . , c, giving:
y = argmaxi∈1,...,c gi(x)
A discriminant function is not unique⇒ the mostconvenient one for computational or explanatory reasonscan be used:
gi(x) = P(yi |x) =p(x|yi)P(yi)
p(x)
gi(x) = p(x|yi)P(yi)
gi(x) = lnp(x|yi) + lnP(yi)
Bayesian decision theory
Representing classifiers
0
0.1
0.2
0.3
decisionboundary
p(x|ω2)P(ω2)
R 1
R 2
p(x|ω1)P(ω1)
R 2
0
5
0
5
Decision regionsThe feature space is divided into decision regionsR1, . . . ,Rc such that:
x ∈ Ri if gi(x) > gj(x) ∀ j 6= i
Decision regions are separated by decision boundaries,regions in which ties occur among the largest discriminantfunctions
Bayesian decision theory
Normal density
Multivariate normal density
1(2π)d/2|Σ|1/2 exp−1
2(x− µ)t Σ−1(x− µ)
The covariance matrix Σ is always symmetric and positivesemi-definiteThe covariance matrix is strictly positive definite if thedimension of the feature space is d (otherwise |Σ| = 0)
Bayesian decision theory
Normal densityx2
x1
µ
HyperellipsoidsThe loci of points of constant density are hyperellipsoids ofconstant Mahalanobis distance from x to µ.The principal axes of such hyperellipsoids are theeigenvectors of Σ, their lengths are given by thecorresponding eigenvalues
Bayesian decision theory
Discriminant functions for normal density
Discriminant functions
gi(x) = ln p(x|yi) + ln P(yi)
= −12
(x− µi)t Σ−1
i (x− µi)−d2
ln 2π − 12
ln |Σi |+ ln P(yi)
Discarding terms which are independent of i we obtain:
gi(x) = −12
(x− µi)t Σ−1
i (x− µi)−12
ln |Σi |+ ln P(yi)
Bayesian decision theory
Discriminant functions for normal density
case Σi = σ2IFeatures are statistically independentAll features have same variance σ2
Covariance determinant |Σi | = σ2d can be ignored beingindependent of iCovariance inverse is given by Σ−1
i = (1/σ2)IThe discriminant functions become:
gi(x) = −||x− µi ||2
2σ2 + ln P(yi)
Bayesian decision theory
Discriminant functions for normal density
-2 2 4
0.1
0.2
0.3
0.4
P(ω1)=.5 P(ω2)=.5
x
p(x|ωi) ω1 ω2
0
R 1 R 2
02
4
0
0.05
0.1
0.15
-2
P(ω1)=.5P(ω2)=.5
ω1ω2
R 1
R 2
-20
2
4
-2-1
01
2
0
1
2
-2
-1
0
1
2
P(ω1)=.5
P(ω2)=.5
ω1
ω2
R 1
R 2
case Σi = σ2IExpansion of the quadratic form leads to:
gi(x) = − 12σ2 [xtx− 2µt
i x + µtiµi ] + ln P(yi)
Discarding terms which are independent of i we obtainlinear discriminant functions:
gi(x) =1σ2µ
ti︸ ︷︷ ︸
w ti
x− 12σ2µ
tiµi + ln P(yi)︸ ︷︷ ︸
wi0
Bayesian decision theory
case Σi = σ2I
Separating hyperplane
Setting gi(x) = gj(x) we note that the decision boundariesare pieces of hyperplanes:
(µi − µj)t︸ ︷︷ ︸
w t
(x−
12
(µi + µj)−σ2
||µi − µj ||2ln
P(yi)
P(yj)(µi − µj)︸ ︷︷ ︸
x0
)
The hyperplane is orthogonal to vector w ⇒ orthogonal tothe line linking the meansThe hyperplane passes through x0:
if the prior probabilities of classes are equal, x0 is halfwaybetween the meansotherwise, x0 shifts towards the more likely mean
Bayesian decision theory
case Σi = σ2I
Separating hyperplane: derivation (1)
gi(x)− gj(x) = 01σ2µ
ti x−
12σ2µ
tiµi + ln P(yi)−
1σ2µ
tj x−
12σ2µ
tjµj + ln P(yj) = 0
(µi − µj)tx− 1/2(µt
iµi − µtjµj) + σ2ln
P(yi)
P(yj)= 0
w t (x− x0) = 0w = (µi − µj)
(µi − µj)tx0 = 1/2(µt
iµi − µtjµj)− σ2ln
P(yi)
P(yj)
Bayesian decision theory
case Σi = σ2I
Separating hyperplane: derivation (2)
(µi − µj)tx0 = 1/2(µt
iµi − µtjµj)− σ2ln
P(yi)
P(yj)
(µtiµi − µt
jµj) = (µi − µj)t (µi + µj)
lnP(yi)
P(yj)=
(µi − µj)t (µi − µj)
(µi − µj)t (µi − µj)
lnP(yi)
P(yj)=
= (µi − µj)t (µi − µj)
||µi − µj ||2ln
P(yi)
P(yj)
x0 = 1/2(µi + µj)− σ2 (µi − µj)
||µi − µj ||2ln
P(yi)
P(yj)
Bayesian decision theory
case Σi = σ2I
P(ω1)=.7 P(ω2)=.3
ω1 ω2
R 1 R 2
p(x|ωi)
x-2 2 4
0.1
0.2
0.3
0.4
0
P(ω1)=.9 P(ω2)=.1
ω1 ω2
R 1 R 2
p(x|ωi)
x-2 2 4
0.1
0.2
0.3
0.4
0
-2
02
4
-20
24
0
0.05
0.1
0.15
P(ω1)=.8
P(ω2)=.2
ω1 ω2
R 1
R 2
-2
02
4
-20
24
0
0.05
0.1
0.15
P(ω1)=.99
P(ω2)=.01
ω1ω2
R 1
R 2
-10
1
2
0
1
2
3
-2
-1
0
1
2
-2
P(ω1)=.8
P(ω2)=.2
ω1
ω2R 1
R 2
0
2
4
-10
1
2
-2
-1
0
1
2
-2
P(ω1)=.99
P(ω2)=.01ω1
ω2
R 1
R 2
Bayesian decision theory
Discriminant functions for normal densitycase Σi = Σ
All classes have same covariance matrixThe discriminant functions become:
gi(x) = −12
(x− µi)t Σ−1(x− µi) + ln P(yi)
Expanding the quadratic form and discarding termsindependent of i we again obtain linear discriminantfunctions:
gi(x) = µti Σ−1︸ ︷︷ ︸
w ti
x−12µt
i Σ−1µi + ln P(yi)︸ ︷︷ ︸
wi0
The separating hyperplanes are not necessarily orthogonalto the line linking the means:
(µi − µj)t Σ−1︸ ︷︷ ︸
w t
(x−12
(µi + µj)−ln P(yi)/P(yj)
(µi − µj)t Σ−1(µi − µj)
(µi − µj)︸ ︷︷ ︸x0
)
Bayesian decision theory
case Σi = Σ
-5
0
5 -5
0
5
0
-0.1
0.2
P(ω1)=.5
P(ω2)=.5
ω1ω2
R 1R 2
-5
0
5-5
0
5
0
-0.1
0.2
P(ω1)=.1
P(ω2)=.9
ω1ω2
R 1R 2
-20
2-2
02
4
-2.5
0
2.5
5
7.5
P(ω1)=.5
P(ω2)=.5
ω1
ω2
R 1
R 2
-20
2-2
02
4
0
-2.5
5
7.5
10
P(ω1)=.1
P(ω2)=.9
ω1
ω2
R 1
R 2
Bayesian decision theory
Discriminant functions for normal density
case Σi = arbitraryThe discriminant functions are inherently quadratic:
gi(x) = xt (−12
Σ−1i )︸ ︷︷ ︸
Wi
x+µti Σ−1i︸ ︷︷ ︸
w ti
x−12µt
i Σ−1i µi −
12
ln |Σi |+ ln P(yi)︸ ︷︷ ︸wio
In two category case, decision surfaces arehyperquadratics: hyperplanes, pairs of hyperplanes,hyperspheres, hyperellipsoids, etc.
Bayesian decision theory
case Σi = arbitrary
Bayesian decision theory
Reducible error
ω2ω1
x
p(x|ωi)P(ωi)
reducibleerror
∫p(x|ω1)P(ω1) dx∫p(x|ω2)P(ω2) dxR 2R 1
R 1 R 2
xB x*
Probability of error in binary classification
P(error) = P(x ∈ R1, y2) + P(x ∈ R2, y1)
=
∫R1
p(x|y2)P(y2)dx +
∫R2
p(x|y1)P(y1)dx
The reducible error is the error resulting from a suboptimalchoice of the decision boundary
Bayesian decision theory
Bayesian decision theory: arbitrary inputs and outputs
Setting
Examples are input-output pairs (x , y) ∈ X × Y generatedwith probability p(x , y).The conditional risk of predicting y∗ given x is:
R(y∗|x) =
∫Y`(y∗, y)P(y |x)dy
The overall risk of a decision rule f is given by
R[f ] =
∫R(f (x)|x)p(x)dx =
∫X
∫Y`(f (x), y)p(y , x)dxdy
Bayes decision rule
yB = argminy∈YR(y |x)
Bayesian decision theory
Handling missing features
Marginalize over missing variablesAssume input x consists of an observed part xo andmissing part xm.Posterior probability of yi given the observation can beobtained from probabilities over entire inputs bymarginalizing over the missing part:
P(yi |xo) =p(yi ,xo)
p(xo)=
∫p(yi ,xo,xm)dxm
p(xo)
=
∫P(yi |xo,xm)p(xo,xm)dxm∫
p(xo,xm)dxm
=
∫P(yi |x)p(x)dxm∫
p(x)dxm
Bayesian decision theory
Handling noisy features
Marginalize over true variablesAssume x consists of a clean part xc and noisy part xn.Assume we have a noise model for the probability of thenoisy feature given its true version p(xn|xt ).Posterior probability of yi given the observation can beobtained from probabilities over clean inputs bymarginalizing over true variables via the noise model:
P(yi |xc ,xn) =p(yi ,xc ,xn)
p(xc ,xn)=
∫p(yi ,xc ,xn,xt )dxt∫
p(xc ,xn,xt )dxt
=
∫p(yi |xc ,xn,xt )p(xc ,xn,xt )dxt∫
p(xc ,xn,xt )dxt
=
∫p(yi |xc ,xt )p(xn|xc ,xt )p(xc ,xt )dxt∫
p(xn|xc ,xt )p(xc ,xt )dxt
=
∫p(yi |x)p(xn|xt )p(x)dxt∫
p(xn|xt )p(x)dxt
Bayesian decision theory