Upload
others
View
20
Download
0
Embed Size (px)
Citation preview
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Conditional Gaussian Distributions
Prof. Nicholas Zabaras
Materials Process Design and Control Laboratory
Sibley School of Mechanical and Aerospace Engineering
101 Frank H. T. Rhodes Hall
Cornell University
Ithaca, NY 14853-3801
Email: [email protected]
URL: http://mpdc.mae.cornell.edu/
January 23, 2014
1
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Conditional Gaussian Distributions
The Precision Matrix
Completing the Square
The Conditional Distribution, Conditional Mean and Variance Formulas,
The Marginal Distribution, Summary of Marginals/Conditionals
2D Distributions Example
Interpolating Noise-Free Data
Data Imputation
Contents
2
Chris Bishop, Pattern Recognition and Machine Learning, Chapter 2
Kevin Murphy, Machine Learning: A probabilistic Perspective, Chapter 4
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
If two sets of variables are jointly Gaussian, then the
conditional distribution of one set conditioned on the other is
again Gaussian.
Suppose x is a D-dimensional vector with Gaussian
distribution N(x|μ,Σ) and that we partition x into two disjoint
subsets xa (M components) and xb (D-M components).
Conditional Gaussian Distributions
3
a
b
xx =
x
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
This partition also implies similar partitions for the mean and
covariance.
ΣT = Σ implies that Σaa and Σbb are symmetric and
Conditional Gaussian Distributions
4
,a aa ab
b ba bb
=
T
ba ab
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
We define the precision matrix L as -1.
Its partition is given as above
where from ΣT = Σ we conclude that Laa and Lbb are
symmetric (the inverse of a symmetric matrix is symmetric)
and
Note that the above partition does NOT imply that Laa is the
inverse of aa , etc.
The Precision Matrix
5
aa ab
ba bb
L LL
L L
T
ba abL L
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
We are given a quadratic form defining the exponent terms in
a Gaussian distribution, and we determine the corresponding
mean and covariance.
The constant term denotes terms independent of x.
If we are given only the right hand side, we can immediately
identify from the 1st quadratic in x term the inverse of the
covariance matrix and subsequently from the 2nd linear in x
term the mean of the distribution.
This approach is used often in analytical calculations.
Completing the Square
6
1 1 11 1( ) ( )
2 2
T T T constant- - -- - - - x x x x x
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
We are now interested to compute p(xa|xb).
An easy way to do that is to look at the joint distribution
p(xa,xb) considering xb constant.
Using the partition of the precision matrix, we can write:
The Conditional Distribution
7
11 1 1( ) ( ) ( ) ( ) ( ) ( )
2 2 2
1 1( ) ( ) ( ) ( )
2 2
T T T
a a aa a a a a ab b b
T T
b b ba a a b b bb b b
-- - - - - - - - -
- - - - - -
x x x x x x
x x x x
L L
L L
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
We fix xb and consider the distribution above in terms of xa. It
is quadratic so we have a Gaussian. We need to complete the
square in xa.
In conclusion:
The Conditional Distribution
8
11 1 1( ) ( ) ( ) ( ) ( ) ( )
2 2 2
1 1( ) ( ) ( ) ( )
2 2
T T T
a a aa a a a a ab b b
T T
b b ba a a b b bb b b
-- - - - - - - - -
- - - - - -
x x x x x x
x x x x
L L
L L
1
| ( )a b a aa ab b b
- - -x L L
1
|
1:
2
T
a aa a a b aaQuadratic term -- x xL L
1
| |: ( ) ( )T
a aa a ab b b a b a b aa a ab b bLinear term -- - - - x x x L L L L
1
|| | , -a b a a b aap Nx x x L 1
| ( )a b a aa ab b b
- - -x L L
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
We can also write (with more complicated expressions) the
previous results in terms of the partitioned covariance
matrix.
We can show that the following result holds:
where
This is called the partitioned inverse formula. M is the
Schur complement of our matrix with respect to D.
The Partitioned Inverse Formula
9
1 1 1
1 1,
-- - -
- -
-1
-1 -1 -1 -1
A B M M BD
C D -D CM D D CM BD
-1M A - BD C
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Step 1.
Step 2.
Step 3. Combining the steps above (with ):
Partitioned Inverse Formula: Proof
10
1 1 1
1 1
-- - -
- -
-1
-1 -1 -1 -1
A B M M BD
C D -D CM D D CM BD
-1M A - BD C
1 1- -- -
0A BI BD A BD C
C D0 I C D
1 1
1
- -
-
- -
-
00 0
0
IA BD C A BD C
D C IC D D
11 11 1 1 1
1 1 1
- - -
- -
0 00 0
0 0
-- -- - - -
- - -
A B I I A BI BD A BD C I BD M
C D D C I D C I C D0 I D 0 I D
1 1 1
1 1
-
-
0 0
0
- - -
- -
A B I M I BD
C D D C I D 0 I
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
We can also use the Schur complement with respect to A.
This leads to:
We easily test with direct multiplication that the following
result holds:
where
Partitioned Inverse Formula
11
1 1
1 1,
-- -
- -
-1 -1 -1 -1 -1
-1
A B A A BM CA A BM
C D -M CA M
-1M D - CA B
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
From the two expressions of the inverse formula, we can
derive useful identities
From equating the upper left blocks we obtain:
Similarly equating the top right blocks we obtain:
Finally one can show:
Matrix Inversion Lemma – Sherman Morrison Woodbury Formula
12
1 11
1 1
1 1
1 11
,
-
-
- --
- -
- -
- --
-1 -1 -1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
-1 -1
A - BD C A - BD C BDA B
C D -D C A - BD C D D C A - BD C BD
A A B D - CA B CA A B D - CA B
- D - CA B CA D - CA B
1 1- -
-1 -1 -1 -1 -1A - BD C A A B D - CA B CA
1 1- -
-1 -1 -1 -1A - BD C BD A B D - CA B
-1 -1 -1A - BD C D - CA B D A
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Woodbury Matrix Inversion Formula
13
In addition to completing the square and the matrix
inversion formula for a partitioned matrix discussed earlier,
the Woodbury matrix inversion formula is quite useful for
manipulating Gaussians:
Consider the following application. Let A= be an NxN
diagonal matrix and let B=CT=X of size NxD where N>>D,
and let D−1 =- IDxD. Then we have
The LHS takes O(N3) time to compute, the RHS takes time
O(D3) to compute.
1 1
1 1 1 1T T T
N N D D
- -
- - - -
-
XX X I X X X
1 1- -
-1 -1 -1 -1 -1A - BD C A A B D - CA B CA
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Rank One Update of an Inverse
14
Another useful application arises in computing the rank 1
update of an inverse matrix. Select B=u (a column vector)
and C=vT (a row vector), and let D=-1 (scalar). Then using
We obtain
This is important when we incrementally add (or subtract)
one data point at a time to the design matrix and we want
to update the sufficient statistics.
1 1
1 11 1 1 1 1
11
1
TT T T
T
- -- -
- - - - -
- - -
A uv AA uv A A u v A u v A A
v A u
1 1- -
-1 -1 -1 -1 -1A - BD C A A B D - CA B CA
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Let us use the inversion formula above to write down the
inverse of the covariance matrix and the precision matrix:
The Conditional Distribution
15
1 1 1
1 1, :
-where
- - -
- -
-1
-1
-1 -1 -1 -1
A B M M BDM A - BD C
C D -D CM D D CM BD
1 11 1 11
1 11 1 1 1 1 1
aa ab bb ba aa ab bb ba ab bbaa ab aa ab
ba bb ba bbbb ba aa ab bb ba bb bb ba aa ab bb ba ab bb
-- -
- - --
- -- - - - - -
- -
- - -
L L
L L
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
We can reverse the previous results as well and write the
partitioned covariance matrix in terms of the inverse of the
partitioned precision matrix:
The Conditional Distribution
16
1 1 1
1 1, :
-where
- - -
- -
-1
-1
-1 -1 -1 -1
A B M M BDM A - BD C
C D -D CM D D CM BD
1 11 1 11
1 11 1 1 1 1 1
- -- - --
- -- - - - - -
aa ab bb ba aa ab bb ba ab bbaa ab aa ab
ba bb ba bbbb ba aa ab bb ba bb bb ba aa ab bb ba ab bb
-- -
- - -
L L L L L L L L L L L L
L L L L L L L L L L L L L L L L L
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
From the earlier expressions of the conditional mean and
variance, we can write:
Note that the conditional mean is linear in xb and the
conditional variance is independent of xb.
The Conditional Distribution
17
1 1
| ( ) ( )a b a aa ab b b a ab bb b b
- - - - -x x L L
1 1
|a b aa aa ab bb ba
- - - L
1 11 1 11
1 11 1 1 1 1 1
aa ab bb ba aa ab bb ba ab bbaa ab aa ab
ba bb ba bbbb ba aa ab bb ba bb bb ba aa ab bb ba ab bb
-- -
- - --
- -- - - - - -
- -
- - -
L L
L L
| || | ,a b a a b a bp Nx x x
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
We are now interested to compute p(xa). An easy way to do
this is to look at the joint distribution p(xa,xb) integrating xb out.
Using the partition of the precision matrix, we can write:
The Marginal Distribution
18
11 1 1( ) ( ) ( ) ( ) ( ) ( )
2 2 2
1 1( ) ( ) ( ) ( )
2 2
1( )
2
T T T
a a aa a a a a ab b b
T T
b b ba a a b b bb b b
T T
b bb b b bb b ba a a bnon dependent terms
-- - - - - - - - -
- - - - - -
- -
m
x x x x x x
x x x x
x x + x - x x
L L
L L
L L L
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
To integrate xb out, we complete the square in xb.
The first term gives a normalization factor when integrating in
xb.
The Marginal Distribution
19
1
( )2
T T
b bb b b bb b ba a a bnon dependent terms- -
m
x x + x - x x L L L
1 1 11 1 1( ) ( )
2 2 2
T T T T
b bb b b b bb bb b bb bb
- - -- - - - x x + x m x m x m m mL L L L L
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
We are left with the following terms that depend on xa:
The Marginal Distribution
20
1
1
1 1
1( ) ( ) ( )
2
1( ) ( )
2
1 1( ) ( )
2 2
1
2
T T
a a aa a a a a ab b
T
bb b ba a a bb bb b ba a a
TT T
a aa a a aa a ab b bb b ba a a bb bb b ba a a
T T
a aa ab bb ba a a aa a ab b ab bb bb b b
-
-
- -
- - - -
- -
- - -
-
x x x
- x - x
x x x x - x
x - x x +
L L
L L L L L
L L L L - L L L L
L L L L L L - L L L L
1 1
...
1...
2
a a
T T
a aa ab bb ba a a aa ab bb ba a
- -
- x - x x
L L L L L - L L L
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
By completing the square in xa, we can find the covariance
and mean of the marginal:
The Marginal Distribution
21
1 11...
2
T T
a aa ab bb ba a a aa ab bb ba a
- -- x - x x L L L L L - L L L
a a x
1
1 11:
2
T
a aa ab bb ba a a aa ab bb ba aaQuadratic term-
- -- x - x -L L L L L L L L
1 1 1: T
a aa ab bb ba a a a aa ab bb ba aLinear term - - - x x L - L L L L - L L L
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Conditional and Marginals Distribution
22
For a marginal distribution, the mean and covariance are
most simply expressed in terms of the partitioned covariance
matrix.
In the conditional distribution, the partitioned precision matrix
gives rise to simpler expressions.
1
| ( )a b a aa ab b b
- - -x L L
| ,a a a aap Nx x
1
|| | , -a b a a b aap Nx x x L
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Conditional & Marginals of 2D Gaussians
23
Consider the 2D Gaussian with covariance
Applying our previous results, we can write:
For s1=s2=s, we simplify further as:
2
1 1 2
2
1 2 2
s s s
s s s
2
1 22 21 21 1 1 1 1 2 1 1 2 2 12 2
2 2
| , , | | ( ),p x x p x x x xs ss s
s ss s
- -
N N
2 2
1 2 1 1 2 2| | ( ), (1 )p x x x x s - -N
-5 0 5
-10
-5
0
5
10
x1
x2
p(x1,x2)
-5 0 50
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
x1
x2
p(x1)
-5 0 50
1
2
3
4
5
6
7
x1
x2
p(x1|x2=1)
gaussCondition2Ddemo2
from PMTK
0.8, 1, s 0
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
01
23
45
0
1
2
3
4
50
0.2
0.4
0.6
0.8
1
x
Marginal bivariate normal pdf
y
Pro
babili
ty D
ensity
x
y
Marginal bivariate normal pdf
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
conditional bivariate normal pdf
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Marginal
Conditional
Ellipsoids :
equiprobability
curves of p(x, y)
.
p(x|y=2)
p(x) Link here for a MatLab program
to generate these figures
Conditional and Marginal Probability Densities
24
01
23
45
0
1
2
3
4
50
0.2
0.4
0.6
0.8
1
x
conditional bivariate normal pdf
y
Pro
babili
ty D
ensity
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Suppose we want to estimate a 1d function, defined on the interval
[0, T], such that yi = f(ti) for N points ti.
To start with, we assume that the data is noise-free and thus our task
is to simply interpolate.
We assume that the unknown function is smooth.
One needs priors over functions, and updating such a prior with
observed values to obtain a posterior over functions.
Here we discuss MAP estimation of functions defined on 1d inputs.
25
Interpolating Noise-Free Data
D. Calvetti and E. Somersalo, Introduction to Bayesian Scientific Computing, 2007
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
We discretize the function as follows:
As smoothness prior, we assume the following:
The precision l encodes our belief on the function smoothness:
Small l corresponds to wiggly function, large l to a smooth function.
In matrix form, we can summarize the above equ. as follows:
26
Interpolating Noise-Free Data
D. Calvetti and E. Somersalo, Introduction to Bayesian Scientific Computing, 2007
( ), , / ,1j j jx f s s jh h T D j D
1 1
1 1, 2,..., 1, ~ ,
2j j j jx x x j D 0
l-
-
N I
1 2 1
1 2 11, 2
2
1 2 1
D D matrix
- -
- - -
- -
L
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
The corresponding prior is:
L=l2LTL is the precision matrix (one can incorporate l in L). It has
rank (D-2) and so it is an improper prior. For data N≥2, the posterior
is however proper.
Partition x in a vector x1 (D-N unknown components) and x2 (N noise
free components). This results in a partition of L=[L1,L2] with (D-
2)x(D-N) and (D-2)xN sizes.
The corresponding partition of the precision matrix L=LTL is then:
27
Interpolating Noise-Free Data
2
1 22
2( ) , exp
2
Tpl
l-
-
0x L L LxN
11 12 1 1 1 2
12 22 2 1 2 2
T T
T T
L LL
L L
L L L L
L L L L
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Let us use the form of the joint distribution
The conditional distribution can be computed directly from above
(keep x2 fixed) or using earlier results:
It is easy to compute the posterior mean noticing that L1 is tridiagonal
and (x2 is hold to its prescribed values):
Note that the posterior mean is equal to the observed data at the
specified locations and smoothly interpolates in between.
28
Interpolating Noise-Free Data
1
1 2 1 1|2 11 11 1 12
1| | , , Tp
l
-
x x x L LN L L
1
1|2 1 2 2
- -L L x
1 1|2 2 2 -L L x
2 2
1 1
1 1 2 2 1 1 1 1 2 2( ) exp exp2 2
TT T Tp
l l- - - -
x x L Lx x L L x L L x L L x
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Prior Modeling: Smoothness Prior
29
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-5
-4
-3
-2
-1
0
1
2
3
4
5l=30
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-5
-4
-3
-2
-1
0
1
2
3
4
5l=0p1
The variance goes up as we move away from the data.
Also the variance goes up as we decrease the precision of the prior,
λ.
λ has no effect on the posterior mean, since it cancels out when
multiplying Λ11 and Λ12 (again for noise free data).
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Prior Modeling: Smoothness Prior
30
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-5
-4
-3
-2
-1
0
1
2
3
4
5l=30
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-5
-4
-3
-2
-1
0
1
2
3
4
5l=0p1
The marginal credibility intervals do not capture the
fact that neighboring locations are correlated. We can represent that
by drawing complete functions (i.e., vectors x) from the posterior, and
plotting them (thin lines). These are not as smooth as the posterior
mean itself since the prior only penalizes first-order differences.
1|2,2j jj
gaussInterpDemo
from PMTK
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Data Imputation
31
Suppose we are missing some entries in a design matrix. If the
columns are correlated, we can use the observed entries to predict the
missing entries.
In the Figure, we sample some data from a
20 dimensional Gaussian, and then deliberately
“hid” 50% of the data in each row.
We then infer the missing entries given the
observed entries, using the true (generating) model.
More precisely, for each row i, we compute p(xhi|xvi
, θ), where hi and vi
are the indices of the hidden and visible entries in case i.
From this, we compute the marginal distribution of each missing
variable, p(xhij|xvi
, θ). We then plot the mean of this distribution.
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Data Imputation
32
The mean
represents our “best guess” about the true value of that entry, in the
sense that it minimizes our expected squared error.
The Figure shows
that the estimates are
quite close to the truth.
(If j ∈ vi, the expected
value is equal to the
observed value,
)
We can use as a measure of confidence in this
guess (not shown). Alternatively, we could draw multiple samples from
p(xhi|xvi
, θ) (multiple imputation).
| ,i
ij j vx x x
ij ijx x
var | ,ij ih vx
x
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Data Imputation
33
0 10 20-10
0
10observed
0 10 20-10
0
10imputed
0 10 20-10
0
10truth
0 10 20-10
0
10observed
0 10 20-10
0
10imputed
0 10 20-10
0
10truth
0 10 20-10
0
10observed
0 10 20-10
0
10imputed
0 10 20-10
0
10truth
gaussImputationDemo
from PMTK
Left column: visualization of
three rows of the data matrix
with missing entries.
Middle: mean of the posterior
predictive, based on partially
observed data in that row,
but the true model
parameters.
Right: true values.
We may also be interested in computing the likelihood of each partially
observed row in the table, p(xvi|θ), which can be computed using
. This is useful for detecting outliers. | ,i i i i iv v v v vp x x N