27
UNIVERSITY OF HELSINKI Dept. of Computer Science Patrik Hoyer 1 Multivariate probability distributions and linear regression

Multivariate probability distributions and linear regression · UNIVERSITY OF HELSINKI Dept. of Computer Science Patrik Hoyer 2 Contents: • Random variable, probability distribution

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Multivariate probability distributions and linear regression · UNIVERSITY OF HELSINKI Dept. of Computer Science Patrik Hoyer 2 Contents: • Random variable, probability distribution

UNIVERSITY OF HELSINKIDept. of Computer Science Patrik Hoyer 1

Multivariate probability distributions and linear regression

Page 2: Multivariate probability distributions and linear regression · UNIVERSITY OF HELSINKI Dept. of Computer Science Patrik Hoyer 2 Contents: • Random variable, probability distribution

UNIVERSITY OF HELSINKIDept. of Computer Science Patrik Hoyer 2

Contents:

• Random variable, probability distribution

• Joint distribution

• Marginal distribution

• Conditional distribution

• Independence, conditional independence

• Generating data

• Expectation, variance, covariance, correlation

• Multivariate Gaussian distribution

• Multivariate linear regression

• Estimating a distribution from sample data

Page 3: Multivariate probability distributions and linear regression · UNIVERSITY OF HELSINKI Dept. of Computer Science Patrik Hoyer 2 Contents: • Random variable, probability distribution

UNIVERSITY OF HELSINKIDept. of Computer Science Patrik Hoyer 3

• Random variable

- sample space (set of possible elementary outcomes) - probability distribution over sample space

• Examples:

- The throw of a die

- The sum of two dice

- Two separate dice (red, blue)

1 2 3 4 5 6

1/6 1/6 1/6 1/6 1/6 1/6

2 3 4 5 6 7 8 9 10 11 12

1/36 1/18 1/12 1/9 5/36 1/6 5/36 1/9 1/12 1/18 1/36

(1,1) (1,2) (1,3) (1,4) (1,5) (1,6) (2,1) (2,2) (2,3) ... (6,6)

1/36 1/36 1/36 1/36 1/36 1/36 1/36 1/36 1/36 1/36

xP (x)

xP (x)

xP (x)

Page 4: Multivariate probability distributions and linear regression · UNIVERSITY OF HELSINKI Dept. of Computer Science Patrik Hoyer 2 Contents: • Random variable, probability distribution

UNIVERSITY OF HELSINKIDept. of Computer Science Patrik Hoyer 4

• Discrete variables:

- Finite number of states(e.g. dice examples)

- Infinite number of states(e.g. how many heads before one tales in a sequence of coin tosses?)

• Continuous variables:Each particular state has a probability of zero, so we need the concept of a probability density:

(e.g. how long time until next bus arrives? what will be the price of oil a year from now?)

P (X ! x) =! x

!"p(t) dt

Page 5: Multivariate probability distributions and linear regression · UNIVERSITY OF HELSINKI Dept. of Computer Science Patrik Hoyer 2 Contents: • Random variable, probability distribution

UNIVERSITY OF HELSINKIDept. of Computer Science Patrik Hoyer 5

• A probability distribution satisfies...

1. Probabilities are non-negative:

2. Sum to one:

[Note that in the discrete case this means that there exists no value of such that . However this does not in general hold for a continuous density !]

P (X = x) = PX(x) = P (x) ! 0

!

x

P (x) = 1 (discrete)

(continuous)�

p(x) dx = 1

P (x) > 1xp(x)

Page 6: Multivariate probability distributions and linear regression · UNIVERSITY OF HELSINKI Dept. of Computer Science Patrik Hoyer 2 Contents: • Random variable, probability distribution

UNIVERSITY OF HELSINKIDept. of Computer Science Patrik Hoyer 6

• The joint distribution of two random variables:

- Let and be random variables. Their joint distribution is

- Example: Two coin tosses, denotes first throw, denotes second (note: independence!)

- Example: : Rain today? : Rain tomorrow?

X Y

X

Y

0.25 0.25

0.25 0.25X

Y

H

T

H T

X Y

0.5 0.2

0.1 0.2X

Y

Y

N

Y N

P (x, y) :

P (x, y) :

P (x, y) = P (X = x and Y = y)

Page 7: Multivariate probability distributions and linear regression · UNIVERSITY OF HELSINKI Dept. of Computer Science Patrik Hoyer 2 Contents: • Random variable, probability distribution

UNIVERSITY OF HELSINKIDept. of Computer Science Patrik Hoyer 7

• Marginal distribution:

- ‘Interested in or observing only one of the two variables’

- The distribution is obtained by summing (or integrating) over the other variable:

- Example (continued): What is the probability of rain tomorrow? That is, what is ?

0.5 0.2

0.1 0.2X

Y

Y

N

Y N

0.6 0.4

!

In the same fashion, we cancalculate that the chance of raintoday is 0.7.

P (x) =!

y

P (x, y)

P (y)

P (y) :

p(x) =�

p(x, y) dy

Page 8: Multivariate probability distributions and linear regression · UNIVERSITY OF HELSINKI Dept. of Computer Science Patrik Hoyer 2 Contents: • Random variable, probability distribution

UNIVERSITY OF HELSINKIDept. of Computer Science Patrik Hoyer 8

• Conditional distribution:

- ‘If we observe how does that affect our belief about the value of ?’

- Obtained by selecting the appropriate row/column of the joint distribution, and renormalizing it to sum to one:

- Example (continued): What is the probability that tomorrow rains, given that today does not rain? i.e. what is ?

0.5 0.2

0.1 0.2X

Y

Y

N

Y N

! 0.1 0.2 / (0.1 + 0.2) 0.33 0.67!

Y N

P (y|X = x) = P (y|x) =P (x, y)P (x)

X = xY

P (y | X = ‘no rain’)

P (y | X = ‘no rain’)

p(y|x) =p(x, y)p(x)

Page 9: Multivariate probability distributions and linear regression · UNIVERSITY OF HELSINKI Dept. of Computer Science Patrik Hoyer 2 Contents: • Random variable, probability distribution

UNIVERSITY OF HELSINKIDept. of Computer Science Patrik Hoyer 9

⇒ Chain rule:

• So the joint distribution can be specified directly, or using the marginal and conditional distribution (can even choose, ‘which way’ one specifies it)

P (x, y) = P (x)P (y|x) = P (y)P (x|y)

p(x, y) = p(x)p(y|x) = p(y)p(x|y)

Page 10: Multivariate probability distributions and linear regression · UNIVERSITY OF HELSINKI Dept. of Computer Science Patrik Hoyer 2 Contents: • Random variable, probability distribution

UNIVERSITY OF HELSINKIDept. of Computer Science Patrik Hoyer 10

• Independence:

Two random variables are independent, if and only if knowing the value of one does not change our belief about the second:

This is equivalent to being able to write the joint distribution as the product of the marginals:

We write this as:

or, if we want to explicitly specify the distribution:

• Example: Two coin tosses...

P (x, y) = P (x)P (y)

!x : P (y|x) = P (y) " !y : P (x|y) = P (x)

X ⊥⊥ Y

(X ⊥⊥ Y )P

Page 11: Multivariate probability distributions and linear regression · UNIVERSITY OF HELSINKI Dept. of Computer Science Patrik Hoyer 2 Contents: • Random variable, probability distribution

UNIVERSITY OF HELSINKIDept. of Computer Science Patrik Hoyer 11

• Three or more variables:

- joint distribution:(‘multidimensional array/function’)

- marginal distributions: (e.g.)

- conditional distributions: (e.g.)

P (v, w, x, y, z, . . .)

P (x) =!

v,w,y,z,...

P (v, w, x, y, z, . . .)

P (x, y) =!

v,w,z,...

P (v, w, x, y, z, . . .)

P (x|v, w, y, z, . . .) = P (v, w, x, y, z, . . .)/P (v, w, y, z, . . .)

P (x, y|v, w, z, . . .) = P (v, w, x, y, z, . . .)/P (v, w, z, . . .)

P (v, w, y, z, . . . |x) = P (v, w, x, y, z, . . .)/P (x)

P (x|y) =!

v,w,z,...

P (v, w, x, z, . . . |y) ← marginal and conditional

Page 12: Multivariate probability distributions and linear regression · UNIVERSITY OF HELSINKI Dept. of Computer Science Patrik Hoyer 2 Contents: • Random variable, probability distribution

UNIVERSITY OF HELSINKIDept. of Computer Science Patrik Hoyer 12

- Chain rule

- Complete independence between all variables if and only if:

- Conditional independence (e.g: if we know the value of then does not give any additional information about ):

This is also written:or explicitly noting the distribution:

P (v, w, x, y, z, . . .) = P (v)P (w|v)P (x|v, w)P (y|v, w, x)

P (z|v, w, x, y)P (. . . |v, w, x, y, z)

P (v, w, x, y, z, . . .) = P (v)P (w)P (x)P (y)P (z)P (. . .)

xy

z

P (x, y|z) = P (x|z)P (y|z)

X ⊥⊥ Y | Z(X ⊥⊥ Y | Z)P

Page 13: Multivariate probability distributions and linear regression · UNIVERSITY OF HELSINKI Dept. of Computer Science Patrik Hoyer 2 Contents: • Random variable, probability distribution

UNIVERSITY OF HELSINKIDept. of Computer Science Patrik Hoyer 13

- In general we can say that marginal distributions are conditional on not knowing the value of other variables:

and (marginal) independence is independence conditional on not observing other variables:

- Example of conditional independence:Drownings and ice-cream sales. These are mutually dependent (both happen during warm weather) but are, at least approximately, conditionally independent given the weather

P (x) = P (x|!)

P (x, y|!) = P (x|!)P (y|!)

Page 14: Multivariate probability distributions and linear regression · UNIVERSITY OF HELSINKI Dept. of Computer Science Patrik Hoyer 2 Contents: • Random variable, probability distribution

UNIVERSITY OF HELSINKIDept. of Computer Science Patrik Hoyer

Example: conditional dependence: Two coin tosses and a bell that rings whenever they get the same result. The coins are marginally independent but conditionally dependent given the bell!

14

0.25 0.25

0.25 0.25X

Y

H

T

H T

X :Y :Z :

P (x, y) =

0.5 0

0 0.5X

Y

H

T

H T

(independent)

(dependent!)P (x, y | Z = ‘bell rang’) =

First coin tossSecond coin tossBell

Page 15: Multivariate probability distributions and linear regression · UNIVERSITY OF HELSINKI Dept. of Computer Science Patrik Hoyer 2 Contents: • Random variable, probability distribution

UNIVERSITY OF HELSINKIDept. of Computer Science Patrik Hoyer 15

• Data generation, sampling

- Given some , how can we draw samples (generate data) from that distribution?

Answer: Divide the unit interval [0,1] into parts corresponding to the probabilities, draw a uniformly distributed number in the interval, and select the state into which we fell:

P (x)

P (x1)P (x2)

P (x3)P (x4)

P (x5)P (x6)

0 10.30245...! X := x2

Page 16: Multivariate probability distributions and linear regression · UNIVERSITY OF HELSINKI Dept. of Computer Science Patrik Hoyer 2 Contents: • Random variable, probability distribution

UNIVERSITY OF HELSINKIDept. of Computer Science Patrik Hoyer

• Given a joint distribution , how can we draw samples (generate data)?

- We could list all joint states, then proceed as above, or...

- Draw data sequentially from conditional distributions:

1. First draw from

2. Next from

3. Finally from

Note: We can freely choose any ordering of the variables!

16

P (x, y, z)

P (x)

P (y|x)

x

y

z P (z|x, y)

Page 17: Multivariate probability distributions and linear regression · UNIVERSITY OF HELSINKI Dept. of Computer Science Patrik Hoyer 2 Contents: • Random variable, probability distribution

UNIVERSITY OF HELSINKIDept. of Computer Science Patrik Hoyer 17

Example (continued): Two coin tosses and a bell that rings if and only if the two tosses give the same result

- can draw all the variables simultaneously by listing all the joint states, calculating their probabilities, placing them on the unit interval, and then draw the joint state

- can first independently generate the coin tosses, then assign the bell

- can first draw one coin toss and the bell, and then assign the second coin toss

Page 18: Multivariate probability distributions and linear regression · UNIVERSITY OF HELSINKI Dept. of Computer Science Patrik Hoyer 2 Contents: • Random variable, probability distribution

UNIVERSITY OF HELSINKIDept. of Computer Science Patrik Hoyer 18

• Numerical random variables

- Expectation:

- Variance:

- Covariance:

- Correlation coefficient:

E{X} =!

x

xP (x) (discrete)

(continuous)E{X} =�

x p(x) dx

Cov(X, Y ) = σXY = E{(X − E{X})(Y − E{Y })}

Var(X) = σ2X = σXX = E{(X − E{X})2}

ρXY =σXY�σ2

Xσ2Y

Page 19: Multivariate probability distributions and linear regression · UNIVERSITY OF HELSINKI Dept. of Computer Science Patrik Hoyer 2 Contents: • Random variable, probability distribution

UNIVERSITY OF HELSINKIDept. of Computer Science Patrik Hoyer 19

- Multivariate numerical random variables... (random vectors)Expectation:

Covariance matrix (‘variance-covariance matrix’)

E{V} =

!

"""#

E{V1}E{V2}

...E{VN}

$

%%%&

CV = ΣV = E{(V − E{V})(V − E{V})T }

=

Var(V1) Cov(V1, VN )

. . .Cov(VN , V1) Var(V2)

=

σV1V1 σV1VN

. . .σVN V1 σVN VN

Page 20: Multivariate probability distributions and linear regression · UNIVERSITY OF HELSINKI Dept. of Computer Science Patrik Hoyer 2 Contents: • Random variable, probability distribution

UNIVERSITY OF HELSINKIDept. of Computer Science Patrik Hoyer 20

• Conditional expectation, variance, covariance, correlation

- Conditional expectation (note: function of !)

- Conditional variance (note: function of !)

- Conditional covariance (note: function of !)

- Conditional correlation coefficient (note: function of !)

(discrete)

(continuous)

E{X|Y = y} =�

x

xP (x|y)

y

E{X|Y = y} =�

x p(x|y) dx

Var(X|Y = y) = σ2X|y = σXX|y = E{(X − E{X})2}P (X|Y =y)

y

z

Cov(X, Y | z) = σXY |z = E{(X − E{X})(Y − E{Y })}P (X,Y | Z=z)

z

ρXY |z =σXY |z�σ2

X|zσ2Y |z

Page 21: Multivariate probability distributions and linear regression · UNIVERSITY OF HELSINKI Dept. of Computer Science Patrik Hoyer 2 Contents: • Random variable, probability distribution

UNIVERSITY OF HELSINKIDept. of Computer Science Patrik Hoyer 21

- Multivariate Gaussian (‘normal’) density:

has the following properties:

• mean vector and covariance matrix as the only parameters

gaussian identities

sam roweis

(revised July 1999)

0.1 multidimensional gaussian

a d-dimensional multidimensional gaussian (normal) density for x is:

N (µ,Σ) = (2π)−d/2|Σ|−1/2 exp�−1

2(x− µ)TΣ−1(x− µ)

�(1)

it has entropy:

S =12

log2

�(2πe)d|Σ|

�− const bits (2)

where Σ is a symmetric postive semi-definite covariance matrix and the(unfortunate) constant is the log of the units in which x is measured overthe “natural units”

0.2 linear functions of a normal vector

no matter how x is distributed,

E[Ax + y] = A(E[x]) + y (3a)

Covar[Ax + y] = A(Covar[x])AT (3b)

in particular this means that for normal distributed quantities:

x ∼ N (µ,Σ)⇒ (Ax + y) ∼ N�Aµ + y,AΣAT

�(4a)

x ∼ N (µ,Σ)⇒ Σ−1/2(x− µ) ∼ N (0, I) (4b)

x ∼ N (µ,Σ)⇒ (x− µ)TΣ−1(x− µ) ∼ χ2n (4c)

1

p(x) =

x1

x2

gaussian identities

sam roweis

(revised July 1999)

0.1 multidimensional gaussian

a d-dimensional multidimensional gaussian (normal) density for x is:

N (µ,Σ) = (2π)−d/2|Σ|−1/2 exp�−1

2(x− µ)TΣ−1(x− µ)

�(1)

it has entropy:

S =12

log2

�(2πe)d|Σ|

�− const bits (2)

where Σ is a symmetric postive semi-definite covariance matrix and the(unfortunate) constant is the log of the units in which x is measured overthe “natural units”

0.2 linear functions of a normal vector

no matter how x is distributed,

E[Ax + y] = A(E[x]) + y (3a)

Covar[Ax + y] = A(Covar[x])AT (3b)

in particular this means that for normal distributed quantities:

x ∼ N (µ,Σ)⇒ (Ax + y) ∼ N�Aµ + y,AΣAT

�(4a)

x ∼ N (µ,Σ)⇒ Σ−1/2(x− µ) ∼ N (0, I) (4b)

x ∼ N (µ,Σ)⇒ (x− µ)TΣ−1(x− µ) ∼ χ2n (4c)

1

gaussian identities

sam roweis

(revised July 1999)

0.1 multidimensional gaussian

a d-dimensional multidimensional gaussian (normal) density for x is:

N (µ,Σ) = (2π)−d/2|Σ|−1/2 exp�−1

2(x− µ)TΣ−1(x− µ)

�(1)

it has entropy:

S =12

log2

�(2πe)d|Σ|

�− const bits (2)

where Σ is a symmetric postive semi-definite covariance matrix and the(unfortunate) constant is the log of the units in which x is measured overthe “natural units”

0.2 linear functions of a normal vector

no matter how x is distributed,

E[Ax + y] = A(E[x]) + y (3a)

Covar[Ax + y] = A(Covar[x])AT (3b)

in particular this means that for normal distributed quantities:

x ∼ N (µ,Σ)⇒ (Ax + y) ∼ N�Aµ + y,AΣAT

�(4a)

x ∼ N (µ,Σ)⇒ Σ−1/2(x− µ) ∼ N (0, I) (4b)

x ∼ N (µ,Σ)⇒ (x− µ)TΣ−1(x− µ) ∼ χ2n (4c)

1

Page 22: Multivariate probability distributions and linear regression · UNIVERSITY OF HELSINKI Dept. of Computer Science Patrik Hoyer 2 Contents: • Random variable, probability distribution

UNIVERSITY OF HELSINKIDept. of Computer Science Patrik Hoyer 22

• all marginal and conditional distributions are also Gaussian, and the conditional (co)variances do not depend on the values of the conditioning variables:

Let x and y be random vectors whose dimensions are n and m. If theyare joined together into one random vector z = (xT ,yT ), with dimensionn + m, then its mean mz and covariance matrix Cz are

mz =�

mx

my

�, Cz =

�Cx Cxy

Cyx Cy

�, (1)

where mx and my are the means of x and y, and Cx and Cy are thecovariance matrices of x and y respectively, and Cxy contains the crosscovariances.If z is multivariate Gaussian then x and y are also Gaussian. Additionallythe conditional distributions p(x|y) and p(y|x) are Gaussian. The latter’smean and covariance matrix are

my|x = my + CyxC−1x (x−mx) (2)

Cy|x = Cy −CyxC−1x Cxy (3)

Let v be a Gaussian random vector over three variables (v1, v2, v3)T whosemean mv = E{v} = 0, and covariance matrix

Cv = E{vvT } =

6 2 12 1 01 0 1

. (4)

Calculate the marginal distribution p(v1, v3). Are v1 and v3 independent?What is their correlation coefficient?Are v2 and v3 independent?Are v2 and v3 conditionally independent, given v1?

2

Page 23: Multivariate probability distributions and linear regression · UNIVERSITY OF HELSINKI Dept. of Computer Science Patrik Hoyer 2 Contents: • Random variable, probability distribution

UNIVERSITY OF HELSINKIDept. of Computer Science Patrik Hoyer 23

• The conditional variance, conditional covariance, and conditional correlation coefficient, for the Gaussian distribution, are known as partial variance , partial covariance , and partial correlation coefficient (respectively)

• These can of course always be computed directly from the covariance matrix (regardless of whether the distribution actually is Gaussian!)...

...but they can only be safely interpreted as conditional variance, conditional covariance, and conditional correlation coefficient (respectively) for the Gaussian distribution.

σ2X·Z σXY ·Z

ρXY ·Z

Page 24: Multivariate probability distributions and linear regression · UNIVERSITY OF HELSINKI Dept. of Computer Science Patrik Hoyer 2 Contents: • Random variable, probability distribution

UNIVERSITY OF HELSINKIDept. of Computer Science Patrik Hoyer 24

• for Gaussian:

zero (partial) covariance ⇔

zero (conditional) covariance ⇔ (conditional) independence

i.e.

• in general:

we only have one-way implication:zero (conditional) covariance ⇐ (conditional) independence

i.e.

Note, however, that conditional independence does not imply zero partial covariance in the completely general case!

(σXY ·Z = 0) ⇔ (∀z : σXY |z = 0) ⇔ (X ⊥⊥ Y | Z)

(∀z : σXY |z = 0) ⇐ (X ⊥⊥ Y | Z)

Page 25: Multivariate probability distributions and linear regression · UNIVERSITY OF HELSINKI Dept. of Computer Science Patrik Hoyer 2 Contents: • Random variable, probability distribution

UNIVERSITY OF HELSINKIDept. of Computer Science Patrik Hoyer 25

• Fit a line through the data, explaining how varies with .

• Minimize sum of squares error between and .

• Probabilistic interpretation: (note that this is true only for roughly linear relationships)

x

y

Linear regression:

xy

y y

y ≈ E{Y | X = x}

y = ryxx + �y

ryx =σXY

σ2X

Page 26: Multivariate probability distributions and linear regression · UNIVERSITY OF HELSINKI Dept. of Computer Science Patrik Hoyer 2 Contents: • Random variable, probability distribution

UNIVERSITY OF HELSINKIDept. of Computer Science Patrik Hoyer 26

• Note the symmetry: We could equally well regress on !

x

y

x y

x = rxyy + �x

Page 27: Multivariate probability distributions and linear regression · UNIVERSITY OF HELSINKI Dept. of Computer Science Patrik Hoyer 2 Contents: • Random variable, probability distribution

UNIVERSITY OF HELSINKIDept. of Computer Science Patrik Hoyer

27

• Multivariate linear regression:

Note that the partial regression coefficient is NOT the same, in general, as one gets from regressing on , ignoring :

Note also that is derived from the partial (co)variances. This holds regardless of the form of the underlying distribution.

x

y

z

z x y

z = ax + by + �z

rzx·yrzx

a = rzx·y =σZX·Yσ2

X·Y

rzx·y