Review (probability & linear algebra)

CE-725 : Statistical Pattern Recognition Sharif University of TechnologySpring 2013

M. Soleymani

Review (Probability & Linear Algebra)

Outline Axioms of probability theory Conditional probability, Joint probability, Bayes rule Discrete and continuous random variables Probability mass and density functions Expected value, variance, standard deviation

Expectation for two variables covariance, correlation

Some probability distributions Gaussian distribution

Linear Algebra

2

Basic Probability Elements

3

Sample space ( ): set of all possible outcomes (or worlds) Outcomes are assumed to be mutually exclusive.

An event is a certain set of outcomes (i.e., subset of ).

A random variable is a function defined over the samplespace Gender: → { , } Height: → ℝ

Probability Space A probability space is defined as a triple :

A sample space Ω ≠ ∅ that contains the set of all possibleoutcomes (outcomes also called states of nature)

A set whose elements are called events. The events aresubsets of Ω. should be a “Borel Field”.

represents the probability measure that assignsprobabilities to events.

4

Probability and Logic Probability theory can be viewed as a generalization of

propositional logic Probabilistic logic

: is a propositional logic sentence Belief of agent in is no longer restricted to true, false,unknown

( ) can range from 0 to 1

5

Probability Axioms (Kolomogrov)

6

Axioms define a reasonable theory of uncertainty

Kolomogrov’s probability axioms (propositional notation) ( ) ≥ 0 (∀ ∈ ) If is a tautology = 1 ∨ = + − ( ∧ ) (∀ , ∈ )

True

a b^

Random Variables Random variables: Variables in probability theory

Domain of random variables:Boolean, discrete or continuous

Probability distribution: the function describing probabilitiesof possible values of a random variable = 1 = , ( = 2) = , …

7

Random Variables Random variable is a function that maps every outcome

in to a real (complex) number. To define probabilities easily as functions defined on (real)

numbers. Expectation, variance, …

Head

Real line0 1

Tail

8

Probabilistic Inference

9

Joint probability distribution Can specify probability of every atomic event We can find every probabilities from it (by summing over atomic

events).

Prior and posterior probabilities belief in the absence or presence of evidences

Bayes’ rule used when it is difficult to compute ( | ) but we have information

about ( | ) Independence new evidence may be irrelevant

Joint Probability Distribution

10

Probability of all combinations of the values for a set ofrandom variables.

If two or more random variables are considered together, theycan be described in terms of their joint probability

Example: Joint probability of features ( , , … , )

Prior and Posterior Probabilities Prior or unconditional probabilities of propositions: belief in

the absence of any other evidence e.g., = = 0.14( ℎ = ) = 0.6

Posterior or conditional probabilities: belief in the presence ofevidences ( | ) is the probability of given that is true e.g., ℎ = = )

11

Conditional Probability

if

Product rule (an alternative formulation):( ∧ ) = ( | ) ( ) = ( | ) ( ) ( | ) obeys the same rules as probabilities

E.g., ( | ) + (~ | ) = 1

12

Conditional Probability For statistically dependent variables, knowing the value of

one variable may allow us to better estimate the other.

All probabilities in effect are conditional probabilities E.g., ( ) = ( | )

Type equation here.(. | ) ΩΩ

Renormalize the probability of events jointly occur with b

13

Conditional Probability: Example Rolling a fair dice : the outcome is an even number : the outcome is a prime number

1 161 32

P a bP a b

P b

14

Chain rule

Chain rule is derived by successive application of product rule:( 1, … , ) = ( 1, … , ) ( | 1, … , )= ( 1, … , ) ( | 1, … , ) ( | 1, … , )= … = ( ) ( | 1, … , )

15

Law of Total Probability mutually exclusive and is an event (subset of )

= | ( ) + | ( ) + ⋯ + ( )

16

…

Independence Propositions and are independent iff

( | ) = ( )= ( , ) = ( ) ( ) Knowing tells us nothing about (and vice versa)

17

Bayes Rule

Bayes rule: Obtained form product rule: = =

In some problems, it may be difficult to compute ( | )directly, yet we might have information about ( | ).

( | ) = ( | ) ( ) / ( )18

Bayes Rule

Often it would be useful to derive the rule a bit further:

19

Bayes Rule: Example

20

Meningitis( ) & Stiff neck ( )

= = 0.01 = 0.7

( | ) = ( ) / ( ) = 0.7 × 0.0002/0.01 = 0.0014

Bayes Rule: General Form

21

…

Calculating Probabilities from Joint Distribution

22

Useful techniques Marginalization

= ∑ ( , )∈ Conditioning

= ∑ ( )∈ Normalization

X| = X,

Probability Mass Function (PMF) Probability Mass Function (PMF) shows the probability

for each value of a discrete random variable Each impulse magnitude is equal to the probability of the

corresponding outcome

Example: PMF of a fair dice

( = ) ≥ 0( = )∈ = 123

( )

Probability Density Function (PDF) Probability Density Function (PDF) is defined for

continuous random variables The probability of is ( ) ×

0 0 00

1( ) lim ( )2 2Xp x P x X x

0 0,2 2

x x x

24

( )( ) ≥ 0= 1

( )

Cumulative Distribution Function (CDF) Cumulative Distribution Function (CDF)

Defined as the integration of PDF Similarly defined on discrete variables (summation instead of integration)

.

X

x

X X

CDF x F x P X x

F x p d

Non-decreasing

Right Continuous(−∞) = 0(∞) = 1( ) = ( )( ) ( ) ( )X XP u x v F v F u

25

( )

Distribution Statistics Basic descriptors of spatial distributions: MeanValue Variance & Standard Deviation Moments Covariance & Correlation

26

Expected Value Expected (or mean) value: weighted average of all possible

values of the random variable Expectation of a discrete random variable :

Expectation of a function of a discrete random variable : Expected value of a continuous random variable : Expectation of a function of a continuous random variable :

x

E X xP x

( ) ( )x

E g X g x P x

XE X xp x dx

XE g X g x p x dx

27

Variance Variance: a measure of how far values of a random

variable are spread out around its expected value

Standard deviation: square root of variance

2

2 2

X

X

Var X E X

E X

var( )X X

X E X

28

Moments Moments nth order moment of a random variable :

Normalized nth order moment:

The first order moment is the mean value. The second order moment is the variance added by the

square of the mean.

nXE X

nnM E X

29

Correlation & Covariance Correlation

Covariance is the correlation of mean removed variables:

( , )Crr X Y E XY

( , ) X Y X YCov X Y E X Y E XY

30

( , )x y

xyP x y Discrete RVs

Covariance: Example

31

, = 0 , = 0.9

Covariance Properties

32

The covariance value shows the tendency of the pair ofRVs to increase together > 0 ∶ and tend to increase together < 0 : tends to decrease when increases = 0 : no linear correlation between and

Orthogonal, Uncorrelated & Independent RVs Orthogonal random variables ( ) , = 0

Uncorrelated random variables ()

, = 0 Independent random variables

, = 0 ⇏ Independent random variables

33

Pearson’s Product Moment Correlation

34

Defined only if both and are finite and nonzero.

shows the degree of linear dependence between and .

−1 ≤ ≤ 1 − − ≤ − − according to Cauchy-

Schwarz inequality ( ≤ )

= 1 shows a perfect positive linear relationship and = −1shows a perfect negative linear relationship

Pearson’s Correlation: Examples

35

[Wikipedia]

= ( , )

Covariance Matrix If is a vector of random variables ( -dim random

vector): Covariance matrix indicates the tendency of each pair of RVs

to vary together

36

= (( − )( − )) ⋯ (( − )( − ))⋮ ⋱ ⋮(( − )( − )) ⋯ (( − )( − ))

= [ − − ]= ⋮ = ( )⋮( )

Covariance Matrix: Two Variables

37

12 21 ( , )Cov X Y

= 1 00 1 = 1 0.90.9 1

Covariance Matrix

38

shows the covariance of and :

( , ) .ij i j i i j jCov X X E X X

= ⋯⋯⋮ ⋮ ⋱ ⋮⋯

Sums of Random Variables= + Mean: = + Variance: ( ) = ( ) + ( ) + 2 ( , ) If , independent: ( ) = ( ) + ( )

Distribution:

,( ) ( , )

( ) ( ) ( ) ( )

Z X Y

X Y X Y

p z p x z x dx

p x p y p x p z x dx

39

independence

Some Famous Probability Mass and Density Functions Uniform

Gaussian (Normal)

1p x U a U bb a

2

22.1 ,. 2

x

p x e N

40

~ ( , )( )1−

~ ( , )( )12

Some Famous Probability Mass and Density Functions

Binomial

Exponential

1 n k knP X k p p

k

00 0

xx e x

p x e U xx

41

~ ( , )( = )

( )

0

Gaussian (Normal) Distribution

42

68% within − , +95% within − 2 , + 2

It is widely used to model the distribution of continuousvariables

Standard Normal distribution:

Multivariate Gaussian Distribution is a vector of Gaussian variables

Mahalanobis Distance:

1

1 1( ) ( )1 2( ) ( , )/2 1/2(2 ) | |

[ ] ( ) ( )

[( )( ) ]

Td

T

Tp N e

d

E E x E x

E

x μ x μx μ

μ x

x μ x μ

12 ( ) ( )Tr x μ x μ

43 −3−2

−10

12

3

−2

0

2

0

0.1

0.2

0.3

0.4

x1x2

Pro

babi

lity

Den

sity

Multivariate Gaussian Distribution

44

The covariance matrix is always symmetric and positive semi-definite

Multivariate Gaussian is completely specified by + (+ 1)/2 parameters

Special cases of : = : Independent random variables with the same variance

(circularly symmetric Gaussian)

Digonal matrix = …⋮ ⋱ ⋮⋯ : Independent random variables with

different variances

Multivariate Gaussian DistributionLevel Surfaces

45

The Gaussian distribution will be constant on surfaces in-space for which:− − = Principal axes of the hyper-ellipsoids are the eigenvectors of .

Bivariate Gaussian: Curves of constant density areellipses.

Hyper-ellipsoid

Bivariate Gaussian distribution and are the eigenvalues of ( ) and and

are the corresponding eigenvectors

46

=

Linear Transformations on Multivariate Gaussian

47

: Whitening transform

=

Gaussian Distribution Properties Some attracting properties of the Gaussian distribution:

Marginal and conditional distributions of a Gaussian are also Gaussian

After a linear transformation, a Gaussian distribution is again Gaussian There exists a linear transformation that diagonalizes the covariance matrix

(whitening transform). It converts the multivariate normal distribution into a spherical one.

Gaussian is a distribution that maximizes the entropy

Gaussian is stable and infinitely divisible Central Limit Theorem

Some distributions can be estimated by Gaussian distribution when theirparameter value is sufficiently large (e.g., Binomial)

48

Central Limit Theorem (under mild conditions) Suppose i.i.d. (Independent Identically Distributed) RVs

( = 1, … , ) with finite variances

Let be the sum of these RVs

Distribution of converges to a normal distribution asincreases, regardless to the distribution of the RVs.

Example:

1

N

N ii

S X

49

~ uniform, = 1, … ,= 1

Linear Algebra: Basic Definitions

Matrix :

Matrix Transpose

Symetric matrix = Vector

11 12 1

21 22 2

1 2

...

...[ ]

... ... ... ......

n

nij m n

m m mn

a a aa a a

a

a a a

A

1 ,1Tij jib a i n j m B A

1

1... [ ,..., ]Tn

n

aa a

a

a a

50

Linear Mapping Linear function ( + ) = ( ) + ( ) ∀ , ∈ ( ) = ( ) ∀ ∈ , ∈

A linear function: In general, a matrix × = ⋯ can be used to

denote a map : ℝ → ℝ where = + ⋯+ =

51

Inner (dot) product

Matrix multiplication

. .

[ ] [ ]

[ ] ,

ij m p ij p n

Tij m n ij i j

a b

c c A B

A B

AB = C

Linear Algebra: Basic Definitions

1

,n

Ti i

i

a b

a b a b

52

i-th row of A j-th column of B

Inner Product Inner (dot) product

Length (Euclidean norm) of a vector a is normalized iff ||a|| = 1

Angle between vectors

Orthogonal vectors and :

Orthonormal set of vectors , , … , :∀ , = 1 =0 . .

1

nT

i ii

a b

a b

2

1

nT

ii

a

a a a

cos|| |||| ||

T

a ba b

0T a b

53

Linear Independence A set of vectors is linearly independent if no vector is

a linear combination of other vectors. + + . . . + = 0 ⇒= = . . . = = 0

54

Matrix Determinant and Trace

Determinant ( ) = ( ) × ( )

Trace

1[ ]

n

jjj

tr A a

1det( ) ; 1,.... ;

( 1) det( )

n

ij ijj

i jij ij

A a A i n

A M

55

Matrix Inversion

Inverse of × :

exists iff det ( ) ≠ 0 (A is nonsingular) Singular: det ( ) = 0 ill-conditioned:A is nonsingular but close to being singular

Pseudo-inverse for a non square matrix # = is not singular # =

1 nAB BA I B A

56

Matrix Rank : maximum number of linearly independent

columns or rows of A.

× :

Full rank × : iff is nonsingular( )

57

Eigenvectors and Eigenvalues

det( ) 0n A I

1( )

n

jj

tr

A

1

det( )n

jj

A A

Characteristic equation: n-th order polynomial, with n roots

58

Eigenvectors and Eigenvalues: Symmetric Matrix For a symmetric matrix, the eigenvectors corresponding

to distinct eigenvalues are orthogonal

These eigenvectors can be used to form an orthonormalset ( and )

59

Eigen Decomposition: Symmetric Matrix

= ⇒ = = Eigen decomposition of a symmetric matrix:

60

Positive definite matrix Symmetric × is positive definite:

Eigen values of a positive define matrix are positive: ∀ , > 0

61

Simple Vector Derivatives

62

Documents

Review (probability & linear algebra)