Upload
others
View
17
Download
0
Embed Size (px)
Citation preview
CE-725 : Statistical Pattern Recognition Sharif University of TechnologySpring 2013
M. Soleymani
Review (Probability & Linear Algebra)
Outline Axioms of probability theory Conditional probability, Joint probability, Bayes rule Discrete and continuous random variables Probability mass and density functions Expected value, variance, standard deviation
Expectation for two variables covariance, correlation
Some probability distributions Gaussian distribution
Linear Algebra
2
Basic Probability Elements
3
Sample space ( ): set of all possible outcomes (or worlds) Outcomes are assumed to be mutually exclusive.
An event is a certain set of outcomes (i.e., subset of ).
A random variable is a function defined over the samplespace Gender: → { , } Height: → ℝ
Probability Space A probability space is defined as a triple :
A sample space Ω ≠ ∅ that contains the set of all possibleoutcomes (outcomes also called states of nature)
A set whose elements are called events. The events aresubsets of Ω. should be a “Borel Field”.
represents the probability measure that assignsprobabilities to events.
4
Probability and Logic Probability theory can be viewed as a generalization of
propositional logic Probabilistic logic
: is a propositional logic sentence Belief of agent in is no longer restricted to true, false,unknown
( ) can range from 0 to 1
5
Probability Axioms (Kolomogrov)
6
Axioms define a reasonable theory of uncertainty
Kolomogrov’s probability axioms (propositional notation) ( ) ≥ 0 (∀ ∈ ) If is a tautology = 1 ∨ = + − ( ∧ ) (∀ , ∈ )
True
a b^
Random Variables Random variables: Variables in probability theory
Domain of random variables:Boolean, discrete or continuous
Probability distribution: the function describing probabilitiesof possible values of a random variable = 1 = , ( = 2) = , …
7
Random Variables Random variable is a function that maps every outcome
in to a real (complex) number. To define probabilities easily as functions defined on (real)
numbers. Expectation, variance, …
Head
Real line0 1
Tail
8
Probabilistic Inference
9
Joint probability distribution Can specify probability of every atomic event We can find every probabilities from it (by summing over atomic
events).
Prior and posterior probabilities belief in the absence or presence of evidences
Bayes’ rule used when it is difficult to compute ( | ) but we have information
about ( | ) Independence new evidence may be irrelevant
Joint Probability Distribution
10
Probability of all combinations of the values for a set ofrandom variables.
If two or more random variables are considered together, theycan be described in terms of their joint probability
Example: Joint probability of features ( , , … , )
Prior and Posterior Probabilities Prior or unconditional probabilities of propositions: belief in
the absence of any other evidence e.g., = = 0.14( ℎ = ) = 0.6
Posterior or conditional probabilities: belief in the presence ofevidences ( | ) is the probability of given that is true e.g., ℎ = = )
11
Conditional Probability
if
Product rule (an alternative formulation):( ∧ ) = ( | ) ( ) = ( | ) ( ) ( | ) obeys the same rules as probabilities
E.g., ( | ) + (~ | ) = 1
12
Conditional Probability For statistically dependent variables, knowing the value of
one variable may allow us to better estimate the other.
All probabilities in effect are conditional probabilities E.g., ( ) = ( | )
Type equation here.(. | ) ΩΩ
Renormalize the probability of events jointly occur with b
13
Conditional Probability: Example Rolling a fair dice : the outcome is an even number : the outcome is a prime number
1 161 32
P a bP a b
P b
14
Chain rule
Chain rule is derived by successive application of product rule:( 1, … , ) = ( 1, … , ) ( | 1, … , )= ( 1, … , ) ( | 1, … , ) ( | 1, … , )= … = ( ) ( | 1, … , )
15
Law of Total Probability mutually exclusive and is an event (subset of )
= | ( ) + | ( ) + ⋯ + ( )
16
…
Independence Propositions and are independent iff
( | ) = ( )= ( , ) = ( ) ( ) Knowing tells us nothing about (and vice versa)
17
Bayes Rule
Bayes rule: Obtained form product rule: = =
In some problems, it may be difficult to compute ( | )directly, yet we might have information about ( | ).
( | ) = ( | ) ( ) / ( )18
Bayes Rule
Often it would be useful to derive the rule a bit further:
19
Bayes Rule: Example
20
Meningitis( ) & Stiff neck ( )
= = 0.01 = 0.7
( | ) = ( ) / ( ) = 0.7 × 0.0002/0.01 = 0.0014
Bayes Rule: General Form
21
…
Calculating Probabilities from Joint Distribution
22
Useful techniques Marginalization
= ∑ ( , )∈ Conditioning
= ∑ ( )∈ Normalization
X| = X,
Probability Mass Function (PMF) Probability Mass Function (PMF) shows the probability
for each value of a discrete random variable Each impulse magnitude is equal to the probability of the
corresponding outcome
Example: PMF of a fair dice
( = ) ≥ 0( = )∈ = 123
( )
Probability Density Function (PDF) Probability Density Function (PDF) is defined for
continuous random variables The probability of is ( ) ×
0 0 00
1( ) lim ( )2 2Xp x P x X x
0 0,2 2
x x x
24
( )( ) ≥ 0= 1
( )
Cumulative Distribution Function (CDF) Cumulative Distribution Function (CDF)
Defined as the integration of PDF Similarly defined on discrete variables (summation instead of integration)
.
X
x
X X
CDF x F x P X x
F x p d
Non-decreasing
Right Continuous(−∞) = 0(∞) = 1( ) = ( )( ) ( ) ( )X XP u x v F v F u
25
( )
Distribution Statistics Basic descriptors of spatial distributions: MeanValue Variance & Standard Deviation Moments Covariance & Correlation
26
Expected Value Expected (or mean) value: weighted average of all possible
values of the random variable Expectation of a discrete random variable :
Expectation of a function of a discrete random variable : Expected value of a continuous random variable : Expectation of a function of a continuous random variable :
x
E X xP x
( ) ( )x
E g X g x P x
XE X xp x dx
XE g X g x p x dx
27
Variance Variance: a measure of how far values of a random
variable are spread out around its expected value
Standard deviation: square root of variance
2
2 2
X
X
Var X E X
E X
var( )X X
X E X
28
Moments Moments nth order moment of a random variable :
Normalized nth order moment:
The first order moment is the mean value. The second order moment is the variance added by the
square of the mean.
nXE X
nnM E X
29
Correlation & Covariance Correlation
Covariance is the correlation of mean removed variables:
( , )Crr X Y E XY
( , ) X Y X YCov X Y E X Y E XY
30
( , )x y
xyP x y Discrete RVs
Covariance: Example
31
, = 0 , = 0.9
Covariance Properties
32
The covariance value shows the tendency of the pair ofRVs to increase together > 0 ∶ and tend to increase together < 0 : tends to decrease when increases = 0 : no linear correlation between and
Orthogonal, Uncorrelated & Independent RVs Orthogonal random variables ( ) , = 0
Uncorrelated random variables ()
, = 0 Independent random variables
, = 0 ⇏ Independent random variables
33
Pearson’s Product Moment Correlation
34
Defined only if both and are finite and nonzero.
shows the degree of linear dependence between and .
−1 ≤ ≤ 1 − − ≤ − − according to Cauchy-
Schwarz inequality ( ≤ )
= 1 shows a perfect positive linear relationship and = −1shows a perfect negative linear relationship
Pearson’s Correlation: Examples
35
[Wikipedia]
= ( , )
Covariance Matrix If is a vector of random variables ( -dim random
vector): Covariance matrix indicates the tendency of each pair of RVs
to vary together
36
= (( − )( − )) ⋯ (( − )( − ))⋮ ⋱ ⋮(( − )( − )) ⋯ (( − )( − ))
= [ − − ]= ⋮ = ( )⋮( )
Covariance Matrix: Two Variables
37
12 21 ( , )Cov X Y
= 1 00 1 = 1 0.90.9 1
Covariance Matrix
38
shows the covariance of and :
( , ) .ij i j i i j jCov X X E X X
= ⋯⋯⋮ ⋮ ⋱ ⋮⋯
Sums of Random Variables= + Mean: = + Variance: ( ) = ( ) + ( ) + 2 ( , ) If , independent: ( ) = ( ) + ( )
Distribution:
,( ) ( , )
( ) ( ) ( ) ( )
Z X Y
X Y X Y
p z p x z x dx
p x p y p x p z x dx
39
independence
Some Famous Probability Mass and Density Functions Uniform
Gaussian (Normal)
1p x U a U bb a
2
22.1 ,. 2
x
p x e N
40
~ ( , )( )1−
~ ( , )( )12
Some Famous Probability Mass and Density Functions
Binomial
Exponential
1 n k knP X k p p
k
00 0
xx e x
p x e U xx
41
~ ( , )( = )
( )
0
Gaussian (Normal) Distribution
42
68% within − , +95% within − 2 , + 2
It is widely used to model the distribution of continuousvariables
Standard Normal distribution:
Multivariate Gaussian Distribution is a vector of Gaussian variables
Mahalanobis Distance:
1
1 1( ) ( )1 2( ) ( , )/2 1/2(2 ) | |
[ ] ( ) ( )
[( )( ) ]
Td
T
Tp N e
d
E E x E x
E
x μ x μx μ
μ x
x μ x μ
12 ( ) ( )Tr x μ x μ
43 −3−2
−10
12
3
−2
0
2
0
0.1
0.2
0.3
0.4
x1x2
Pro
babi
lity
Den
sity
Multivariate Gaussian Distribution
44
The covariance matrix is always symmetric and positive semi-definite
Multivariate Gaussian is completely specified by + (+ 1)/2 parameters
Special cases of : = : Independent random variables with the same variance
(circularly symmetric Gaussian)
Digonal matrix = …⋮ ⋱ ⋮⋯ : Independent random variables with
different variances
Multivariate Gaussian DistributionLevel Surfaces
45
The Gaussian distribution will be constant on surfaces in-space for which:− − = Principal axes of the hyper-ellipsoids are the eigenvectors of .
Bivariate Gaussian: Curves of constant density areellipses.
Hyper-ellipsoid
Bivariate Gaussian distribution and are the eigenvalues of ( ) and and
are the corresponding eigenvectors
46
=
Linear Transformations on Multivariate Gaussian
47
: Whitening transform
=
Gaussian Distribution Properties Some attracting properties of the Gaussian distribution:
Marginal and conditional distributions of a Gaussian are also Gaussian
After a linear transformation, a Gaussian distribution is again Gaussian There exists a linear transformation that diagonalizes the covariance matrix
(whitening transform). It converts the multivariate normal distribution into a spherical one.
Gaussian is a distribution that maximizes the entropy
Gaussian is stable and infinitely divisible Central Limit Theorem
Some distributions can be estimated by Gaussian distribution when theirparameter value is sufficiently large (e.g., Binomial)
48
Central Limit Theorem (under mild conditions) Suppose i.i.d. (Independent Identically Distributed) RVs
( = 1, … , ) with finite variances
Let be the sum of these RVs
Distribution of converges to a normal distribution asincreases, regardless to the distribution of the RVs.
Example:
1
N
N ii
S X
49
~ uniform, = 1, … ,= 1
Linear Algebra: Basic Definitions
Matrix :
Matrix Transpose
Symetric matrix = Vector
11 12 1
21 22 2
1 2
...
...[ ]
... ... ... ......
n
nij m n
m m mn
a a aa a a
a
a a a
A
1 ,1Tij jib a i n j m B A
1
1... [ ,..., ]Tn
n
aa a
a
a a
50
Linear Mapping Linear function ( + ) = ( ) + ( ) ∀ , ∈ ( ) = ( ) ∀ ∈ , ∈
A linear function: In general, a matrix × = ⋯ can be used to
denote a map : ℝ → ℝ where = + ⋯+ =
51
Inner (dot) product
Matrix multiplication
. .
[ ] [ ]
[ ] ,
ij m p ij p n
Tij m n ij i j
a b
c c A B
A B
AB = C
Linear Algebra: Basic Definitions
1
,n
Ti i
i
a b
a b a b
52
i-th row of A j-th column of B
Inner Product Inner (dot) product
Length (Euclidean norm) of a vector a is normalized iff ||a|| = 1
Angle between vectors
Orthogonal vectors and :
Orthonormal set of vectors , , … , :∀ , = 1 =0 . .
1
nT
i ii
a b
a b
2
1
nT
ii
a
a a a
cos|| |||| ||
T
a ba b
0T a b
53
Linear Independence A set of vectors is linearly independent if no vector is
a linear combination of other vectors. + + . . . + = 0 ⇒= = . . . = = 0
54
Matrix Determinant and Trace
Determinant ( ) = ( ) × ( )
Trace
1[ ]
n
jjj
tr A a
1det( ) ; 1,.... ;
( 1) det( )
n
ij ijj
i jij ij
A a A i n
A M
55
Matrix Inversion
Inverse of × :
exists iff det ( ) ≠ 0 (A is nonsingular) Singular: det ( ) = 0 ill-conditioned:A is nonsingular but close to being singular
Pseudo-inverse for a non square matrix # = is not singular # =
1 nAB BA I B A
56
Matrix Rank : maximum number of linearly independent
columns or rows of A.
× :
Full rank × : iff is nonsingular( )
57
Eigenvectors and Eigenvalues
det( ) 0n A I
1( )
n
jj
tr
A
1
det( )n
jj
A A
Characteristic equation: n-th order polynomial, with n roots
58
Eigenvectors and Eigenvalues: Symmetric Matrix For a symmetric matrix, the eigenvectors corresponding
to distinct eigenvalues are orthogonal
These eigenvectors can be used to form an orthonormalset ( and )
59
Eigen Decomposition: Symmetric Matrix
= ⇒ = = Eigen decomposition of a symmetric matrix:
60
Positive definite matrix Symmetric × is positive definite:
Eigen values of a positive define matrix are positive: ∀ , > 0
61
Simple Vector Derivatives
62