Upload
trinhduong
View
215
Download
1
Embed Size (px)
Citation preview
Stochastic processes
A stochastic (or random) process {Xi} is an indexed sequence of randomvariables. The dependencies among the random variables can be arbitrary.The process is characterized by the joint probability mass functions
p(x1, x2, . . . , xn) = Pr{(X1,X2, . . . ,Xn) = (x1, x2, . . . , xn)}
(x1, x2, . . . , xn) ∈ X n
A stochastic process is said to be stationary if the joint distribution ofany subset of the sequence of random variables is invariant with respectto shifts in the time index, ie
Pr{X1 = x1, . . . ,Xn = xn} = Pr{X1+l = x1, . . . ,Xn+l = xn}
for every n and shift l and for all x1, . . . , xn ∈ X .
Markov chains
A discrete stochastic process X1,X2, . . . is said to be a Markov chain or aMarkov process if for n = 1, 2, . . .
Pr{Xn+1 = xn+1|Xn = xn, . . . ,X1 = x1} = Pr{Xn+1 = xn+1|Xn = xn}
for all x1, . . . , xn, xn+1 ∈ X .
For a Markov chain we can write the joint probability mass function as
p(x1, x2, . . . , xn) = p(x1)p(x2|x1) · · · p(xn|xn−1)
A Markov chain is said to be time invariant if the conditional probabilityp(xn+1|xn) does not depend on n, ie
Pr{Xn+1 = b|Xn = a} = Pr{X2 = b|X1 = a}
for all a, b ∈ X and n = 1, 2, . . ..
Markov chains, cont.
If {Xi} is a Markov chain, Xn is called the state at time n. Atime-invariant Markov chain is characterized by its initial state and aprobability transition matrix P = [Pij ], where Pij = Pr{Xn+1 = j |Xn = i}(assuming that X = {1, 2, . . . ,m}).
If it is possible to go with positive probability from any state of theMarkov chain to any other state in a finite number of steps, the Markovchain is said to be irreducible. If the largest common factor of thelengths of different paths from a state to itself is 1, the Markov chain issaid to be aperiodic.
If the probability mass function at time n is p(xn), the probability massfunction at time n + 1 is
p(xn+1) =∑xn
p(xn)Pxnxn+1
Markov chains, cont.
A distribution on the states such that the distribution at time n + 1 is thesame as the distribution at time n is called a stationary distribution. Thestationary distribution is so called because if the initial state of thedistribution is drawn according to a stationary distribution, the Markovchain forms a stationary process.
If a finite-state markov chain is irreducible and aperiodic, the stationarydistribution is unique, and from any starting distribution, the distributionof Xn tends to the stationary distribution as n→∞.
Entropy rate
The entropy rate (or just entropy) of a stochastic process {Xi} is definedby
H(X ) = limn→∞
1
nH(X1,X2, . . . ,Xn)
when the limit exists. It can also be defined as
H ′(X ) = limn→∞
H(Xn|Xn−1,Xn−2, . . . ,X1)
when the limit exists.For stationary processes, both limits exist and are equal
H(X ) = H ′(X )
For a stationary Markov chain, we have
H(X ) = H ′(X ) = limn→∞
H(Xn|Xn−1,Xn−2 . . . ,X1)
= limn→∞
H(Xn|Xn−1) = H(X2|X1)
Entropy rate, cont.
For a Markov chain, the stationary distribution µ is the solution to theequation system
µj =∑i
µiPij , j = 1, . . . ,m
Given a stationary Markov chain {Xi} with stationary distribution µ andtransition matrix P. Let X1 ∼ µ. The entropy rate of the Markov chain isthen given by
H(X ) = −∑ij
µiPij log Pij =∑i
µi (−∑j
Pij log Pij)
ie a weighted sum of the entropies for each state.
Functions of Markov chains
Let X1,X2, . . . be a stationary markov chain and let Yi = φ(Xi ) be aprocess where each term is a function of the corresponding state in theMarkov chain. What is the entropy rate H(Y)? If Yi also forms a Markovchain this is easy to calculate, but this is not generally the case. We canhowever find upper and lower bounds.
Since the Markov chain is stationary, so is Y1,Y2, . . .. Thus, the entropyrate H(Y) is well defined. We might try to compute H(Yn|Yn−1 . . .Y1)for each n to find the entropy rate, however the convergence might bearibtrarily slow.
We already know that H(Yn|Yn−1 . . .Y1) is an upper bound on H(Y),since it converges monotonically to the entropy rate. For a lower bound,we can use H(Yn|Yn−1, . . . ,Y1,X1).
Functions of Markov chains, cont.
We haveH(Yn|Yn−1, . . . ,Y2,X1) ≤ H(Y)
Proof: For k = 1, 2, . . . we have
H(Yn|Yn−1, . . . ,Y2,X1) = H(Yn|Yn−1, . . . ,Y2,Y1,X1)
= H(Yn|Yn−1, . . . ,Y2,Y1,X1,X0, . . . ,X−k)
= H(Yn|Yn−1, . . . ,Y2,Y1,X1,X0, . . . ,X−k ,
Y0, . . . ,Y−k)
≤ H(Yn|Yn−1, . . . ,Y2,Y1,Y0, . . . ,Y−k)
= H(Yn+k+1|Yn+k , . . . ,Y2,Y1)
Since this is true for all k it will be true in the limit too. Thus
H(Yn|Yn−1, . . . ,Y1,X1) ≤ limk→∞
H(Yn+k+1|Yn+k , . . . ,Y2,Y1) = H(Y)
Functions of Markov chains, cont.
The interval between the upper and lower bounds tends to 0, ie
H(Yn|Yn−1, . . . ,Y1)− H(Yn|Yn−1, . . . ,Y1,X1)→ 0
Proof: The difference can be rewritten as
H(Yn|Yn−1, . . . ,Y1)− H(Yn|Yn−1, . . . ,Y1,X1) = I (X1; Yn|Yn−1, . . . ,Y1)
By the properties of mutual information
I (X1; Y1,Y2, . . . ,Yn) ≤ H(X1)
Since I (X1; Y1,Y2, . . . ,Yn) increases with n, the limit exists and
limn→∞
I (X1; Y1,Y2, . . . ,Yn) ≤ H(X1)
Functions of Markov chains, cont.
The chain rule for mutual information gives us
H(X1) ≥ limn→∞
I (X1; Y1,Y2, . . . ,Yn)
= limn→∞
n∑i=1
I (X1; Yi |Yi−1, . . . ,Y1)
=∞∑i=1
I (X1; Yi |Yi−1, . . . ,Y1)
Since this infinite sum is finite and all terms are non-negative, the termsmust tend to 0, which means that
limn→∞
I (X1; Yn|Yn−1, . . . ,Y1) = 0
Functions of Markov chains, cont.
We have thus shown that
H(Yn|Yn−1, . . . ,Y1,X1) ≤ H(Y) ≤ H(Yn|Yn−1, . . . ,Y1)
and that
limn→∞
H(Yn|Yn−1, . . . ,Y1,X1) = H(Y) = limn→∞
H(Yn|Yn−1, . . . ,Y1)
This result can also be generalized to the case when Yi is a randomfunction of Xi , called a hidden Markov model.
Higher order Markov sources
A Markov chain is a process where the dependency of an outcome attime n only reaches back one step in time
p(xn|xn−1, xn−2, xn−3, . . .) = p(xn|xn−1)
In some applications, particularly in data compression, we might want tohave source models where the dependency reaches back longer in time. AMarkov source of order k is a random process where the following holds
p(xn|xn−1, xn−2, xn−3, . . .) = p(xn|xn−1, . . . , xn−k)
A Markov source of order 0 is a memoryless process. A Markov source oforder 1 is a Markov chain. Given Xn, a stationary Markov source of order
k . Form a new source Sn = (Xn−k+1,Xn−k+2, . . . ,Xn). This new processis a Markov chain. The entropy rate is
H(Sn|Sn−1) = H(Xn|Xn−1, . . . ,Xn−k)
Differential entropy
A continuous random variable X has the probability density functionf (x). Let the support set S be the set of x where f (x) > 0. Thedifferential entropy h(X ) of the variable is defined as
h(X ) = −∫S
f (x) log f (x) dx
Unlike the entropy for a discrete variable, the differential entropy can beboth positive and negative.
Some common distributions
Normal distribution (gaussian distribution)
f (x) =1√2πσ
e−(x−m)2
2σ2 , h(X ) =1
2log 2πeσ2
Laplace distribution
f (x) =1√2σ
e−√
2|x−m|σ , h(X ) =
1
2log 2e2σ2
Uniform distribution
f (x) =
{1
b−a a ≤ x ≤ b
0 otherwise, h(X ) = log(b − a) =
1
2log 12σ2
AEP for continuous variables
Let X1,X2, . . . ,Xn be a sequence of i.i.d. random variables drawnaccording to the density f (x). Then
−1
nlog f (X1,X2, . . . ,Xn)→ E [− log f (X )] = h(X ) in probability
For ε > 0 and any n we define the typical set A(n)ε with respect to f (x) as
follows
A(n)ε = {(x1, . . . , xn) ∈ Sn : | − 1
nlog f (x1, . . . , xn)− h(X )| ≤ ε}
The volume Vol(A) of any set A is defined as
Vol(A) =
∫A
dx1 dx2 . . . dxn
AEP, cont.
The typical set A(n)ε has the following properties
1. Pr{A(n)ε } > 1− ε for sufficiently large n
2. Vol(A(n)ε ) ≤ 2n(h(X )+ε) for all n
3. Vol(A(n)ε ) ≥ (1− ε)2n(h(X )−ε) for sufficiently large n
This indicates that the typical set contains almost all probability and hasa volume of approximately 2nh(X ).
Quantization
Suppose we do uniform quantization of a continuous random variable X ,ie we divide the range of X into bins of length ∆. Assuming that f (x) iscontinuous in each bin, there exists a value xi within each bin such that
f (xi )∆ =
∫ (i+1)∆
i∆
f (x)dx
Consider the quantized variable X ∆ defined by
X ∆ = xi if i∆ ≤ X < (i + 1)∆
The probability p(xi ) = pi that X ∆ = xi is
pi =
∫ (i+1)∆
i∆
f (x)dx = f (xi )∆
Quantization, cont.
The entropy of the quantized variable is
H(X ∆) = −∑i
pi log pi
= −∑i
∆f (xi ) log(∆f (xi ))
= −∑i
∆f (xi ) log f (xi )−∑i
∆f (xi ) log ∆
≈ −∫ ∞−∞
f (x) log f (x) dx − log ∆
= h(X )− log ∆
Differential entropy, cont.
Two random variables X and Y with joint density function f (x , y) andconditional density functions f (x |y) and f (y |x). The joint differentialentropy is defined as
h(X ,Y ) = −∫
f (x , y) log f (x , y) dxdy
The conditional differential entropy is defined as
h(X |Y ) = −∫
f (x , y) log f (x |y) dxdy
We haveh(X ,Y ) = h(X ) + h(Y |X ) = h(Y ) + h(X |Y )
which can be generalized to (chain rule)
h(X1,X2, . . . ,Xn) = h(X1) + h(X2|X1) + . . .+ h(Xn|X1,X2, . . . ,Xn−1)
Relative entropy
The relative entropy (Kullback-Leibler distance) between two densities fand g is defined by
D(f ||g) =
∫f log
f
g
The relative entropy is finite only if the support set of f is contained inthe support set for g .
Mutual information
The mutual information between X and Y is defined as
I (X ; Y ) =
∫f (x , y) log
f (x , y)
f (x)f (y)dxdy
which gives
I (X ; Y ) = h(X )− h(X |Y ) = h(Y )− h(Y |X ) = h(X ) + h(Y )− h(X ,Y )
andI (X ; Y ) = D(f (x , y)||f (x)f (y))
Given two uniformely quantized versions of X and Y
I (X ∆; Y ∆) = H(X ∆)− H(X ∆|Y ∆)
≈ h(X )− log ∆− (h(X |Y )− log ∆)
= I (X ; Y )
Properties
D(f ||g) ≥ 0
with equality iff f and g are equal almost everywhere.
I (X ; Y ) ≥ 0
with equality iff X and Y are independent.
h(X |Y ) ≤ h(X )
with equality iff X and Y are independent.
h(X + c) = h(X )
h(aX ) = h(X ) + log |a|
h(AX) = h(X) + log | det(A)|
Differential entropy, cont.
The gaussian distribution is the distribution that maximizes thedifferential entropy, for a given covariance matrix.
Ie, for the one-dimensional case, the differential entropy for a variable Xwith variance σ2 satisfies the inequality
h(X ) ≤ 1
2log 2πeσ2
with equality iff X is gaussian.
For the general case, the differential entropy for a random n-dimensionalvector X with covariance matrix K satisfies the inequality
h(X) ≤ 1
2log(2πe)n|K |
with equality iff X is a multivariate gaussian vector.