23
Stochastic processes A stochastic (or random) process {X i } is an indexed sequence of random variables. The dependencies among the random variables can be arbitrary. The process is characterized by the joint probability mass functions p(x 1 , x 2 ,..., x n )= Pr {(X 1 , X 2 ,..., X n )=(x 1 , x 2 ,..., x n )} (x 1 , x 2 ,..., x n ) ∈X n A stochastic process is said to be stationary if the joint distribution of any subset of the sequence of random variables is invariant with respect to shifts in the time index, ie Pr {X 1 = x 1 ,..., X n = x n } = Pr {X 1+l = x 1 ,..., X n+l = x n } for every n and shift l and for all x 1 ,..., x n ∈X .

Stochastic processes - Informationskodning processes A stochastic (or random) process fX igis an indexed sequence of random variables. The dependencies among the random variables can

Embed Size (px)

Citation preview

Page 1: Stochastic processes - Informationskodning processes A stochastic (or random) process fX igis an indexed sequence of random variables. The dependencies among the random variables can

Stochastic processes

A stochastic (or random) process {Xi} is an indexed sequence of randomvariables. The dependencies among the random variables can be arbitrary.The process is characterized by the joint probability mass functions

p(x1, x2, . . . , xn) = Pr{(X1,X2, . . . ,Xn) = (x1, x2, . . . , xn)}

(x1, x2, . . . , xn) ∈ X n

A stochastic process is said to be stationary if the joint distribution ofany subset of the sequence of random variables is invariant with respectto shifts in the time index, ie

Pr{X1 = x1, . . . ,Xn = xn} = Pr{X1+l = x1, . . . ,Xn+l = xn}

for every n and shift l and for all x1, . . . , xn ∈ X .

Page 2: Stochastic processes - Informationskodning processes A stochastic (or random) process fX igis an indexed sequence of random variables. The dependencies among the random variables can

Markov chains

A discrete stochastic process X1,X2, . . . is said to be a Markov chain or aMarkov process if for n = 1, 2, . . .

Pr{Xn+1 = xn+1|Xn = xn, . . . ,X1 = x1} = Pr{Xn+1 = xn+1|Xn = xn}

for all x1, . . . , xn, xn+1 ∈ X .

For a Markov chain we can write the joint probability mass function as

p(x1, x2, . . . , xn) = p(x1)p(x2|x1) · · · p(xn|xn−1)

A Markov chain is said to be time invariant if the conditional probabilityp(xn+1|xn) does not depend on n, ie

Pr{Xn+1 = b|Xn = a} = Pr{X2 = b|X1 = a}

for all a, b ∈ X and n = 1, 2, . . ..

Page 3: Stochastic processes - Informationskodning processes A stochastic (or random) process fX igis an indexed sequence of random variables. The dependencies among the random variables can

Markov chains, cont.

If {Xi} is a Markov chain, Xn is called the state at time n. Atime-invariant Markov chain is characterized by its initial state and aprobability transition matrix P = [Pij ], where Pij = Pr{Xn+1 = j |Xn = i}(assuming that X = {1, 2, . . . ,m}).

If it is possible to go with positive probability from any state of theMarkov chain to any other state in a finite number of steps, the Markovchain is said to be irreducible. If the largest common factor of thelengths of different paths from a state to itself is 1, the Markov chain issaid to be aperiodic.

If the probability mass function at time n is p(xn), the probability massfunction at time n + 1 is

p(xn+1) =∑xn

p(xn)Pxnxn+1

Page 4: Stochastic processes - Informationskodning processes A stochastic (or random) process fX igis an indexed sequence of random variables. The dependencies among the random variables can

Markov chains, cont.

A distribution on the states such that the distribution at time n + 1 is thesame as the distribution at time n is called a stationary distribution. Thestationary distribution is so called because if the initial state of thedistribution is drawn according to a stationary distribution, the Markovchain forms a stationary process.

If a finite-state markov chain is irreducible and aperiodic, the stationarydistribution is unique, and from any starting distribution, the distributionof Xn tends to the stationary distribution as n→∞.

Page 5: Stochastic processes - Informationskodning processes A stochastic (or random) process fX igis an indexed sequence of random variables. The dependencies among the random variables can

Entropy rate

The entropy rate (or just entropy) of a stochastic process {Xi} is definedby

H(X ) = limn→∞

1

nH(X1,X2, . . . ,Xn)

when the limit exists. It can also be defined as

H ′(X ) = limn→∞

H(Xn|Xn−1,Xn−2, . . . ,X1)

when the limit exists.For stationary processes, both limits exist and are equal

H(X ) = H ′(X )

For a stationary Markov chain, we have

H(X ) = H ′(X ) = limn→∞

H(Xn|Xn−1,Xn−2 . . . ,X1)

= limn→∞

H(Xn|Xn−1) = H(X2|X1)

Page 6: Stochastic processes - Informationskodning processes A stochastic (or random) process fX igis an indexed sequence of random variables. The dependencies among the random variables can

Entropy rate, cont.

For a Markov chain, the stationary distribution µ is the solution to theequation system

µj =∑i

µiPij , j = 1, . . . ,m

Given a stationary Markov chain {Xi} with stationary distribution µ andtransition matrix P. Let X1 ∼ µ. The entropy rate of the Markov chain isthen given by

H(X ) = −∑ij

µiPij log Pij =∑i

µi (−∑j

Pij log Pij)

ie a weighted sum of the entropies for each state.

Page 7: Stochastic processes - Informationskodning processes A stochastic (or random) process fX igis an indexed sequence of random variables. The dependencies among the random variables can

Functions of Markov chains

Let X1,X2, . . . be a stationary markov chain and let Yi = φ(Xi ) be aprocess where each term is a function of the corresponding state in theMarkov chain. What is the entropy rate H(Y)? If Yi also forms a Markovchain this is easy to calculate, but this is not generally the case. We canhowever find upper and lower bounds.

Since the Markov chain is stationary, so is Y1,Y2, . . .. Thus, the entropyrate H(Y) is well defined. We might try to compute H(Yn|Yn−1 . . .Y1)for each n to find the entropy rate, however the convergence might bearibtrarily slow.

We already know that H(Yn|Yn−1 . . .Y1) is an upper bound on H(Y),since it converges monotonically to the entropy rate. For a lower bound,we can use H(Yn|Yn−1, . . . ,Y1,X1).

Page 8: Stochastic processes - Informationskodning processes A stochastic (or random) process fX igis an indexed sequence of random variables. The dependencies among the random variables can

Functions of Markov chains, cont.

We haveH(Yn|Yn−1, . . . ,Y2,X1) ≤ H(Y)

Proof: For k = 1, 2, . . . we have

H(Yn|Yn−1, . . . ,Y2,X1) = H(Yn|Yn−1, . . . ,Y2,Y1,X1)

= H(Yn|Yn−1, . . . ,Y2,Y1,X1,X0, . . . ,X−k)

= H(Yn|Yn−1, . . . ,Y2,Y1,X1,X0, . . . ,X−k ,

Y0, . . . ,Y−k)

≤ H(Yn|Yn−1, . . . ,Y2,Y1,Y0, . . . ,Y−k)

= H(Yn+k+1|Yn+k , . . . ,Y2,Y1)

Since this is true for all k it will be true in the limit too. Thus

H(Yn|Yn−1, . . . ,Y1,X1) ≤ limk→∞

H(Yn+k+1|Yn+k , . . . ,Y2,Y1) = H(Y)

Page 9: Stochastic processes - Informationskodning processes A stochastic (or random) process fX igis an indexed sequence of random variables. The dependencies among the random variables can

Functions of Markov chains, cont.

The interval between the upper and lower bounds tends to 0, ie

H(Yn|Yn−1, . . . ,Y1)− H(Yn|Yn−1, . . . ,Y1,X1)→ 0

Proof: The difference can be rewritten as

H(Yn|Yn−1, . . . ,Y1)− H(Yn|Yn−1, . . . ,Y1,X1) = I (X1; Yn|Yn−1, . . . ,Y1)

By the properties of mutual information

I (X1; Y1,Y2, . . . ,Yn) ≤ H(X1)

Since I (X1; Y1,Y2, . . . ,Yn) increases with n, the limit exists and

limn→∞

I (X1; Y1,Y2, . . . ,Yn) ≤ H(X1)

Page 10: Stochastic processes - Informationskodning processes A stochastic (or random) process fX igis an indexed sequence of random variables. The dependencies among the random variables can

Functions of Markov chains, cont.

The chain rule for mutual information gives us

H(X1) ≥ limn→∞

I (X1; Y1,Y2, . . . ,Yn)

= limn→∞

n∑i=1

I (X1; Yi |Yi−1, . . . ,Y1)

=∞∑i=1

I (X1; Yi |Yi−1, . . . ,Y1)

Since this infinite sum is finite and all terms are non-negative, the termsmust tend to 0, which means that

limn→∞

I (X1; Yn|Yn−1, . . . ,Y1) = 0

Page 11: Stochastic processes - Informationskodning processes A stochastic (or random) process fX igis an indexed sequence of random variables. The dependencies among the random variables can

Functions of Markov chains, cont.

We have thus shown that

H(Yn|Yn−1, . . . ,Y1,X1) ≤ H(Y) ≤ H(Yn|Yn−1, . . . ,Y1)

and that

limn→∞

H(Yn|Yn−1, . . . ,Y1,X1) = H(Y) = limn→∞

H(Yn|Yn−1, . . . ,Y1)

This result can also be generalized to the case when Yi is a randomfunction of Xi , called a hidden Markov model.

Page 12: Stochastic processes - Informationskodning processes A stochastic (or random) process fX igis an indexed sequence of random variables. The dependencies among the random variables can

Higher order Markov sources

A Markov chain is a process where the dependency of an outcome attime n only reaches back one step in time

p(xn|xn−1, xn−2, xn−3, . . .) = p(xn|xn−1)

In some applications, particularly in data compression, we might want tohave source models where the dependency reaches back longer in time. AMarkov source of order k is a random process where the following holds

p(xn|xn−1, xn−2, xn−3, . . .) = p(xn|xn−1, . . . , xn−k)

A Markov source of order 0 is a memoryless process. A Markov source oforder 1 is a Markov chain. Given Xn, a stationary Markov source of order

k . Form a new source Sn = (Xn−k+1,Xn−k+2, . . . ,Xn). This new processis a Markov chain. The entropy rate is

H(Sn|Sn−1) = H(Xn|Xn−1, . . . ,Xn−k)

Page 13: Stochastic processes - Informationskodning processes A stochastic (or random) process fX igis an indexed sequence of random variables. The dependencies among the random variables can

Differential entropy

A continuous random variable X has the probability density functionf (x). Let the support set S be the set of x where f (x) > 0. Thedifferential entropy h(X ) of the variable is defined as

h(X ) = −∫S

f (x) log f (x) dx

Unlike the entropy for a discrete variable, the differential entropy can beboth positive and negative.

Page 14: Stochastic processes - Informationskodning processes A stochastic (or random) process fX igis an indexed sequence of random variables. The dependencies among the random variables can

Some common distributions

Normal distribution (gaussian distribution)

f (x) =1√2πσ

e−(x−m)2

2σ2 , h(X ) =1

2log 2πeσ2

Laplace distribution

f (x) =1√2σ

e−√

2|x−m|σ , h(X ) =

1

2log 2e2σ2

Uniform distribution

f (x) =

{1

b−a a ≤ x ≤ b

0 otherwise, h(X ) = log(b − a) =

1

2log 12σ2

Page 15: Stochastic processes - Informationskodning processes A stochastic (or random) process fX igis an indexed sequence of random variables. The dependencies among the random variables can

AEP for continuous variables

Let X1,X2, . . . ,Xn be a sequence of i.i.d. random variables drawnaccording to the density f (x). Then

−1

nlog f (X1,X2, . . . ,Xn)→ E [− log f (X )] = h(X ) in probability

For ε > 0 and any n we define the typical set A(n)ε with respect to f (x) as

follows

A(n)ε = {(x1, . . . , xn) ∈ Sn : | − 1

nlog f (x1, . . . , xn)− h(X )| ≤ ε}

The volume Vol(A) of any set A is defined as

Vol(A) =

∫A

dx1 dx2 . . . dxn

Page 16: Stochastic processes - Informationskodning processes A stochastic (or random) process fX igis an indexed sequence of random variables. The dependencies among the random variables can

AEP, cont.

The typical set A(n)ε has the following properties

1. Pr{A(n)ε } > 1− ε for sufficiently large n

2. Vol(A(n)ε ) ≤ 2n(h(X )+ε) for all n

3. Vol(A(n)ε ) ≥ (1− ε)2n(h(X )−ε) for sufficiently large n

This indicates that the typical set contains almost all probability and hasa volume of approximately 2nh(X ).

Page 17: Stochastic processes - Informationskodning processes A stochastic (or random) process fX igis an indexed sequence of random variables. The dependencies among the random variables can

Quantization

Suppose we do uniform quantization of a continuous random variable X ,ie we divide the range of X into bins of length ∆. Assuming that f (x) iscontinuous in each bin, there exists a value xi within each bin such that

f (xi )∆ =

∫ (i+1)∆

i∆

f (x)dx

Consider the quantized variable X ∆ defined by

X ∆ = xi if i∆ ≤ X < (i + 1)∆

The probability p(xi ) = pi that X ∆ = xi is

pi =

∫ (i+1)∆

i∆

f (x)dx = f (xi )∆

Page 18: Stochastic processes - Informationskodning processes A stochastic (or random) process fX igis an indexed sequence of random variables. The dependencies among the random variables can

Quantization, cont.

The entropy of the quantized variable is

H(X ∆) = −∑i

pi log pi

= −∑i

∆f (xi ) log(∆f (xi ))

= −∑i

∆f (xi ) log f (xi )−∑i

∆f (xi ) log ∆

≈ −∫ ∞−∞

f (x) log f (x) dx − log ∆

= h(X )− log ∆

Page 19: Stochastic processes - Informationskodning processes A stochastic (or random) process fX igis an indexed sequence of random variables. The dependencies among the random variables can

Differential entropy, cont.

Two random variables X and Y with joint density function f (x , y) andconditional density functions f (x |y) and f (y |x). The joint differentialentropy is defined as

h(X ,Y ) = −∫

f (x , y) log f (x , y) dxdy

The conditional differential entropy is defined as

h(X |Y ) = −∫

f (x , y) log f (x |y) dxdy

We haveh(X ,Y ) = h(X ) + h(Y |X ) = h(Y ) + h(X |Y )

which can be generalized to (chain rule)

h(X1,X2, . . . ,Xn) = h(X1) + h(X2|X1) + . . .+ h(Xn|X1,X2, . . . ,Xn−1)

Page 20: Stochastic processes - Informationskodning processes A stochastic (or random) process fX igis an indexed sequence of random variables. The dependencies among the random variables can

Relative entropy

The relative entropy (Kullback-Leibler distance) between two densities fand g is defined by

D(f ||g) =

∫f log

f

g

The relative entropy is finite only if the support set of f is contained inthe support set for g .

Page 21: Stochastic processes - Informationskodning processes A stochastic (or random) process fX igis an indexed sequence of random variables. The dependencies among the random variables can

Mutual information

The mutual information between X and Y is defined as

I (X ; Y ) =

∫f (x , y) log

f (x , y)

f (x)f (y)dxdy

which gives

I (X ; Y ) = h(X )− h(X |Y ) = h(Y )− h(Y |X ) = h(X ) + h(Y )− h(X ,Y )

andI (X ; Y ) = D(f (x , y)||f (x)f (y))

Given two uniformely quantized versions of X and Y

I (X ∆; Y ∆) = H(X ∆)− H(X ∆|Y ∆)

≈ h(X )− log ∆− (h(X |Y )− log ∆)

= I (X ; Y )

Page 22: Stochastic processes - Informationskodning processes A stochastic (or random) process fX igis an indexed sequence of random variables. The dependencies among the random variables can

Properties

D(f ||g) ≥ 0

with equality iff f and g are equal almost everywhere.

I (X ; Y ) ≥ 0

with equality iff X and Y are independent.

h(X |Y ) ≤ h(X )

with equality iff X and Y are independent.

h(X + c) = h(X )

h(aX ) = h(X ) + log |a|

h(AX) = h(X) + log | det(A)|

Page 23: Stochastic processes - Informationskodning processes A stochastic (or random) process fX igis an indexed sequence of random variables. The dependencies among the random variables can

Differential entropy, cont.

The gaussian distribution is the distribution that maximizes thedifferential entropy, for a given covariance matrix.

Ie, for the one-dimensional case, the differential entropy for a variable Xwith variance σ2 satisfies the inequality

h(X ) ≤ 1

2log 2πeσ2

with equality iff X is gaussian.

For the general case, the differential entropy for a random n-dimensionalvector X with covariance matrix K satisfies the inequality

h(X) ≤ 1

2log(2πe)n|K |

with equality iff X is a multivariate gaussian vector.