Stochastic processes - Informationskodning processes A stochastic (or random) process fX igis an indexed sequence of random variables. The dependencies among the random variables can

Stochastic processes

A stochastic (or random) process {Xi} is an indexed sequence of randomvariables. The dependencies among the random variables can be arbitrary.The process is characterized by the joint probability mass functions

p(x1, x2, . . . , xn) = Pr{(X1,X2, . . . ,Xn) = (x1, x2, . . . , xn)}

(x1, x2, . . . , xn) ∈ X n

A stochastic process is said to be stationary if the joint distribution ofany subset of the sequence of random variables is invariant with respectto shifts in the time index, ie

Pr{X1 = x1, . . . ,Xn = xn} = Pr{X1+l = x1, . . . ,Xn+l = xn}

for every n and shift l and for all x1, . . . , xn ∈ X .

Markov chains

A discrete stochastic process X1,X2, . . . is said to be a Markov chain or aMarkov process if for n = 1, 2, . . .

Pr{Xn+1 = xn+1|Xn = xn, . . . ,X1 = x1} = Pr{Xn+1 = xn+1|Xn = xn}

for all x1, . . . , xn, xn+1 ∈ X .

For a Markov chain we can write the joint probability mass function as

p(x1, x2, . . . , xn) = p(x1)p(x2|x1) · · · p(xn|xn−1)

A Markov chain is said to be time invariant if the conditional probabilityp(xn+1|xn) does not depend on n, ie

Pr{Xn+1 = b|Xn = a} = Pr{X2 = b|X1 = a}

for all a, b ∈ X and n = 1, 2, . . ..

Markov chains, cont.

If {Xi} is a Markov chain, Xn is called the state at time n. Atime-invariant Markov chain is characterized by its initial state and aprobability transition matrix P = [Pij ], where Pij = Pr{Xn+1 = j |Xn = i}(assuming that X = {1, 2, . . . ,m}).

If it is possible to go with positive probability from any state of theMarkov chain to any other state in a finite number of steps, the Markovchain is said to be irreducible. If the largest common factor of thelengths of different paths from a state to itself is 1, the Markov chain issaid to be aperiodic.

If the probability mass function at time n is p(xn), the probability massfunction at time n + 1 is

p(xn+1) =∑xn

p(xn)Pxnxn+1

Markov chains, cont.

A distribution on the states such that the distribution at time n + 1 is thesame as the distribution at time n is called a stationary distribution. Thestationary distribution is so called because if the initial state of thedistribution is drawn according to a stationary distribution, the Markovchain forms a stationary process.

If a finite-state markov chain is irreducible and aperiodic, the stationarydistribution is unique, and from any starting distribution, the distributionof Xn tends to the stationary distribution as n→∞.

Entropy rate

The entropy rate (or just entropy) of a stochastic process {Xi} is definedby

H(X ) = limn→∞

1

nH(X1,X2, . . . ,Xn)

when the limit exists. It can also be defined as

H ′(X ) = limn→∞

H(Xn|Xn−1,Xn−2, . . . ,X1)

when the limit exists.For stationary processes, both limits exist and are equal

H(X ) = H ′(X )

For a stationary Markov chain, we have

H(X ) = H ′(X ) = limn→∞

H(Xn|Xn−1,Xn−2 . . . ,X1)

= limn→∞

H(Xn|Xn−1) = H(X2|X1)

Entropy rate, cont.

For a Markov chain, the stationary distribution µ is the solution to theequation system

µj =∑i

µiPij , j = 1, . . . ,m

Given a stationary Markov chain {Xi} with stationary distribution µ andtransition matrix P. Let X1 ∼ µ. The entropy rate of the Markov chain isthen given by

H(X ) = −∑ij

µiPij log Pij =∑i

µi (−∑j

Pij log Pij)

ie a weighted sum of the entropies for each state.

Functions of Markov chains

Let X1,X2, . . . be a stationary markov chain and let Yi = φ(Xi ) be aprocess where each term is a function of the corresponding state in theMarkov chain. What is the entropy rate H(Y)? If Yi also forms a Markovchain this is easy to calculate, but this is not generally the case. We canhowever find upper and lower bounds.

Since the Markov chain is stationary, so is Y1,Y2, . . .. Thus, the entropyrate H(Y) is well defined. We might try to compute H(Yn|Yn−1 . . .Y1)for each n to find the entropy rate, however the convergence might bearibtrarily slow.

We already know that H(Yn|Yn−1 . . .Y1) is an upper bound on H(Y),since it converges monotonically to the entropy rate. For a lower bound,we can use H(Yn|Yn−1, . . . ,Y1,X1).

Functions of Markov chains, cont.

We haveH(Yn|Yn−1, . . . ,Y2,X1) ≤ H(Y)

Proof: For k = 1, 2, . . . we have

H(Yn|Yn−1, . . . ,Y2,X1) = H(Yn|Yn−1, . . . ,Y2,Y1,X1)

= H(Yn|Yn−1, . . . ,Y2,Y1,X1,X0, . . . ,X−k)

= H(Yn|Yn−1, . . . ,Y2,Y1,X1,X0, . . . ,X−k ,

Y0, . . . ,Y−k)

≤ H(Yn|Yn−1, . . . ,Y2,Y1,Y0, . . . ,Y−k)

= H(Yn+k+1|Yn+k , . . . ,Y2,Y1)

Since this is true for all k it will be true in the limit too. Thus

H(Yn|Yn−1, . . . ,Y1,X1) ≤ limk→∞

H(Yn+k+1|Yn+k , . . . ,Y2,Y1) = H(Y)


The interval between the upper and lower bounds tends to 0, ie

H(Yn|Yn−1, . . . ,Y1)− H(Yn|Yn−1, . . . ,Y1,X1)→ 0

Proof: The difference can be rewritten as

H(Yn|Yn−1, . . . ,Y1)− H(Yn|Yn−1, . . . ,Y1,X1) = I (X1; Yn|Yn−1, . . . ,Y1)

By the properties of mutual information

I (X1; Y1,Y2, . . . ,Yn) ≤ H(X1)

Since I (X1; Y1,Y2, . . . ,Yn) increases with n, the limit exists and

limn→∞

I (X1; Y1,Y2, . . . ,Yn) ≤ H(X1)


The chain rule for mutual information gives us

H(X1) ≥ limn→∞

I (X1; Y1,Y2, . . . ,Yn)

= limn→∞

n∑i=1

I (X1; Yi |Yi−1, . . . ,Y1)

=∞∑i=1

I (X1; Yi |Yi−1, . . . ,Y1)

Since this infinite sum is finite and all terms are non-negative, the termsmust tend to 0, which means that

limn→∞

I (X1; Yn|Yn−1, . . . ,Y1) = 0


We have thus shown that

H(Yn|Yn−1, . . . ,Y1,X1) ≤ H(Y) ≤ H(Yn|Yn−1, . . . ,Y1)

and that

limn→∞

H(Yn|Yn−1, . . . ,Y1,X1) = H(Y) = limn→∞

H(Yn|Yn−1, . . . ,Y1)

This result can also be generalized to the case when Yi is a randomfunction of Xi , called a hidden Markov model.

Higher order Markov sources

A Markov chain is a process where the dependency of an outcome attime n only reaches back one step in time

p(xn|xn−1, xn−2, xn−3, . . .) = p(xn|xn−1)

In some applications, particularly in data compression, we might want tohave source models where the dependency reaches back longer in time. AMarkov source of order k is a random process where the following holds

p(xn|xn−1, xn−2, xn−3, . . .) = p(xn|xn−1, . . . , xn−k)

A Markov source of order 0 is a memoryless process. A Markov source oforder 1 is a Markov chain. Given Xn, a stationary Markov source of order

k . Form a new source Sn = (Xn−k+1,Xn−k+2, . . . ,Xn). This new processis a Markov chain. The entropy rate is

H(Sn|Sn−1) = H(Xn|Xn−1, . . . ,Xn−k)

Differential entropy

A continuous random variable X has the probability density functionf (x). Let the support set S be the set of x where f (x) > 0. Thedifferential entropy h(X ) of the variable is defined as

h(X ) = −∫S

f (x) log f (x) dx

Unlike the entropy for a discrete variable, the differential entropy can beboth positive and negative.

Some common distributions

Normal distribution (gaussian distribution)

f (x) =1√2πσ

e−(x−m)2

2σ2 , h(X ) =1

2log 2πeσ2

Laplace distribution

f (x) =1√2σ

e−√

2|x−m|σ , h(X ) =

1

2log 2e2σ2

Uniform distribution

f (x) =

{1

b−a a ≤ x ≤ b

0 otherwise, h(X ) = log(b − a) =

1

2log 12σ2

AEP for continuous variables

Let X1,X2, . . . ,Xn be a sequence of i.i.d. random variables drawnaccording to the density f (x). Then

−1

nlog f (X1,X2, . . . ,Xn)→ E [− log f (X )] = h(X ) in probability

For ε > 0 and any n we define the typical set A(n)ε with respect to f (x) as

follows

A(n)ε = {(x1, . . . , xn) ∈ Sn : | − 1

nlog f (x1, . . . , xn)− h(X )| ≤ ε}

The volume Vol(A) of any set A is defined as

Vol(A) =

∫A

dx1 dx2 . . . dxn

AEP, cont.

The typical set A(n)ε has the following properties

1. Pr{A(n)ε } > 1− ε for sufficiently large n

2. Vol(A(n)ε ) ≤ 2n(h(X )+ε) for all n

3. Vol(A(n)ε ) ≥ (1− ε)2n(h(X )−ε) for sufficiently large n

This indicates that the typical set contains almost all probability and hasa volume of approximately 2nh(X ).

Quantization

Suppose we do uniform quantization of a continuous random variable X ,ie we divide the range of X into bins of length ∆. Assuming that f (x) iscontinuous in each bin, there exists a value xi within each bin such that

f (xi )∆ =

∫ (i+1)∆

i∆

f (x)dx

Consider the quantized variable X ∆ defined by

X ∆ = xi if i∆ ≤ X < (i + 1)∆

The probability p(xi ) = pi that X ∆ = xi is

pi =

∫ (i+1)∆

i∆

f (x)dx = f (xi )∆

Quantization, cont.

The entropy of the quantized variable is

H(X ∆) = −∑i

pi log pi

= −∑i

∆f (xi ) log(∆f (xi ))

= −∑i

∆f (xi ) log f (xi )−∑i

∆f (xi ) log ∆

≈ −∫ ∞−∞

f (x) log f (x) dx − log ∆

= h(X )− log ∆

Differential entropy, cont.

Two random variables X and Y with joint density function f (x , y) andconditional density functions f (x |y) and f (y |x). The joint differentialentropy is defined as

h(X ,Y ) = −∫

f (x , y) log f (x , y) dxdy

The conditional differential entropy is defined as

h(X |Y ) = −∫

f (x , y) log f (x |y) dxdy

We haveh(X ,Y ) = h(X ) + h(Y |X ) = h(Y ) + h(X |Y )

which can be generalized to (chain rule)

h(X1,X2, . . . ,Xn) = h(X1) + h(X2|X1) + . . .+ h(Xn|X1,X2, . . . ,Xn−1)

Relative entropy

The relative entropy (Kullback-Leibler distance) between two densities fand g is defined by

D(f ||g) =

∫f log

f

g

The relative entropy is finite only if the support set of f is contained inthe support set for g .

Mutual information

The mutual information between X and Y is defined as

I (X ; Y ) =

∫f (x , y) log

f (x , y)

f (x)f (y)dxdy

which gives

I (X ; Y ) = h(X )− h(X |Y ) = h(Y )− h(Y |X ) = h(X ) + h(Y )− h(X ,Y )

andI (X ; Y ) = D(f (x , y)||f (x)f (y))

Given two uniformely quantized versions of X and Y

I (X ∆; Y ∆) = H(X ∆)− H(X ∆|Y ∆)

≈ h(X )− log ∆− (h(X |Y )− log ∆)

= I (X ; Y )

Properties

D(f ||g) ≥ 0

with equality iff f and g are equal almost everywhere.

I (X ; Y ) ≥ 0

with equality iff X and Y are independent.

h(X |Y ) ≤ h(X )

with equality iff X and Y are independent.

h(X + c) = h(X )

h(aX ) = h(X ) + log |a|

h(AX) = h(X) + log | det(A)|

Differential entropy, cont.

The gaussian distribution is the distribution that maximizes thedifferential entropy, for a given covariance matrix.

Ie, for the one-dimensional case, the differential entropy for a variable Xwith variance σ2 satisfies the inequality

h(X ) ≤ 1

2log 2πeσ2

with equality iff X is gaussian.

For the general case, the differential entropy for a random n-dimensionalvector X with covariance matrix K satisfies the inequality

h(X) ≤ 1

2log(2πe)n|K |

with equality iff X is a multivariate gaussian vector.

Documents

Stochastic processes - Informationskodning processes A stochastic (or random) process fX igis an indexed sequence of random variables. The dependencies among the random variables can