Entropy, Inference, and Channel Coding · Entropy, Inference, and Channel Coding ... University of Illinois and the Coordinated Science Laboratory ... Hypothesis testing

Entropy, Inference, and Channel Coding

Sean Meyn

Department of Electrical and Computer EngineeringUniversity of Illinois and the Coordinated Science Laboratory

NSF support: ECS 02-17836, ITR 00-85929 and CCF 00-49089

Overview

Hypothesis testing and channel coding

Structure of optimal codes

Error exponents

Algorithms

Optimal code

QAM

R

E Rr( )

References

Large deviations

Dembo and Zeitouni, Large Deviations Techniques And Applications, 1998

Kontoyiannis, Lastras-Montano and Meyn, Relative Entropy and Exponential Deviation Bounds for General Markov Chains, ISIT, 2005

Pandit and Meyn, Extremal Distributions and Worst-Case Large-Deviation Bounds, 2004

Hypothesis testing

D&Z 1998

Zeitouni and Gutman. On universal hypothesis testing via large deviations, IT-37, 1991

Pandit, Meyn and Veeravalli, Asymptotic Robust Neyman-Pearson Testing Based on Moment Classes, ISIT, 2004.

References

Channel coding

Csiszar and Korner. Information theory: Coding Theorems for Discrete Memoryless Systems. Academic Press New York, 1997

MacKay, Information Theory, Inference, and Learning Algorithms, CUP, 2003 http://www.inference.phy.cam.ac.uk/mackay/itila/

Blahut, Hypothesis testing and information theory, IT-20, 1974

Outline (today)

Introduction

Relative entropy & Large deviations

Hypothesis testing

Channel capacity

Conclusions

Introduction


Hypothesis testing

Channel capacity

Conclusions

Memoryless Channel Model

Memoryless channel with input sequence X, output sequence Y

Channel kernel

If X is i.i.d. with marginal distribution µ

Then, Y is i.i.d. with marginal distribution π

P (dy | x) = P{Yt ∈ dx | Xt = x}

π( · ) =

∫P ( · | x)µ(dx)

Random codebook

Channel kernel

N-dimensional code words

N-dimensional output Y received: i.i.d.,

with marginal distribution π

X i, i = 1, 2, . . . , eNR

P (dy | x) = P{Yt ∈ dx | Xt = x}

IEEE Std 802.11a -1999 SUPPLEMENT TO IEEE STANDARD FOR INFORMATION TECHNOLOGY

0 1

I+1

+1

QBPSK QPSK

01

00 10

11

I+1

+1

Q b0b0 b1

11 10

11 11 10 11

10 10

I+1

+1

Q b0b1b2 b3

+3

11 01

11 00 10 00

10 01+3

00 10

00 11 01 11

01 10

00 01

00 00 01 00

01 01

16-QAM

011 010

011 011 010 011

010 010 I+1

+1

Q b0b1b2b3 b4b5

+3

011 001

011 000 010 000

010 001

+3

000 010

000 011 001 011

001 010

000 001

000 000 001 000

001 001

011 110

011 111 010 111

010 110

011 101

011 100 010 100

010 101

000 110

000 111 001 111

001 110

000 101

000 100 001 100

001 101

111 010

111 011110 011

110 010

111 001

111 000110 000

110 001

100 010

100 011101 011

101 010

100 001

100 000101 000

101 001

111 110

111 111110 111

110 110

111 101110 101

100 110

100 111101 111

101 110

100 101101 101

111 100110 100 100 100101 100+7

+5

+5 +7

64-QAM

Questions & Objectives

1. What is the structure of optimal µ ?

2. Construct algorithms based on this structure

3. Worst-case modeling to simplify code construction

4. Decoding algorithms and evaluation

Questions & Objectives

1. What is the structure of optimal µ ?

2. Construct algorithms based on this structure

3. Worst-case modeling to simplify code construction

4. Decoding algorithms and evaluation

Methodology& Viewpoint: Hypothesis testing

Large deviations

Convex & linear optimization theory

Example: Rayleigh Channel Y = AX + N

σ2A = 1, σ2

N = 1, and σ2P = 26.4 (SNR=14.2 dB)

A and N are i.i.d. and mutually independent:


σ2A = 1, σ2

N = 1, and σ2P = 26.4 (SNR=14.2 dB)

16-point QAM

I = 0.2 nats/symbol.


Standard:

Rate:2.57 7.71

16-point QAM


σ2A = 1, σ2

N = 1, and σ2P = 26.4 (SNR=14.2 dB)


2.7 8 2.57 7.71

16-point QAM Three-point constellation


σ2A = 1, σ2

N = 1, and σ2P = 26.4 (SNR=14.2 dB)


3-point distribution: three-fold improvement

over 16-point QAM

E R

R

r( )

0 0.1 0.2 0.3 0.4 0.5 0.60

0.05

0.15

0.25

0.10

0.20

Outline

Introduction


Hypothesis testing

Channel capacity

Conclusions

Introduction


Hypothesis testing

Channel capacity

Conclusions

Large Deviations

Simulate a function

X = {X1, X2, . . . } a nice Markov chain on X, marginal distribution µ

g : X → R

n−1n∑

t=1

g(Xt)cn =

Large Deviations

Simulate a function

Probability of over-estimate


c > c0

n−1 log P{n−1

n∑t=1

g(Xt) ≥ c}→ −Λ∗(c)

g : X → R

n−1n∑

t=1

g(Xt)cn = → c0 = µ(g)

Large Deviations

Simulate a function

Rate function & log-moment generating function

Probability of over-estimate


c > c0 = µ(g),

Λ∗(c) = supθ>0

[θc − Λ(θ)]

n−1 log P{n−1

n∑t=1

g(Xt) ≥ c}→ −Λ∗(c)

Λ(θ) = limn→∞ n−1 log E

[exp

(θ

n∑t=1

g(Xt))]

g : X → R

n−1n∑

t=1

g(Xt)cn = → c0 = µ(g)

Hoeffding's Bound

Marginal distribution µ unknown

Worst-case rate function & log-moment generating function

X = {X1, X2, . . . } is i.i.d. on X

n−1n∑

t=1

Xtcn = → c0 = µ(g)

g(x) = x= [0, 1]

inf{Λ∗µ(c) : µ(g) = c0} sup{Λµ(θ) : µ(g) = c0}

Hoeffding's Bound



Solution: is binary on

X = {X1, X2, . . . } is i.i.d. on X

n−1n∑

t=1

Xtcn = → c0 = µ(g)

g(x) = x= [0, 1]

{0, 1}

inf{Λ∗

∗

µ(c) : µ(g) = c0} sup{Λµ(θ) : µ(g) = c0}

µ

Bennett's Lemma



X = {X1, X2, . . . } is i.i.d. on X Mean and variance given

n−1n∑

t=1

Xtcn =

g(x) = x

= [0, 1]

finf{Λ∗µ(c) : µ(gi) = ci, i = 1, 2} sup{Λµ(θ) : µ(gi) = ci, i = 1, 2}

Bennett's Lemma



Solution: is binary on

X = {X1, X2, . . . } is i.i.d. on X Mean and variance given

n−1n∑

t=1

Xtcn =

g(x) = x

= [0, 1]

{ , 1}

f

∗ µ

inf{Λ∗µ(c) : µ(gi) = ci, i = 1, 2} sup{Λµ(θ) : µ(gi) = ci, i = 1, 2}

x0

Generalized Bennett's Lemma


Worst-case moment generating function:

X = {X1, X2, . . . } is i.i.d. on X n moments given

n−1n∑

t=1

cn =

= [0, 1]

g(Xt)

λ(θ) = E[eθg(Xt)] = 〈µ, eθg〉

gi

Generalized Bennett's Lemma


Worst-case moment generating function:

Linear program over M:

X = {X1, X2, . . . } is i.i.d. on X n moments given

n−1n∑

t=1

cn =

= [0, 1]

g(Xt)

max 〈µ, eθg〉s. t. 〈µ, gi〉 = ci, i = 1, . . . , n.

µ∗ is discrete

λ(θ) = E[eθg(Xt)] = 〈µ, eθg〉

gi

Sanov's Theorem

X Probability measures: M

Ln :=1

n

n−1∑t=0

δXt n ≥ 1

µ a measureg a function on X

〈µ, g〉 = µ(g) :=

∫g(y)µ(dy)

〈Ln, g〉 =1

n

n−1∑t=0

g(Xt)

State space:

Notation:

Empirical measures:

Ln ∈ M for

Sanov's Theorem

X Probability measures: M

Ln :=1

n

n−1∑t=0

δXt n ≥ 1

µ a measureg a function on X

〈µ, g〉 = µ(g) :=

∫g(y)µ(dy)

State space:

Notation:

Empirical measures:

Ln ∈ M for

Relative entropy:

D(ν‖µ) =⟨ν, log

(dν

dµ

)⟩=

∫log

(dν

dµ

)ν(dx)

Sanov's Theorem

µ

Ln :=1

n

n−1∑t=0

δXt

Law of large numbers:Ln

Ln

→ µ,

µ

n → ∞

Sanov's Theorem

n−1 log P{Ln ∈ K} → −

K ⊂ M

µ

K

Convex set of probability measures

?

?

µ �∈ K

Ln

µ

Sanov's Theorem

n−1 log P{Ln ∈ K} → infν∈K

J(ν)η −− =

K ⊂ M

µ

K

Qη

Convex set of probability measures

Qη = {ν : J (ν) < η}

µ �∈ K

Ln

µ

Sanov's Theorem

µ

K

Qη

Qη = {ν : J (ν) < η}

J(ν) = D(ν‖µ)

tr. kernel with ν invariantJ(ν) = inf :D(ν � P P‖ν � P )

i.i.d. source:

Markov:

Ln

µ

Sanov's Theorem

n−1 log P{Ln ∈ K} → inf J(ν)

K = {ν : 〈ν, g〉 ≥ c}

η −− = =

Example:

− Λ∗(c)〈ν, g〉 ≥ c

Sanov's Theorem

n−1 log P{Ln ∈ K} → inf J(ν)

K = {ν : 〈ν, g〉 ≥ c

〈ν, g〉 = c

}

η −− = =

µ

K

Qη

Example:

Qη = {ν : J (ν) < η}

− Λ∗(c)〈ν, g〉 ≥ c

Ln

µ

Outline

Introduction


Hypothesis testing

Channel capacity

Conclusions

Introduction


Hypothesis testing

Channel capacity

Conclusions

Neyman Pearson Hypothesis Testing

Observations X = {Xt : t = 1,2, . . . N}X i.i.d. with marginal πj under Hj, j = 0,1

Hypothesis test:

φ(x) = 1 if H is declared true,based on N observations

1

Error Probabilities

Pe,0 = P0 {φ(X) = 1} , Pe,1 = P1 {φ(X) = 0}

N-P Criterion: infφ

Pe,1 subject to Pe,0 ≤ e−Nη


Observations X = {Xt : t = 1,2, . . . N}

Erro

Solution: if

r Probabilities

Pe,0 = P0 {φ(X) = 1} , Pe,1 = P1 {φ(X) = 0

φ(X) = 0

}

N-P Criterion: infφ

Pe,1 subject to Pe,0 ≤ e−Nη

X i.i.d. with marginal πj

π0

π1

under Hj, j = 0,1

Ln ∈ Qη(π0)

Qη(π0)


Solution: ifφ(X) = 0

π0

π1

Ln ∈ Qη(π0)

Qη(π0)

limN→∞

N−1 log P0{φN = 1} = −η

limN→∞

N−1 log P1{φN = 0} = −β∗


Solution: ifφ(X) = 0

π0

π1

Ln ∈ Qη(π0)

Qη(π0)

Q (π1)lim

N→∞N−1 log P0{φN = 1} = −η

limN→∞

N−1 log P1{φN = 0} = −β∗

β∗ = inf{J1(ν) : J0(ν) ≤ η}

= inf{β > 0 : Qβ(π1) ∩ Qη(π0) �= ∅}

β∗

〈ν, � 〉 = c

Robust Neyman Pearson Hypothesis Testing

Uncertainty classes defined by moment constraints

P1

P0

π0 ∈ P0 π1 ∈ P1



P1

P0

Q (P0)η

π0 ∈ P0 π1 ∈ P1



β∗ = infπ1∈P1

infµ∈Qη(P0)

D(µ ‖ π1 )

There exist π∗0 ∈ P0, π∗

1 ∈ P1, and µ∗ solving,

P1

P0

Q (P0)η

π0∗

π1∗

µ∗



Optimizers again discrete

β∗ = infπ1∈P1

infµ∈Qη(P0)

D(µ ‖ π1 )

There exist π∗0 ∈ P0, π∗

1 ∈ P1, and µ∗ solving,

Qβ∗(P1)P1

P0

Q (P0)η

〈µ, log(�)〉 = 〈µ∗, log(� ) 〉π0∗

π1∗

µ∗

Outline

Introduction


Hypothesis testing

Channel capacity

Conclusions

Introduction


Hypothesis testing

Conclusions

Channel Coding and Sanov's Theorem

Channel kernelChannel kernel

N-dimensional code wordsN-dimensional code words

X is i.i.d. with marginal distribution µX is i.i.d. with marginal distribution µ

Y is i.i.d. with marginal distribution πY is i.i.d. with marginal distribution π

N-dimensional output Y receivedN-dimensional output Y received

X i, i = 1, 2, . . . , eNR

P (dy | x) = P{Yt ∈ dy | Xt = x}

π( · ) =

∫P ( · | x)µ(dx)


Channel kernelChannel kernel

N-dimensional code wordsN-dimensional code words

N-dimensional output Y receivedN-dimensional output Y received

X i, i = 1, 2, . . . , eNR

µ � P (dx, dy) = µ(dx)P

µ ⊗ π (dx, dy) = µ(dx)π(dy)

P (dy | x) = P{Yt ∈ dy | Xt = x}

(dy | x)

If i is the true codeword then( , ) has marginal distributionIf i is the true codeword then( , ) has marginal distribution

Otherwise, independence:Otherwise, independence:

X Yi



µ � P

µ ⊗ π

µ ⊗ π (dx, dy) = µ(dx π(dy))

Two hypotheses based on observations:

H :0

H :1 (dy | x)


Qη(π0)


µ � P

µ ⊗ π

µ ⊗ π (dx, dy) = µ(dx π(dy))

( , )

Solution: Reject codeword i ( )

Empirical distributions forjoint observations

Two hypotheses based on observations:

H :

X Yi

0

H :1

φ = 0

if Ln ∈ Qη(π0)

(dy | x)


Qη(π0)

µ � P

µ ⊗ π

The error probability must be multiplied by

For vanishing error,

That is,

limN→∞

N−1 log P0{φN = 1} = −η

e−Nη e NR

eNR × e−Nη

η

< 1

<R

Solution: ifφ = 0 Ln ∈ Qη(π0)


Qη (π0)

µ � P

µ ⊗ π

The error probability must be multiplied by

Solution: ifφ = 0 Ln ∈ Qη(π0)

limN→∞

N−1 log P0{φN = 1} = −η

e−Nη e NR

R < ηmax =

= mutual information

D(µ � P‖µ⊗ π)

max

Error Exponent

limN→∞

N−− 1 log P { } =E(R,µ ) error

Formula expressed as solution to a robust hypothesis testing problem:

For a given input distribution µ, denote product measureson

P0 = { {µ ⊗ ν : ν is a probability measure on Y

X × Y with first marginal µ,

Error Exponent

limN→∞

N−− 1 log P { } =E(R,µ ) error

Formula expressed as solution to a robust hypothesis testing problem:

For a given input distribution µ, denote product measureson

P0 = { {µ ⊗ ν : ν is a probability measure on Y

X × Y with first marginal µ,

Hypothesis : Code word i not sent; independent

Test: Empirical distributions within entropy ball around P0

(X ij ) ( Yj )H0

Error Exponent

Entropy neighborhood of P0

Entropy neighborhood of

H0: {(X ij , Yj) : j = 1, . . . , N} has marginal distribution π0 ∈ P0

H1: {(X ij , Yj) : j = 1, . . . , N} has marginal distribution π1

π1

π1

:=

� p

Q+R(P0) = { {γ : minν D(γ ‖ µ ⊗ ν) ≤ R

Q+β( ) = { {γ : D(γ ‖ µ

� pµ

) ≤ β

Error Exponent

limN→∞

N−− 1 log P { }

= infimum over β such that these entropy neighborhoods meet:

=E(R,µ ) error

β

Qβ(µ p )

µ ⊗ pµ

µ p

ˆµ pQR(P0)

P0

+

+

Error Exponent

limN→∞

N−− 1 log P { }

{ } =

=E(R,µ )

= random coding exponent = supremum over µE(R )

error

Qβ(µ p )

µ ⊗ pµ

µ p

ˆµ pQR(P0)

P0

+

+

infβ

β : Q+β (µ � p) ∩ Q+

R(P0) �= ∅

Outline

Introduction


Hypothesis testing

Channel capacity

Conclusions

Introduction


Hypothesis testing

Channel capacity

Summary

Large Deviations is the grand unifying principle of Information Theory

Summary

Standard coding based on AWGN models

May be unrealistic in wireless models with fading

Discrete distributions arise in coding, and other applications involving optimization over M

Extremal distributions arise in worst-case models

M

What's Next?

Channel models

Convex optimization and channel coding

Cutting plane algorithm

II

Worst-case models

Extremal distributions

III

Documents

Entropy, Inference, and Channel Coding · Entropy, Inference, and Channel Coding ... University of Illinois and the Coordinated Science Laboratory ... Hypothesis testing