Analytic Combinatorics (of P. Flajolet) Profile of Digital ... · Analytic Combinatorics (of P....

Preview:

Citation preview

Analytic Combinatorics (of P. Flajolet)

Profile of Digital Trees

Wojciech Szpankowski∗

Department of Computer Science

Purdue University, W. Lafayette, IN

U.S.A.

August 14, 2012

Polish Combinatorial Conference, Bedlewo, 2012

Dedicated to PHILIPPE FLAJOLET

∗Joint work with H-K. Hwang, M. Drmota, P. Flajolet, P. Nicodeme, G. Park, and M. Ward.

Outline of the Presentation

1. Tries and Suffix Trees

2. Profiles of Tries

• Trie Parameters

• Main Results

• Sketch of Proofs

• Consequences (height, fill-up, shortest path)

3. Profile of Digital Search Trees

4. Applications: Error Resilient Lempel-Ziv’77 (Suffix Trees)

5. Analysis of Algorithms and Analytic Information Theory

Tries and Suffix Trees

00

0

0100

0

01010

0

01011

1

1

0

011

1

1

0

1010

0

1011

1

1

0

1

A trie, or prefix tree, is an ordered tree

data structure that stores keys usually

represented by strings.

Tries were introduced by de la Briandais

(1959) and Fredkin (1960) who introduced

the name:

“tries” derived from retrieval.

Suffix tree is a trie built form suffixes of

one string.

Other digital trees are: PATRICIA and

digital search trees (DST).

Typical Tries: In this talk we mostly discuss random tries (and DST)

built from n (independent) sequences generated by a binary memoryless

source with p denoting the probability of generating a “0” (q = 1−p ≤ p).

Digital Trees

Digital Trees:

Tries, PATRICIA Tries, Digital Search Trees:

Figure 1: A trie, PATRICIA trie and a digital search tree bu ilt from:

x1 = 11100 . . . , x2 = 10111 . . . , x3 = 00110 . . . , and x4 = 00001 . . ..

Usefulness of Tries

Tries, suffix tress, and DST are widely used in diverse applications:

• automatically correcting words in texts; Kukich (1992);

• taxonomies of regular language; Watson (1995);

• event history in datarace detection for multi-threaded object-oriented

programs; Choi et al. (2002);

• internet IP addresses lookup; Nilsson and Tikkanen (2002);

• data compression, Lempel-Ziv, . . . ; W.S. (2001);

• distributed hash tables, Malkhi et al. (2002) and Adler et al. (2003).

Fundamental, prototype data structures:

• have a large number of variations and extensions (Patricia, DST, bucket

digital search trees, k-d tries, quadtries, LC-tries, multiple-tries, etc.);

• closely connected to several splitting procedures using coin-flipping:

collision resolution in multi-access (or broadcast) communication

models, loser selection or leader election, etc.

• have direct combinatorial interpretations in terms of words, urn models,

etc.

Outline Update

1. Tries and Suffix Trees

2. Usefulness of Tries

3. Profiles of Tries

• Trie Parameters

• Main Results

• Sketch of Proofs

• Consequences

4. Applications

(G. Park) (P. Nicodeme)

G. Park, H-K. Hwang, P. Nicodeme, W. Szpankowski,

“Profile in Tries,”

SIAM J. Computing, 38, 5, 1821-1880, 2009.

External and Internal Profiles

00

0

0100

0

01010

0

01011

1

1

0

011

1

1

0

1010

0

1011

1

1

0

1

B0n = 0, I0

n = 1

B1n = 0, I1

n = 2

B2n = 1, I2

n = 2

B3n = 1, I3

n = 2

B4n = 3, I4

n = 1

B5n = 2 I5

n = 0

External profile and internal profile:

Bkn = # external nodes at distance k from the root;

Ikn = # internal nodes at distance k from the root.

Why to Study Profiles?

• Fine, informative shape characteristic;

• Related to path length, depth, height, shortest path, width, etc.;

• Breadth-first search;

• Compression algorithms.

• Mathematically challenging, phenomenally interesting!

Example: Parameters such height Hn, shortest path, sn, fill-up level Fn, and

depth, Dn can be studied through the profiles since:

Hn = max{k : Bkn > 0},

sn = min{k : Bkn > 0},

Fn = max{k : Ikn = 2k},

Pr(Dn = k) =E[Bk

n]n .

0 1

x4 x5

x1 x2

Fn Dn5 Hns n

x3

Recurrence for the Profiles

External Profile Bkn:

Define the probability generating function as

Bnk(u)

Bik-1(u) Bn-i(u)

k-1

Bkn(u) = E[uBk

n] =∑

ℓ≥0

P (Bkn = l)ul.

Then

Bkn(u) =

n∑

i=0

(ni

)piqn−i

Bk−1i (u)B

k−1n−i(u)

with B0n = 1 for n 6= 1 and B0

1 = u

Internal Profile probability generating function Ikn(u) = E[Ik

n] satisfies the

same recurrence with I0n(u) = u for n > 1 and I0

0(u) = I01(u) = 1.

Average External Profile:

E[Bkn] =

n∑

i=0

(ni

)piqn−i

(E[Bk−1i ] + E[B

k−1n−i ]), n ≥ 2, k ≥ 1,

under some initial conditions (e.g., E[Bk0 ] = 0 for all k).

Main Results

Notation: r = p/q = p/(1 − p)> 1, and α := αn,k = k

log n:

p0.50.60.70.80.9 1

2

4

6

8

10

α1

α2

α3

α1 :=1

log(1/q),

α2 :=p2 + q2

p2 log(1/p) + q2 log(1/q),

α3 :=2

log(1/(p2 + q2)).

1: Exponential Growth (0 < α < α1):

Let 1 ≤ k ≤ 1log q−1(log n − log log logn + log(r − 1) − ε):

E[Bkn] = nqk(1 − qk)n−1

(1 + O

((log n)−δ

))= O(2−nν)

2: Logarithmic Growth (0 < α < α1):

Let 1 ≤ k ≤ 1log q−1(log n − log log logn + m log(r − 1) − ε):

E[Bkn] = O(log log n · logm−β

n).

where m and β are constants (smaller or greater than m).

Phase Transitions

3: Polynomial Growth: α1 · log n < k < α2 · logn: (α1 < α < α2)

E[Bkn] ∼ G1 (log n)

pρqρ(p−ρ + q−ρ)√2πα log(p/q)

· nυ1

√log n

,

where G1(x) is a periodic function and

υ1 = −ρ + α log(p−ρ

+ q−ρ

), ρ = − 1

log(p/q)log

(−1 − α log q

1 + α log p

).

6 × 10−23

−6 × 10−23

1p = 0.55

3 × 10−7

−3 × 10−7

1

p = 0.65

6 × 10−3

−6 × 10−3

1p = 0.75p = 0.85p = 0.95

Figure 2: The fluctuating part of the periodic function G1(x) for p = 0.55, 0.65, . . . , 0.95.

4: Polynomial Growth/Decay: α2 · log n < k: (α2 < α)

E[Bkn] =

2pq

p2 + q2n

ν2 + O (nν3)

where ν2 = 2 + α log(p2 + q2) for some ν3 < ν2.

External Shapes

log2n

log n + O(1)

log2 n + O(1)

2 log2 n + O(1)

– Typeset by FoilTEX – 1

log1/qn

log log n + O(1)

log n

p log(1/p) + q log(1/q)

2log(1/(p2+q2))

log n + O(1)

1(p = 0.5, α1 = α2 = 1/ log 2) (p = 0.75)

Average Internal Profile

1: Almost Full Tree: k < α1 · logn

E(Ikn) = 2k − E(Bn,k)(1 + o(1)).

2: Phase Transition I: α1 · logn < k < α0 · logn, where α0 = 2log(1/p)+log(1/q)

E[Ikn] = 2

k − G2 (log n)E(Bn,k)(1 + o(1))

where G2(x) is a periodic function.

3: Phase Transition II: α0 · logn < k < α2 · logn

E[Ikn] = G2 (logn)E(Bn,k)(1 + o(1))

where G2(x) is a periodic function.

4: Polynomial Growth/Decay: α2 · log n < k

E[Ikn] =

1

2n

ν2(1 + o(1))

where ν2 = 2 − α log(p2 + q2).

Variance and Limiting Distributions of the External Profile

Variance:

1: k < α1 · logn: V[Bkn] ∼ E[Bk

n].

2: α1 · logn < k < α2 · logn: V[Bkn] ∼ G3(log n)E[Bk

n].where G3(log n) is a periodic function.

3: α2 · logn < k: V[Bkn] ∼ 2E[Bk

n].

Limiting Distributions:

Central Limit Theorem: For α1 · logn < k < α3 · log n:

Bkn − E[Bk

n]√V[Bk

n]→ N(0, 1),

where N(0, 1) is the standard normal distribution.

Poisson Distribution: For α3 · log n < k:

P (Bn,k = 2m) =λ0

m

m!e−λ0 + o(1), and P (Bn,k = 2m + 1) = o(1),

where λ0 := pqn2(p2 + q2)k−1.

Consequences

Height: For large n (cf. Flajolet, 1980, Pittel, 1985, W.S., 1988, Devroye, 1992)

Hn =2

log(p2 + q2)−1logn = α3log n := kH, (whp).

••

• ••

••

Upper Bound: P (Hn > (1 + ǫ)kH) ≤ P (Bkn ≥ 1) ≤ E[Bk

n] → 0.

Lower Bound: P (Hn < (1 − ǫ)kH) ≤ P (B⌈(1−ε)kH⌉)n = 0)

≤ V[B⌈(1−ǫ)kHn ⌉]

(E[B⌈(1−ǫ)kH⌉n ])2

= O

(1

E[B⌈(1−ǫ)kH⌉n ]

)→ 0.

Define: kS := ⌊ 1log q−1(log n − log log log n + log(e log r))⌋.

Shortest Path: For large n (cf. Knessl and W.S., 2005)

P (sn = kS or sn = kS + 1) → 1.

Fill-up: For large n (cf. Pittel, 1986, Devroye, 1992, Knessl & W.S., 2005)

P (Fn = kS − 1 or Fn = kS) → 1.

Sketch of the Proof

1. Recurrence: E[Bkn] =

∑ni=0

(ni

)piqn−i(E[Bk−1

i ]+E[Bk−1n−i ]), n ≥ 2, k ≥ 1.

Sketch of the Proof

1. Recurrence: E[Bkn] =

∑ni=0

(ni

)piqn−i(E[Bk−1

i ]+E[Bk−1n−i ]), n ≥ 2, k ≥ 1.

2. Poisson Transform: Ek(z) =∑∞

n=0 E[Bkn]

zn

n! e−z:

Ek(z) = Ek−1(zp) + Ek−1(zq), k ≥ 2,

Sketch of the Proof

1. Recurrence: E[Bkn] =

∑ni=0

(ni

)piqn−i(E[Bk−1

i ]+E[Bk−1n−i ]), n ≥ 2, k ≥ 1.

2. Poisson Transform: Ek(z) =∑∞

n=0 E[Bkn]

zn

n! e−z:

Ek(z) = Ek−1(zp) + Ek−1(zq), k ≥ 2,

3. Mellin Transform: E∗k(s) :=

∫∞0

zs−1Ek(z)dz = (p−s + q−s)E∗k−1(s):

E∗k(s) = (p

−s+ q

−s)k−1 · s · (p−s

+ q−s − 1)Γ(s)

for ℜ(s) ∈ (−2,∞), where Γ(s) is the Euler Gamma function.

Sketch of the Proof

1. Recurrence: E[Bkn] =

∑ni=0

(ni

)piqn−i(E[Bk−1

i ]+E[Bk−1n−i ]), n ≥ 2, k ≥ 1.

2. Poisson Transform: Ek(z) =∑∞

n=0 E[Bkn]

zn

n! e−z:

Ek(z) = Ek−1(zp) + Ek−1(zq), k ≥ 2,

3. Mellin Transform: E∗k(s) :=

∫∞0

zs−1Ek(z)dz = (p−s + q−s)E∗k−1(s):

E∗k(s) = (p

−s+ q

−s)k−1 · s · (p−s

+ q−s − 1)Γ(s)

for ℜ(s) ∈ (−2,∞), where Γ(s) is the Euler Gamma function.

4. Inverse Mellin Transform: Ek(z) = 12πi

∫ c+i∞c−i∞ z−sE∗

k(s)ds:

Ek(z) =1

2πi

∫ c+i∞

c−i∞s(p

−s+ q

−s − 1)Γ(s)z−s

(p−s

+ q−s

)k−1

ds

through the saddle point method (see next slide for more).

Sketch of the Proof

1. Recurrence: E[Bkn] =

∑ni=0

(ni

)piqn−i(E[Bk−1

i ]+E[Bk−1n−i ]), n ≥ 2, k ≥ 1.

2. Poisson Transform: Ek(z) =∑∞

n=0 E[Bkn]

zn

n! e−z:

Ek(z) = Ek−1(zp) + Ek−1(zq), k ≥ 2,

3. Mellin Transform: E∗k(s) :=

∫∞0

zs−1Ek(z)dz = (p−s + q−s)E∗k−1(s):

E∗k(s) = (p

−s+ q

−s)k−1 · s · (p−s

+ q−s − 1)Γ(s)

for ℜ(s) ∈ (−2,∞), where Γ(s) is the Euler Gamma function.

4. Inverse Mellin Transform: Ek(z) = 12πi

∫ c+i∞c−i∞ z−sE∗

k(s)ds:

Ek(z) =1

2πi

∫ c+i∞

c−i∞s(p

−s+ q

−s − 1)Γ(s)z−s

(p−s

+ q−s

)k−1

ds

through the saddle point method (see next slide for more).

5. Depoissonization: From the Poisson transform Ek(z) to E[Bkn].

Saddle Point Method: Phase Transitions

By depoisonization we have Ek(n) ∼ Ek(z), where recall

Ek(n) =1

2πi

∫ c+i∞

c−i∞g(s)Γ(s + 1)n

−s(p

−s+ q

−s)kds

=1

2πi

∫ c+i∞

c−i∞g(s)Γ(s + 1) exp(h(s)log n)ds, k = αlogn.

where g(s) = (p−s + q−s − 1) (note g(0) = g(−1) = 0). For k = αlog n,

n−s(p−s + q−s)k = exp(h(s)logn) is large.

Saddle Point Method: Phase Transitions

By depoisonization we have Ek(n) ∼ Ek(z), where recall

Ek(n) =1

2πi

∫ c+i∞

c−i∞g(s)Γ(s + 1)n−s(p−s + q−s)kds

=1

2πi

∫ c+i∞

c−i∞g(s)Γ(s + 1) exp(h(s)log n)ds, k = αlogn.

where g(s) = (p−s + q−s − 1) (note g(0) = g(−1) = 0). For k = αlog n,

n−s(p−s + q−s)k = exp(h(s)logn) is large.

Saddle Point Method:

A function F (z) analytic with nonnegative coefficients and “fast growth”.

fn = [zn]F (z) =

1

2iπ

CF (z)

dz

zn+1=

1

2iπ

CeH(z)

dz,

where H(z) := logF (z) − (n + 1) log z.

Define H ′(z0) = 0 to be the saddle point, that is,

H(z) = H(z0) +12(z − z0)

2H ′′(z0) + O(H ′′′(z0)(z − z0)3).

[zn]F (z) =

1√2π|H ′′(z0)|

exp(H(z0)) ·(1 + O

(H ′′′(z0)

(H ′′(z0))3/2

)).

Infinity of Saddle Points

The saddle point equation is h′(s) = 0 where

h(s) = α log(p−s + q−s) − s.

It has a unique real root:

ρ =−1

log rlog

(α log q−1 − 1

1 − α log p−1

),

1

log q−1< α <

1

log p−1.

Note∣∣∣p−ρ−itj + q−ρ−itj

∣∣∣ = p−ρ + q−ρ for tj = 2πj/ log r, j ∈ Z

nlog

ρ +⋅

rlog----------i

ρ +

⋅rlog

----------

i

ρ –

ρ – i

i⋅

rlog----------

2πlog-----------

r⋅

nlog–

2

2

1

1

j ≥

j ≤ρ – i⋅

rlog----------

2π j

ρ + i⋅

rlog----------

2π j,

,

Phase Transitions:

1. There are infinitely many saddle points ρ + itj.

2. ρ → ∞ as α ↓ 1/ log q−1 = α1.

3 ρ → −∞ when α ↑ 1/ log p−1.

4. Saddle points coalesce with poles of the

Γ(s + 1) function at s = −2,−3, . . ..

Pole s = −2 leads to α2.

Analytic Depoissonization

Theorem 1 (Jacquet, and W.S., 1998). Let G(z) =∑

n≥0 gnzn

n! e−z be the

Poisson transform of gn. In a linear cone Sθ assume two conditions:

(I) For z ∈ Sθ

and some reals B,R > 0, ν

|z| > R ⇒ |G(z)| ≤ B|z|νΨ(|z|)

where Ψ(x) is a

slowly varying function.

(O) For z /∈ Sθ and A,α < 1

|z| > R ⇒ |G(z)ez| ≤ Aexp(α|z|).

Then

gn = G(n) + O(nν−1

Ψ(n)).

Using the depoissonization theorem, we get E[Bkn] = Ek(n) + O

(nν−1√log n

).

Outline Update

1. Tries and Suffix Trees

2. Profiles of Tries

3. Profile of Digital Search Trees

4. Applications: Error Resilient Lempel-Ziv’77

M. Drmota and W. Szpankowski,

“The Expected Profile of Digital Search Trees”,

J. Combin. Theory, Ser. A, 118, 1939-1965, 2011.

Digital Search Trees

Figure 3: A digital search tree built on eight strings s1, . . . , s8 (i.e.,

s1 = 0 . . ., s2 = 1 . . ., s3 = 01 . . ., s4 = 11 . . ., etc.) with internal and

external (squares) nodes, and its profiles.

Profile of Digital Search Trees

1. Recurrence: E[Bk+1n+1] =

∑ni=0

(ni

)piqn−i(E[Bk

i ] +E[Bkn−i]), n ≥ 2, k ≥ 0.

Profile of Digital Search Trees

1. Recurrence: E[Bk+1n+1] =

∑ni=0

(ni

)piqn−i(E[Bk

i ] +E[Bkn−i]), n ≥ 2, k ≥ 0.

2. Poisson Transform: Ek(z) =∑∞

n=0 E[Bkn]

zn

n! e−z:

E′k+1(z) + Ek+1(z) = Ek(zp) + Ek(zq), k ≥ 2,

Profile of Digital Search Trees

1. Recurrence: E[Bk+1n+1] =

∑ni=0

(ni

)piqn−i(E[Bk

i ] +E[Bkn−i]), n ≥ 2, k ≥ 0.

2. Poisson Transform: Ek(z) =∑∞

n=0 E[Bkn]

zn

n! e−z:

E′k+1(z) + Ek+1(z) = Ek(zp) + Ek(zq), k ≥ 2,

3. Mellin Transform: E∗k(s) :=

∫∞0

zs−1Ek(z)dz = −Γ(s)Fk(s):

F∗k+1(s) − F

∗k+1(s − 1) = (p

−s+ q

−s) · F ∗

k (s)

for ℜ(s) ∈ (−k − 1, 0), and F ∗0 (s) = 1.

Profile of Digital Search Trees

1. Recurrence: E[Bk+1n+1] =

∑ni=0

(ni

)piqn−i(E[Bk

i ] +E[Bkn−i]), n ≥ 2, k ≥ 0.

2. Poisson Transform: Ek(z) =∑∞

n=0 E[Bkn]

zn

n! e−z:

E′k+1(z) + Ek+1(z) = Ek(zp) + Ek(zq), k ≥ 2,

3. Mellin Transform: E∗k(s) :=

∫∞0

zs−1Ek(z)dz = −Γ(s)Fk(s):

F∗k+1(s) − F

∗k+1(s − 1) = (p

−s+ q

−s) · F ∗

k (s)

for ℜ(s) ∈ (−k − 1, 0), and F ∗0 (s) = 1.

4. The Power Series: f(x, s) =∑

k≥0 F k(s)xs becomes

f(x, s) =g(x, s)

g(x, 0), g(x, s) =

h(x, s)

1 − x(p−s + q−s),

where g(x, s) = 1+x∑

j≥0 g(x, s − j)(p−s+j+q−s+j), and asymptotically

F ∗k(s) ∼ A(s)(p−s + q−s)k.

where A(s) is analytic with A(−r) = 0 for r = 0, 1, 2 . . ..

Profile of Digital Search Trees

1. Recurrence: E[Bk+1n+1] =

∑ni=0

(ni

)piqn−i(E[Bk

i ] +E[Bkn−i]), n ≥ 2, k ≥ 0.

2. Poisson Transform: Ek(z) =∑∞

n=0 E[Bkn]

zn

n! e−z:

E′k+1(z) + Ek+1(z) = Ek(zp) + Ek(zq), k ≥ 2,

3. Mellin Transform: E∗k(s) :=

∫∞0

zs−1Ek(z)dz = −Γ(s)Fk(s):

F∗k+1(s) − F

∗k+1(s − 1) = (p

−s+ q

−s) · F ∗

k (s)

for ℜ(s) ∈ (−k − 1, 0), and F ∗0 (s) = 1.

4. The Power Series: f(x, s) =∑

k≥0 F k(s)xs becomes

f(x, s) =g(x, s)

g(x, 0), g(x, s) =

h(x, s)

1 − x(p−s + q−s),

where g(x, s) = 1+x∑

j≥0 g(x, s − j)(p−s+j+q−s+j), and asymptotically

F ∗k(s) ∼ A(s)(p−s + q−s)k.

where A(s) is analytic with A(−r) = 0 for r = 0, 1, 2 . . ..

5. Inverse Mellin Transform by saddle point and depoissonization.

Profile of Digital Search Trees

1. Recurrence: E[Bk+1n+1] =

∑ni=0

(ni

)piqn−i(E[Bk

i ] +E[Bkn−i]), n ≥ 2, k ≥ 0.

2. Poisson Transform: Ek(z) =∑∞

n=0 E[Bkn]

zn

n! e−z:

E′k+1(z) + Ek+1(z) = Ek(zp) + Ek(zq), k ≥ 2,

3. Mellin Transform: E∗k(s) :=

∫∞0

zs−1Ek(z)dz = −Γ(s)Fk(s):

F∗k+1(s) − F

∗k+1(s − 1) = (p

−s+ q

−s) · F ∗

k (s)

for ℜ(s) ∈ (−k − 1, 0), and F ∗0 (s) = 1.

4. The Power Series: f(x, s) =∑

k≥0 F k(s)xs becomes

f(x, s) =g(x, s)

g(x, 0), g(x, s) =

h(x, s)

1 − x(p−s + q−s),

where g(x, s) = 1+x∑

j≥0 g(x, s − j)(p−s+j+q−s+j), and asymptotically

F∗k(s) ∼ A(s)(p

−s+ q

−s)k.

where A(s) is analytic with A(−r) = 0 for r = 0, 1, 2 . . ..

5. Inverse Mellin Transform by saddle point and depoissonization.

6. Asymptotically DST profile behaves similarly to the profile of tries.

Main Results

Theorem 2. Let E Ikn denote the expected internal profile in (asymmetric)

digital search trees, 0 < p < q = 1 − p < 1. The following assertions hold:

1. α1 = 1

log 1p+ ε ≤ k

log n ≤ α0 − ε:

E Ikn = 2

k − G3

(log p

kn) (p−ρn,k + q−ρn,k)kn−ρn,k

√2πβ(ρn,k)k

(1 + O

(k−1/2

)),

where G3(x) is a non-zero periodic function with period 1.

2. k = α0

(log n + ξ

√α0β(0) logn

), where ξ = o((log n)

16), then

E Ikn = 2

kΦ(−ξ)

(1 + O

(1 + |ξ|3√

logn

))

where Φ is the normal distribution function.

3. α0 + ε ≤ klog n ≤ 1

log 1q− ε = α4 − ε:

E Ikn = G3

(log p

kn) (p−ρn,k + q−ρn,k)kn−ρn,k

√2πβ(ρn,k)k

(1 + O

(k−1/2

))

where ρn,k = ρ(k/ logn).

Thank You!

Outline Update

1. Tries and Suffix Trees

2. Usefulness of Tries

3. Profiles of Tries

4. Applications: Error Resilient Lempel-Ziv’77 (Suffix Trees)

S. Lonardi and M. Ward, and W. Szpankowski,

“Error Resilient LZ’77 Compression: Algorithms, Analysis, and Experiments”,

IEEE Trans. Information Theory, 53, 1799-1813, 2007.

Error Resilient LZ’77 Scheme

1. The Lempel-Ziv’77 works on-line: It compresses phrases by replacing the

longest prefix by (pointer, length) of its copy.

2. Castelli and Lastras in 2004 proved that a single error in LZ’77 corrupts

O(n2/3) phrases, thus about O(n2/3 logn) symbols, where n is the size.

3. There are multiple copies of the longest prefix that we denote by Mn for

a database of length n.

4. By a judicious choice of pointers in the LZ’77 scheme, we can recover

⌊log2 Mn⌋ bits without losing a bit in compression. Parity bits recovered

from the multiple copies are used for the Reed-Solomon channel coding.

historyhistory current positioncurrent position

0001

10

11

Analysis of Mn Via Suffix Trees

Why does LZRS’77 work so well?

Performance of LZRS’77 depends on Mn. How does Mn typically behave?

Build a suffix tree from the first n suffixes of the database X (i.e., S1 =X∞

1 , S2 = X∞2 , . . . , Sn = X∞

n ). Then insert the (n+1)st suffix, Sn+1 = X∞n+1.

Depth of insertion of Sn+1 is the (n+1)-st phrase length, and Mn is the size

of the subtree that starts at the insertion point of the (n + 1)st suffix.

S1 S2

S3 S4

S5

Mn

Figure 6: M4(=2) is the size of the subtree at the insertion point of S5.

Analysis of Mn for Independent Tries

1. Consider digital tries built over n independent strings.

Average E[MIn] and probability generating function

satisfying the following recurrences

(p = 1 − q is the probability of generating a “1”) MkI

MnI

Mn k–I

q = 1-pp

E[MIn] = pn(qn + pE[MI

n]) + qn(pn + qE[MIn)] +

n−1∑

k=1

(nk

)pkqn−k(pE[MI

k ] + qE[MIn−k])

E[uMIn] = pn(qun + pE[uMI

n]) + qn(pun + qE[uMIn]) +

n−1∑

k=1

(nk

)pkqn−k(pE[uMI

k ] + qE[uMI

n−k]).

• (Analytic) Poissonization(W (z) =

∑n≥0 E[M

In]

zn

n! e−z)

:

W (z) = qpzeqz + pqzepz + pW (pz) + qW (qz).

• Mellin Transform(f∗(s) =

∫∞0

f(x)xs−1dx):

W∗(s) =

Γ(s + 1)(pq−s + qp−s)

1 − p−s+1 − q−s+1.

• Inverse Mellin Transform: W (z) = 1/h + fluctuations.

• (Analytic) Depoissonization: E[MIn] = 1/h + fluctuations.

Analysis of Mn for Dependent Strings

2. Suffix Trees: Using analytic combinatorics on words we prove that

M(z, u) =∞∑

n=1

∞∑

k=1

P(Mn = k)ukzn

=∑

w∈A∗α∈A

uP(β)P(w)

Dw(z)

Dwα(z) − (1 − z)

Dw(z) − u(Dwα(z) − (1 − z))

Dw(z) = (1 − z)Sw(z) + zmP (w) and Sw(z) is the autocorrelation

polynomial:

Sw(z) =∑

k∈P(w)

P(wmk+1)z

m−k

P(w) denotes the set of positions k of w satisfyingw1 . . . wk = wm−k+1 . . . wm.

For any ε > 0 there exists β > 1 such that (all hard analytic work is here!)

|Pr(Mn = k) − Pr(MIn = k)| = O(n−εβ−k)

Random suffix trees resemble random independent tries (cf. P. Jacquet,

W.S., 1994, Lonardi, W.S., Ward, 2005: IEEE Trans. Inf. Th., 2007).

Main Results

Theorem 4 (Ward, W.S., 2005). Let zk = 2krπiln p ∀k ∈ Z, where ln p

ln q = rs for

some relatively prime integers r, s (i.e., ln pln q is rational).

The jth factorial moment E[(Mn)j] = E[Mn(Mn − 1) · · ·Mn(−j + 1)] is

E[(Mn)j] = Γ(j)

q(p/q)j + p(q/p)j

h+ δj(log1/p n) + O(n−η)

where h = −p log p− q log q is the entropy rate, η > 0, and where Γ is the

Euler gamma function and

δj(t) =∑

k 6=0

−e2krπitΓ(zk + j)

(pjq−zk−j+1 + qjp−zk−j+1

)

p−zk+1 ln p + q−zk+1 ln q.

δj is a periodic function that has a small magnitude and exhibits fluctuation

when ln pln q is rational

Note: On average there are

E[Mn] ∼ 1/h additional pointers.

j 1ln 2

∑k 6=0

∣∣Γ(j − 2kiπ

ln 2

)∣∣1 1.4260 ×10−5

3 1.2072 ×10−3

5 1.1421 ×10−1

6 1.1823 ×100

8 1.4721 ×102

9 1.7798 ×103

10 2.2737 ×104

Distribution of Mn

Theorem 5 (Ward, W.S., 2005). Let zk = 2krπiln p ∀k ∈ Z, where ln p

ln q = rs. Then

P (Mn = j) =pjq + qjp

jh+∑

k 6=0

−e2krπi log1/p n

Γ(zk)(pjq + qjp)(zk)

j

j!(p−zk+1 ln p + q−zk+1 ln q)+ O(n

−η)

where η > 0, and Γ is the Euler gamma function.

Therefore, Mn follows the logarithmic series distribution with mean 1/h (plus

some fluctuations).

The logarithmic series distribution ((pjq + qjp)/(jh))

is well concentrated around its mean EMn ≈ 1/h.

0

0.2

0.4

0.6

0.8

2 3 4 5 6 7 8

x

Analytic Information Theory and Analysis of Algorithms

• In the 1997 Shannon Lecture Jacob Ziv presented compelling

arguments for “backing off” from first-order asymptotics in order to

predict the behavior of real systems with finite length description.

• To overcome these difficulties we propose replacing first-order analyses

by full asymptotic expansions and more accurate analyses (e.g., large

deviations, central limit laws).

• Following Hadamard’s precept1, we study information theory problems

using techniques of complex analysis such as generating functions,

combinatorial calculus, Rice’s formula, Mellin transform, Fourier series,

sequences distributed modulo 1, saddle point methods, analytic

poissonization and depoissonization, and singularity analysis.

• This program, which applies complex-analytic tools of analysis of

algorithms to information theory, constitutes analytic information

theory.

1The shortest path between two truths on the real line passes through the complex plane.

Thank You!

Recommended