ECE750-TXB Lecture 1: Asymptotics Motivationece750-ads/notes/alllectures.pdf · Basic asymptotic notations Asymptotic behaviour as n ! 1 , where for our purposes n is the \problem

$Page 1: ECE750-TXB Lecture 1: Asymptotics Motivationece750-ads/notes/alllectures.pdf · Basic asymptotic notations Asymptotic behaviour as n ! 1 , where for our purposes n is the \problem$
ECE750-TXBLecture 1:

Asymptotics

Todd L.Veldhuizen

[email protected]

Asymptotics

Asymptotics:Motivation

Bibliography

ECE750-TXB Lecture 1: Asymptotics

Todd L. [email protected]

Electrical & Computer EngineeringUniversity of Waterloo

Canada

February 26, 2007


Asymptotics

Todd L.Veldhuizen

[email protected]

Asymptotics


Bibliography

Motivation

I We want to choose the best algorithm or data structurefor the job.

I Need characterizations of resource use, e.g., time,space; for circuits: area, depth.

I Many, many approaches:I Worst Case Execution Time (WCET): for hard real-time

applicationsI Exact measurements for a specific problem size, e.g.,

number of gates in a 64-bit addition circuit.I Performance models, e.g., R∞, n1/2 for

latency-throughput, HINT curves for linear algebra(characterize performance through different cacheregimes), etc.

I ...


Asymptotics

Todd L.Veldhuizen

[email protected]

Asymptotics


Bibliography

Asymptotic analysis

I We will focus on Asymptotic analysis: a good “firstapproximation” of performance that describes behaviouron big problems

I Reasonably independent of:I Machine details (e.g., 2 cycles for add+mult vs. 1 cycle)I Clock speed, programming language, compiler, etc.


Asymptotics

Todd L.Veldhuizen

[email protected]

Asymptotics


Bibliography

Asymptotics: Brief history

I Basic ideas originated in Paul du Bois-Reymond’sInfinitarcalcul (‘calculus of infinities’) developed in the1870s.

I G. H. Hardy greatly expanded on Paul duBois-Reymond’s ideas in his monograph Orders ofInfinity (1910) [3].

I The “big-O” notation was first used by Bachmann(1894), and popularized by Landau (hence sometimescalled “Landau notation.”)

I Adopted by computer scientists [4] to characterizeresource consumption, independent of small machinedifferences, languages, compilers, etc.


Asymptotics

Todd L.Veldhuizen

[email protected]

Asymptotics


Bibliography

Basic asymptotic notations

Asymptotic ≡ behaviour as n →∞, where for our purposesn is the “problem size.”Three basic notations:

I f ∼ g (“f and g are asymptotically equivalent”)

I f g (“f is asymptotically dominated by g”)

I f g (f and g are asymptotically bounded by oneanother)


Asymptotics

Todd L.Veldhuizen

[email protected]

Asymptotics


Bibliography


f ∼ g means limn→∞

f (n)

g(n)= 1

Example: 3x2 + 2x + 1 ∼ 3x2.∼ is an equivalence relation:

I Transitive: (x ∼ y) ∧ (y ∼ z) ⇒ (x ∼ z)

I Reflexive: x ∼ x

I Symmetric: (x ∼ y) ⇒ (y ∼ x).

Basic idea: We only care about the “leading term,”disregarding less quickly-growing terms.


Asymptotics

Todd L.Veldhuizen

[email protected]

Asymptotics


Bibliography


f g means lim supn→∞

f (n)

g(n)< ∞

i.e., f (n)g(n) is eventually bounded by a finite value.

I Basic idea: f grows more slowly than g , or just asquickly as g .

I is a preorder (or quasiorder):I Transitive: (f g) ∧ (g h) ⇒ (f h).I Reflexive: f f

I fails to be a partial order because it is notantisymmetric: there are functions f , g where f gand g f but f 6= g .

I Variant: g f means f g .


Asymptotics

Todd L.Veldhuizen

[email protected]

Asymptotics


Bibliography


Write f g when there are positive constants c1, c2 such that

c1 ≤f (n)

g(n)≤ c2

for sufficiently large n.

I Examples:I n 2nI n (2 + sin πn)n

I is an equivalence relation.


Asymptotics

Todd L.Veldhuizen

[email protected]

Asymptotics


Bibliography

Strict forms

Write f ≺ g when f g but f 6 g .

I Basic idea: f grows strictly less quickly than g

I Equivalent: f g exactly when limn→∞f (n)g(n) = 0.

I ExamplesI x2 ≺ x3

I log x ≺ x

I Variant: f g means g ≺ f


Asymptotics

Todd L.Veldhuizen

[email protected]

Asymptotics


Bibliography

Orders of growth

We can use ≺ as a “ruler” by which to judge the growth offunctions. Some common “tick marks” on this ruler are:

log log n ≺ log n ≺ logk n ≺ nε ≺ n ≺ n2 ≺ · · · ≺ 2n ≺ n! ≺ nn ≺ 22n

We can always find in ≺ a dense total order withoutendpoints. i.e.,

I There is no slowest-growing function;

I There is no fastest-growing function;

I If f ≺ h we can always find a g such that f ≺ g ≺ h.

(The canonical example of a dense total order withoutendpoints is Q, the rationals.)

I This fact allows us to sketch graphs in which points onthe axes are asymptotes.


Asymptotics

Todd L.Veldhuizen

[email protected]

Asymptotics


Bibliography

Big-O Notation

“Big-O” is a convenient family of notations for asymptotics:

O(g) ≡ f : f g

i.e., O(g) is the set of functions f so that f g .

I O(n2) contains n2, 7n2, n, log n, n3/2, 5, . . .

I Note that f ∈ O(g) means exactly f g .

I A standard abuse of notation is to treat a big-Oexpression as if it were a term:

x2 + 2x1/2 + 1︸︷︷︸≺x2

= x2 + O(x1/2)

The above equation should be read as “there exists afunction f ∈ O(x1/2) such thatx2 + 2x1/2 + 1 = x2 + f (x).”


Asymptotics

Todd L.Veldhuizen

[email protected]

Asymptotics


Bibliography

Big-O for algorithm analysis

I Big-O notation is an excellent tool for expressingmachine/compiler/language-independent complexityproperties.

I On one machine a sorting algorithm might take≈ 5.73n log n seconds, on another it might take≈ 9.42n log n + 3.2n seconds.

I We can wave these differences aside by saying thealgorithm runs in O(n log n) seconds.

I O(f (n)) means something that behaves asymptoticallylike f (n):

I Disregarding any initial transient behaviour;I Disregarding any multiplicative constants c · f (n);I Disregarding any additive terms that grow less quickly

than f (n).


Asymptotics

Todd L.Veldhuizen

[email protected]

Asymptotics


Bibliography

Basic properties of big-O notation

Given a choice between an sorting algorithm that runs inO(n2) time and one that runs in O(n log n) time, whichshould we choose?

1. Gut instinct: the O(n log n) one, of course!

2. But: note that the class of functions O(n2) alsocontains n log n. Just because we say an algorithm isO(n2) does not mean it takes n2 time!

3. It could be that the O(n2) algorithm is faster than theO(n log n) one.


Asymptotics

Todd L.Veldhuizen

[email protected]

Asymptotics


Bibliography

Additional notations

To distinguish between “at most this fast,” “at least thisfast,” etc. there are additional big-O-like notations:

f ∈ O(g) ≡ f g upper bound

f ∈ o(g) ≡ f ≺ g strict upper bound

f ∈ Θ(g) ≡ f g tight bound

f ∈ Ω(g) ≡ f g lower bound

f ∈ ω(g) ≡ f g strict lower bound


Asymptotics

Todd L.Veldhuizen

[email protected]

Asymptotics


Bibliography

Tricks for a bad remembering day

I Lower case means strict:I o(n) is strict version of O(n)I ω(n) is strict version of Ω(n)

I ω, Ω (omega) is the last letter of the greek alphabet —if f ∈ ω(g) then g comes after f in asymptotic ordering.

I f ∈ Θ(g): the line through the middle of the theta —asymptotes converge


Asymptotics

Todd L.Veldhuizen

[email protected]

Asymptotics


Bibliography

Notation: o(·)

f ∈ o(g) means f ≺ g

I o(·) expresses a strict upper bound.

I If f (n) is o(g(n)), then f grows strictly slower than g .

I Example:∑n

k=0 2−k = 2− 12n = 2 + o(1)

I o(1) indicates the class of functions for which

limn→∞g(n)

1 = 0, which means limn→∞ g(n) = 0.I 2 + o(1) means “2 plus something that vanishes as

n →∞”

I If f is o(g), it is also O(g).

I n! = o(nn).


Asymptotics

Todd L.Veldhuizen

[email protected]

Asymptotics


Bibliography

Notation: ω(·)

f ∈ ω(g) means f g

I ω(·) expresses a strict lower bound.

I If f (n) is ω(g(n)), then f grows strictly faster than g .

I f ∈ ω(g) is equivalent to g ∈ o(f ).I Example: Harmonic series

I hn =∑n

k=01k ∼ ln n + γ + O(n−1)

I hn ∈ ω(1) (It is unbounded.)I hn ∈ ω(ln ln n)

I n! = ω(2n) (grows faster than 2n)


Asymptotics

Todd L.Veldhuizen

[email protected]

Asymptotics


Bibliography

Notation: Ω(·)

f ∈ Ω(g) means f g

I Ω(·) expresses a lower bound, not necessarily strict

I If f (n) is Ω(g(n)), then f grows at least as fast as g .

I f ∈ Ω(g) is equivalent to g ∈ O(f )

I Example: Matrix multiplication requires Ω(n2) time. (At

least enough time to look at each of the n2 entries in the

matrices.)


Asymptotics

Todd L.Veldhuizen

[email protected]

Asymptotics


Bibliography

Notation: Θ(·)

f ∈ Θ(g) means f g

I Θ(·) expresses a tight asymptotic bound

I If f (n) is Θ(g(n)), then f (n)/g(n) is eventuallycontained in a finite positive interval [c1, c2].

I Θ(·) bounds are very precise, but often hard to obtain.

I Example: QuickSort runs in time Θ(n log n) on average.(Tight! Not much faster or slower!)

I Example: Stirling’s approximationln n! ∼ n ln n− n + O(ln n) implies that ln n! is Θ(n ln n)

I Don’t make the mistake of thinking that f ∈ Θ(g)

means limn→∞f (n)g(n) = k for some constant k.


Asymptotics

Todd L.Veldhuizen

[email protected]

Asymptotics


Bibliography

Algebraic manipulations of big-O

I Manipulating big-O terms requires some thought —always keep in mind what the symbols mean!

I An additive O(f (n)) term swallows any terms that are f (n):

n2 + n1/2 + O(n) + 3 = n2 + O(n)

The n1/2 and 3 on the l.h.s. are meaningless in thepresence of an O(n) term.

I O(f (n))− O(f (n)) = O(f (n)) not 0!

I O(f (n)) · O(g(n)) = O(f (n)g(n)).

I Example: What is ln n + γ +O(n−1) times n +O(n1/2)?[ln n + γ + O(n−1)

]·[n + O(n1/2)

]= n ln n + γn + O(n1/2 ln n)

The terms γO(n1/2), O(n−1/2), O(1), etc. getswallowed by O(n1/2 ln n).


Asymptotics

Todd L.Veldhuizen

[email protected]

Asymptotics


Bibliography

Sharpness of estimates

Example: for a constant c ,

ln(n + c) = ln(n(1 +

c

n

))= ln n + ln

(1 +

c

n

)= ln n +

c

n− c2

2n2+ · · · (Maclaurin series)

= ln n + Θ

(1

n

)It is also correct to write

ln(n + c) = ln n + O(n−1)

ln(n + c) = ln n + o(1)

since Θ(n−1) ⊆ O( 1n ) ⊆ o(1). However, the Θ( 1

n ) error termis sharper — a better estimate of the error.


Asymptotics

Todd L.Veldhuizen

[email protected]

Asymptotics


Bibliography

Sharpness of estimates & The RiemannHypothesisExample: let π(n) be the number of prime numbers ≤ n.The Prime Number Theorem is that

π(n) ∼ Li(n) (1)

where Li(n) =∫ nx=2

1ln x dx is the logarithmic integral, and

Li(n) ∼ n

ln n

Note that (1) is equivalent to:

π(n) = Li(n) + o(Li(n))

It is known that the error term can be improved, for exampleto

π(n) = Li(n) + O( n

ln ne−a

√ln n)


Asymptotics

Todd L.Veldhuizen

[email protected]

Asymptotics


Bibliography

Sharpness of estimates & The RiemannHypothesis

The famous Riemann hypothesis is the conjecture that asharper error estimate is true:

π(n) = Li(n) + O(n12 ln n)

This is one of the Clay Institute millenium problems, with a$1,000,000 reward for a positive proof. Sharp estimatesmatter!


Asymptotics

Todd L.Veldhuizen

[email protected]

Asymptotics


Bibliography


To maintain sharpness of asymptotic estimates duringanalysis, some caution is required.E.g. If f (n) = 2n + O(n), what is log f (n)?Bad answer: log f (n) = n + O(n).More careful answer:

log f (n) = log(2n + O(n))

= log(2n(1 + O(n2−n)))

= log(2n) + log(1 + O(n2−n))

Since log(1 + δ(n)) ∼ O(δ(n)) if δ ∈ o(1),

log f (n) = n + O(n2−n)

i.e., log f (n) is equal to n plus some value convergingexponentially fast to 0.


Asymptotics

Todd L.Veldhuizen

[email protected]

Asymptotics


Bibliography


log f (n) = n + O(n2−n)

is a reasonably sharp estimate (but, what happens if we take2log f (n) with this estimate?)If we don’t care about the rate of convergence we can write

f (n) = n + o(1)

where o(1) represents some function converging to zero.This is less sharp since we have lost the rate of convergence.Even less sharp is

f (n) ∼ n

which loses the idea that f (n)− n → 0, and doesn’t rule outthings like f (n) = n + n3/4.


Asymptotics

Todd L.Veldhuizen

[email protected]

Asymptotics


Bibliography

Asymptotic expansions

An asymptotic expansion of a function describes how thatfunction behaves for large values. Often it is used when anexplicit description of the function is too messy or hard toderive.e.g. if I choose a string of n bits uniformly at random (i.e.,each of the 2n possible strings has probability 2−n), what isthe probability of getting ≥ 3

4n 1’s?Easy to write the answer: there are

(nk

)ways of arranging k

1’s, so the probability of getting ≥ 34n 1’s is:

P(n) =n∑

k=d 34ne

2−n

(n

k

)

This equation is both exact and wholly uninformative.


Asymptotics

Todd L.Veldhuizen

[email protected]

Asymptotics


Bibliography

Asymptotic expansionsCan we do better? Yes!The number of 1’s in a random bit string is a binomialdistribution and is well-approximated by the normaldistribution as n →∞:

n∑k= 1

2n+α

√n

2−n

(n

k

)∼∫ ∞

x=α

1√2π

e−x2

2 dx

= 1− F (α)

where F (x) = 12

(1 + erf

(x√2

))is the cumulative normal

distribution.Maple’s asympt command yields the asymptotic expansion:

F (x) ∼ 1− O

(1

xex2

2

)


Asymptotics

Todd L.Veldhuizen

[email protected]

Asymptotics


Bibliography

Asymptotic expansions example

We want to estimate the probability of ≥ 34n 1’s:

1

2n + α

√n =

3

4n

gives α =√

n4 . Therefore the probability is

P(n) ∼ 1− F

(√n

4

)∼ 1− 1 + O

(1

√ne

n32

)= O

(1

√ne

n32

)So, the probability of having more than 3

4n 1’s converges to0 exponentially fast.


Asymptotics

Todd L.Veldhuizen

[email protected]

Asymptotics


Bibliography

Asymptotic Expansions

I When taking an asymptotic expansion, one writes

ln n! ∼ n ln n − n + O(1)

rather than

ln n! = n ln n − n + O(1)

Writing ∼ is a clue to the reader that an asymptoticexpansion is being taken, rather than just carrying anerror term around.

I Asymptotic expansions are very important in averagecase analysis, where we are interested in characterizinghow an algorithm performs for most inputs.

I To prove an algorithm runs in O(f (n)) on average, onetechnique is to obtain an asymptotic estimate of theprobability of running in time f (n), and show itconverges to zero very quickly.


Asymptotics

Todd L.Veldhuizen

[email protected]

Asymptotics


Bibliography

Asymptotic Expansions for Average-Case Analysis

I The time required to add two n-bit integers by a nocarry adder is proportional to the longest carrysequence.

I It can be shown that the probability of having a carrysequence of length ≥ t(n) satisfies

Pr(carry sequence ≥ t(n)) ≤ 2−t(n)+log n+O(1)

I If t(n) log n, the probability converges to 0. We canconclude that the average running time is O(log n).

I In fact we can make a stronger statement:

Pr(carry sequence ≥ log n + ω(1)) → 0

Translation: “The probability of having a carrysequence longer than log n + δ(n), where δ(n) is anyunbounded function, converges to zero.”


Asymptotics

Todd L.Veldhuizen

[email protected]

Asymptotics


Bibliography

The Taylor series method of asymptoticexpansion

I This is a very simple method for asymptotic expansionthat works for simple cases; it is one technique Maple’sasympt function uses.

I Recall that the Taylor series of a C∞ function aboutx = 0 is given by:

f (x) = f (0) + xf ′(0) +x2

2!f ′′(0) +

x3

3!f ′′′(0) + · · ·

I To obtain an asymptotic expansion of some functionF (n) as n →∞,

1. Substitute n = x−1 into F (n). (Then n →∞ asx → 0.)

2. Take a Taylor series about x = 0.3. Substitute x = n−1.4. Use the dominating term(s) as the expansion, and the

next term as the error term.


Asymptotics

Todd L.Veldhuizen

[email protected]

Asymptotics


Bibliography

Taylor series method of asymptotic expansion:exampleExample expansion: F (n) = e1+ 1

n .Obviously limn→∞ F (n) = e, so we expect something of theform F (n) ∼ e + o(1).

1. Substitute n = x−1 into F (n): obtain F (x−1) = e1+x

2. Taylor series about x = 0:

e1+x = e + xe +x2

2e +

x3

6e + · · ·

3. Substitute x = n−1:

= e +e

n+

1

2n2e +

1

6n3e + · · ·

4. Since e 1ne 1

2n2 e · · · ,

F (n) ∼ e + Θ

(1

n

)


Asymptotics

Todd L.Veldhuizen

[email protected]

Asymptotics


Bibliography

Asymptotics of algorithms

Asymptotics is a key tool for algorithms and data structures:I Analyze algorithms/data structures to obtain sharp

estimates of asymptotic resource consumption (e.g.,time, space)

I Possibly use asymptotic expansions in the analysis toestimate e.g. probabilities

I Use these resource estimates toI Decide which algorithm/data structure is “best”

according to design criteriaI Reason about the performance of compositions

(combinations) of algorithms and data structures.


Asymptotics

Todd L.Veldhuizen

[email protected]

Asymptotics


Bibliography

References on asymptotics

I Course text: [1] Asymptotic notations

I Concrete Mathematics, Ronald L. Graham, Donald E.Knuth and Oren Patashnik, Ch. 9 Asymptotics [2]

I Advanced:I Shackell, Symbolic Asymptotics [6]I Hardy, Orders of Infinity [3]I Lightstone + Robinson, Nonarchimedean fields and

asymptotic expansions [5]


Asymptotics

Todd L.Veldhuizen

[email protected]

Asymptotics


Bibliography

Bibliography I

[1] Thomas H. Cormen, Charles E. Leiserson, and Ronald R.Rivest.Intoduction to algorithms.McGraw Hill, 1991. bib

[2] Ronald L. Graham, Donald E. Knuth, and OrenPatashnik.Concrete Mathematics: A Foundation for ComputerScience.Addison-Wesley, Reading, MA, USA, second edition,1994. bib


Asymptotics

Todd L.Veldhuizen

[email protected]

Asymptotics


Bibliography

Bibliography II

[3] G. H. Hardy.Orders of infinity. The ‘Infinitarcalcul’ of Paul duBois-Reymond.Hafner Publishing Co., New York, 1971.Reprint of the 1910 edition, Cambridge Tracts inMathematics and Mathematical Physics, No. 12. bib

[4] Donald E. Knuth.Big omicron and big omega and big theta.SIGACT News, 8(2):18–24, 1976. bib pdf

[5] A. H. Lightstone and Abraham Robinson.Nonarchimedean fields and asymptotic expansions.North-Holland Publishing Co., Amsterdam, 1975.North-Holland Mathematical Library, Vol. 13. bib


Asymptotics

Todd L.Veldhuizen

[email protected]

Asymptotics


Bibliography

Bibliography III

[6] John R. Shackell.Symbolic asymptotics, volume 12 of Algorithms andComputation in Mathematics.Springer-Verlag, Berlin, 2004. bib

ECE750-TXBLecture 2

Todd L.Veldhuizen

[email protected]

Outline

ResourceConsumption

Complexity classes

Bibliography

ECE750-TXB Lecture 2Resources and Complexity Classes



Canada

January 16, 2007

ECE750-TXBLecture 2

Todd L.Veldhuizen

[email protected]

Outline

ResourceConsumption

Complexity classes

Bibliography

Resource Consumption

To decide which algorithm or data structure to use, we areinterested in their resource consumption. Depending on theproblem context, we might be concerned with:

I Time and space consumptionI For logic circuits:

I Number of gatesI DepthI AreaI Heat production

I For parallel/distributed computing:I Number of processorsI Amount of communication requiredI Parallel running time

I For randomized algorithms:I Number of random bits usedI Error probability

ECE750-TXBLecture 2

Todd L.Veldhuizen

[email protected]

Outline

ResourceConsumption

Complexity classes

Bibliography

Machine models

I The performance of an algorithm must always beanalyzed with reference to some machine model thatdefines:

I The basic operations supported (e.g., random-accessmemory; arithmetic; obtaining a random bit; etc.)

I The resource cost of each operation.

I Some common machine models:I Turing machine (TM): very primitive, tape-based, used

for theoretical arguments only;I Nondeterministic Turing machine: TM that can

effectively fork its execution at each step, so that after tsteps it can behave as if it were a superfast parallelmachine with e.g. 2t processors;

I RAM (Random Access Machine) is a model thatcorresponds more-or-less to an everyday single-CPUdesktop machine, but with infinite memory;

I PRAM and LogP [2, 3] are popular models for parallelcomputing.

ECE750-TXBLecture 2

Todd L.Veldhuizen

[email protected]

Outline

ResourceConsumption

Complexity classes

Bibliography

Machine models

I The performance of an algorithm can change drasticallywhen you change machine models. e.g., many problemsbelieved to take exponential time (assuming P 6= NP)on a RAM can be solved in polynomial time on aNondeterministic TM.

I Often there are generic results that let you translateresource bounds on one machine model to another:

I An algorithm taking time T (n) and space S(n) on aTuring machine can be simulated inO(T (n) log log S(n)) time by a RAM;

I An algorithm taking time T (n) and space S(n) on aRAM can be simulated in O(T 3(n)(S(n) + T (n))2)time by a Turing machine.

I Unless otherwise stated, people are usually referring toa RAM or similar machine model.

ECE750-TXBLecture 2

Todd L.Veldhuizen

[email protected]

Outline

ResourceConsumption

Complexity classes

Bibliography

Machine models

I When you are analyzing an algorithm, know yourmachine model.

I There are embarassing papers in the literature in whichnonspecialists have “proven” outlandish complexityresults by making basic mistakes

I e.g. Assuming that arbitrary precision real numbers canbe stored in O(1) space and multiplied, added, etc. inO(1) time. On realistic sequential (nonparallel)machine models, d-digit real numbers take:

I O(d) spaceI O(d) time to addI O(d log d) time to multiply

ECE750-TXBLecture 2

Todd L.Veldhuizen

[email protected]

Outline

ResourceConsumption

Complexity classes

Bibliography

Example of time and space complexity

I Let’s compare three containers for storing values: list,tree, sorted array. Let n be the number of elementsstored.

I Average-case complexity (on a RAM) is:

Space Search time Insert timeList Θ(n) Θ(n) Θ(n)Balanced tree Θ(n) Θ(log n) Θ(log n)Sorted array Θ(n) Θ(log n) Θ(n)

I If search time is important: since log n ≺ n, a balancedtree or sorted array will be faster than a list forsufficiently large n.

I If insert time is important: use a balanced tree.

I Caveat: asymptotic performance says nothing aboutperformance for small cases.

ECE750-TXBLecture 2

Todd L.Veldhuizen

[email protected]

Outline

ResourceConsumption

Complexity classes

Bibliography

Example: Circuit complexityI In circuit complexity, we do not analyze programs per

se, but a family of circuits, one for each problem size(e.g., addition circuits for n-bit integers).

I Circuits are built from basic gates. The most realisticmodel is gates that have finite fan-in and fan-out, i.e.,gates have 2-inputs and output signals can be fed intoat most k inputs

I Common resource measures are:I time (i.e., delay, circuit depth)I number of gates (or cells, for VLSI)I fan-outI area

I E.g., addition circuits:

Adder type Gates DepthRipple-carry adder ≈ 7n ≈ 2nCarry-skip (1L) ≈ 8n ≈ 4

√n

Carry lookahead ≈ 14n ≈ 4 log nConditional-sum adder ≈ 3n log n ≈ 2 log n

ECE750-TXBLecture 2

Todd L.Veldhuizen

[email protected]

Outline

ResourceConsumption

Complexity classes

Bibliography

Resource consumption tradeoffs

I Often there are tradeoffs between consumption ofresources.

I Example: Testing whether a number is prime. TheMiller-Rabin test takes time Θ(k log3 n) and hasprobability of error 4−k .

I Choosing k = 20 yields time Θ(log3 n) and probabilityof error 2−40.

I Choosing k = 12 log n yields time Θ(log4 n) and

probability of error 1n .

ECE750-TXBLecture 2

Todd L.Veldhuizen

[email protected]

Outline

ResourceConsumption

Complexity classes

Bibliography

Resource consumption tradeoffs: time-spaceCracking passwords has a time-space tradeoff:

I Passwords are stored encrypted to make them hard torecover: e.g. htpasswd (web passwords) turns “foobar”into “AjsRaSQk32S6s”

I Brute force approach: if there are n possible passwords,precompute a database of size O(n) containing everypossible encrypted password and its plaintext. Crackpasswords in O(log n) time by looking them up in thedatabase.

I Prohibitively expensive in space: e.g. n ≈ 264.

I Hellman: can recover plaintext in O(n2/3) time using adatabase of size O(n2/3).

I MS-Windows LanManager passwords are 14-characters;they are stored hashed (encrypted). With aprecomputed database of size 1.4Gb (two CD-ROMs),99.9% of all alphanumerical password hashes can becracked in 13.6 seconds [4].

ECE750-TXBLecture 2

Todd L.Veldhuizen

[email protected]

Outline

ResourceConsumption

Complexity classes

Bibliography

Resource consumption tradeoffs: Area-time

In designing circuits e.g., VLSI, one is concerned with howmuch area a circuit takes up vs. how fast it is (its gatedepth).

I Often one can sacrifice area for time (depth), and viceversa.

I e.g. Multiplying two n-bit numbers. With A the areaand T the time, it is known [1] that for any circuitfamily

(AT )2α = Ω(n1+α)

This is an “area-time product.”

ECE750-TXBLecture 2

Todd L.Veldhuizen

[email protected]

Outline

ResourceConsumption

Complexity classes

Bibliography

Kinds of problems

I We write algorithms to solve problems.I Some special classes of problems:

I Decision problems: require a yes/no answer. Example:Does this file contain a valid Java program?

I Optimization problems: require choosing a solutionthat minimizes (maximizes) some objective function.Example: Find a circuit made out of AND, OR, andNOT gates that computes the sum of two 8-bitintegers, and has the fewest gates.

I Counting problems: count the number of objects thatsatisfy some criterion. Example: For how many inputswill this circuit output zero?

ECE750-TXBLecture 2

Todd L.Veldhuizen

[email protected]

Outline

ResourceConsumption

Complexity classes

Bibliography

Complexity classes

I A complexity class is defined asI a style of problemI that can be solved with a specified amount of resourcesI on a specified machine model

I Example: P (a.k.a PTIME) is the class of decisionproblems that can be solved in polynomial time (i.e.,time O(nd) for some d ∈ N) on a Turing machine.

I Complexity classes:I Let us lump together problems according to how “hard”

they areI Are usually defined so as to be invariant under

non-radical changes of machine model (e.g., the class Pon a TM is the same as the class P on a RAM).

ECE750-TXBLecture 2

Todd L.Veldhuizen

[email protected]

Outline

ResourceConsumption

Complexity classes

Bibliography

Some basic distinctions

I At the coarsest level of structure, decision problemscome in three varieties:

I Problems we can write computer programs tosolve. What this course is about! (Program will alwaysstop and say “yes” or “no,” and be right!)

I Problems we can define, but not write computerprograms to solve (e.g., deciding whether a Javaprogram runs in polynomial time)

I Problems we cannot even define.I Consider deciding whether x ∈ A for some set A ⊆ N

of natural numbers. e.g., prime numbers.I In any (effective) notation system we care to choose,

there are ℵ0 (countably many) problem definitions.(They can be put into 1-1 correspondence with thenatural numbers).

I There are 2ℵ0 (uncountably many) problems —subsets of A ⊆ N. (They can be put into 1-1correspondence with the reals.)

ECE750-TXBLecture 2

Todd L.Veldhuizen

[email protected]

Outline

ResourceConsumption

Complexity classes

Bibliography

An aside: Hasse diagrams

I Complexity classes are sets of problemsI Some complexity classes are contained inside other

complexity classes.I e.g., every problem in class P (polynomial time on TM)

is also in class PSPACE (polynomial space on TM).I We can write P ⊆ PSPACE to mean: the class P is

contained in the class PSPACE.

I ⊆ is a partial order: reflexive, transitive, anti-symmetric.

I Hasse diagrams are intuitive ways of drawing partialorders.

ECE750-TXBLecture 2

Todd L.Veldhuizen

[email protected]

Outline

ResourceConsumption

Complexity classes

Bibliography

An aside: Hasse diagrams

I Example: I am a professor and a geek. Professors arepeople; geeks are people (are too!)

I me ⊆ professorsI me ⊆ geeksI professors ⊆ peopleI geeks ⊆ people

people

professors

ooooooogeeks

KKKKKK

me

NNNNNNNssssss

ECE750-TXBLecture 2

Todd L.Veldhuizen

[email protected]

Outline

ResourceConsumption

Complexity classes

Bibliography

Whirlwind tour of major complexity classes

I Some caveats:

I There are 462 classes in the Complexity Zoo.

I We’ll see... slightly fewer than that. (Most complexityclasses are interesting primarily to structural complexitytheorists — they capture fine distinctions that we’re notconcerned with day-to-day.)

I For every class we shall see, there are many classesabove, beside, and below it that are not shown;

I The Hasse diagrams do not imply that the containmentis strict: e.g., when the diagram shows NP above P,this means P ⊆ NP, not P ⊂ NP.

ECE750-TXBLecture 2

Todd L.Veldhuizen

[email protected]

Outline

ResourceConsumption

Complexity classes

Bibliography

Whirlwind tour of major complexity classes

Decidable

EXP

PSPACE

coNP

mmmmmNP

PPPPP

P

QQQQQQQnnnnnnn

I EXP = decision – exponential time on TM (akaEXPTIME)

I PSPACE = decision – polynomial space on TM

I P = decision – polynomial time on TM (aka P)

I NP, co−NP: we’ll get to these...

ECE750-TXBLecture 2

Todd L.Veldhuizen

[email protected]

Outline

ResourceConsumption

Complexity classes

Bibliography

Randomness-related classes

ZPP, RP, coRP, BPP: probabilistic classes (machine hasaccess to random bits)

EXPTIME

NP

nnnnnnBPP coNP

RRRRR

RP

nnnnnncoRP

RRRRR

ZPP

PPPPPP llllll

PTIMEI BPP ≈ problems that can be solved in polynomial time

with access to a random number source, withprobability of error < 1

2 . (Run many times and vote:get error as low as you like.)

I ZPP=problems that can be solved in polynomial timewith access to a random number source, with zeroprobability of error.

ECE750-TXBLecture 2

Todd L.Veldhuizen

[email protected]

Outline

ResourceConsumption

Complexity classes

Bibliography

Polynomial-time and below

PTIME Polynomial time

NC “Nick’s class”

LOGSPACE Logarithmic space

NC1 Logarithmic depth circuits, bounded fan in/out

AC0 Constant depth circuits, bounded fan in/out

ECE750-TXBLecture 2

Todd L.Veldhuizen

[email protected]

Outline

ResourceConsumption

Complexity classes

Bibliography

Structural complexity theory

I Structural complexity theory = the study of complexityclasses and their interrelationships

I Many fundamental relationships are not known:I Is P=NP? (Lots of industrially important problems are

NP, like placement & routing for VLSI, designingcommunication networks, etc.)

I Is ZPP=P? (Is randomness really necessary?)I Is BPP ⊆ NP? If so, we can solve those hard problems

in NP by flipping coins, with some error so tiny wedon’t care.

I Lots of conditional results are known, e.g.: “If BPPcontains NP, then RP=NP and PH is contained inBPP; any proof of BPP=P would require showing eitherNEXP is not in P/poly or that #P requiressuperpolynomial sized circuits.”

I Luckily (for me and you) this is not a course incomplexity theory. We will do basics only.

ECE750-TXBLecture 2

Todd L.Veldhuizen

[email protected]

Outline

ResourceConsumption

Complexity classes

Bibliography

Bibliography I

Richard P. Brent and H. T. Kung.The area-time complexity of binary multiplication.J. ACM, 28(3):521–534, 1981. bib pdf

David Culler, Richard Karp, David Patterson, AbhijitSahay, Klaus Erik Schauser, Eunice Santos, RameshSubramonian, and Thorsten von Eicken.LogP: Towards a realistic model of parallel computation.In Marina Chen, editor, Proceedings of the 4th ACMSIGPLAN Symposium on Principles and Practice ofParallel Programming, pages 1–12, San Diego, CA, May1993. ACM Press. bib

ECE750-TXBLecture 2

Todd L.Veldhuizen

[email protected]

Outline

ResourceConsumption

Complexity classes

Bibliography

Bibliography II

David E. Culler, Richard M. Karp, David Patterson,Abhijit Sahay, Eunice E. Santos, Klaus Erik Schauser,Ramesh Subramonian, and Thorsten von Eicken.Logp: a practical model of parallel computation.Commun. ACM, 39(11):78–85, 1996. bib

Philippe Oechslin.Making a faster cryptanalytic time-memory trade-off.In Dan Boneh, editor, CRYPTO, volume 2729 ofLecture Notes in Computer Science, pages 617–630.Springer, 2003. bib pdf

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

Outline

ECE750-TXB Lecture 3Basic Algorithm Analysis, Recurrences, and Z-transforms



Canada

February 28, 2007

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

Part I

Basic Algorithm Analysis

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

RAM-style machine models

I Unless we are dealing with parallelism, randomness,circuits, etc., for the remainder of this course we willalways assume a RAM-style machine.

I RAM = random access memoryI Every memory location can be read and written in O(1)

time. (This is in contrast to a Turing machine, wherereading a symbol at position p on the tape requiresmoving the position of the machine across the tape top, requiring O(p) steps.)

I Memory locations, variables, registers, etc. all containobjects of size O(1). (e.g., 64-bit words)

I Basic operations (addition, multiplication, etc.) all takeO(1) time.

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

Styles of analysis

I Worst case: if an algorithm has worst case O(f (n))time, there are constants c1, c2 such that no inputrequires more than c1 + c2f (n) time, for n big.

I Average case: average case O(f (n)) time means: thetime required by the algorithm on inputs of size n,averaged according to some probability distribution(usually uniform) is O(f (n)).

I Amortized analysis: if a data structure has amortizedtime O(f (m)) then a sequence of m operations willtake O(m · f (m)) time. (Most operations are cheap,but every now and then you need to do somethingexpensive.)

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

What is n?

I When we say an algorithm takes O(f (n)) time, whatdoes n refer to?

I Default: n is the number of bits required to representthe input.

I However, often we choose n to be a natural descriptionof the “size” of the problem:

I Number of vertices in a graph (input length is O(n2)bits to specify edges)

I For number-theory algorithms: n is often an integer(input length is O(log n) bits)

I For linear algebra, n usually indicates rank:I Input is O(n) bits for vectors, e.g., dot product;I Input is O(n2) bits for matrices.

I Exactly what n stands for is important:I Two integers ≤ n can be multiplied in O(log n) time.I Two n-bit integers can be multiplied in O(n log n) time.

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

Tools for analyzing algorithms

I Asymptotics

I Recurrences, z-transforms

I Combinatorics, Ramsey theory

I Discrepancy

I Probability, Statistics

I Information Theory

I Random objects (e.g., random graphs), zero-one laws

I Kolmogorov complexity

I ... pretty much anything else you can think of

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

No silver bullet

I Finding a bound for the running time of an algorithm isan undecidable problem; it is impossible to write aprogram that will automatically prove a bound, if oneexists, for any program.

I There are very simple algorithms that have extremelylong proofs of complexity bounds.

I There are very simple algorithms that nobody knows therunning time of! e.g. Collatz problem.

I In any formal system (e.g., ZFC set theory) there aresimple algorithms that have a complexity bound, butthis cannot be proven.

I There is no finite set of tools that suffice for algorithmanalysis.

I However, there are well-defined classes of algorithmsthat can be analyzed in a systematic way, and we willlearn some of these.

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

Recurrence equations

I Recurrence equations are one of the simplest techniquesfor algorithm analysis, and for simple programs theanalysis is easily automated.

I Recipe:I Write out the algorithm in some suitable programming

language or pseudocode, so that every step is expressedin terms of basic operations supported by the machinemodel that take O(1) time.

I Attach to each statement/syntax block a variable thatcounts the amount of resource used there (e.g., time)

I Write equations that relate the variables; simplify orapproximate as necessary.

I Solve the equations.

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

Pseudocode language

I A simple pseudocode language:

s = loc ← e assignment| if e then b [else b]opt if statement| for v = e to e b for loop| v(e, · · · , e) function call| return e

b = s | s b statement block (one or more statements)loc = v [e] array

v variablee = loc location (array or variable)

| e op e operator: + ∗ − etc.| e R e relation: ≤, =, etc.| constant

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

Analysis Rules

I Basic operations (array access, multiply, add, compare,etc.) take O(1) time.

I Represent constant time operations by arbitraryconstants c1, c2, etc.

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

For loops (simple version)

t1

∣∣∣∣∣ for i = 1 to nt2(i)

∣∣∣ ... endt1 = c1 +

n∑i=1

(c2 + t2(i))

I The time required by a loop is:I c1: time required to initialize i = 1I for each loop iteration (sum):

I some constant overhead c2 (time required to incrementi and compare to n)

I the time required by the body of the loop t2(i), whichmight depend on the value of i .

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

If statements (simple version)

t1

∣∣∣∣∣∣∣∣∣∣∣∣

if t2e then

t3

∣∣∣∣...else

t4

∣∣∣∣...t1 = t2 + c1 + max(t3, t4)

I Time required for an if statement:I t2 = time required to evaluate branch conditionI c1 = some constant time required for branchingI t3, t4 = time taken in the branchesI We use max(t3, t4) because we are seeking an upper

bound on running time.

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

Function callsI For each function F , introduce a time function TF (n):

represents the amount of time used by the function Fon inputs characterized by parameters n = (n1, n2, · · · ).(Usually have just a single parameter: TF (n).)

I The variable(s) n should include any values on whichthe time required by the function depends.

I Example: (naive) function to compute r s

Exp(r , s)p ← 1for i = 1 to s

p ← p ∗ rendreturn p

Time depends on s (exponent) but not r (base), sotime function should be TExp(s).

I Time required for function call:

t1Exp(x , y) t1 = c1 + TExp(y)

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

Example: Exp(r , s)

Exp(r , s)

t1

∣∣∣∣∣∣∣∣∣∣

t2p ← 1

t3

∣∣∣∣∣∣for i = 1 to s

t4p ← p ∗ rend

t5return p

t1 = t2 + t3 + t5t2 = c2

t3 = c3 +∑s

i=1 (c ′3 + t4)t4 = c4

t5 = c5

I Solve:

TExp(s) = t1 = c2 + c3 + s(c ′3 + c4) + c5

I So, TExp(s) = c + c ′s.

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

A simplifying notation

I Write c to mean Θ(1). Anytime a constant is needed,just use c .

I The result is an upper bound on the time, for csufficiently large.

I Example:

Exp(r , s)

t1

∣∣∣∣∣∣∣∣∣∣

t2p ← 1

t3

∣∣∣∣∣∣for i = 1 to s

t4p ← p ∗ rend

t5return p

t1 = t2 + t3 + t5t2 = ct3 = c +

∑si=1 (c + t4)

t4 = ct5 = c

TExp(s) = c + cs

I Fine point: each time c occurs, it means Θ(1) (somebounded value): but not the same value at eachoccurrence.

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

Matrix Multiply Example

Matrix-Multiply(n,A,B,C )for i = 1 to n

for j = 1 to nC (i , j)← 0for k = 1 to n

C (i , j)← C (i , j) + A(i , k) ∗ B(k, j)end

endend

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

Matrix Multiply: analyze

Matrix-Multiply(n,A,B,C )

t1

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣

for i = 1 to n

t2

∣∣∣∣∣∣∣∣∣∣∣∣

for j = 1 to nt3C (i , j)← 0

t4

∣∣∣∣∣∣for k = 1 to n

t5C (i , j)← C (i , j) + A(i , k) ∗ B(k, j)end

endend

t1 = c +∑n

i=1 (c + t2)t2 = c +

∑nj=1 (c + t3 + t4)

t3 = ct4 = c +

∑nk=1 (c + t5)

t5 = c

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

Matrix multiply: solve

I Solve:

t1 = c +n∑

i=1

2c +n∑

j=1

(3c +

n∑k=1

2c

)= c + n · (2c + n · (3c + n · 2c))

= 2cn3 + 2cn2 + 2cn + c

I MatrixMultiply(n,A,B,C ) takes Θ(n3) time.

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

Analyzing Recursion

I When functions call themselves, we will get timefunctions defined in terms of themselves.

I Such equations are called recurrences.

I Example:

T (1) = c Base case

T (n) = c + T (n − 1) Recursion

This example easy to solve:T (n) = c + c + c + · · · c︸︷︷︸

n

= cn

I Rarely that easy in practice!

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

Fibonacci example

Fibonacci(n)if n ≤ 2 then

return 1else

return Fibonacci(n − 1) + Fibonacci(n − 2)

I Analyze base case(s) separately:

T (1) = c

T (2) = c

I Recurrence:

T (n) = c + T (n − 1) + T (n − 2)

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

Fibonacci example: Call graph

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

Bibliography

Part II

Z-transforms

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

Bibliography

Unilateral Z-transform I

I Transforms give us an alternative representation offunctions or series in which certain manipulationsand/or insights are easier.

I The Z-transform is of special interest to algorithmanalysis because it makes solving some recurrencessimple.

I To put the Z-transform in context of transforms ingeneral: there are Integral transforms (for functions ofthe real line) and their discrete cousins Generatingfunctions and Formal power series.

I Integral transforms [1, 2]: Represent a function f (x) byits transform F (p).

I Forward transform F (p) =∫

K (p, x)f (x)dxI Inverse transform f (x) =

∫L(x , p)F (p)dp

I K , L are kernels.I Laplace transform: K (p, x) = e−px

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

Bibliography

Unilateral Z-transform II

I Fourier transform: K (p, x) = e iωt

I Mellin transform: K (p, x) = xp−1

I Exotica: Hankel transform, Hilbert transform, Abeltransform, · · ·

I Generating functions / formal power series [6, 2]:I Represent a sequence/discrete function f (n) by its

transform/generating function F (z)I Forward transform: F (z) =

∑K (z , n)f (n)

I Inverse transform: f (n) =∫

L(n, z)F (z)I Ordinary Generating Functions: K (z , n) = zn

I ? Z-transforms: K (z , n) = z−n

I Exponential Generating Functions: K (z , n) = zn

n!

I Poisson Generating Functions: K (z , n) = e−zzn

n!I Exotica: Lambert series, Bell series

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

Bibliography

Unilateral Z-transform III

I Z-transforms are more-or-less the same as ordinarygenerating functions. The OGF can be obtained from the

z-transform, and vice versa, by the substitution z 7→ z−1. The

OGF form is more common in combinatorics and theoretical

computer science; the Z-transform is more common in

engineering, particularly signals and controls.

I Very useful for solving linear difference equations, i.e.,equations of the form

f (n) + c1f (n − a1) + c2f (n − a2) + · · · = g(n)

that arise frequently in algorithm analysis.

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

Bibliography

Unilateral Z-transform IV

I Linear difference equations are a special case ofrecurrences. Examples of recurrences not in this classare:

T (n) = T (n/2) + 1 Solution T (n) = Θ(log n)

T (n) = T (√

n) + 1 Solution T (n) = Θ(log log n)

However, we can use Z-transforms to obtain approximate

solutions to the above pair of recurrences by performing a

change of variables that results in a linear recurrence: r = 2n

for the first, r = 22n

for the second. For a survey of

recurrence-solving techniques, see [3].

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

Bibliography

Unilateral Z-transform VI Definition of the Z-transform and its inverse:

Z[f (n)] =∞∑

n=0

f (n)z−n = F (z) (1)

Z−1[F (z)] = 12πi

∮C

F (z)zn−1dz (2)

where the contour C must be in the region ofconvergence of F (z) and encircle the origin.

I The function f (n) is discrete, i.e., f : N→ R.

I The Z-transform F (z) is complex: F : C→ C.I The sum of Eqn. (1) often converges for only part of

the complex plane, called the Region of Convergence(ROC) of F (z).

I In practice, never use Eqns. (1,2): instead use tables oftransform pairs.

I Standard references: [4, 6, 5] or any DSP book

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

Bibliography

Unilateral Z-transform VI

I An important transform pair: the z-transform off (n) = bn is

Z [bn] =∞∑

n=0

z−nbn =1

1− bz−1

Note that the sequencef (0), f (1), f (2), · · · = b0, b1, b2, · · · can be read offfrom the series expansion of F (z):

1

1− bz−1= b0 + b1z−1 + b2z−2 + b3z−3 + · · ·

This is by definition — compare Eqn. (1).

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

Bibliography

Unilateral Z-transform VII

I A typical z-transform of a function f (n) looks like:

F (z) =(1− a1z

−1)(1− a2z−1) · · ·

(1− b1z−1)(1− b2z−1) · · ·

Here we have written F (z) in factored form.

I When z = ai , (1− aiz−1) = 0, and F (z) = 0. Such

values of z are called zeros.

I When z → bi , (1− biz−1)→ 0 and F (z)→∞. Such

values of z are called poles.

I To take the inverse Z-transform of something in theform

F (z) =N(z)

(1− b1z−1)(1− b2z−1) · · ·

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

Bibliography

Unilateral Z-transform VIII

where the b1, b2, · · · are all distinct, we can use partialfractions expansion to write

F (z) =N1(z)

(1− b1z−1)+

N2(z)

(1− b2z−2)+ · · ·

and then use the transform pair Z[bn] = 1(1−bz−1)

to

obtain something like

f (n) = c1bn1 + c2b

n2 + · · ·

The term with the largest value of |bi | will beasymptotically dominant, e.g. if f (n) = 2n + 3n, thenf (n) ∼ 3n.

I Hence, the asymptotic behaviour of f (n) can be readoff directly from F (z): find the pole farthest from theorigin (i.e., the bi with |bi | largest); then f (n) = Θ(bn

i ).

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

Bibliography

Unilateral Z-transform IX

I When the largest pole occurs in a form such as(1− biz

−1)2 or (1− biz−1)3 etc. (double,triple poles),

we need to consult a table of transforms and find whatform (1− biz

−1)k will take:

F−1[(1− bz−1)2] = (n + 1)bn

F−1[(1− bz−1)3] = ( 12n2 + 3

2n + 1)bn

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

Bibliography

Z-transforms

I Two compelling reasons for using Z-transforms:

1. Because of the transform pair

Z[f (n − a)] = z−aF (z)

linear difference equations become linear equations thatcan be solved by simple algebraic manipulation.

2. The asymptotics of f (n) are governed by the pole(s) ofF (z) farthest from the origin. If we just want to knowΘ(f (n)), we can take the z-transform, and find theoutermost pole(s) [2].

I e.g. If the outermost pole(s) of F (z) is a single pole atz = 2, then f (n) is Θ(2n).

I e.g. If the outermost pole(s) of F (z) is a double poleat z = 5, then f (n) is Θ(n5n).

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

Bibliography

Solving linear recurrences with Z-transformsI Workflow for exact solution:

1. Use (discrete) δ-functions to encode initial conditions:

δ(n − a) =

1 when n = a

0 otherwise

2. Take Z-transform of difference equation(s) to obtainequation(s) in F (z).

3. Solve for F (z). Linear difference equations result inF (z) being a ratio of polynomials in z . Factor thedenominator.

4. Use partial fraction expansion to split into a sum ofsimple terms, and take the inverse Z-transform.

I Workflow for asymptotic solution:1. Disregard initial conditions.2. Take Z-transform of recurrence. Solve for F (z). Factor

denominator.3. Identify outermost pole(s). If they are > 1, find the

inverse Z-transform of the term corresponding to thosepole(s). If outermost poles are ≤ 1, the initialconditions may matter ⇒ exact solution.

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

Bibliography

Common Z-transform pairs

I Linearity:

Z [af (n) + bg(n)] = aZ[f (n)] + bZ[g(n)]

I Common transform pairs:

Z [T (n − a)] = z−aT (z) shiftZ [δ(n)] = 1 impulseZ [an] = 1

1−az−1 single pole

Z [(n + 1)an] = 1(1−az−1)2

double pole

Z [1] = 11−z−1 single pole at 1

Z [n] = z−1

(1−z−1)2double pole at 1

Z[n2]

= z−1+z−2

(1−z−1)3triple pole at 1

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

Bibliography

Finding boundary conditions I

I We use the Z-transform as a unilateral transform: allfunctions are assumed to be 0 for n < 0.

I Initial conditions must be dealt with by introducingδ-functions.

I E.g., the Fibonacci numbers satisfy the recurrence

f (n) = f (n − 1) + f (n − 2)

with the boundary conditions (BCs) f (0) = f (1) = 1.

I If we evaluate f (0) = f (−1) + f (−2) = 0, it doesn’tsatisfy the BC f (0) = 1.

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

Bibliography

Finding boundary conditions III We add a term δ(n):

f (n) = f (n − 1) + f (n − 2) + αδ(n)

Then f (0) = α, so we choose α = 1 to match the BCf (0) = 1. Then try f (1):

f (1) = f (0)︸︷︷︸=1

+ f (−1)︸︷︷︸=0

+α δ(1)︸︷︷︸=0

= 1

So, our BC f (1) = 1 is satisfied.

I In general, if the recurrence has a term f (n − k), wemay need terms

α0δ(n) + α1δ(n − 1) + · · ·+ αkδ(n − k)

to account for BC’s.

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

Bibliography

Finding boundary conditions III

I However, if we are only interested in asymptoticbehaviour, boundary conditions often do not matter:

I Functions αδ(n − k) have a Z-transform αz−k

1−z−1 .I Will contribute a pole at z = 1, plus some term(s) to

the numerator of the Z -transform when written infactored form.

I If the dominant poles of the Z -transform are > 1, thenthe pole(s) and zero(s) contributed by δ(n − k)functions do not change the asymptotics.

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

Bibliography

Z-Transforms: Fibonacci Example

I Example: our time recurrence for the Fibonaccifunction.

t(n) = c + t(n − 1) + t(n − 2)

I Ignore initial conditions; take z-transform:

T (z) =c

1− z−1+ z−1T (z) + z−2T (z)

I Solve for T (z):

T (z) =c

1− 2z−1 + z−3

I Asymptotics: have poles at z = 1, 1±√

52

I Outermost pole (i.e., with |z | maximized) is z = 1+√

52 ;

dominates asymptoticsI T (n) is Θ(φn) with φ = 1+

√5

2 ≈ 1.618.

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

Bibliography

Z-Transforms: Fibonacci ExampleI Exact solution: via partial fractions. Let φ = 1+

√5

2 ,

θ = 1−√

52 .

T (z) =c

1− 2z−1 + z−3

= c(1−z−1)(1−φz−1)(1−θz−1)

= A1−z−1 + B

1−φz−1 + C1−θz−1

A = T (z)(1− z−1)˛z=1

= −c

B = T (z)(1− φz−1)˛z=φ

= c√

5(√

5+1)2

10(√

5−1)

C = T (z)(1− θz−1)˛z=θ

= c√

5(√

5−1)2

10(1+√

5)(Yech.)

I Inverse Z transform: Z−1[

α1−az−1

]= αan

f (n) = −c + Bφn + Cθn

∼ Bφn + O(1)

I Partial fractions is tedious: if we only want asymptotics,just read the pole locations and do not bother with anexact inverse transform.

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

Bibliography

Method 2: Maple

The practical method of choice is to use a symbolic algebrapackage like Maple:

> rsolve(T(n)=c+T(n−1)+T(n−2),T(1..2)=c,T);

−1/5 c√

5“−1/2

√5 + 1/2

”n

+ 1/5 c√

5“1/2 + 1/2

√5

”n

−1/5 c√

5

„−2

“√5 + 1

”−1«n

−1/5 c“√

5− 1”√

5

„−2

“−√

5 + 1”−1

«n “−√

5 + 1”−1

− c

> asympt(%,n,2);

2/5 c√

5

(1 +√

5

2

)n

+ O (1)

So, running time is Θ(φn) with φ = 1+√

52 = 1.6180 · · ·

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

Bibliography

Fibonacci example cont’d

I So, our little Fibonacci(n) function requiresexponential time.

I Is there a better way?I Iterate: for i = 2..n sum previous two elements.

Requires Θ(n) time.I Use our Z-transform powers:

Fib(n) =

1 if n ≤ 2

Fib(n − 1) + Fib(n − 2) otherwise

= Fib(n − 1) + Fib(n − 2) + δ(n − 1)− δ(n − 2)

Z-transform, solve, inverse Z-transform:

Fib(n) = aφn + bθn

where a, b are constants, and φ, θ are as before. Thiscan be implemented in O(log n) time.

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

Bibliography

Bibliography I

[1] Brian Davies.Integral transforms and their applications.Springer, 3rd edition, 2005. bib

[2] Philippe Flajolet and Robert Sedgewick.Analytic Combinatorics.2007.Book draft. bib pdf

[3] George S. Lueker.Some techniques for solving recurrences.ACM Comput. Surv., 12(4):419–436, 1980. bib pdf

ECE750-TXBLecture 3

Todd L.Veldhuizen

[email protected]

Bibliography

Bibliography II

[4] Alan V. Oppenheim, Alan S. Willsky, and Syed HamidNawab.Signals and systems.Prentice-Hall signal processing series. Prentice-Hall,second edition, 1997. bib

[5] Robert Sedgewick and Philippe Flajolet.An introduction to the analysis of algorithms.Addison-Wesley, 1996. bib

[6] Herbert S. Wilf.Generatingfunctionology.Academic Press, 1990. bib

ECE750-TXBLecture 4: Search

& CorrectnessProofs

Todd L.Veldhuizen

[email protected]

Outline

Bibliography

ECE750-TXB Lecture 4: Search &Correctness Proofs



Canada

February 14, 2007


& CorrectnessProofs

Todd L.Veldhuizen

[email protected]

Outline

Bibliography

Problem: Searching a sorted array

I Let 〈T ,≤〉 be a total order. e.g. T could be theintegers Z.

Problem: Searching a sorted array

I Inputs:

1. An integer n > 0.2. An array A[0..n − 1] of elements of T , sorted in

ascending order so that (i ≤ j)⇒ (A[i ] ≤ A[j ]).3. An element x of T .

I Specification: Return true if x is equal to someelement in the array, false otherwise.


& CorrectnessProofs

Todd L.Veldhuizen

[email protected]

Outline

Bibliography

Linear Searching

I Let’s analyze the worst-case time complexity of thefollowing (naive) algorithm on a RAM.

Linsearch(int n,T[] A,T x)for i = 0 to n − 1

if A[i ] = x then return trueendreturn false

I e.g.,

Linsearch(5, [3, 5, 9, 9, 13], 4) returns false

Linsearch(5, [3, 5, 9, 9, 13], 9) returns true


& CorrectnessProofs

Todd L.Veldhuizen

[email protected]

Outline

Bibliography

Linear Searching

I Without thinking much we can say this algorithm takestime Θ(n), but let’s get the practice, and be sure:

I Time taken depends on n, but not on x or the contentsof A[], assuming that the type T allows comparisons intime Θ(1).

I Let T (n) be the time taken for an array of size n.

Linsearch(int n,T[] A,T x)

t7

∣∣∣∣∣∣∣∣t5

∣∣∣∣∣∣for t4 i = 0 to n − 1

t3 if t1A[i ] = x then t2return trueend

t6return false


& CorrectnessProofs

Todd L.Veldhuizen

[email protected]

Outline

Bibliography

Linear SearchingI Write equations: (see rules from Lecture 3)

t1 = c1

t2 = c2

t3 = c3 + t1 + max(t2, 0)t4 = c4

t5 = t4 +∑n−1

i=0 (c5 + t3)t6 = c6

t7 = t5 + t6

I (In the above analysis, I analyzed the “one-armed if”(t3) by pretending it was an if · · · then · · · else · · ·statement in which the second branch required zerotime.)

I Solving,

T (n) = t7 = n(c5 + c3 + c1 + c2) + c4 + c6

So, Linsearch requires Θ(n) time as expected.


& CorrectnessProofs

Todd L.Veldhuizen

[email protected]

Outline

Bibliography

How To Catch A Lion In A Desert

The Bolzano-Weierstraß method. Divide the desert by aline running from north to south. The lion is then either inthe eastern or in the western part. Let’s assume it is in theeastern part. Divide this part by a line running from east towest. The lion is either in the northern or in the southernpart. Let’s assume it is in the northern part. We cancontinue this process arbitrarily and thereby constructingwith each step an increasingly narrow fence around theselected area. The diameter of the chosen partitionsconverges to zero so that the lion is caged into a fence ofarbitrarily small diameter. — from How To Catch A Lion InThe Desert, Mathematical MethodsNot as elegant as the inversion method1, but a good startingpoint for a search algorithm.

1Place a spherical cage in the desert, enter it and lock from theinside. Perform an inversion with respect to the cage. Then, the lion isinside the cage and you are outside.


& CorrectnessProofs

Todd L.Veldhuizen

[email protected]

Outline

Bibliography

Binary Search

I Search a sorted array A[0..n − 1] by repeatedly dividingit in half, and searching one of the halves for x .

I The portion of the array we are searching will be l ..h(for low and high). Initially we’ll have l = 0 andh = n − 1, so that we are searching A[0..n − 1].

I We will design the algorithm around an invariant: aproperty that is maintained as the algorithm runs.

I The invariant has three parts, each of which mustalways be true:

I A is sorted.I l ≤ h. Otherwise, A[l ..h] is not a valid interval of the

array.I A[l ] ≤ x ≤ A[h]. The lion is always in our segment.


& CorrectnessProofs

Todd L.Veldhuizen

[email protected]

Outline

Bibliography

Binary SearchBinarySearch(n,A[], x)

I Require A sorted.I Let l = 0 and h = n − 1. A[l ..h] is our search range.I If x < A[l ] or A[h] < x then x is not in the array; return

false.I Otherwise, A[l ] ≤ x ≤ A[h] and l ≤ h. We have

established the invariant.I Call BinarySearch2(A, x , l , h).

BinarySearch2(A[], x , l , h):I Require l ≤ h, A[l ] ≤ x ≤ A[h], A sorted (Invariant).I If l = h then return true.I Otherwise, split the search range in two by choosing

midpoint i = l + b12(h − l)c. Then either:I x ≤ A[i ], in which case A[l ] ≤ x ≤ A[i ]. Return

BinarySearch2(A, x , l , i).I A[i ] < x , in which case either

I A[i ] < x < A[i + 1], in which case return false; orI A[i + 1] ≤ x ≤ A[h]. Return

BinarySearch2(A, x , i + 1, h).


& CorrectnessProofs

Todd L.Veldhuizen

[email protected]

Outline

Bibliography

Binary Search: Code

BinarySearch2(A, x , l , h)requires (l ≤ h) ∧ (A[l ] ≤ x ≤ A[h]) ∧ (A sorted)if l = h then

return trueelse

i ← l + b(h − l)/2cif x ≤ A[i ] then

return BinarySearch2(A, x , l , i)else

if x < A[i + 1] thenreturn false

elsereturn BinarySearch2(A, x , i + 1, h)


& CorrectnessProofs

Todd L.Veldhuizen

[email protected]

Outline

Bibliography

Binary Search: Example

I Example: n = 7, search for x = 3.

0 1 2 3 4 5 6

A[] 1 3 6 9 10 10 13

l

OO

i

OO

h

OO

x ≤ A[i ]

A[] 1 3 6 9 10 10 13

l

OO

i

OO

h

OO

x ≤ A[i ]

A[] 1 3 6 9 10 10 13

l , i

OO

h

OO

¬(x ≤ A[i ]) ∧ ¬(x < A[i + 1])

A[] 1 3 6 9 10 10 13

l , h

OO

l = h return true


& CorrectnessProofs

Todd L.Veldhuizen

[email protected]

Outline

Bibliography

Binary Search: Time analysis for h − l + 1 = 2k .I Let n = h − l + 1, the size of the search range.I We will analyze BinarySearch2(· · · ) only; the

‘setup’ function BinarySearch(· · · ) adds a constantoverhead.

I Let T (n) be an (upper bound for) the time complexity.

BinarySearch2(A, x , l , h)

t11

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣

if t10 l = h thent9return A[l ] = x

elset8 i ← l + b(h − l)/2c

t7

∣∣∣∣∣∣∣∣∣∣∣∣∣∣

if t6x ≤ A[i ] thent5return BinarySearch2(A, x , l , i)

else

t4

∣∣∣∣∣∣∣∣if t3x < A[i + 1] then

t2return falseelse

t1return BinarySearch2(A, x , i + 1, h)


& CorrectnessProofs

Todd L.Veldhuizen

[email protected]

Outline

Bibliography

Binary Search: Time analysis

I Analyze for the case where h − l + 1 = 2k , for somek ∈ N.

I Need to prove that when BinarySearch2 calls itself,we go from a problem of size 2k to a problem of size2k−1: then recurrence will have the formT (n) = · · ·+ T (n/2) + · · · .

Proposition (Both halves have size 2k−1.)

If h − l + 1 = 2k and i = l + b(h − l)/2c, theni − l + 1 = 2k−1 and h − (i + 1) + 1 = 2k−1.

Proof.h − l + 1 = 2k impliesi = l + b(h − l)/2c = l + b(2k − 1)/2c = l + 2k−1 − 1, soi − l + 1 = (l + 2k−1 − 1)− l + 1 = 2k−1, andh − (i + 1) + 1 = h − (l + 2k−1 − 1 + 1) + 1 = 2k − 2k−1 =2k−1.


& CorrectnessProofs

Todd L.Veldhuizen

[email protected]

Outline

Bibliography


I Assuming n = h − l + 1 = 2k , a recursive call hasproblem size 2k−1 = n/2.

I Write equations: (use c = Θ(1) notation)

t1 = c + T (n/2)t2 = ct3 = ct4 = c + max(t1, t2)t5 = c + T (n/2)t6 = ct7 = t6 + c + max(t5, t4)t8 = ct9 = c

t10 = c

t11 =

c if n = 1

c + t7 + t8 otherwise


& CorrectnessProofs

Todd L.Veldhuizen

[email protected]

Outline

Bibliography


I Solving, simplifying, and folding constants, we obtainthe recurrence

T (n) =

c if n = 1

c + T (n/2) otherwise

I So far, we have only seen recurrences of the form

F (n) = αF (n − a) + βF (n − b) + · · ·+ G (n)

i.e. linear difference equations.

I Change of variables: Let ξ = log n, and r(ξ) = T (2ξ).Then T (n/2) = r(ξ − 1). New recurrence:

r(ξ) = c + r(ξ − 1)

Now it is a linear difference equation in ξ.


& CorrectnessProofs

Todd L.Veldhuizen

[email protected]

Outline

Bibliography

Binary Search: Solve recurrenceI From inspection r(ξ) = c(1 + ξ), but for practice:2

r(ξ) = c + r(ξ − 1)

⇓ z-transform

R(z) =c

1− z−1+ z−1R(z)

I Solve: get R(z) = c(1−z−1)2

. A double pole at z = 1.

Z[n] = z−1

(1−z−1)2

Z[T (n − a)] = z−aT (z)

Write as R(z) = c · z+1 · z−1

(1−z−1)2. Then

Z−1

[c · z+1 · z−1

(1− z−1)2

]= cZ−1

[z−1

(1− z−1)2

]∣∣∣∣ξ←ξ+1

= cξ|ξ←ξ+1 = c(ξ + 1)

2Note: we treat the c term as if it were c · u(n), where u(n) is thestep function: u(n) = 1 if and only if n ≥ 0.


& CorrectnessProofs

Todd L.Veldhuizen

[email protected]

Outline

Bibliography

Binary Search: Solve recurrence

I Therefore r(ξ) = c(ξ + 1). And, T (n) = r(log n), soT (n) = c(1 + log n).

I So, binary search takes time O(log n) for n = 2k .

I It turns out (we won’t prove) it is O(log n) for anyn ≥ 1. (See assignment 1, section 1, problem 1.)

I Much faster than Linsearch. On an array of sizen = 1, 000, 000,

I On average Linsearch will require ≈ 500, 000comparisons.

I BinarySearch requires ≤ 21 comparisons.I There is an even faster search method, interpolation

search, that under favourable assumptions takes averagetime Θ(log log n) steps.


& CorrectnessProofs

Todd L.Veldhuizen

[email protected]

Outline

Bibliography

Binary Search: Correctness Proof

I The invariant helps immensely in proving correctness.

I To prove: if the preconditions are satisfied, thenBinarySearch2(A, x , l , h) returns true if and only ifthere is a k ∈ [l , h] such that A[k] = x .

I Proof architecture: Progress and Preservation.

1. Progress is made: in each recursive call, the problembecomes strictly smaller. (This implies a base case iseventually reached.)

2. Preservation: the invariant is satisfied at all entries tothe function.

2.1 the invariant is satisfied when BinarySearch2 iscalled from BinarySearch (initial entry).

2.2 the invariant is preserved when BinarySearch2 callsitself recursively.

3. Base cases: when BinarySearch2 returns directly(without calling itself), its return value is correct.

4. (1,2,3) together yield a simple correctness proof byinduction over problem size.


& CorrectnessProofs

Todd L.Veldhuizen

[email protected]

Outline

Bibliography


A basic rule of inference: in the branches of an if statement,

if ψ then(1) · · ·

else(2) · · ·

I At point (1), ψ is true.

I At point (2), ψ is false.

Example: if ψ were ‘n > m’, where n,m ∈ N, then at (1)n > m is true, and at (2) ¬(n > m) is true, which meansm ≤ n.


& CorrectnessProofs

Todd L.Veldhuizen

[email protected]

Outline

Bibliography

Binary Search: Correctness Proof I

Lemma (Base cases correct)

If the invariant holds, and BinarySearch2 returns directlywithout calling itself, then it returns true if and only if thereis a k ∈ [l , h] such that A[k] = x. Moreover,BinarySearch2 always returns directly on problems of sizen = 1.

Proof. There are two cases.

1. The program path taken is:

if (1)l = h thenreturn true

We have l = h, and the invariant says A[l ] ≤ x ≤ A[h];since 〈T ,≤〉 is a total order, antisymmetry((x ≤ y) ∧ (y ≤ x)⇒ x = y) implies A[l ] = x . Thereturn value is true, satisfying the requirement. If the


& CorrectnessProofs

Todd L.Veldhuizen

[email protected]

Outline

Bibliography

Binary Search: Correctness Proof IIproblem size is h − l + 1 = 1 then l = h, andBinarySearch2 returns directly.

2. The program path taken is:


if (1)l = h thenelse

if (2)x ≤ A[i ] thenelse

if (3)x < A[i + 1] thenreturn false

We have (1) l 6= h and (2) x > A[i ] and (3)x < A[i + 1]. Putting (2) and (3) together,A[i ] < x < A[i + 1], which together with the arraybeing sorted implies x is not in the array, and the returnvalue is false.


& CorrectnessProofs

Todd L.Veldhuizen

[email protected]

Outline

Bibliography


Lemma (Progress)

Each time BinarySearch2 calls itself, the problem size isstrictly smaller.

Proof. The possible recursion paths are:



i ← l + b(h − l)/2c· · · (a)BinarySearch2(A, x , l , i)

· · · (b)BinarySearch2(A, x , i + 1, h)

From (1) we have l 6= h. Therefore problem sizen = (h − l + 1) satisfies n ≥ 2.


& CorrectnessProofs

Todd L.Veldhuizen

[email protected]

Outline

Bibliography


We have two cases:

1. (Call site (a).) The problem size of the recursive call isi − l + 1, so we must prove i − l + 1 < h − l + 1. Thisis equivalent to i < h.(By contradiction.) Suppose that i ≥ h. Then,substituting i = l + b(h − l)/2c, we obtainb(h − l)/2c ≥ h − l , a contradiction since h − l ≥ 1.

2. (Call site (b).) To prove: h − (i + 1) + 1 < h − l + 1.Equivalent to l < i + 1.From the invariant, l ≤ h, so b(h − l)/2c ≥ 0. Sincei = l + b(h − l)/2c, l ≤ i . Therefore l < i + 1.


& CorrectnessProofs

Todd L.Veldhuizen

[email protected]

Outline

Bibliography


Lemma (Preservation)

If the invariant is satisfied on a call to BinarySearch2,then it is satisfied on calls by BinarySearch2 to itself.

Proof. Recall the invariant is:

(l ≤ h) ∧ (A[l ] ≤ x ≤ A[h]) ∧ (A sorted)

We never modify the array, so A remains sorted. There aretwo cases to consider.


& CorrectnessProofs

Todd L.Veldhuizen

[email protected]

Outline

Bibliography

Binary Search: Correctness Proof I

1. The program path taken is



i ← l + b(h − l)/2cif (2)x ≤ A[i ] then

return BinarySearch2(A, x , l , i)

We have (1) l 6= h and (2) x ≤ A[i ]. Need to prove

1.1 A[l ] ≤ x ≤ A[i ]. We have A[l ] ≤ x from the invariant,and x ≤ A[i ] from the branch condition (2), soA[l ] ≤ x ≤ A[i ].

1.2 l ≤ i . From the invariant, l ≤ h, so b(h − l)/2c ≥ 0.Therefore l ≤ i = l + b(h − l)/2c.


& CorrectnessProofs

Todd L.Veldhuizen

[email protected]

Outline

Bibliography

Binary Search: Correctness Proof II2. The program path taken is



i ← l + b(h − l)/2cif (2)x ≤ A[i ] thenelse

if (3)x < A[i + 1] thenelse

return BinarySearch2(A, x , i + 1, h)

Need to prove:

2.1 A[i + 1] ≤ x ≤ A[h]. From (3) x ≥ A[i + 1] and fromthe invariant x ≤ A[h]. Therefore A[i + 1] ≤ x ≤ A[h].

2.2 i + 1 ≤ h. We proved i < h in the first case of theprogress lemma, so i + 1 ≤ h.


& CorrectnessProofs

Todd L.Veldhuizen

[email protected]

Outline

Bibliography

Binary Search: Correctness Proof: Denouement

Put everything together:

Lemma (Correct Step)

If the invariant is initially satisfied, then

1. If the problem is of size n = 1, BinarySearch2returns immediately a correct answer.

2. If the problem is of size n > 1 then BinarySearch2either immediately returns a correct answer, or callsitself with the invariant satisfied on a problem of size n′

where 1 ≤ n′ < n.

Proof.(1) from the base cases lemma. (2) if it returns immediatelyit is correct from the base cases lemma. Otherwise, it callsitself: progress lemma gives n′ < n, preservation lemma givesinvariants satisfied, and 1 ≤ n′ follows from l ≤ h(invariant).


& CorrectnessProofs

Todd L.Veldhuizen

[email protected]

Outline

Bibliography


TheoremIf the invariant is satisfied then BinarySearch2 returns acorrect answer.

Proof.By induction on problem size.

1. Base case. To prove: if n = 1 a correct answer isreturned. Proof: apply Correct Step Lemma.

2. Induction step. To prove: if a correct answer is returnedfor problems of size ≤ n (induction hypothesis), then acorrect answer is returned for a problem of size n + 1.Proof: apply the Correct Step Lemma: for a problem ofsize n + 1, BinarySearch2 is correct or calls itself ona problem of size n′ < n + 1. Therefore n′ ≤ n, andfrom the induction hypothesis a correct answer isreturned.


& CorrectnessProofs

Todd L.Veldhuizen

[email protected]

Outline

Bibliography

Binary Search: The front endOne last item: the entry routine BinarySearch. Itestablishes the invariant. Its only requirement is that A is asorted array of at least one element.

BinarySearch(n,A, x)requires (A sorted) ∧ (n ≥ 1)

if (1)x < A[0] thenreturn false

else

if (2)x > A[n − 1] thenreturn false

else

return (3)BinarySearch2(A, x , 0, n − 1)

For (3) to be reached, A must be sorted, and

1. n ≥ 1 implies 0 ≤ n − 1, establishing l ≤ h;

2. (1) gives x ≥ A[0], and (2) gives x ≤ A[n − 1],establishing A[l ] ≤ x ≤ A[h].


& CorrectnessProofs

Todd L.Veldhuizen

[email protected]

Outline

Bibliography


I This establishes the correctness of binary search up to abasic level of rigour.

I However, proofs by hand are notoriously error-prone:I I did this proof by hand, and given my track record, I

give it a 25% chance of being correct.I Reward of $5 for each error found, up to a maximum

of $20. :)

I Gold standard is a formal proof in a system such asIsabelle, Coq, ACL2, etc.


& CorrectnessProofs

Todd L.Veldhuizen

[email protected]

Outline

Bibliography

In praise of invariants

I A good invariant is an indispensible tool in designingalgorithms and data structures.

I The fine details of the algorithm are often dictated bythe need to

1. Preserve the invariant (during recursion, iteration,changing state);

2. Make progress;3. Handle the base cases correctly.

I If you design an algorithm around an invariant:

1. the invariant guides you in the design;2. you are more likely to have a correct implementation;3. the proof of correctness is often easier (and, sometimes,

straightforward).


& CorrectnessProofs

Todd L.Veldhuizen

[email protected]

Outline

Bibliography

Bibliography I

ECE750-TXBLecture 5: Veni,

Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

ECE750-TXB Lecture 5: Veni, Divisi, Vici(Divide and Conquer)



Canada

February 14, 2007


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

Divide And Conquer

BinarySearch is an (arguably degenerate) instance of abasic algorithm design pattern:

Divide And Conquer

1. If the problem is of trivial size, solve it and return.

2. Otherwise:

2.1 Divide the problem into several problems of smaller size.2.2 Conquer these smaller problems by recursively applying

the divide and conquer pattern.2.3 Combine the answers to the smaller problems into an

answer to the whole problem.


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

Divide and Conquer

Binary Search as Divide-and-Conquer:

1. If the problem is of size n = 1, answer is obvious —return.

2. Otherwise:I Split the array into two halves. Principle:

(x in A[l ..h]) ≡ (x in A[l ..i ]) ∨ (x in A[i + 1..j ])

I Search the two halves: for one half, call self recursively;for other half, answer is false.

I Combine the two answers: since one answer is alwaysfalse, and “x or false” is just “x ,” simply return theanswer from the half we searched.


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

Divide-and-Conquer Recurrences

If

1. The base cases (trivially small problems) require timeO(1), and

2. a problem of size n is split into k subproblems of sizes(n), and

3. splitting the problem into subproblems and combiningthe answers takes time f (n)

then the general form of the time recurrence is

T (n) = c + kT (s(n)) + f (n)

e.g. for binary search we had k = 1 (we only had to searchone half), s(n) = n/2, and f (n) = 0.


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

Strassen Matrix Multiplication

I Recall that our MatrixMultiply routine took Θ(n3)time.

I Can we do better? No one thought so until...

I A landmark paper:Volker Strassen, Gaussian Elimination is not optimal(1969). [1]

I An o(n3) divide-and-conquer approach to matrixmultiplication.


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

Strassen’s method

Strassen: “If A,B are matrices of order m2k+1 to bemultiplied, write

A =

[A11 A12

A21 A22

]B =

[B11 B12

B21 B22

]C =

[C11 C12

C21 C22

]where the Aik ,Bik ,Cik matrices are of order m2k ...


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

Strassen’s method

“Then compute

(7 subproblems)

I = (A11 + A22)(B11 + B22)II = (A21 + A22)B11

III = A11(B12 − B22)IV = A22(−B11 + B21)V = (A11 + A12)B22

VI = (−A11 + A21)(B11 + B12)VII = (A12 − A22)(B21 + B22)

(and combine)

C11 = I + IV − V + VIIC21 = II + IVC12 = III + VC22 = I + III − II + VI


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

Strassen’s method

1. Subproblems: compute 7 matrix multiplications of sizen/2.

2. Constructing the subproblems and combining theanswers is done with matrix additions/subtractions,taking Θ(n2) time.

3. Apply general divide-and-conquer recurrence withk = 7, s(n) = n/2, f (n) = Θ(n2):

T (n) = c + 7T (n/2) + Θ(n2)


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

Strassen’s method

Recall that Θ(n2) means “some function f (n) for which f (n)n2

is eventually restricted to some finite positive interval[c1, c2].”Pick a value c > c2; then eventually f (n) ≤ cn2.Solve recurrence:

T (n) = c + 7T (n/2) + cn2

This will give an asymptotically correct bound, but possiblyT (n) is less than the actual time required for small n.


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

Strassen’s methodI Let r(ξ) = T (2ξ), and ξ = log2 n. Recurrence becomes

r(ξ) = c + 7r(ξ − 1) + c(2ξ)2

= c + 7r(ξ − 1) + c(4ξ)

I Z-transform:

Z [c] =c

1− z−1

Z [7r(ξ − 1)] = 7z−1R(z)

Z[c(4ξ)

]=

c

1− 4z−1

I Z-transform version of recurrence is:

R(z) =c

1− z−1+ 7z−1R(z) +

c

1− 4z−1

I Solve:

R(z) =c2(1− 5

2z−1)

(1− z−1)(1− 4z−1)(1− 7z−1)


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

Strassen’s method

I Look at singularities: zero at z = 52 , poles at z = 1, 4, 7.

I Pole at z = 7 is asymptotically dominant:

r(ξ) ∼ c17ξ

I Change variables back: ξ = log2 n:

T (n) ∼ c17log2 n

= c1nlog2 7 = c1n

2.807···

I Strassen matrix multiplication takes Θ(n2.807···) time.


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

Divide-and-conquer recurrences I

I Our analysis of Strassen’s algorithm is easily generalized.

I Consider a divide-and-conquer recurrence of the form

T (n) = kT (n/s) + Θ(nd) (1)

I To obtain an asymptotic upper bound we can solve therelated recurrence

T ′(n) = kT ′(n/s) + cnd (2)

I It can be shown that T (n) T ′(n).

I Analyzing for the case when n = sξ with ξ ∈ N, we canchange variables to turn Eqn. (2) into:

r(ξ) = kr(ξ − 1) + c(sd)ξ


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

Divide-and-conquer recurrences III Z-transform:

R(z) = kz−1R(z) +c

1− sdz−1

I Solve:

R(z) =c

(1− kz−1)(1− sdz−1)

I Which is the dominant pole? Depends on the values ofk, s, d .

I Three cases: sd < k, sd = k, sd > k.

1. sd < k. Then the dominant pole is z = k. Get

r(ξ) kξ

Since n = sξ, use ξ = logs n:

T ′(n) k logs n = (2log k)log nlog s = (2log n)

log klog s = n

log klog s


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

Divide-and-conquer recurrences III

2. sd = k. Then get a double pole at z = k. Recall that

Z−1

[kz−1

(1− kz−1)2

]= ξkξ

End up with

r(ξ) ξkξ

T ′(n) (logs n)k logs n (log n)2log k·log n

log s = nlog klog s log n

= nlog sd

log s log n = nd log n

3. sd > k. Then dominant singularity is z = sd . Get

r(ξ) (sd)ξ

T ′(n) (sd)logs n = (2d log s)log nlog s = nd


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

‘Master’ Theorem

Theorem (‘Master’)

The solution to a divide-and-conquer recurrence of the form

T (n) = kT (dn/se) + Θ(nd)

where s > 1, is

T (n) =

Θ

(n

log klog s

)if sd < k

Θ(nd log n

)if sd = k

Θ(nd) if sd > k

I Examples:I Binary search: k=1, s=2, d=0: second case, log k

log s = 0,

so T (n) = Θ(n0 log n) = Θ(log n).I Strassen: k = 7, s = 2, d = 2: first case, log k

log s ≈ 2.807,

so T (n) = Θ(n2.807···).


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

BibliographyAbstract Data Types


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

Abstract Data Types

I A basic software engineering principle:

Separate the interface (what you can do) fromthe implementation (how it is done.)

I An abstract data type is a an interface to a collection ofdata.

I There may be numerous ways to implement an ADT,each with different performance characteristics.

I An ADT consists of

1. Some types that may be required to provide operations,relations, and satisfy properties. e.g. a totally orderedset.

2. An interface: the capabilities provided by the ADT.


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

Abstract Data Types

I Notation for types: 〈T , f1, f2, · · · ,R1,R2, · · · 〉 indicatesa set T together with

I Some operators or functions f1, f2, . . .I Some relations R1,R2, · · · .

This is the standard notation for a structure in logic:could be

I an algebra (functions but no relations) e.g. a field〈F ,+, ∗,−, ·−1, 0, 1〉;

I a relational structure (relations but no functions) e.g. atotal order 〈T ,≤〉;

I some structure with both functions and relations e.g. anordered field 〈F ,+, ∗,−, ·−1, 0, 1,≤〉

I Often a type is required to satisfy certain axioms, orbelong to a specified class of structures, e.g. (a field, atotal order, a distance metric.)


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

ADT: Dictionary[K , V ]

Dictionary[K,V]

I Stores a set of pairs (key, value), and finds values bykey. At most one value per key is permitted.

I Types:I 〈K 〉: key type (e.g. a word).I 〈V , 0〉: value type (e.g., a definition of a word). The

value 0 is a special value used to indicate absence of adictionary entry.

I Operations:I insert(k, v): insert a key-value pair into the dictionary.

If an entry for the key is already present, it is replaced.I V find(k): if (k, v) is in the dictionary, returns v ;

otherwise returns 0.I remove(k): deletes the entry for key k, if present.


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

Abstract Data Types

ADTImplementationData structure

Linked list

Dictionary

55kkkkkkkkk//

))SSSSSSSSS Sorted array

Search tree

Implementation insert time find time remove time

Linked list O(n) O(n) O(n)Sorted array O(n) O(log n) O(n)Search tree O(log n) O(log n) O(log n)


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

The role of ADTs in choosing data structures

Questions to consider when choosing a data structure:

1. What type of data do you need to store?I Does it have a natural total order? e.g. integers, strings

under lexicographic ordering, database keys, etc.I Does it bear some natural partial order?I Do elements represent points or regions in some metric

space, e.g., Rn? (e.g., screen regions, boxes in athree-dimensional space, etc.)

The order relation(s) or geometric organization of yourdata may allow the use of ADTs that exploit thoseproperties to allow efficient access.


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography


2. What operations are required? Do you need toI determine whether a value is in the data set (search)?I insert, modify, delete elements?I iterate through the elements (order doesn’t matter)?I iterate through the elements in some sorted order?I find the “biggest” or “smallest” element?I find elements that are “close” to some value?

These requirements can be compared with theinterfaces provided by ADTs, to decide what ADTsmight be suitable.


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography


3. What is the typical mixture of operations you willperform? Will you

I insert frequently?I delete frequently?I search frequently?

Different ADT implementations may offer differentperformance characteristics. By understanding thetypical mixture of operations you can choose animplementation with the most suitable performancecharacteristics.


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

ADT: Array[V]

Array[V]

I A finite sequence of cells, each containing a value.I Types :

I 〈V , 0〉: a value type, with a “default value” 0 used toinitialize cells.

I Operations :I Array(n): create an array of length n, where n ∈ N is a

positive integer. Cells are initialized to 0.I integer length(): returns the length of the arrayI get(i): returns the value of cell i . It is required that

0 ≤ i ≤ length()− 1.I set(i , v): sets the value of cell i to v . It is required that

0 ≤ i ≤ length()− 1.


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

ADT: Set[V]

Set[V]

I Stores a set of values. Permits inserting, removing, andtesting whether a value is in the set. At most oneinstance of a value can be in the set.

I Types:I 〈V 〉: a value type

I Operations:I insert(v): adds v to the set, if absent. Inserting an

element already in the set causes no change.I remove(v): removes v from the set, if present;I boolean contains(v): returns true if and only if v is in

the set.


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

ADT: Multiset[V]

Multiset[V]

I Stores a multiset of values (i.e., with duplicate elementspermitted.) Permits inserting, removing, and testingwhether a value is in the set. Sometimes called a bag.

I Types:I 〈V 〉: a value type

I Operations:I insert(v): adds v to the multiset.I remove(v): removes an instance of v from the multiset,

if present;I boolean contains(v): returns true if and only if v is in

the set.


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

Stacks, Queues, and Priority Queues

I Informally, a queue is a collection of objects “awaitingtheir turn.” e.g., customers queueing at a grocery store.The queueing policy governs “who goes next.”

I First In, First Out (FIFO): like a line at the grocerystore: the element that was added least recently goesnext.

I Last In, First Out (LIFO): the item added most recentlygoes next. A stack: like an “in-box” of work where newitems are placed on the top, and what ever is on the topof the stack gets processed next.

I Priority Queueing: items are associated with priorities;the item with the highest priority goes next.


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

ADT: Queue[V]

Queue[V]

I A FIFO queue (first in, first out) of objects.I Types :

I 〈V 〉: a value type

I Operations :I insert(v): adds the object v to the end of the queueI boolean isEmpty(): returns true just when the queue is

emptyI V next(): returns and removes the value at the front of

the queue. It is an error to perform this operation whenthe queue is empty.


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

ADT: Stack[V]

Stack[V]

I A LIFO (last in, first out) stack of objects.I Types :

I 〈V 〉: a value type

I Operations :I push(v): adds a value to the top of the stack.I boolean isEmpty(): returns true just when the stack

contains no elements.I V pop(): returns the value at the top of the stack. It is

an error to perform this operation when the stack isempty.


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

ADT: PriorityQueue[P,V]

PriorityQueue[P,V]

I A queue of objects, each with an associated priority, inwhich an object with maximal priority is always chosennext.

I Types :I 〈P,≤〉: a priority type, with a total order ≤.I 〈V 〉: a value type

I Operations :I insert(p, v): insert a pair (p, v) where p ∈ P is a

priority, and v ∈ V is a value;I boolean isEmpty(): returns true just when the queue is

empty;I (P,V ) next(): returns and removes the object at the

front of the queue, which is guaranteed to havemaximal priority. It is an error to perform this operationon an empty queue.


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

Data Structure: Linked List

I Realizes a list of items, e.g., (2, 3, 5, 7, 11)

I Inserting and removing elements at the front of the listrequires O(1) time.

I Searching requires iterating through the list: O(n) time

I List can be iterated through from front to back.I Basic building block is a node that contains:

I data: A piece of data;I next: A pointer to the next node, or a null pointer if

the node is the end of the list.


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

Data Structure: Linked List

public class LinkedList<T> Node<T> head; // Pointer to the first node in the list

...

class Node<T> T data; // Data itemNode<T> next; // Next node in the list , if any

Node(T data, Node<T> next)

data = data;next = next;


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

Data Structure: Linked List I

I Insert a new data element:

public void insert (T data)

head = new Node<T>(data, head);

I Remove the front element, if any:

public T removeFirst()

if (head == null)throw new RuntimeException(”List is empty.”);

else T data = head.data;head = head.next;return data;


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

Data Structure: Linked List II

I Check if the list is empty:

public boolean isEmpty()

return (head == null);


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

Implementing Stack[V ] with a Linked List

I The ADT Stack[V ] is naturally implemented by alinked list.

public class Stack<V> LinkedList<V> list;

public void push(V v) list . insert (v); public boolean isEmpty() return list . isEmpty(); public V pop() return list . removeFirst ();

I push(v), isEmpty(), and pop() require O(1) time.


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

Iterators

Iterator[V]

I An iterator is an ADT that provides traversal throughsome container. It abstracts away from the details ofhow the data are stored, and presents a simple interfacefor retrieving the elements of the container one at atime.

I Types:I V : the value type stored in the container

I Operations:I Iterator(C ): initialize the iterator to point to the first

element in the container C ;I boolean hasNext(): returns true just when there is

another element in the container;I V next(): returns the next element in the container,

and advances the iterator.


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

An Iterator for a linked list

public class ListIterator <T> Node<T> node;

ListIterator ( LinkedList<T> list) node = list .head;

boolean hasNext() return (node == null);

T next()

if (node == null)throw new RuntimeException(”Tried to iterate past end of list ”);

T data = node.data;node = node.next;return data;


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

Iterators

Example usage of an iterator:

ListIterator <Integer> iter = new ListIterator<Integer>( list );while ( iter .hasNext())

System.out. println (”The next element is ” + iter .next ());


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

Implementing Queue[V ] with a Linked List

I Recall that a Queue[V ] ADT requires first-in first-outqueueing. But, the list as we have shown it onlysupports inserting and removing at one end.

I Simple variant of a linked list: a in addition tomaintaining a pointer to the head of the list, alsomaintain a pointer to the tail of the list.


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

Bidirectional Linked List

class BiNode<T> T data; // Data itemBiNode<T> next; // Next node in the list , if anyBiNode<T> prev; // Previous node in the list , if any

BiNode(T data, BiNode<T> next)

data = data;next = next;next.prev = this;


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography



Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography


public class BiLinkedList<T> BiNode<T> head; // Pointer to the first node in the listBiNode<T> tail; // Pointer to the last node in the list

public void insert (T data)

head = new BiNode<T>(data, head);if ( tail == null)

tail = head;


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography


public T removeLast()

if ( tail == null)throw new RuntimeException(”List is empty.”);

else T data = tail .data;if ( tail .prev == null)

head = null;tail = null;

else

tail = tail .prev ;tail .next = null;

return data;


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

Implementing a Queue < V > with aBidirectional Linked List

public class Queue<V> BiLinkedList<V> list;

public void insert (V v) list . insert (v); public boolean isEmpty() return list . isEmpty(); public V next() return list .removeLast();

I insert(v), isEmpty() and next() all take O(1) time.


Divisi, Vici

Todd L.Veldhuizen

[email protected]

Divide andConquer

Abstract DataTypes

Bibliography

Bibliography I

[1] V. Strassen.Gaussian elimination is not optimal.Numerische Mathematik, 13:354–356, 1969. bib pdf

ECE750-TXBLecture 6: Lists

and Trees

Todd L.Veldhuizen

[email protected]

Linear DataStructures

Trees

Bibliography

ECE750-TXB Lecture 6: Lists and Trees



Canada

February 14, 2007


and Trees

Todd L.Veldhuizen

[email protected]


Trees

Bibliography

Iterators

Iterator[V]

I An iterator is an ADT that provides traversal throughsome container. It abstracts away from the details ofhow the data are stored, and presents a simple interfacefor retrieving the elements of the container one at atime.

I Types:I V : the value type stored in the container

I Operations:I boolean hasNext(): returns true just when there is

another element in the container;I V next(): returns the next element in the container,

and advances the iterator.


and Trees

Todd L.Veldhuizen

[email protected]


Trees

Bibliography

An Iterator for a linked list

public class ListIterator <T> Node<T> node;

ListIterator ( LinkedList<T> list) node = list .head;

boolean hasNext() return (node == null);

T next()

if (node == null)throw new RuntimeException(”Tried to iterate past end of list ”);

T data = node.data;node = node.next;return data;


and Trees

Todd L.Veldhuizen

[email protected]


Trees

Bibliography

Iterators

Example usage of an iterator:

ListIterator <Integer> iter = new ListIterator<Integer>( list );while ( iter .hasNext())

System.out. println (”The next element is ” + iter .next ());


and Trees

Todd L.Veldhuizen

[email protected]


Trees

Bibliography


I Bidirectional Linked ListI Each node has a link to both the next and previous

items in the list.I We maintain a pointer to both the front and the back.I We can insert and remove items at both the front and

back of the list.

I We will use Bidirectional Linked Lists to illustrate twobasic, but extremely useful principles:

1. Maintaining invariants of data structures;2. Symmetry.


and Trees

Todd L.Veldhuizen

[email protected]


Trees

Bibliography


I We have encountered invariants already, in thecorrectness sketch of binary search. That was aninvariant for a recursive algorithm, and was required tobe true of each recursive invokation of the function.Here we discuss data structure invariants.

An invariant of a data structure is a propertythat is required to be always true, exceptwhen we are performing some transientupdate operation.

I Invariants help us to implement data structurescorrectly: many basic operations can be viewed asdisruptions of the invariant (e.g., inserting an element)after which we need to repair or maintain the invariant.


and Trees

Todd L.Veldhuizen

[email protected]


Trees

Bibliography


I As in a linked list, the basic building block of abidirectional linked list is a node.

class BiNode<T> T data; /∗ Data item ∗/BiNode<T> next; /∗ Next node in the list , if any ∗/BiNode<T> prev; /∗ Previous node in the list , if any ∗/

BiNode(T data)

data = data;next = null;prev = null;


and Trees

Todd L.Veldhuizen

[email protected]


Trees

Bibliography

Bidirectional Linked List: Invariants

I Let’s look at a few examples to see what invariantssuggest themselves.


and Trees

Todd L.Veldhuizen

[email protected]


Trees

Bibliography

Bidirectional Linked List: Invariants

I Here are the invariants we will use:

1. (front 6= null) implies (front.prev = null). (If the list isnonempty, there is no element before the front element.)

2. (back 6= null) implies (back.next = null). (If the list isnonempty, there is no element after the last element.)

3. (front = null) if and only if (back = null).4. For any node x ,

4.1 (x.next 6= null) implies x.next.prev = x ;4.2 (x.prev 6= null) implies x.prev.next = x .


and Trees

Todd L.Veldhuizen

[email protected]


Trees

Bibliography

Bidirectional Linked List: Symmetry

I Bidirectional linked lists have a natural symmetry: if we‘reverse’ the list by swapping the ‘front back’pointers and each of the ‘next prev’ pointers, we getanother bidirectional linked list.

I We’ll call this the dual list.

I The symmetry extends to operations on the list:insertFront() is ‘dual’ to insertBack(), andremoveFront() is dual to removeBack().

I This kind of duality has the following nice property:I Carrying out a sequence of operations op1, . . . , opk

gives the same result asI Taking the dual list, carrying out the dual sequence of

operations op1, . . . , op2, and taking the dual list.


and Trees

Todd L.Veldhuizen

[email protected]


Trees

Bibliography


I Example: starting from the list [1, 2], we canI insertBack(3) to give [1, 2, 3];I removeFront() to give [2, 3].

The dual version:I take the dual list [2, 1];I insertFront(3) to give [3, 2, 1];I removeBack() to give [3, 2];I take the dual list [2, 3].

Get the same answer both ways.


and Trees

Todd L.Veldhuizen

[email protected]


Trees

Bibliography


I Why we care about symmetry: if our implementation iscorrect,

I We can obtain the routine insertBack() by taking thecode for insertFront() and swapping front/back andprev/next;

I Ditto for removeBack() and removeFront();I The set of invariants should not change under swapping

front/back and prev/next. Example: For any node x ,

1. (x.next 6= null) implies x.next.prev = x ;2. (x.prev 6= null) implies x.prev.next = x .

If we swap next/prev, we get

1. (x.prev 6= null) implies x.prev.next = x ;2. (x.next 6= null) implies x.next.prev = x .


and Trees

Todd L.Veldhuizen

[email protected]


Trees

Bibliography

Bidirectional Linked List: Implementation

I For operations at the front of the list, there are threecases to consider:

1. front=null (empty list)2. front 6= null and front.next = null (one-element list)3. front 6= null and front.next 6= null (multi-element list)

I We need to consider each of these cases when weimplement, and ensure that in each case, the invariantsare maintained.


and Trees

Todd L.Veldhuizen

[email protected]


Trees

Bibliography


public void insertFront (T data)

BiNode<T> node = new BiNode<T>(data);

if ( front == null) /∗ Case 1 ∗/

front = node; /∗ Both made non−null for Inv. 3 ∗/back = node; else /∗ Case 2,3 ∗/

front .prev = node; /∗ Inv 4.1 ∗/node.next = front; /∗ Inv 4.2 ∗/front = node;


and Trees

Todd L.Veldhuizen

[email protected]


Trees

Bibliography


public void insertBack(T data)

BiNode<T> node = new BiNode<T>(data);

if (back == null) /∗ Case 1 ∗/

back = node; /∗ Both made non−null for Inv. 3 ∗/front = node;

else /∗ Case 2,3 ∗/back.next = node; /∗ Inv 4.1 ∗/node.prev = back; /∗ Inv 4.2 ∗/back = node;


and Trees

Todd L.Veldhuizen

[email protected]


Trees

Bibliography


public T removeFront()

if ( front == null) /∗ Case 1 ∗/throw new RuntimeException(”Empty list.”);

else T data = front.data;front = front.next;if ( front == null) /∗ Case 2 ∗/

back = null;else

front .prev = null; /∗ Case 3 ∗/return data;


and Trees

Todd L.Veldhuizen

[email protected]


Trees

Bibliography


public T removeBack()

if (back == null) /∗ Case 1 ∗/throw new RuntimeException(”Empty list.”);

else T data = back.data;back = back.prev;if (back == null) /∗ Case 2 ∗/

front = null;else

back.next = null; /∗ Case 3 ∗/return data;


and Trees

Todd L.Veldhuizen

[email protected]


Trees

Bibliography

Trees


and Trees

Todd L.Veldhuizen

[email protected]


Trees

Bibliography

Trees

I Recall that binary search allowed us to find items in asorted array in Θ(log n) time. However, inserting orremoving an item from the array took Θ(n) time in theworst case.

I Balanced Binary Search Trees offer Θ(log n) search, andalso Θ(log n) insert and remove.

I More generally, trees offer a hierarchical decompositionof a search space:

I Spatial searching: R-trees, quadtrees, octtrees, kd-trees;I Databases: BTrees and their kin;I Intervals: interval trees;I ...


and Trees

Todd L.Veldhuizen

[email protected]


Trees

Bibliography

Binary Trees

I Basic building block is a tree node, which contains:I A data value, drawn from some total order 〈T ,≤〉;I A pointer to a left child;I A pointer to a right child.


and Trees

Todd L.Veldhuizen

[email protected]


Trees

Bibliography

Binary Trees

I Traversing a tree means visiting all its nodes. There arethree common orders for doing this:

I Preorder: a node is visited before its children, e.g.,[E ,C ,B,A,D,G ,F ,H, I ]

I Inorder: the left subtree is visited, then the node, thenthe right subtree, e.g., [A,B,C ,D,E ,F ,G ,H, I ].

I Postorder: a node is visited after its children, e.g.,[A,B,D,C ,F , I ,H,G ,E ].


and Trees

Todd L.Veldhuizen

[email protected]


Trees

Bibliography

Binary Trees

I Terminology:I E is the root and is on level 0;I C ,G are children of E and are on level 1;I C is the parent of B,D;I C is an ancestor of A (as are E and B);I I is a descendent of G (and E and H);I The sequence (E ,C ,B,A) is a path from the root to A.


and Trees

Todd L.Veldhuizen

[email protected]


Trees

Bibliography

Binary Search Trees

I Binary Search Trees satisfy the following invariant: forany tree node x ,

1. If x .left 6= null, then x .left.data ≤ x .data;2. If x .right 6= null, then x .data ≤ x .right.data.

I If we want to visit the nodes of the tree in order, westart at the root and recursively:

1. Visit the left subtree;2. Visit the node;3. Visit the right subtree.

i.e. an inorder traversal.


and Trees

Todd L.Veldhuizen

[email protected]


Trees

Bibliography

Binary Search Trees

I Search procedure is very similar to binary search of anarray.

boolean contains(int z)

if (z == data)return true;

else if ((z < data) && (left != null))return left . contains(z );

else if ((z > data) && (right != null))return right . contains(z );

return false ;


and Trees

Todd L.Veldhuizen

[email protected]


Trees

Bibliography

Binary Search Trees

I The worst-case performance of contains(z) depends onthe height of the tree.

I The height of a tree is the length of the longest pathfrom the root to a leaf.

I Root: the node at the top of the treeI Leaf (or external node): a node with no childrenI Internal node: any node that is not a leaf.

I Worst-case time required for contains(z) is O(h), whereh is the height of the tree.


and Trees

Todd L.Veldhuizen

[email protected]


Trees

Bibliography

Binary Search Trees

I A binary search tree of height h can contain up to2h − 1 values.

I Given n values, we can always construct a binary searchtree of height at most 1 + log2(n):

I Sort the n values in ascending order and put them in anarray A[0..n − 1].

I Make the root element A[bn/2c].I Build the left subtree using elements A[0..bn/2c − 1]I Build the right subtree using elements A[bn/2c+ 1].

I But, we can also construct a binary tree of height n− 1.

I Make the root A[0]; make its right subtree A[1], ...


and Trees

Todd L.Veldhuizen

[email protected]


Trees

Bibliography

Binary Search Trees

I To achieve O(log n) search times, it is necessary to havethe tree balanced, i.e., have all leaves roughly the samedistance from the root.

I This is easy if the contents of the tree are fixed.I This is not easy if we are adding and removing elements

dynamically.I We can aim for average-case balanced, i.e., the

probability of having a badly balanced tree → 0 asn→∞.

I Example: treaps

I We can have deterministic balancing that guaranteesbalance in the worst case.

I red-black trees;I AVL trees;I 2-3 trees;I B-trees;I splay trees.


and Trees

Todd L.Veldhuizen

[email protected]


Trees

Bibliography

Enumeration of Binary Search Trees

I A useful fact: the number of valid binary search treeson n keys is given by the Catalan numbers:

Cn =

(2n

n

)1

n + 1

∼ 4n · 1√

πn3/2

(Sequence A000108 in the Online Encyclopedia ofInteger Sequences.)

I First few values:

1 12 23 54 145 426 132

7 4298 14309 4862

10 1679611 5878612 208012


and Trees

Todd L.Veldhuizen

[email protected]


Trees

Bibliography

Binary Search Trees

I A naive insertion strategy, which does not guaranteebalance, is the following:

void insert ( int z)

if (z == data) return;else if (z < data)

if ( left == null)left = new Tree(z);

elseleft . insert (z );

else if (z > data) if ( right == null)

right = new Tree(z);else

right . insert (z );

Note the symmetry: can swap left/right and <,>.


and Trees

Todd L.Veldhuizen

[email protected]


Trees

Bibliography

Binary Search Trees

I Naive insertion works fairly well in the average case [1].

TheoremThe expected height of a binary search tree constructed byinserting a sequence of n random values is ∼ c log n withc ≈ 4.311.

I Equivalently, inserting n values in a randomly chosenorder.

I Using Markov’s inequality, we can say that if H is arandom variable giving the height of a tree after theinsertion of n keys chosen uniformly at random, then

Pr(H ≥ αn) ≤ E[H]

αn=

c log n

αn= O

(log n

n

)i.e., the probability of a tree having height linear in nconverges to zero. So, badly balanced trees are veryunlikely for large n.


and Trees

Todd L.Veldhuizen

[email protected]


Trees

Bibliography

Binary Search Trees

Result of 100 random insertions.


and Trees

Todd L.Veldhuizen

[email protected]


Trees

Bibliography

RotationsI A rotation is a simple, local operation that makes some

subtrees shorter and others deeper.I Rotations preserve the inorder traversal, i.e., the order

of keys in the tree remains the same.I Any two binary search trees on the same set of keys can

be transformed into one another by a sequence ofrotations.1

I Rotations are a common method to restore balance to atree.

D

~~~~~~

@@@

@

B

@@@

@ E

A C

rotate right at D →

← rotate left at B

B

@@@

@

A D

~~~~~~

@@@

@

C E1In fact, something even more interesting is true: for each n there is

a sequence of rotations that produces every possible binary tree withoutany duplicates, and eventually returns the tree to its initialconfiguration. (i.e., the rotation graph Gn, where vertices are trees andedges are rotations, contains a Hamiltonian path [2].)


and Trees

Todd L.Veldhuizen

[email protected]


Trees

Bibliography

Rotations

I Code to rotate right:

Tree rotateRight ()

if ( left == null)throw new RuntimeException(”Cannot rotate here”);

Tree A = left . left ;Tree B = left ;Tree C = left . right ;Tree D = this;Tree E = right ;

return new Tree(B.data, A, new Tree(D.data, C, E));

I Code to rotate left: use duality, swap left/right.


and Trees

Todd L.Veldhuizen

[email protected]


Trees

Bibliography

Rotation example

I Badly balanced binary tree: rebalance by rotating left ata, then left at c :

a>>

b >>

c??

d ??

e

b >>

a c??

d ??

e

b ==

a d ??

c e


and Trees

Todd L.Veldhuizen

[email protected]


Trees

Bibliography

Iterator for a binary search tree I

class BSTIterator implements Iterator Stack stack ;

public BSTIterator(BSTNode t)

stack = new Stack();fathom(t);

public boolean hasNext()

return ! stack .empty();

public Object next()

BSTNode t = (BSTNode)stack.pop();if (t . right child != null)


and Trees

Todd L.Veldhuizen

[email protected]


Trees

Bibliography

Iterator for a binary search tree II

fathom(t. right child );return t ;

void fathom(BSTNode t)

do stack .push(t );t = t. left child ; while (t != null );


and Trees

Todd L.Veldhuizen

[email protected]


Trees

Bibliography

Bibliography I

[1] Luc Devroye.A note on the height of binary search trees.Journal of the ACM (JACM), 33(3):489–498, 1986. bib

pdf

[2] J. M. Lucas, D. R. van Baronaigien, and F. Ruskey.On rotations and the generation of binary trees.Journal of Algorithms, 15(3):343–366, November 1993.bib ps


Red-Black Trees,Heaps, and Treaps

Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

ECE750-TXB Lecture 7: Red-Black Trees,Heaps, and Treaps



Canada

February 14, 2007



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

Binary Search Trees

I Recall that in a binary tree of height h the timerequired to find or insert an element is O(h).

I In the worst case h = n, the number of elements.

I To keep h ∈ O(log n) one needs a balancing strategy.I Balancing strategies may be either:

I Randomized: e.g. a random insert order results inexpected height of c log n with c ≈ 4.311.

I Deterministic (in the sense of not random).

I Today we will see an example of each:I Red-black trees: deterministic balancingI Treaps: randomized. Also demonstrate persistence and

unique representation.



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

Red-black trees

I Red-black trees are a popular form of binary search treewith a deterministic balancing strategy.

I Nodes are coloured red or black.I Properties of the node-colouring ensure that the longest

path to a leaf is no more than twice the length of theshortest path.

I This ensures height of ≤ 2 log2(n + 1), which impliessearch, min, max in O(log n) worst-case time.

I Insert and Delete can also be performed in O(log n)worst-case time.

I Invented by Bayer [2], red-black formulation due toGuibas and Sedgewick [9]. Other sources: [5, 10].



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

Red-Black Trees: Invariants

I Balance invariants:

1. No red node has a red child.2. Every path in a subtree contains the same number of

black nodes.



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

Red-Black Trees



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

Red-Black Trees: Balance I

Let bh(x) be the number of black nodes along any pathfrom a node x to a leaf, excluding the leaf.

LemmaThe number of internal nodes in the subtree rooted at x isat least 2bh(x) − 1.

Proof.



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

Red-Black Trees: Balance II

By induction on height:

1. Base case: If x has height 0, then x is a leaf, andbh(x) = 0; the number of internal (non-leaf)descendents of x is 0 = 2bh(x) − 1.

2. Induction step: assume the hypothesis is true for height≤ h. Consider a node of height h + 1. From invariant(2), the children have black height either bh(x)− 1 (ifthe child is black) or bh(x) (if the child is red). Byinduction hypothesis, each child subtree has at least2bh(x)−1 − 1 internal nodes. The total number ofinternal nodes in the subtree rooted at x is therefore≥ (2bh(x)−1 − 1) + 1 + (2bh(x)−1 − 1) = 2bh(x) − 1.



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

Red-Black Trees: Balance

TheoremA red-black tree with n internal nodes has height at most2 log2(n + 1).

Proof.Let h be the tree height. From invariant 1 (a red node musthave both children black), the black-height of the root mustbe ≥ h/2. Applying Lemma 1.1, the number of internalnodes n of the tree satisfies n ≥ 2h/2 − 1. Rearranging,h ≤ 2 log2(n + 1).



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography


I As with all non-randomized binary search trees, balancemust be maintained when insert or delete operations areperformed.

I These operations may disrupt the invariants, sorotations and recolourings are needed to restore them.

I Insert for red-black tree:

1. Insert the new key as a red node, using the usual binarytree insert.

2. Perform restructurings and recolourings along the pathfrom the newly added leaf to the root to restoreinvariants.

3. Root is always coloured black.



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography


I Four cases for red nodes with red children:

I Restructure/recolour to correct: each of the abovecases becomes



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

Red-Black Trees: Example

I Insertion of [1,2,3,4,5] into a red-black tree:

I Implementation of rebalancing is straightforward but abit involved.



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

Heaps and Treaps

I Treaps are a randomized search tree that combineTRees and hEAPS.

I First, let’s look at heaps.I Consider determining the maximum element of a set.

I We could iterate through the array and keep track ofthe maximum element seen so far. Time taken: Θ(n).

I We could build a binary tree (e.g. red-black). We canobtain the maximum (minimum) element in O(h) timeby following rightmost (leftmost) branches. If tree isbalanced, requires O(n log n) time to build the tree, andO(log n) time to retrieve the maximum element.

I A heap is a highly efficient data structure formaintaining the maximum element of a set. It is arudimentary example of a dynamic algorithm/datastructure.



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

Dynamic Algorithms

I A static problem is one where we are given an instanceof a problem to solve, we solve it, and are done (e.g.,sort an array).

I A dynamic problem is one where we are given a problemto solve, we solve it.

I Then the problem is changed slightly and we resolve.I ...ad infinitum.

I The challenge goes from solving a single instance of aproblem to maintaining a solution as the problem ismodified.

I It is usually more efficient to update the solution thanrecompute from scratch.

I e.g., binary search trees can be viewed as a method fordynamically maintaining an ordered list as elements areinserted and removed.



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

Heaps

I A heap dynamically maintains the maximum element ina collection (or, dually, the minimum element). Abinary heap can:

I Obtain the maximum element in O(1) time;I Remove the maximum element in O(log n) time;I Insert new element in O(log n) time.

Heaps are a natural implementation of thePriorityQueue ADT.

I There are several flavours of heaps: binary heaps,binomial heaps, fibonacci heaps, pairing heaps. Themore sophisticated of these support merging (melding)two heaps.

I We will look at binary heaps.



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

Binary Heap Invariants

1. A binary heap is a complete binary tree of height h − 1,plus a possibly incomplete level of height h filled fromleft to right.

2. The key stored at each node is ≥ the key(s) stored inits children.



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

Binary HeapI A binary heap may be stored as a (1-based) array,

whereI Parent(j) = bj/2cI LeftChild(i) = 2 ∗ iI RightChild(i) = 2 ∗ i + 1

I e.g., [17, 11, 13, 9, 6, 2, 12, 4, 3, 1] is an arrayrepresentation of the heap:



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

Heap operations

I To insert a key k into the heap:I Place k at the next available position.I Swap k with its parent(s) until the heap invariant is

satisfied. (Takes O(log n) time.)

I The maximum element is just the key stored at theroot, which can be read off in O(1) time.

I To delete the maximum element:I Place the key at the last heap position at the root

(overwriting the current maximum), and decrease thesize of the heap by one.

I Choose the largest of the root and its two children, andmake this the root; perform this procedure recursivelyuntil the heap invariant is satisfied.



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

Heap: insert example

I Example: insert 23 into the heap and restore the heapinvariant.



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

Heap: delete-max example

I To delete the max element, move the element from thelast position (2) to the root;

I To restore heap invariant, swap root with the largestchild greater than it, if any, and repeat down the heap.



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

Treaps

Treaps (binary TRee + hEAP)

I a randomized binary search tree

I with O(log n) average-case insert, delete, search

I with O(∆ log n) average-case union, intersection, ⊆, ⊇,where ∆ = |(A \ B) ∪ (B \ A)| is the difference betweenthe sets

I uniquely represented (to be explained)

I easily made persistent (to be explained)

I Due to Vuillemin [14] and independently, Seidel andAragon [11]. Additional references: [3, 16, 15].



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

Treaps: Basics

I Keys are assigned (randomly chosen) priorities.I Two total orders on keys:

I The usual key order;I A randomly chosen priority order, often obtained by

assigning each key a random integer, or using anappropriate hash function

I Treaps are kept sorted by key in the usual way (inordertree traversal visits keys in order).

I The heap property is maintained wrt the priority order.



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

Treap ordering

I Each node has key k and priority p

I Ordering invariants:

(k2, p2)

ttttttttt

KKKKKKKKK

(k1, p1) (k3, p3)

k1 ≤ k2 ≤ k3 Key orderp2 ≥p p1

p2 ≥p p3Priority order

Every node has a higher priority than its descendents.



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

Treaps: Basics

I If priorities are chosen randomly, the tree is on averagebalanced, and insert, delete, search take O(log n) time

I Random priorities behave like a random insertion order:the structure of the treap is exactly that obtained byinserting the keys into a binary search tree indescending order of heap prioritity.

I If keys are unique (no duplicates), and priorities areunique, then the treap has the unique representationproperty



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

Unique representation

I Unique representation: each set is represented by aunique data structure [1, 13, 12]

I Most tree data structures do not have this property:depending on order of inserts, deletes, etc. the tree canhave different forms for the same set of keys.

I Recall there are Cn ∼ 4nn−3/2π−1/2 ways to place nkeys in a binary search tree (Catalan numbers). e.g.C20 = 6564120420.

I Deterministic (i.e., not randomized) uniquelyrepresented search trees are known to require Ω(

√n)

worst-case time for insert, delete, search [12].

I Treaps are randomized (not deterministic), and haveO(log n) average-case time for insert, delete, search

I If you memoize or cache the constructors of a uniquelyrepresented data structure, you can do equality testingin O(1) time by comparing pointers.



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

Treap: Example

Treap A1 = R.insert("f"); // Insert the key fTreap A2 = A1.insert("u"); // Insert the key u

Treap B1 = R.insert("u"); // Insert the key u into RTreap B2 = R.insert("f"); // Insert the key f



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

Canonical forms

I The structure of the treap does not depend on the orderon which the operations are carried out.

I Treaps give a canonical form for sets: if A,B are sets,we can determine whether A = B by constructing treapscontaining the elements of A and B, and comparingthem. If the treaps are the same, the sets are equal.

I Treaps give an easy decision procedure for equality ofterms modulo associativity, commutativity, andidempotency.

I Treaps are very useful in program analysis (e.g., forcompilers) for solving fixpoint equations on sets.



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

Persistent Data Structures

Literature: [7, 8, 4, 6]

I Partially persistent: Can access previous versions of adata structure, but cannot derive new versions fromthem (read-only access to a linear past.)

I Fully persistent: Can make changes in previous versionsof the data structure: versions can “fork.”

I Any linked data structure with constant boundedin-degree can be made fully persistent with amortizedO(1) space and time overhead, and worst case O(1)overhead for access [7]

I Confluently persistent: Can branch into two versions ofthe data structure, and later reconcile these branches



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

The Version Graph

The version graph shows how versions of a data structureare derived from one another.

I Vertices: Data structures

I Edges: Show how one data structure was derived fromanother

I Treaps example:

R

~~

BBB

BBBB

B

A1

B1

A2 B2



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

Version graph

I Partial persistence: version graph is a linear sequence ofversions, each derived from the previous version.

I Partial/full persistence: get a version tree

I Confluent persistence: get a version DAG (directedacyclic graph)

X

AAA

AAAA

A

Y 1

Z

Y 2

!!CCC

CCCC

C

W



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

Purely Functional Data Structures

I Literature: [10]

I Functional data structures: cannot modify a node ofthe data structure once it is created. (One implication:no cyclic data structures.)

I Functional data structures are by nature partiallypersistent: we can always hold onto pointers to oldversions of the data structure.



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

Scopes

I Partial persistence is very useful for managing scopes incompilers and program analysis.

I A scope is a representation of the names that are visibleat a given program point:

int foo(int a, int b) // S1

int x = a*a, y = b*b, z=0; // S2

for (int k=0; k < x; ++k) // S3for (int l=0; l < y; ++l) // S4

++c;// S5

return x;



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

Scopes Example



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

Bibliography I

[1] A. Andersson and T. Ottmann.Faster uniquely represented dictionaries.In IEEE, editor, Proceedings: 32nd annual Symposiumon Foundations of Computer Science, San Juan, PuertoRico, October 1–4, 1991, pages 642–649, 1109 SpringStreet, Suite 300, Silver Spring, MD 20910, USA, 1991.IEEE Computer Society Press. bib pdf

[2] Rudolf Bayer.Symmetric binary B-trees: Data structure andmaintenance algorithms.Acta Inf, 1:290–306, 1972. bib



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

Bibliography II

[3] Guy E. Blelloch and Margaret Reid-Miller.Fast set operations using treaps.In Proceedings of the 10th Annual ACM Symposium onParallel Algorithms and Architectures, pages 16–26,Puerto Vallarta, Mexico, June 1998. bib ps

[4] Adam L. Buchsbaum and Robert E. Tarjan.Confluently persistent deques via data-structuralbootstrapping.In Proceedings of the fourth annual ACM-SIAMSymposium on Discrete algorithms, pages 155–164.ACM Press, 1993. bib pdf ps

[5] Thomas H. Cormen, Charles E. Leiserson, andRonald R. Rivest.Intoduction to algorithms.McGraw Hill, 1991. bib



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

Bibliography III

[6] P. F. Dietz.Fully persistent arrays.In F. Dehne, J.-R. Sack, and N. Santoro, editors,Proceedings of the Workshop on Algorithms and DataStrucures, volume 382 of LNCS, pages 67–74, Berlin,August 1989. Springer. bib

[7] James R. Driscoll, Neil Sarnak, Daniel Dominic Sleator,and Robert Endre Tarjan.Making data structures persistent.In ACM Symposium on Theory of Computing, pages109–121, 1986. bib pdf



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

Bibliography IV

[8] Amos Fiat and Haim Kaplan.Making data structures confluently persistent.In Proceedings of the Twelfth Annual ACM-SIAMSymposium on Discrete Algorithms (SODA-01), pages537–546, New York, January 7–9 2001. ACM Press.bib pdf

[9] Leonidas J. Guibas and Robert Sedgewick.A dichromatic framework for balanced trees.In FOCS, pages 8–21. IEEE, 1978. bib

[10] Chris Okasaki.Purely Functional Data Structures.Cambridge University Press, Cambridge, UK, 1998. bib

[11] Raimund Seidel and Cecilia R. Aragon.Randomized search trees.Algorithmica, 16(4/5):464–497, 1996. bib pdf ps



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

Bibliography V

[12] Lawrence Snyder.On uniquely representable data structures.In 18th Annual Symposium on Foundations ofComputer Science, pages 142–146, Long Beach, Ca.,USA, October 1977. IEEE Computer Society Press. bib

[13] R. Sundar and R. E. Tarjan.Unique binary search tree representations andequality-testing of sets and sequences.In Baruch Awerbuch, editor, Proceedings of the 22ndAnnual ACM Symposium on the Theory of Computing,pages 18–25, Baltimore, MY, May 1990. ACM Press.bib pdf



Todd L.Veldhuizen

[email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

Bibliography VI

[14] Jean Vuillemin.A unifying look at data structures.Communications of the ACM, 23(4):229–239, 1980.bib pdf

[15] M. A. Weiss.A note on construction of treaps and Cartesian trees.Information Processing Letters, 54(2):127–127, April1995. bib

[16] Mark Allen Weiss.Linear-time construction of treaps and Cartesian trees.Information Processing Letters, 52(5):253–257,December 1994. bib pdf

ECE750-TXBLecture 8: Treaps,Tries, and Hash

Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography

ECE750-TXB Lecture 8: Treaps, Tries, andHash Tables



Canada

February 1, 2007


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography

Review: Treaps

I Recall that a binary search tree has keys drawn from atotally ordered structure 〈K ,≤〉

I An inorder traversal of the tree recovers the keys inascending order.

d

b h

a c f i


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography

Review: Treaps

I Recall that a heap has priorities drawn from a totallyordered structure 〈P,≤〉

I The priority of a parent is ≥ that of its children (for amax heap.)

I The largest priority is at the root.

23

11 14

7 1 6 13


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography

Review: Treaps

I In a treap, nodes contain a pair (k, p) where k ∈ K is akey, and p ∈ P is a priority.

I A Treap is a mixture of a binary search tree and a heap:

I A binary search tree with respect to keys;I A heap with respect to priorities.

(d,23)

(b,11) (h,14)

(a,7) (c,1) (f,6) (i,13)


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography

Review: Unique Representation

I If the keys and priorities are unique, then treaps havethe unique representation property: given a set of (k, p)pairs, there is only one way to build the tree.

I For the heap property to be satisfied, there is only one(k, p) pair that can be the root: the one with thehighest priority.

I The left subtree of the root will contain all keys < k,and the right subtree of the root will contain all keys> k.

I Of the keys < k, the one with the highest priority mustoccupy the left child of the root. This then splitsconstructing the left subtree into two subproblems.

I etc.


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography


I Example: to build a treap from(i , 13), (c , 1), (d , 23), (b, 11), (h, 14), (a, 7), (f , 6),unique choice of root: (d , 23)

(d , 23)jjjjj TTTTT

(c, 1), (b, 11), (a, 7) (i , 13), (h, 14), (f , 6)

I To build the left subtree, pick out the highest priorityelement: (b, 11). And so forth.

(d , 23)

tttt TTTTT

(b, 11)

uuuu KKK

K(i , 13), (h, 14), (f , 6)

(a, 7) (c, 1)


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography


I Data structures with the unique representation can bechecked for equality in O(1) time by using caching (alsoknown as memoization):

I Implement the data structure in a purely functional style(a node’s fields are never altered after construction.Any changes require creating a new node.)

I Maintain a map from (key,priority, lchild, rchild)tuples to already constructed nodes.

I Before constructing a node, check the cache to see if italready exists; if so, return the pointer to that node.Otherwise, construct the node and add it to the cache.

I If two treaps contain the same keys, their root pointerswill be equal: can be checked in O(1) time.

I Checking and maintaining the cache requires additionaltime overhead.


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography

Review: Balance of treaps

I Treaps are balanced if the priorities are chosenrandomly.

I Recall that building a binary search tree with a randominsertion order results in a tree of expected heightc log n, with c ≈ 4.311.

I A treap with random priorities assigned to keys hasexactly the same structure as a binary search treecreated by inserting keys in descending order of priority

I Descending order of priority is a random order;I Therefore treaps have expected height c log n with

c ≈ 4.311.


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography

Insertion into treaps

I Insertion for treaps is much simpler than that forred-black trees.

1. Insert the (k, p) pair as for a binary search tree, by keyalone: the new node will be placed somewhere at thebottom of the tree.

2. Perform rotations along the path from the new leaf tothe root to restore invariants:

I If there is a node x whose right subchild has a higherpriority, rotate left at x .

I If there is a node x whose left subchild has a higherpriority, rotate right at x .


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography


I Example: the treap below has just had (e, 19) insertedas a new leaf. Rotations have not yet been performed.

(d,23)

(b,11) (h,14)

(a,7) (c,1) (f,6) (i,13)

(e,19)

I f has a left subchild with greater priority: rotate rightat f .


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography


I After rotating right at f :

(d,23)

(b,11) (h,14)

(a,7) (c,1) (e,19) (i,13)

(f,6)

I h has a left subchild with greater priority: rotate rightat h.


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography


I After rotating right at h:

(d,23)

(b,11) (e,19)

(a,7) (c,1) (h,14)

(f,6) (i,13)

I Heap invariant is satisfied: all done.


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography

I Treaps are easily made persistent (retain previousversions) by implementing them in a purely functionalstyle. Insertion requires duplicating at most a sequenceof nodes from the root to a leaf: an O(log n) spaceoverhead. The remaining parts of the tree are shared.

I E.g. the previous insert done in a purely functional style:

Version 2

(d,23)

(b,11)

(e,19)

(a,7) (c,1)

(h,14)

(f,6) (i,13)

(d,23)

Version 1


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography

Strings

I A string is a sequence of characters drawn from somealphabet Σ. We will often use Σ = 0, 1: binarystrings.

I We write Σ∗ to mean all finite strings1 composed ofcharacters from Σ. (∗ is the Kleene closure.)

I Σ∗ contains the empty string ε.I If w , v ∈ Σ∗ are strings, we write w · v or just wv to

mean the concatenation of w and v .I Example: given w = 010 and v = 11, w · v = 01011.

〈Σ∗, ·, ε〉 is an example of a monoid: a set (Σ∗) together with anassociative binary operator (·) and an identity element (ε). Forany strings u, v , w ∈ Σ∗,

u · (v · w) = (u · v) · wvε = εv = v

1Infinite strings are very useful also: if we write a real numberx ∈ [0, 1] as a binary number e.g. 0.101100101000 · · · , this is arepresentation of x by an infinite string from Σω.


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography

Tries

I Recall that we may label the left and right links of abinary tree with 0 (for left) and 1 (for right):

0yyy

y 1@@

@@

x 0 1::

::

y z

I To describe a path in the tree, one can list the sequenceof left/right branches to take from the root. E.g., 10gives y , 11 gives z .

I The set of all paths from the root to leaves isP = 0, 10, 11

I The set of all paths from the root to leaves or internalnodes is: P• = ε, 0, 1, 10, 11, where ε is the emptystring indicating the path starting and ending at theroot.


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography

Tries

I The set P is prefix-free: no string is an initial segmentof any other string. Otherwise, there would be a pathto a leaf passing through another leaf!

I The set P• is prefix-closed: if wv ∈ P•, then w ∈ P•

also. i.e., P• contains all prefixes of all strings in P•.2

2We can define • as an operator by A• ≡ w : wv ∈ A. • is aclosure operator. A useful fact: every closure operator has as its range acomplete lattice, where meet and join are given by (X uY )• = X • ∩Y •

and (X tY )• = (X • ∪Y •)•. Applying this fact to the representation ofbinary trees by strings, • induces a lattice of binary trees.


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography

Tries

I Given a binary tree, we can produce a set of strings P•

or P that describe all paths (resp. all paths to leaves).

I The converse is also true: given a set P• or P, we canreproduce the tree.3

I Example: the set 100, 11, 001, 01 is prefix free, andthe corresponding tree can be built by simply addingthe paths one-by-one to an initially empty tree:

0

ooooooooooooo1

OOOOOOOOOOOOO

0

1

????

????

0

1

????

????

1

????

????

0

3Formally we can say there is a bijection (a 1-1 correspondence)between binary trees and prefix-closed (resp. prefix-free) sets.


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography

TriesI A tree constructed in this way — by interpreting a set

of strings as paths of the tree — is called a trie. (Theterm comes from reTRIEval; pronounced either “tree”or “try” depending on taste. Tries were invented by dela Briandais, and independently by Fredkin [5].)

I The most common use of a trie is to implement aDictionary〈K ,V 〉, i.e., maintaining a mapf : K V by associating each k ∈ K with a paththrough the trie to a node where f (k) is stored.4

I Tries find applications in bioinformatics, coding andcompression, sorting, SAT solving, routing, naturallanguage processing, very large databases (VLDBs),data mining, etc.

I Binary Decision Diagrams (BDDs) are essentially trieswith caching and sharing of subtrees.

I Recent survey by Flajolet [4].4The notation K V indicates a partial function from K to V : a

function that might not be defined for some keys.


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography

Trie example: word list

I Example: build a trie to store english words: trie, larch,saxophone, tried, saxifrage, squeak, try, squeak,squeaky, squeakily, squeakier.

I Common implementation variants of a trie:I associate internal nodes with entries also, if one occurs

there. (Can use 1 bit on internal nodes to indicatewhether a key terminates there.)

I when a node has only one descendent, end the triethere, rather than including a possibly long chain ofnodes with single children.

I Use the trie to store keys only; implicitly the values weare storing are V = 0, 1. The function the trierepresents is a map χ : K → 0, 1 where χ is thecharacteristic function of the set: χ(k) = 1 if and onlyif k is in the set.

I Use the alphabet a, b, · · · , z.I Instead of having a 26-way branch in each node, put a

little BST at each node with up to 26 elements in it (a“ternary search trie” [1])


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography

Trie example: wordlist

larchl

st

a

q

r

x

u e a squeakki

squeakyy

squeakiere

squeakilyl

saxifragei

saxophoneo

i

try

y triee triedd


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography

Trie example: coding

I Suppose we want to transmit (or compress) data.

I At the receiving (or decoding) end, we will have a longstring of bits to decode.

I A simple but effective strategy is to build a codebookthat maps binary codewords to plaintext. The incomingtransmission is then just a sequence of codewords thatwe will replace, one by one, with their correspondingplaintext.5

I A code that can be described by a trie, with outputsonly at the leaves, is an example of a uniquelydecodeable code: there is only one way an encodedmessage can be decoded. Specifically, such codes arecalled prefix codes or instantaneous codes.

5This strategy is asymptotically optimal (achieves a bitrate ≤ H + εfor any ε > 0) for stationary ergodic random processes, with anappropriate choice of codebook.


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography


I Example: to encode english, we might assign codewordsto sequences of three letters, giving the most frequentwords shorter codes:

Three-letter combination Codewordthe 000and 001for 010are 011but 100not 1010you 1011all 1100...

...etc 11101101...

...qxw 1111011001101001

I These codewords are chosen to be a prefix-free set.


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography


I For decoding messages we build a trie:

0 1

0 1 0 1

the

0

and

1

for

0

are

1

but

01 0 1

not

0

you

1

all

0 1 0 1

0 1


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography

Trie example: decoding

I Incoming message: 100101001010111100

I To decode: start at root of trie, follow path given bybits. When a leaf is reached, output the word there,and return to the root.

100︸︷︷︸but

1010︸︷︷︸not

010︸︷︷︸for

1011︸︷︷︸you

1100︸︷︷︸all

I This requires substantially fewer bits than transmittingas ASCII text (24 bits per 3-letter sequence).

I A good code assigns short codewords tofrequently-occurring strings; if a string occurs withprobability pi , one wants the codeword to have lengthabout − log2 pi .

I Later in the course we shall see how such codes can beconstructed optimally using a greedy algorithm.


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography

Tries: Kraft’s inequality

I Kraft’s inequality is a simple constraint on the lengthsof codewords in a prefix code (equivalently, leaf depthsin a binary tree.)

Theorem (Kraft)

Let (d1, d2, . . .) be a sequence of code lengths of a code.There is a prefix code with code lengths d1, d2, . . .(equivalently, a binary tree with leaves at depth d1, d2, . . .) ifand only if

n∑i=1

2−di ≤ 1 (1)


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography

Tries: Kraft’s inequality I

I Positive example: the codeword lengths 3, 3, 2, 2 satisfyKraft’s inequality: 1

8 + 18 + 1

4 + 14 = 3

4 . Possible trierealization:

0ooooo 1

OOOOO0

1??? 0

0

1??

?

I Negative example: the codeword lengths 3, 3, 3, 2, 2, 2violate Kraft’s inequality: sum is 9

8 .

I Kraft’s inequality becomes an equality for trees in whichevery internal node has two children.


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography

Tries: Kraft’s inequality

Two ways to prove Kraft’s inequality:I Put each node of a binary tree in correspondence with a

subinterval of [0, 1] on the real line: root is [0, 1], its children get[0, 1

2] and [ 1

2, 1]. Each node at depth d receives an interval of

length 2−d and splits it in half for its children. The union of theintervals at the leaves is ⊆ [0, 1], and the intervals at the leavesare pairwise disjoint, so the sum of their interval lengths is ≤ 1.

I Kraft’s inequality can also be proved with a simple inductionargument. The list of valid codeword length sequences can begenerated from the initial sequence 〈1, 1〉 (codewords 0, 1) bythe rewrite rules k → k + 1, k + 1 (expand a node into twochildren) and k → k + 1 (expand a node to have a single child).Base case: with 〈1, 1〉 obviously 2−1 + 2−1 = 1. Induction step: ifsum is ≤ 1, consider expanding a single element of the sequence:have either the rewrite k → k + 1, k + 1, and 2k ≥ 2k−1 + 2k−1; orthe rewrite k → k + 1, and 2k ≥ 2k−1. So rewrites never increasethe “weight” of a node.


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography

Tries: Kraft’s inequality I

It is occasionally useful to have an infinite set of codewordshandy, in case we do not know in advance how manydifferent objects we might need to code.For an infinite set of codewords (or infinite binary tree),Kraft’s inequality implies

dk ≥ c + log+ k + log log+ log∗ k infinitely often (2)

where

log+ x ≡ log x + log log x + log log log x + · · ·

with the sum taken only over the positive terms, and log∗ xis the “iterated logarithm” —

log∗ x =

0 if x ≤ 1

1 + log∗(log x) otherwise


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography

Tries: Kraft’s inequality II

See e.g., [2, 9].Where does this bound come from? Well, a necessary condition for

∞Xk=0

2−dk ≤ 1

to hold is that the seriesP∞

k=0 2−dk converges. For example, ifdk = log k, then 2−dk = 1

k, the Harmonic series. The Harmonic series

diverges, so Kraft’s inequality can’t hold.We can parlay this into an inequality by remembering the “comparisontest” for convergence of series: if ak , bk are two positive series, andak ≤ bk for all k, then

Pak ≤

Pbk . If we stick the Harmonic series in

for ak and 2−dk for bk , we get:

If 1k≤ 2−dk for all k then ∞ ≤

P2−dk .

The premiss of this test must be false ifP

2−dk does not diverge toinfinity. Therefore 2−dk must be < 1

kfor at least some k. If 2−dk < 1

k

for only some finite number of choices of k, the series would stilldiverge. So, a necessary condition for 2−dk to converge is that 2−dk < 1

k


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography

Tries: Kraft’s inequality III

for infinitely many terms. Taking logarithms and multiplying through by−1 we get dk > log k for infinitely many i .We can generalize this by saying that if g ∈ ω(1) is any divergingfunction, then dk > − log g ′(k) for infinitely many k. (The Harmonicseries bound follows from choosing g(x) = log x .) Unfortunately thereis no “slowest growing function” g(x) from which we could obtain atightest possible bound.

Eqn. (2) is from [2]; Bentley credits the result to Ronald Graham and

Fan Chung, apparently unpublished.


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography

Tries: Variations on a theme I

There are many useful variants of tries [4]:

I Multiway branching: instead of choosing Σ = 0, 1,one can choose any finite alphabet, and allow eachnode to have |Σ| children.

I Paged trie: each node is required to have a minimalnumber of leaves descended from it; when thisthreshold is not met, the subtree is converted into acompact form (e.g., an array of keys and values)suitable for secondary storage. This technique can alsobe used to increase performance in main memory [6].

I Patricia tries [7] (“Practical Algorithm To RetrieveInformation Coded in Alphanumeric6”) Introduce skippointers to avoid long sequences of single-branch nodeslike

0 // 1 // 1 // 0 //


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography

Tries: Variations on a theme II

I LC-Trie: the first few levels of a big trie tend to bealmost a complete binary tree of some depth, which canbe collapsed into an array of pointers to tries [8].

I Ternary Search Tries (TSTs): a blend of a trie and aBST; can require substantially less space than a trie.For a large |Σ|, replace a |Σ|-way branch at eachinternal node with a BST of depth ≤ log |Σ|.

6Almost better than my all-time favourite strained CS acronym,PERIDOT: “Programming by Example for Real-time Interface DesignObviating Typing.” Great project, despite the acronym.


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography

Hash Tables

I Suppose we wanted to represent the following set:

M = 35, 139, 395, 1691, 1760, 1795, 3632, 3789, 4657

Given some x ∈ N, we want to quickly test whetherx ∈ M.

I Binary search trees: require following a path through atree — perhaps not fast enough for our problem.

I Super fast way: allocate an array of 4657 bytes. Set

A[i ] =

0 if i 6∈ M

1 if i ∈ M

Then, on a RAM, can test whether x ∈ M with a singlememory access to A[i ] (a constant amount of time).However, space required by this strategy is O(supM).


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography

Hash Tables

I Obviously the array A would contain mostly emptyspace. Can we somehow “compress” the array but stillsupport fast access?

I Yes: allocate a much smaller table B of length k.Define a function h : [1, 4657] → [1, k] that mapsindices of A to indices of B, can be computed quickly,and ensures that if x , y ∈ M and x 6= y , thenh(x) 6= h(y) i.e., no two elements of M have the sameindex in B.

I Then, x ∈ M if and only if B[h(x)] = x .


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography

Hash Tables

I For our example, h(x) = x mod 17 does the trick. Hereis the array B:

j B[j ]

0 01 352 03 1394 3955 0

j B[j ]

6 07 08 16919 176010 179511 3632

j B[j ]

12 013 014 015 378916 4657

I e.g.: x = 1691: h(x) = 8, and B[8] = 1691, so x ∈ M.

I e.g.: x = 1692: h(x) = 9, and B[9] = 1760 6= 1692, sox 6∈ M.

I This is a hash table. h(x) = x mod 17 is called a hashfunction.


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography

Hash Functions

I A hash function is a map h : K → H from some(usually large) key space K to some (usually small) setof hash values H. In our example, we were mappingfrom K = [1, 4657] to H = [1, 17].

I If the set M ⊆ K is chosen uniformly at random, keysare uniformly distributed (i.e., each k ∈ K has the sameprobability of appearing in a set to represent). In thiscase the hash function should distribute the keys evenlyamongst elements of H, i.e., we want that|h−1(y)| ≈ |h−1(z)| for y , z ∈ H.7

I For a nonuniform distribution on keys, one just wants to choose h

so that the distribution induced on H is close to uniform.

7Recall that for a function f : R → S , f −1(s) ≡ r : f (r) = s.


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography

Hash Functions

I We will describe some hash functions where K = N(keys are nonnegative integers). These are easilyadapted to other kinds of keys (e.g., strings) byinterpreting the binary representation of the key as aninteger.

Some commonly used hash functions are the following:

1. Division: use h(k) = k mod m where m = |H| is usuallychosen to be a prime number far away from any powerof 2. (Note.8)

I For long bit strings, use Horner’s rule for evaluatingpolynomials in Z/mZ (will explain.)

2. Multiplication: use h(k) = bmkφc, where 0 < φ < 1is an irrational number and x ≡ x − bxc. A popular

choice of φ is φ =√

5−12 .

8A particularly terrible choice would be m = 256, which would hashobjects based only on their lowest 8 bits. e.g., the hash of a stringwould depend only on its last character.


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography

Multiplication hash functions: ExampleExample of multiplication hash function using φ =

√5−12

, and hashtable with m = 100 slots:

key kφ bmkφc1 0.618034 61.2 0.236068 23.3 0.854102 85.4 0.472136 47.5 0.090170 9.6 0.708204 70.7 0.326238 32.8 0.944272 94.9 0.562306 56.

10 0.180340 18.11 0.798374 79.12 0.416408 41.13 0.034442 3.14 0.652476 65.15 0.270510 27.16 0.888544 88.17 0.506578 50.

Idea is that the third column (the hash slots) ‘looks like’ a random

sequence.


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography

Multiplication hash functionsI The reason why h(k) = bmkφc is a reasonable hash

function is interesting.I The short answer is that the sequence kφ for

k = 1, 2, 3, . . . ‘kind of behaves like’ a random realdrawn from (0, 1). So, h(k) = bmkφc ‘looks like’ arandomly chosen hash function.A less sketchy explanation:

1. kφ is uniformly distributed on (0, 1): asymptotically,the proportion of kφ falling in an interval (α, β)where (α, β) ⊆ (0, 1) is (β − α). Just like a uniformdistribution on (0, 1)!

2. kφ satisfies an ergodic theorem: if we sample asuitably well-behaved9 function f at points kφ andaverage, this converges to the integral:

1

m

m∑k=1

f (kφ) →∫ 1

0

f (x)dx

Just like a uniform distribution on (0, 1)!See [3]. Variously called Weyl’s ergodic principle, Weyl’sequidistribution theorem.

However, kφ is emphatically not a random sequence.9Continuously differentiable and periodic with period 1


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography

Hash FunctionsI To evaluate whether a hash function is a good choice

for a set of data S ⊆ K , one can see how the observeddistribution of keys into hash table slots compares to auniform distribution.

I Suppose there are n keys and m hash slots. Computethe observed distribution of the keys:

pi =|k : h(k) = i|

nI To measure how far from uniform, compute

D(P||U) = log2 m +m∑

i=1

pi log2 pi

Convention: 0 log2 0 = 0.

I This is the Kullback-Leibler divergence of the observeddistribution P from the uniform distribution U. It maybe thought of as the “distance” from P to U.

I The smaller D(P||U), the better the hash function.


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography

Bibliography I

[1] Jon L. Bentley and Robert Sedgewick.Fast algorithms for sorting and searching strings.In SODA ’97: Proceedings of the eighth annualACM-SIAM symposium on Discrete algorithms, pages360–369, Philadelphia, PA, USA, 1997. Society forIndustrial and Applied Mathematics. bib

[2] Jon Louis Bentley and Andrew Chi Chih Yao.An almost optimal algorithm for unbounded searching.Information Processing Lett., 5(3):82–87, 1976. bib pdf

[3] Bernard Chazelle.The Discrepancy Method — Randomness andComplexity.Cambridge University Press, Cambridge, 2000. bib


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography

Bibliography II

[4] Philippe Flajolet.The ubiquitous digital tree.In Bruno Durand and Wolfgang Thomas, editors,STACS, volume 3884 of Lecture Notes in ComputerScience, pages 1–22. Springer, 2006. bib pdf

[5] Edward Fredkin.Trie memory.Commun. ACM, 3(9):490–499, 1960. bib

[6] Steffen Heinz, Justin Zobel, and Hugh E. Williams.Burst tries: a fast, efficient data structure for string keys.

ACM Trans. Inf. Syst., 20(2):192–223, 2002. bib


Tables

Todd L.Veldhuizen

[email protected]

Review: Treaps

Tries

Hash Tables

Bibliography

Bibliography III

[7] Donald R. Morrison.PATRICIA—practical algorithm to retrieve informationcoded in alphanumeric.J. ACM, 15(4):514–534, 1968. bib pdf

[8] Stefan Nilsson and Gunnar Karlsson.IP-address lookup using LC-tries.IEEE Journal on Selected Areas in Communications,17:1083–1092, June 1999. bib

[9] Jorma Rissanen.Stochastic Complexity in Statistical Inquiry, volume 15of Series in Computer Science.World Scientific, 1989. bib

ECE750-TXBLecture 9: Hashing

Todd L.Veldhuizen

[email protected]

Outline

Hashing

Bibliography

ECE750-TXB Lecture 9: Hashing



Canada

February 6, 2007


Todd L.Veldhuizen

[email protected]

Outline

Hashing

Bibliography

Hash tables

I Recall that a hash table consists ofI m slots into which we are placing items;I A map h : K → [0,m − 1] from key values to slots.

I We put n keys k1, k2, . . . , kn into locationsh(k1), h(k2), . . . , h(kn).

I In the ideal situation we can then locate keys with O(1)operations.


Todd L.Veldhuizen

[email protected]

Outline

Hashing

Bibliography

Horner’s Rule I

I Horner’s rule gives an efficient method for evaluatinghash functions for sequences, e.g., strings.

I Consider a hash function of the form

h(k) = k mod m

I If we wish to hash a string such as “hello,” we caninterpret it as a long binary number: in ASCII, “hello” is

01101000︸︷︷︸h

01100101︸︷︷︸e

01101100︸︷︷︸l

01101100︸︷︷︸l

01101111︸︷︷︸o

I As a sequence of integers, “hello” is[104, 101, 108, 108, 111]. We want to compute

(104 · 232 + 101 · 224 + 108 · 216 + 108 · 28 + 111 · 20) mod m


Todd L.Veldhuizen

[email protected]

Outline

Hashing

Bibliography

Horner’s Rule II

I Horner’s rule is a general trick for evaluating apolynomial. We write

ax3 + bx2 + cx + d = (ax2 + bx + c)x + d

= ((ax + b)x + c)x + d

So that instead of computing x3, x2, . . . we have onlymultiplications:

t1 = ax + b

t2 = t1x + c

t3 = t2x + d

I Trivia: some early CPUs included an instruction opcodefor applying Horner’s rule. May be making a comeback!


Todd L.Veldhuizen

[email protected]

Outline

Hashing

Bibliography

Horner’s Rule IIII To use Horner’s rule for hashing: to compute

(a · 224 + b · 216 + c · 28 + d) mod m,

t1 = (a · 28 + b) mod m

t2 = (t1 · 28 + c) mod m

t3 = (t2 · 28 + d) mod m

Note that multiplying by 2k is simply a shift by k bits.I Why this works. In short, algebra. The integers Z form a ring

under multiplication and addition. The hash functionh(k) = k mod m can be interpreted as a homomorphism from thering Z of integers to the ring Z/mZ of integers modulo m.Homomorphisms preserve structure in the following sense: if wewrite + for integer addition, and ⊕ for addition modulo m,

h(a + b) = h(a)⊕ h(b)

i.e., it doesn’t matter whether we compute (a + b) mod m orcompute (a mod m) and (b mod m) and add with modular


Todd L.Veldhuizen

[email protected]

Outline

Hashing

Bibliography

Horner’s Rule IVarithmetic: we get the same answer either way. Similarly, if wewrite × for multiplication in Z, and ⊗ for multiplication in Z/mZ,

h(a× b) = h(a)⊗ h(b)

Horner’s rule works precisely because h : Z → Z/mZ is ahomomorphism:

h(((a× 28 + b)× 28 + c)× 28 + d)

= (((h(a)⊗ h(28)⊕ h(b))⊗ h(28)⊕ h(c))⊗ h(28)⊕ h(d))

This can be optimized to use fewer applications of h, as above.In this form it is obvious why m = 28 is a horrible choice for ahash table size: 28 mod 28 = 0, so

(((h(a)⊗ h(28)⊕ h(b))⊗ h(28)⊕ h(c))⊗ h(28)⊕ h(d))

= (((h(a)⊗ 0⊕ h(b))⊗ 0⊕ h(c))⊗ 0⊕ h(d))

= h(d)

i.e., the hash value depends only on the last byte. Similarly, if weused m = 216, we would have h(216) = 0, which would remove allbut the last two bytes from the hash value computation.

For background on algebra see, e.g., [1, 9, 7].


Todd L.Veldhuizen

[email protected]

Outline

Hashing

Bibliography

Collisions

I A collision occurs when two keys map to the samelocation in the hash table, i.e., there are distinctx , y ∈ M such that h(x) = h(y).

I Strategies for handling collisions:

1. Pick a value of m large enough so that collisions arerare, and can be easily dealt with e.g., by maintaining ashort “overflow” list of items whose hash slot is alreadyoccupied.

2. Pick the hash function h to avoid collisions.3. Put another data structure in each hash table slot (a

list, tree, or another hash table);4. If a hash slot is full then try some other slots in some

fixed sequence (open addressing).


Todd L.Veldhuizen

[email protected]

Outline

Hashing

Bibliography

Collision Strategy 1: Pick m big I

I Let’s see how big m must be for the probability ofcollisions to be small.

I Two cases:I n > m: then there must be a collision, by the

pigeonhole principle.1

I n ≤ m: may or may not be a collision.

I The “birthday problem”: what is the probability thatamongst n people, at least two share the same birthday?

I This is a hashing problem: people are keys, days of theyear are slots, and h maps people to their birthdays.

I If n ≥ 23, then the probability of two people having thesame birthday is > 1

2 . (Counterintuitive, but true.)

I The “birthday problem” analysis is straightforward toadapt to hashing.


Todd L.Veldhuizen

[email protected]

Outline

Hashing

Bibliography

Collision Strategy 1: Pick m big III Suppose the hash function h and the distribution of

keys cooperate to produce a uniform distribution of keysinto hash table slots.

I Recall that with a uniform distribution, probability maybe computed by simple counting:

Pr(event E happens) =# outcomes in which E happens

# outcomes

I First we count the number of hash functions withoutcollisions:

I There are m choices of where to put the first key; m− 1choices of where to put the second key; ... m − n + 1choices of where to put the nth key.

I The number of hash functions with no collisions ismn = m · (m − 1) · · · (m − n + 1) = m!

(m−n)! . (Note2.)

I Next we count the number of hash functions allowingcollisions:


Todd L.Veldhuizen

[email protected]

Outline

Hashing

Bibliography

Collision Strategy 1: Pick m big IIII There are m choices of where to put the first key; m

choices of where to put the second key; ... m choices ofwhere to put the nth key.

I The number of hash functions allowing collisions is mn.

I The probability of a collision-free arrangement is

p =m!

(m − n)! ·mn

I Asymptotic estimate of ln p, assume m n:

ln p ∼ − n2

2m+

n

2m+ O

(n3

m2

)(1)

Here we have used Stirling’s approximation and

ln(m − n) = ln m − nm− O

“n2

m2

”.

I Two cases: If n2 ≺ m then ln p → 0. If n2 m thenln p → −∞.


Todd L.Veldhuizen

[email protected]

Outline

Hashing

Bibliography

Collision Strategy 1: Pick m big IVI Recall that if

ln p = x + ε

then

p = ex+ε

= exeε

= ex(1 + ε + ε2 + · · ·

)Taylor series

= ex (1 + O(ε)) if ε ∈ o(1)

I Probability of a collision-free arrangement is

p ∼ e−n(n−1)

2m + O

(n3e−

n(n−1)2m

m2

)I Interpretation:


Todd L.Veldhuizen

[email protected]

Outline

Hashing

Bibliography

Collision Strategy 1: Pick m big V

I If m ∈ ω(n2) there are no collisions (almost surely).I If m ∈ o(n2) there is a collision (almost surely).

I i.e., if we want a low probability of collisions, our hashtable has to be quadratic (or more) in the number ofitems.

1If m + 1 pigeons are placed in m pigeonholes, there must be twopigeons in the same hole. (Replace “pigeons” with “keys,” and“pigeonholes” with “hash slots.”)

2The handy notation mm is called a “falling power” [8].


Todd L.Veldhuizen

[email protected]

Outline

Hashing

Bibliography

Threshold functions

I m = 12n2 is an example of a threshold function:

I ≺ the threshold, asymptotic probability of event is 0I the threshold, asymptotic probability of event is 1.

Prob. of no collision

XX

CC

n20

1

n3n2−εn n2+ε

Hash table size (m)


Todd L.Veldhuizen

[email protected]

Outline

Hashing

Bibliography

Collision Strategy 1: pick m big

I Picking m big is not an effective strategy for handlingcollisions.

I For n = 1000 elements, this table shows how big mmust be to achieve the desired probability of nocollisions:

p m0.1 5 000 0000.01 50 000 00010−6 500 000 000 00010−9 500 000 000 000 000


Todd L.Veldhuizen

[email protected]

Outline

Hashing

Bibliography

Collision Strategy 1: pick m big

I The analysis of collisions in hashing demonstrates twopigeonhole principles.

I The simplest pigeonhole principle states that if you put≥ m + 1 pigeons in m holes, there must be one holewith ≥ 2 pigeons.

I With respect to hash tables, the pigeonhole applies asfollows: If a hash table with m slots is used to store≥ m + 1 elements, there is a collision.

I The probability-of-collision analysis of the previous slidedemonstrates a probabilistic pigeonhole principle: if youput ω(

√n) pigeons in n holes, there is a hole with ≥ 2

pigeons almost surely (i.e., with probability convergingto 1 as n →∞.)


Todd L.Veldhuizen

[email protected]

Outline

Hashing

Bibliography

Collision Strategy 2: pick h carefully I

I Can we pick our hash function h to avoid collisions?

I For example, if we use hash functions of the form

h(k) = bmkφc

we could try random values of φ ∈ (0, 1) until we foundone that was collision-free.

I We have a probability of success

p ∼ e− n

2m(m−1) (1 + o(1))

I Geometric distribution:I Probability of success p, probability of failure 1− pI Each trial independent, identically distributed.I Probability that k tries are needed for success

= (1− p)k−1pI Mean: p−1


Todd L.Veldhuizen

[email protected]

Outline

Hashing

Bibliography

Collision Strategy 2: pick h carefully II

I Number of values of φ we expect to try before we find acollision-free hash table for n = 1000:

m # Expected failures before success1000 10217

2000 10109

10000 1022

100000 147

I Picking hash functions randomly in this manner isunlikely to be practical.

I There are better strategies: see [6, 2].


Todd L.Veldhuizen

[email protected]

Outline

Hashing

Bibliography

Collision Strategy 3: secondary data structures I

I By far the most common technique for handlingcollisions is to put a secondary data structure in eachhash table slot:

I A linked list (‘chaining’)I A binary search tree (BSTs)I Another hash table

I Let α = nm be the load factor: the average number of

items per hash table slot.I Assuming uniform distribution of keys into slots:

I Linked lists require 1 + α steps (on average) to find akey;

I Suitable BSTs require 1 + max(c log α, 0) steps (onaverage).3

I Using secondary hash tables of size quadratic in thenumber of elements in the slot, one can achieve O(1)lookups on average, and require only Θ(n) space.


Todd L.Veldhuizen

[email protected]

Outline

Hashing

Bibliography

Collision Strategy 3: secondary data structures III Analysis of secondary hash tables:

I Let Ni be a random variable indicating the number ofitems landing in slot j .

I E[Ni ] = α

I Var[Ni ] = n · 1

m

„1− 1

m

«| z Bernoulli variance

I Space required for secondary hash tables isproportional to

E

24 X1≤i≤m

N2i

35 =X

1≤i≤m

E[N2i ] =

X1≤i≤m

Var[Ni ] + α2

= m ·„

n · 1

m

„1− 1

m

«+

n2

m2

«∼ n2

m+ n − n

m

Plus space Θ(m) for the primary hash table =

Θ(m + n2

m+n). Choosing m = Θ(n) yields linear space.

3The max(· · · ) deals with the possibility that α < 1, in which caselog α < 0.


Todd L.Veldhuizen

[email protected]

Outline

Hashing

Bibliography

Collision Strategy 4: open addressing I

I Open addressing is a family of techniques for resolvingcollisions that do not require secondary data structures.This has the advantage of not requiring any dynamicmemory allocation.

I In the simplest scenario we have a function s : H → Hthat is ideally a permutation of the hash values, forexample the “linear probing” function

s(x) = (x + 1) mod m

I When we attempt to insert a key k, we look in slot h(k),s(h(k)), s(s(h(k))), etc. until an empty slot is found.

I To find a key k, we look in slot h(k), s(h(k)),s(s(h(k))), etc. until either k or an empty slot is found.


Todd L.Veldhuizen

[email protected]

Outline

Hashing

Bibliography

Collision Strategy 4: open addressing III However, the use of permutations performs badly as the

hash table becomes fuller: tend to get“clumps/clusters,” i.e., long sequencesh(k), s(h(k), s(s(h(k))), . . . where all the slots areoccupied (see e.g. [10]).

I Performance can be good for not very full tables, e.g.α < 2

3 . As α → 1 operations begin to take Θ(√

n) time[5].

I Quadratic probing offers less clumping: try slots h0(k),h1(k), · · · where

hi (k) = (h(k) + i2) mod m

h(k) is an initial fixed hash function. If m prime, thesequence hi (k) will visit every slot.

I Double hashing uses two hash functions, h1 and h2:

hi (k) = (h1(k) + i · h2(k)) mod m

h1(k) gives an initial slot to try; h2(k) gives a ‘stride’(reduces to linear probing when h2(k) = 1.)


Todd L.Veldhuizen

[email protected]

Outline

Hashing

Bibliography

Collision Strategy 4: open addressing III

I Under favourable conditions, an open addressingscheme behaves like a geometric distribution whensearching for an open slot: the probability of finding anempty slot is 1− α, so the expected number of trials is

11−α . Note the catastrophe when α → 1.


Todd L.Veldhuizen

[email protected]

Outline

Hashing

Bibliography

Summary of collision strategies

Strategy E[access time] Space

Choose m big O(1) Ω(n2)Linked List 1 + α O(n + m)Binary Search Tree 1 + max(c log α, 0) O(n + m)Secondary Hash Tables O(1) O(n + m)Open addressing 1

1−α O(m)

I Open addressing can be quite effective if α 1, butfails catastrophically as α → 1.


Todd L.Veldhuizen

[email protected]

Outline

Hashing

Bibliography

Summary of collision strategiesI If unexpectedly n m (e.g. we have far more data than

we designed for), then α →∞. For example, ifm ∈ O(1) and n ∈ ω(1):

I Linked list has O(n) accesses;I BSTs have O(log n) accesses—offer a gentler failure

mode.I If hash function is badly nonuniform:

I Linked list can be O(n);I BST will have O(log n);I Secondary hash tables may require O(n2) space.

I To summarize: hash table + BST will give fast searchtimes, and let you sleep at night.

I To maintain O(1) access times as n →∞, it isnecessary to maintain m n. This can be done bychoosing an allowable interval α ∈ [c1, c2]; when α > c2

resize the hash table to make α = c1. So long asc2 > c1, this strategy adds O(1) amortized time perinsertion, as in dynamic arrays.


Todd L.Veldhuizen

[email protected]

Outline

Hashing

Bibliography

Applications of hashing I

I Hashing is a ubiquitous concept, used not just formaintaining collections but also for

I cryptographyI combinatoricsI data miningI computational geometryI databasesI router traffic analysis

I An example: probabilistic counting


Todd L.Veldhuizen

[email protected]

Outline

Hashing

Bibliography

Probabilistic Counting I

I Problem: estimate the number of unique elements in aLARGE collection (e.g., a database, a data stream)without requiring much working space

I Useful for query optimization in databases [11]:I e.g. to evaluate A ∩B ∩ C can do either A∩ (B ∩ C ) or

(A ∩ B) ∩ CI one of these might be very fast, one very slow.I have rough estimates of |B ∩ C | vs |A ∩ B| to decide

which strategy will be faster.


Todd L.Veldhuizen

[email protected]

Outline

Hashing

Bibliography


I Less serious (but more readily understood) example:

I Shakespeare’s complete works:I N=884,647 words (or so)I n=28,239 unique words (or so)I w = average word lengthI Nmax ≈ n = prior estimate on n

I Problem: estimate n — the number of unique wordsused. Approaches:

1. Sorting: Put all 884,647 words in a list and sort, thencount. (Time O(Nw log N), space O(Nw))

2. Trie: Scan through the words and build a trie, withcounters at each node; requires O(nw) space(neglecting size of counters.)

3. Super-LogLog Probabilistic Counting [3]: Use 128 bytesof space, obtain estimate of 30897 words (error 9.4%).


Todd L.Veldhuizen

[email protected]

Outline

Hashing

Bibliography


I Inputs: a multiset A of elements, possibly with manyduplicates (e.g., Shakespeare’s plays)

I Problem: estimate card(A): the number of uniqueelements in A (e.g., number of distinct wordsShakespeare used)

I Simple starting idea: hash the objects into anm-element hash table. Instead of storing keys, justcount the number of elements landing in each hash slot.

I Extreme cases to illustrate the principle:I Elements of A are all different: will get an even

distribution in the hash table.I Elements of A are all the same: will get one hash table

slot with all the elements!I The shape of the hash table distribution reflects the

frequency of duplicates.


Todd L.Veldhuizen

[email protected]

Outline

Hashing

Bibliography

Probabilistic Counting

I Linear Counting [11]I Compute hash values in the range [0,Nmax)I Maintain a bitmap representing which elements of the

hash table would be occupied, and estimate n from thesparsity of the hash table.

I Uses Θ(Nmax) bits, e.g., on the order of card(A) bits.

I Room for improvement: the precise sparsity patterndoesn’t matter: just the number of full vs. empty slots.


Todd L.Veldhuizen

[email protected]

Outline

Hashing

Bibliography


I Probabilistic Counting [4]I Compute hash values in the range [0,Nmax)I Instead of counting hash values directly, count the

occurrence of hash values matching certain patterns:

Pattern Expected occurrencesxxxxxxx1 2−1 · card(A)xxxxxx10 2−2 · card(A)xxxxx100 2−3 · card(A)xxxx1000 2−4 · card(A)

......

Use these counts to estimate card(A).I To improve accuracy, use m different hash functions.I Uses Θ(m log Nmax) storage, and delivers accuracy of

O(m−1/2)


Todd L.Veldhuizen

[email protected]

Outline

Hashing

Bibliography

Probabilistic Counting

I Super-LogLog [3] requires Θ(log log Nmax) bits. With1.28kb of memory can estimate card(A) to withinaccuracy of 2.5% for Nmax ≤ 130 million.

I Probabilistic counters: count to N using log log N bits:

12

//

12

14

//

34

18

//

78

116

//

1516

· · ·

Need log N states, which can be encoded in log log Nbits.


Todd L.Veldhuizen

[email protected]

Outline

Hashing

Bibliography

Bibliography I

[1] Stanley Burris and H. P. Sankappanavar.A Course in Universal Algebra.Springer-Verlag, 1981. bib pdf

[2] Martin Dietzfelbinger, Anna Karlin, Kurt Mehlhorn, andFriedhelm MeyerAuf Der.Dynamic perfect hashing: Upper and lower bounds.SIAM J. Comput., 23(4):738–761, 1994. bib

[3] Marianne Durand and Philippe Flajolet.Loglog counting of large cardinalities (extendedabstract).In Giuseppe Di Battista and Uri Zwick, editors, ESA,volume 2832 of Lecture Notes in Computer Science,pages 605–617. Springer, 2003. bib pdf


Todd L.Veldhuizen

[email protected]

Outline

Hashing

Bibliography

Bibliography II

[4] Philippe Flajolet and G. N. Martin.Probabilistic counting algorithms for data baseapplications.Journal of Computer and System Sciences,31(2):182–209, September 1985. bib pdf

[5] Philippe Flajolet, Patricio V. Poblete, and AlfredoViola.On the analysis of linear probing hashing.Algorithmica, 22(4):490–515, 1998. bib pdf

[6] Michael L. Fredman and Janos Komlosan Endre Szemeredi.Storing a sparse table with 0(1) worst case access time.J. ACM, 31(3):538–544, 1984. bib


Todd L.Veldhuizen

[email protected]

Outline

Hashing

Bibliography

Bibliography III

[7] Joseph A. Gallian.Contemporary Abstract Algebra.D. C. Heath and Company, Toronto, 3rd edition, 1994.bib

[8] Ronald L. Graham, Donald E. Knuth, and OrenPatashnik.Concrete Mathematics: A Foundation for ComputerScience.Addison-Wesley, Reading, MA, USA, second edition,1994. bib

[9] Saunders MacLane and Garrett Birkhoff.Algebra.Chelsea Publishing Co., New York, third edition, 1988.bib


Todd L.Veldhuizen

[email protected]

Outline

Hashing

Bibliography

Bibliography IV

[10] Robert Sedgewick and Philippe Flajolet.An introduction to the analysis of algorithms.Addison-Wesley Publishing Company, Reading,MA-Menlo Park-New York-Don Mills,Ontario-Wokingham, England-Amsterdam-Bonn-Sydney-Singapore-Tokyo-Madrid-San Juan-Milan-Paris,1996. bib

[11] Kyu-Young Whang, Brad T. Vander-Zanden, andHoward M. Taylor.A linear-time probabilistic counting algorithm fordatabase applications.ACM Trans. Database Syst., 15(2):208–229, 1990. bib

pdf

ECE750-TXBLecture 10: Design

Tradeoffs,Introduction toAverage-Case

Analysis

Todd L.Veldhuizen

[email protected]

ECE750-TXB Lecture 10: Design Tradeoffs,Introduction to Average-Case Analysis



Canada

March 1, 2007



Analysis

Todd L.Veldhuizen

[email protected]

Part I

Design Tradeoffs



Analysis

Todd L.Veldhuizen

[email protected]

Theme: Design Tradeoffs

I Tradeoffs between design parameters: A recurringtheme in algorithms & data structures.

I Examples:I By making a hash table bigger, we can decrease α (the

load factor) and achieve faster search times. (A tradeoffbetween space and time.)

I In designing circuits to add n-bit integers, we can obtainvery low delays (the maximum number of gates betweeninputs and outputs) by increasing the number of gates:trading time (delay) for area (number of gates)

I In many tasks we can trade the precision of an answerfor time and space, e.g., responding quickly to databasequeries with an estimate of the answer, rather than theexact answer.



Analysis

Todd L.Veldhuizen

[email protected]

Theme: Design Tradeoffs

I Design tradeoffs are often parameterizable.I For example, in speed/accuracy tradeoffs we don’t

usually have to choose either speed or accuracy. Insteadwe have a parameter ε — the allowable error — that wecan adjust.

I With ε large we get fast (but possibly not veryaccurate) answers

I As ε → 0 we get very accurate answers that take longerto compute.

I Let’s look at an example of a tradeoff in the design ofdata structures.



Analysis

Todd L.Veldhuizen

[email protected]

Design Tradeoff: Hash tables vs. BSTs

I Consider representing a collection of n keys drawn froman ordered structure 〈K ,≤〉.

I A (balanced) binary search tree (BST) has Θ(log n)search times.

I A hash table has Θ(1) search (if we keep the size number of elements, and choose an appropriate hashfunction.)

I Difference between these two data structures:I A BST allows us to iterate through the elements in

order, using Θ(log n) working space. The Θ(log n) space

is used to record the path from the root to the iterator

position in a stack.I Items in a hash table are not stored in order — if we

want to iterate through them in order, we need extraspace and time, e.g. Θ(n) space for a temporary arrayand Θ(n log n) time to sort the items.



Analysis

Todd L.Veldhuizen

[email protected]


I We can view BSTs and hash tables as two points in adesign space:Data structure Search time Working space for

ordered iterationHash table Θ(1) Θ(n)Binary Search Tree Θ(log n) Θ(log n)

I Suppose:I We have a very large (n = 109) collection of keys that

barely fits in memoryI Dynamic: keys added and removed frequently.I We need fast search, fast insert, fast remove.

I Red-black: height is ≈ 2 log n ≈ 61 levels

I We need to be able to iterate through the collection inorder.

I There is not enough room in memory to create atemporary array for sorting; also, this would beprohibitively slow.



Analysis

Todd L.Veldhuizen

[email protected]

Design Tradeoff: Hash tables vs. BSTsI Let’s make a simple data structure that will offer a

smoother tradeoff between search time and the workingspace required for an ordered iteration.

I If you think of BST + hash table as two points in adesign space, we want a structure that will ‘interpolate’smoothly between them.

Search

cc

Working space for ordered iteration

c log n

log n1

n

Binary Search Tree

Hash Table

Time

ccECE750-TXB

Lecture 10: DesignTradeoffs,

Introduction toAverage-Case

Analysis

Todd L.Veldhuizen

[email protected]

Design Tradeoff: Hash tables vs. BSTs I

I Consider a hash table of m slots, using a BST in eachslot to resolve collisions:

XX

b bb b bbb bb b bb b bb bbbb b b bb bb

b!! !!

aaaa

!!XXXX

XXhh

b

I Observation:I When m = 1 we have a degenerate hash table with a

single slot.I All the keys are put in a single BST.I So, choosing m = 1 essentially gives us a BST: we can

iterate through the keys in order, search requires c log nsteps, where c reflects the average tree depth.



Analysis

Todd L.Veldhuizen

[email protected]

Design Tradeoff: Hash tables vs. BSTs II

I What about the case m = 2?I We have a hash table with two slots. If hash function is

good, we get two BSTs of roughly n/2 keys apiece.I Search time is about c log(n/2).I Can we iterate through the keys in order?

I Yes: have two iterators, one for each tree. Initially thetwo iterators point at the smallest key in their tree.

I At each step of the iteration, choose the iterator thatis pointing at the smaller of the two keys. Retrievethat key, and advance the iterator.

I Generalize: if we choose an arbitrary m,I We will have m BSTs of average size n/mI Search times will be around c log(n/m), assuming

m n.I To iterate through the keys in order,

I Obtain m iterators, one for each tree.I At each step, choose the iterator pointing at the

smallest key, retrieve that key and advance the iterator.



Analysis

Todd L.Veldhuizen

[email protected]

Design Tradeoff: Hash tables vs. BSTs III

I To do this efficiently, we need a fast way to maintain acollection (of iterators) that lets us quickly obtain theone with the smallest value (the iterator pointing at thesmallest key)

I Easy: a min-heap.

I Our algorithm for ordered iteration will look like this:

1. Create an array of m BST iterators, one for each hashtable slot.

2. Turn this array into a min-heap, ordering iterators bythe key they are pointing at. (The heap can be built inO(m) time.)

3. To obtain the next element,

3.1 Remove the least element from the min heap. (Thistakes O(log m) time.)

3.2 Obtain its key, and advance the iterator. (Advancing aBST iterator requires O(1) amortized time.)

3.3 Put the iterator back into the min-heap. (This takesO(log m) time.)



Analysis

Todd L.Veldhuizen

[email protected]

Design Tradeoff: Hash tables vs. BSTs II We can iterate through the keys in order in time

O(n(1 + log m)).I O(m) time to obtain the iterators and build the heapI O(1 + log m) time per key to adjust the heap, times n

keys = O(n log m) (The 1 + · · · handles the case m = 1.)I Overall, O(n(1 + log m)) time, assuming m n.

I The space required for iterating through the keys inorder is O(m(1 + log(n/m))):

I We need m iterators, one per hash table slot.I Each iterator requires space O(1 + log(n/m)), on

average, for a stack recording its position in the tree.(The 1 + · · · handles the case where n = m.)

I The number of steps for searching is on average1 + c log(n/m), where c is a constant depending on thekind of BST we choose. The constant 1 is added to reflect

visiting the correct slot in the hash table; and to handle the case

where m = n, in which case c log(n/m) = 0, and having 0 search

steps doesn’t make sense.



Analysis

Todd L.Veldhuizen

[email protected]

Design Tradeoff: Hash tables vs. BSTsI Looking at these complexities, a sensible

parameterization is m = n1−β .I When β = 0, m = n and we get a hash table;I When β = 1, m = 1 and we get a BST.I Space and time:

I Number of search steps is ≈ 1 + c log(n/m) =1 + c log(n/(n1−β)) = 1 + c log nβ = 1 + βc log n.

I β directly multiplies our search time: choosing β = 12

halves our search time.I Working space for ordered iteration is

O(m(1 + log(n/m))) = O(n1−β(1 + log nβ)).I E.g., if we choose β = 1

2 we are twice as fast as a BSTfor searching, and require O(

√n log n) working space

for ordered iteration.I The amount of extra space we need for ordered

iteration, relative to the space needed to store the keys,

is ≈ n1−β(1+β log n)n = n−β(1 + β log n). NB: if β > 0 the

relative space overhead for supplying ordered iteration is→ 0.



Analysis

Todd L.Veldhuizen

[email protected]


I Let’s look at some real-life numbers. Take n = 109 keys.I Assume we use red-black trees, so that average depth of

keys in a tree of n/m elements is ≤ 2 log(n/m).

Parameter #Search steps Space for iter. Space overheadβ 1 + β2 log n n1−β(1 + β log n) n−β(1 + β log n)

(Hash) 0 1 1000000000 100%1/8 4.7 355237568 35%1/4 16.0 47654705 4.7%1/2 31.9 504341 0.05%3/4 45.8 4165 0.0004%7/8 53.3 362 0.00004%

(BST) 1 60.8 31 0.000003%

I e.g. Choosing β = 1/4, we can get searches 4 timesfaster than the plain red-black tree, and have only a4.7% space overhead for ordered iteration.

I Choosing β = 1/2, we can get searches twice as fast asa plain red-black tree, with a 0.05% space overhead forordered iteration.



Analysis

Todd L.Veldhuizen

[email protected]

BibliographyPart II

Introduction to Average-Case Analysis



Analysis

Todd L.Veldhuizen

[email protected]

Bibliography

Average-case Analysis

I Worst-case analysis is very important for someapplications (e.g., real-time systems: worst-caseexecution time), and in theoretical computer science.

I However, average-case performance is usually moreimportant for practical engineering work.

I Given a choice between a data structure that alwaysfinds an item in 253 steps, vs one that finds an item in5 steps on average (but with probability 10−12 takesmore than 10000 steps), we would usually choose thefast-on-average data structure.

I In practice we are usually interested in performance forthe average case, rather than worst case.



Analysis

Todd L.Veldhuizen

[email protected]

Bibliography


I We can often find algorithms + data structures that aremuch more efficient on average than the bestworst-case data structures.

I We shall see that randomness can be an extremelyeffective tool for achieving good average-caseperformance.

I Example: uniquely represented dictionaries.I It is known that deterministic (no randomness)

tree-based uniquely represented dictionaries requireΩ(√

n) time for insert/search operationsI Sundar-Tarjan trees [3] achieve this bound.I Treaps are uniquely represented and on average achieve

O(log n) search and insert. However, with vanishinglysmall probability they may require O(n) time.



Analysis

Todd L.Veldhuizen

[email protected]

Bibliography


I Example: QuickSortI QuickSort is the standard sorting algorithm. It achieves

O(n log n) time on average, and is faster in practicethan other algorithms. However, in the worst case itrequires O(n2) time.

I Merge Sort requires O(n log n) time in the worst case,but is slower in practice than QuickSort.

I Example: SearchingI Binary search requires O(log n) time in the worst case.I Interpolation search [1] requires O(log log n) time on

average, if the data is uniformly distributed. However, itcan require O(n) time with pathological distributions.

I The function log log n is so slowly growing as to beeffectively constant: log log 1010 ≈ 5; log log 1020 ≈ 6;log log 1040 ≈ 7.

I We can often use hashing to make an arbitrary keydistribution uniform.



Analysis

Todd L.Veldhuizen

[email protected]

Bibliography

Some key ideas we will explore I

1. We can always obtain an algorithm that combines thebest average case and best worst-case bounds of anyalgorithms we have available. (We needn’t settle for“fast on average, but occasionally catastrophicallyslow.”)

2. If an algorithm has O(f (n)) average time, theprobability of taking ω(f (n)) time goes to zero.

3. The amount of randomness, or entropy, of an inputdistribution plays a critical role in the performance ofalgorithms.

I Randomness can help average-case performance: ifthere are comparatively few “worst cases,” then as longas we have at least a certain amount of randomness inthe inputs, those worst cases do not contribute to theaverage running time.



Analysis

Todd L.Veldhuizen

[email protected]

Bibliography

Some key ideas we will explore II

I Randomness can hurt average-case performance: wecan design BSTs so that search time depends not onthe number of keys n, but just on the amount ofrandomness in the distribution of keys we are asked tosearch for. In this case, the less randomness, the better!



Analysis

Todd L.Veldhuizen

[email protected]

Bibliography

Best of both worlds

I Algorithms that are good in the average-caseoccasionally break down on pathological examples, e.g.,that make QuickSort run in O(n2) time, cause searchtrees of height O(n), etc.

I Suppose we have a pair of algorithms:I Algorithm A has average case time Θ(f (n)) and worst

case Θ(g(n))I Algorithm B has average case time Θ(f ′(n)) and worst

case Θ(g ′(n))

I We can always construct an algorithm that has the bestof both:

I average case time Θ(min(f (n), f ′(n))) andI worst case time Θ(min(g(n), g ′(n))).



Analysis

Todd L.Veldhuizen

[email protected]

Bibliography

Best of both worlds

I Easy: Run A and B in parallel (or interleaved), andreturn the result of the first finisher.

B

Input First finisher

Algorithm

Algorithm

A eI E.g. we can simultaneously perform a binary search and

an interpolation search: this gives an algorithm withO(log log n) average case, and O(log n) worst case [2].

I However, note that this may entail larger constantfactors and extra space compared to using a singlealgorithm.



Analysis

Todd L.Veldhuizen

[email protected]

Bibliography

Applications of randomnessI Here are a few of the applications of randomness we will

encounter:I Modelling the distribution of inputs, so we can:

I compute average-case performance;I determine the structure of “typical” inputs, so we can

tune our algorithm to them;I Exploiting the amount of randomness (entropy) of

inputs to achieve better performance;I Using randomness to force some quantity of interest

into a desirable distribution (e.g., height of a treap);I Using randomness to foil an adversary: e.g. our

algorithm/data structure could perform poorly only ifthe entity generating the queries could predict somerandom sequence we were using

I Using randomness to break symmetry, e.g., leaderelection in distributed systems;

I Using randomness to efficiently approximate answers inextremely small amounts of time or space.

I Random distributions of inputs can cause complexityclasses to collapse, so that problems that were hard tosolve efficiently suddenly become efficiently solveable onaverage.



Analysis

Todd L.Veldhuizen

[email protected]

Bibliography

Typical inputs I

I We can use tools of average-case analysis tocharacterize the “typical case” for which we should tuneour algorithms.

I Simple example: find the first nonzero bit in a binarystring of n bits.1

I Simple strategy: scan the string from left to right, stopwhen a 1 is encountered.

I Clearly has worst case O(n).

I What does a typical input look like?I With a uniform distribution on n-bit strings, each bit is

0 or 1 with probability 12 .



Analysis

Todd L.Veldhuizen

[email protected]

Bibliography

Typical inputs II

I Waiting time to encounter a 1 follows a geometricdistribution: probability of success p = 1

2 .

Bitstring pattern Probability1xxxxx · · · 1

201xxxx · · · 1

4001xxx · · · 1

8...

...

I Mean of a geometric distribution is 1p = 2.

I Conclusion: on average we encounter a 1 after 1p = 2

bits. The running time of our naive “scan left to rightlooking for a 1” algorithm is Θ(1) — does not dependon n.

1In practice, many CPUs have a single instruction that will computethis for you.



Analysis

Todd L.Veldhuizen

[email protected]

Bibliography

Typical inputsI This is an example of an exponential concentration:

1

18

14

12

116

Prob.

Location of first nonzero bit2 3 4 5 6 7 · · ·

I Probability is concentrated around the mean.I Probability of the first nonzero bit being ≥ 2 + δ is≤ 2−δ−1 = O(2−δ): falls off exponentially quickly.

I Exponential concentrations are enormously useful.I An exponential concentration can swallow any

polynomial function: if f (n) = O(na) for a ∈ N, then∞∑

δ=0

O(2−δ) · f (c + δ) = O(1)

The tail of the distribution can contribute only an O(1)factor when we compute an average of f (· · · ).



Analysis

Todd L.Veldhuizen

[email protected]

Bibliography

Typical inputs

I On a more practical note, understanding typical inputscan help us engineer fast implementations.

I Consider our “first nonzero bit” example:2

I From our analysis we know that with probability1516 ≈ 0.94 will encounter a 1 in the first 4 bits.

I Design: have a lookup table indexed by the first fourbits: 94% of the time we can just glance in the tableand return.

I For the other ≈ 6% of the time, scan the remaining bitsto find the first 1.

2Again, this is only ‘for example’: in practice if you were at allinterested in performance you would be using a single cpu instructionfor this.



Analysis

Todd L.Veldhuizen

[email protected]

Bibliography

Average-case time

I We will consider several equivalent definitions ofaverage-case performance. First, an informal definition:

An algorithm runs in average time O(f (n))(respectively, Θ,Ω, o, ω) if the average timeT (n) is O(f (n)), where

T (n) =∑

all inputs wof size n

Pr(w)︸︷︷︸probability of

input w

· T (w)︸︷︷︸time of algorithmon input w



Analysis

Todd L.Veldhuizen

[email protected]

Bibliography

Average-case time

I Let’s make this more precise.

I For each n, let Kn be all possible inputs of size n.I For each n, let µn : Kn → R be a probability

distribution on Kn.I µn(w) ≥ 0 for all w ∈ Kn. (Probabilities are positive.)I

∑w∈Kn

µn(w) = 1. (Probabilities sum to 1.)

I Example: Bit stringsI Kn = 0, 1n, e.g., K2 = 00, 01, 10, 11.I We often choose the uniform distribution µn(w) = 1

2n .e.g. µ2(00) = 1

4 , µ2(01) = 14 , etc.

I (Kn)n∈N is a family of sets indexed by n.

I Similarly, (µn)n∈N is a family of distributions. We oftencall (µn)n∈N an asymptotic distribution.



Analysis

Todd L.Veldhuizen

[email protected]

Bibliography

Average-case time

I With these definitions in hand, our first formaldefinition:

Definition (1)

Let the average time T (n) for inputs of size n be given by

T (n) =∑

w∈Kn

µn(w)T (w)

An algorithm has average-case time O(f (n)) if and only ifT (n) ∈ O(f (n)).



Analysis

Todd L.Veldhuizen

[email protected]

Bibliography

Bibliography I

[1] Yehoshua Perl, Alon Itai, and Haim Avni.Interpolation search—a log logn search.Commun. ACM, 21(7):550–553, 1978. bib pdf

[2] Nicola Santoro and Jeffrey B. Sidney.Interpolation-binary search.Inform. Process. Lett., 20(4):179–181, 1985. bib pdf

[3] R. Sundar and R. E. Tarjan.Unique binary search tree representations andequality-testing of sets and sequences.In Baruch Awerbuch, editor, Proceedings of the 22ndAnnual ACM Symposium on the Theory of Computing,pages 18–25, Baltimore, MY, May 1990. ACM Press.bib pdf

ECE750-TXBLecture 11:Probability

Todd L.Veldhuizen

[email protected]

Bibliography

ECE750-TXB Lecture 11: Probability



Canada

February 28, 2007


Todd L.Veldhuizen

[email protected]

Bibliography

Twentieth-Century Probability

I Two foundational questions:

1. How to define probability of infinite sequences in ameaningful way?

2. What is a randomness? What is a “random sequence”?

I Two landmarks:

1. Andrei N. Kolmogorov (1933): The rigorous formulationof probability theory using measure theory, allowing aconsistent treatment of both finite and infinite samplespaces.

2. Per Martin-Lof (1966): An acceptable definition ofrandom sequences using constructive measure theory.Martin-Lof’s definition implies a sequence is random if acomputer is incapable of compressing it. [2]


Todd L.Veldhuizen

[email protected]

Bibliography

Probability and Measure I

I Modern probability theory is based on the concept of ameasure. A measure generalizes the idea of “volumes,”“lengths,” “probabilities,” and so forth.

I Consider defining a “length measure” that assigns ameasure to subsets of the real line.

I Recall the following notations for closed and openintervals:

[a, b] = x ∈ R : a ≤ x ≤ b(a, b) = x ∈ R : a < x < b

where we require a ≤ b.I We will say that the interval [a, b] has measure b − a.I The empty set ∅ has measure 0.I This is the usual definition of the measure of an

interval, called Lebesgue measure.


Todd L.Veldhuizen

[email protected]

Bibliography

Probability and Measure III Lebesgue measure is a function µ that assigns measures

to certain subsets of the reals R. For example,µ([1, 2]) = 1.

I What should the measure of [1, 2] ∪ [3, 5] be?I [1, 2] and [3, 5] are disjoint sets, so we should be able to

add their measures: µ([1, 2]) = 1 and µ([3, 5]) = 2, sowe can set µ([1, 2] ∪ [3, 5]) = 3.

I In general if X and Y are disjoint sets,

µ(A ∪ B) = µ(A) + µ(B)

I What should be the measure of the open interval (1, 2)(that is, [1, 2] with the endpoints removed)?

I Note that [1, 2] = (1, 2) ∪ 1 ∪ 2.I We can write the set 1 as [1, 1], and by our previous

definition

µ([1, 1]) = 0

similarly for 2: “points have no length.”


Todd L.Veldhuizen

[email protected]

Bibliography

Probability and Measure III

I We can then apply the disjoint-sets rule to say that

µ([1, 2]) = µ(1 ∪ (1, 2) ∪ 2)= µ(1)︸︷︷︸

=0

+µ( (1, 2) ) + µ(2)︸︷︷︸=0

therefore µ([1, 2]) = µ( (1, 2) ).I By similar reasoning, µ([a, b]) = µ( (a, b) ).I The disjoint-sets rule implies we can combine any finite

number of disjoint sets:

µ(A1 ∪ A2 ∪ · · · ∪ An) =n∑

i=1

µ(Ai )

Should this extend also to an infinite collection ofdisjoint sets?


Todd L.Veldhuizen

[email protected]

Bibliography

Probability and Measure IVI For example, we can construct the interval (0, 1] as the

union of the intervals

[0, 1] =(

12 , 1]∪(

14 , 1

2

]∪(

18 , 1

4

]∪ · · ·

We would like to say that

µ

(∞⋃i=0

(2−i−1, 2−i

])=

∞∑i=0

2−i−1 = 1

I We could also build the interval (0, 1] as a union ofpoints:

⋃0<x≤1x. But, points have measure 0. If we

allowed the disjoint-sets rule for infinite collections ofsets, we could calculate:

µ ( (0, 1] ) = µ

⋃x∈(0,1]

x

=∑

x∈(0,1]

µ(x) =∑

x∈(0,1]

0 = 0

This is an inconsistency.


Todd L.Veldhuizen

[email protected]

Bibliography

Probability and Measure V

I To avoid this particular inconsistency, we can restrictthe union-of-disjoint-sets rule to countable sequences ofsets, i.e., collections of sets that can be put intoone-to-one correspondence with the naturals:

I If A1, A2, · · · are a countable sequence of pairwisedisjoint sets, then

µ

[i∈N

Ai

!=Xi∈N

µ(Ai )

I However, even this restriction (to countable sequencesof sets) is not enough. In certain flavours of set theory(e.g., ZFC: usual set theory with the axiom of choice),measures cannot be defined to every subset of R in aconsistent way; this leads to contradictions like beingable to chop up the unit interval (measure 1) intopieces and reassemble it into something of measure 2.(If interested, see Vitali sets, Banach-Tarski paradox.)


Todd L.Veldhuizen

[email protected]

Bibliography

Probability and Measure VI

I Measure theory sidesteps inconsistencies by declaringcertain sets to be nonmeasurable. In Lebesgue measureon the real line, measurable sets are defined by thefollowing rules:

1. R is measurable, and has measure µ(R) = ∞.2. All intervals of the form [a, b] are measurable.3. If A is measurable, then its complement R \ A is

measurable.4. The union of a finite or countable sequence of

measurable sets is measurable.

I Sets that cannot be constructed by the above rules aredeemed nonmeasurable (e.g., the Cantor set).

I So, a measure space consists of three things:

1. A set Ω on which measures are being defined (e.g., R)2. A set F of measurable sets. Each X ∈ F is a subset

of Ω.3. A measure µ : F → [0,∞].


Todd L.Veldhuizen

[email protected]

Bibliography

Probability Spaces I

I A probability space comprisesI A sample space of outcomes. For continuous

distributions the sample space is often R or Rn; fordiscrete distributions the sample space is oftenZ, N, 0, 1, etc.

I A class of measurable sets of the sample space;I A probability measure that assigns probabilities to

events.

I A probability space is a triple (Ω,F , µ) whereI Ω is a sample space. The elements x ∈ Ω are outcomes.I F is a collection of subsets of Ω we call events (the

measurable sets);I µ : F → R is a probability measure.

I Of the events F , these properties are required:I Ω ∈ F (we can measure the whole sample space)I F is closed under complementation and countable

union:I If X ∈ F then so is its complement (Ω \ X ) ∈ F ;


Todd L.Veldhuizen

[email protected]

Bibliography

Probability Spaces II

I If X1, X2, · · · are in F , then so isS

i∈N Xi .

I The probability measure µ must satisfy these properties(the Kolmogorov axioms):

1. Probabilities are positive: for every X ∈ F , µ(X ) ≥ 0.2. µ(Ω) = 1. (Probabilities sum to 1.)3. For any finite or countable sequence of pairwise disjoint

events X1,X2, · · · ,

µ(⋃

Xi

)=∑

i

µ(Xi )

(The probability of one of the events Xi happening isthe sum of their probabilities.)

I Finite probability spaces are very simple. We usuallytake F = 2Ω (the powerset of Ω), i.e., every subset ofΩ is measurable.


Todd L.Veldhuizen

[email protected]

Bibliography

Probability Spaces IIII Example: a random bit. We can define the probability

space by Ω = 0, 1, F = ∅, 0, 1, 0, 1, and

µ(∅) = 0

µ(0) = 12

µ(1) = 12

µ(0, 1) = 1

I We can take a product of two probability spaces. For auniform distribution on two bits, we can use theprobability space

(Ω2,F2, µ2) = (Ω,F , µ)× (Ω,F , µ)

which hasI Sample space

Ω2 = Ω× Ω = (0, 0), (0, 1), (1, 0), (1, 1), i.e., allcombinations of two bits;


Todd L.Veldhuizen

[email protected]

Bibliography

Probability Spaces IV

I F2 = F ×F , containing all pairs of events drawn fromF ;

I Probability measure µ2 defined by

µ2(X ,Y ) = µ(X )µ(Y )

for all X ,Y . Note this implies that events in the firstprobability space of the product are independent fromthose of the second.

I We can repeat this process to obtain probabilitymeasures µ3, µ4, . . . on any finite number of bits.

I One of the useful consequences of using the measure-theoretic

treatment of probability is the Kolmogorov extension theorem,

which says that the finite distributions µ1, µ2, µ3, . . . (on bit

sequences of length 1, 2, 3, etc.) define a unique stochastic

process:

I a probability space whose sample space Ωω is infinitebinary sequences,


Todd L.Veldhuizen

[email protected]

Bibliography

Probability Spaces V

I with an appropriate set F of measurable sets ofsequences,

I with a probability measure µω on those measurable sets;I such that the finite projections of µω (e.g., the first k

bits) match the finite distributions µ1, µ2, · · · .I Any set of finite distributions satisfying the Kolmogorov

consistency conditions can be extended to a random process inthis way.

This particular stochastic process is an example of a Bernoulli

process, in which outcomes are sequences of digits drawn from a

binary alphabet.


Todd L.Veldhuizen

[email protected]

Bibliography

Probability Basics I

I Informally, we will just write Pr(· · · ) to mean theprobability of some event; the implication is that “· · · ”specifies some measurable event X ∈ F .

I For example, when we write Pr(Z ≥ 0) we are referringto µ(X ) where X is the set of outcomes X ∈ F inwhich Z ≥ 0. (Strictly speaking, it would not makesense to write µ(Z ≥ 0) because ‘Z ≥ 0’ is a formula,rather than a subset of Ω.)

Probability essentials.

1. Independence. Two events X and Y are said to beindependent if and only if

µ(X ∩ Y ) = µ(X ) · µ(Y )

i.e., Pr(both X and Y happen) = Pr(X ) · Pr(Y )


Todd L.Veldhuizen

[email protected]

Bibliography

Probability Basics IISimilarly, events X1,X2, · · · are independent if and onlyif

µ(⋂

Xi

)=∏i

µ(Xi )

2. Union bound. For any finite or countable set of eventsX1,X2, · · · ,

µ(⋃

Xi

)≤∑

i

µ(Xi ) (1)

Eqn. (1) is an equality when the Xi are pairwise disjoint.The union bound is often used to obtain an upperbound on the probability of some rare event happening.

I Example: what is the probability that a binary string oflength n chosen uniformly at random contains a run of≥√

n ones? (For convenience, we limit ourselves ton = k2 for some integer k.)


Todd L.Veldhuizen

[email protected]

Bibliography

Probability Basics III

I Uniformly at random means the probability space isdefined as a product measure Ωn with Ω = 0, 1, andµ1(0) = µ1(1) = 1

2 .I Define Xi to be the event that the string contains a run

of√

n ones starting at position i , where i ≤ n −√

n.I The probability of a run of

√n 1’s starting at position i

is µn(Xi ) = 12√

n . (This is obvious, but to be finicky we

could consider every possible string having such a run

starting at i . Each such string w is an event w; the events

are pairwise disjoint; there are 2n−√

n such strings, each with

probability 2−n, so summing over the pairwise disjoint events

(cf. Kolmogorov’s 3rd axiom) the sum comes out to 2−√

n.)I The events X1,X2, · · · ,Xn−

√n are definitely not

independent. However, we can use the union bound toobtain

Pr(run of√

n ones) ≤n−√

n∑i=1

µ(Xi ) = (n −√

n)1

2√

n


Todd L.Veldhuizen

[email protected]

Bibliography

Probability Basics IVSo, the probability of having a run of

√n ones is

O(n2−√

n), which is going to zero very quickly. (Notewe used O(·), not Θ(·), because we have a possiblyloose upper bound.)

3. Inclusion-Exclusion. For any two events X1 and X2,

µ(X1 ∪ X2) = µ(X1) + µ(X2)− µ(X1 ∩ X2)

More generally, for any events X1,X2, · · · ,

µ(⋃

Xi

)=∑

i

µ(Xi )

−∑i<j

µ(Xi ∩ Xj)

+∑

i<j<k

µ(Xi ∩ Xj ∩ Xk)

− · · ·


Todd L.Veldhuizen

[email protected]

Bibliography

Probability Basics V

I This can be easily remembered by looking at Venndiagrams: sum up the areas, subtract the things youcounted twice, then add in the things you subtractedtoo many times, ...

I Inclusion-Exclusion is a particular instance of a generalprinciple called Mobius inversion.

4. Random variables. In the measure-theoretic treatmentof probability, a random variable Z is a functionZ : Ω → V from outcomes to some set V. Commonly Vis R (a continuous random variable), N (a discreterandom variable), or 0, 1 (an indicator variable orBernoulli variable).

I Example. Take bitstrings of length n again. Let Z be arandom variable counting the number of ones in thestring. Then formally Z is a function from 0, 1n → N,so that for example

Z (010010110) = 4


Todd L.Veldhuizen

[email protected]

Bibliography

Probability Basics VI

5. Indicator (Bernoulli) random variables. In the specialcase that a random variable takes on only the values 0and 1, it is called an indicator variable or Bernoullirandom variable. We can associate each event E ∈ F with anindicator variable ZE that is 1 if E occurs, and 0 otherwise, i.e.

ZE (X ) =

(1 if X ∈ E

0 otherwise

6. Expectation. For a random variable Z , we write E[Z ]for the expected value of Z . This is simply the averageover the sample space. For a finite probability space(i.e., the number of outcomes |Ω| is finite),

E[Z ] ≡∑X∈Ω

Z (X )µ(X) (2)

But in the general case (i.e., a possibly infinite samplespace) the expectation is an integral over the probability


Todd L.Veldhuizen

[email protected]

Bibliography

Probability Basics VII

space; in the case that Ω = R this is the familiarintegral on the real line defined in terms of pdf’s. I.e., ifF (x) = µ( (−∞, x ] ) (the cdf), and f (x) = F ′(x) (thepdf), then

E[Z ] =

∫ +∞

x=−∞xf (x)dx

7. Linearity of Expectation. If Z1,Z2, . . . are randomvariables, then

E

[∑i

Zi

]=∑

i

E[Zi ]

I The usefulness of this cannot be understated! Therandom variables may be far from independent, but wecan still sum their expectation.


Todd L.Veldhuizen

[email protected]

Bibliography

Probability Basics VIII

I In particular, the combination of indicator variables withlinearity of expectation is very powerful, and one of themost basic tools of the Probabilistic Method [1].

I In algorithm analysis, we can sometimes choose a set ofindicator variables Z1,Z2, . . . where each Zi representssome piece of work we may or may not have to do. Theexpected value of the sum E[Z1 + Z2 + · · · ] is an upperbound on on the average amount of work we need todo. For example, in [3, §2.5] you can find an ingeniouslysimple analysis of the average time complexity ofQuickSort using this method.

I It can also be used to characterize what the typical“largest occurrence” of some pattern is in a randomobject. For example, what is the expected length of thelongest run of 1’s in a random n-bit string?

I Let t be the length of a run. (We will solve for t tofind the most likely situation.)


Todd L.Veldhuizen

[email protected]

Bibliography

Probability Basics IX

I Let X1, X2, . . . , Xn−t+1 be indicator random variables,where Xi = 1 just when a run of ≥ t 1’s starts atposition i . The probability of a run of t 1’s is 2−t , soPr(Xi = 1) = 2−t .

I Let Y = X1 + X2 + . . . + Xn−t+1 be the expectednumber of runs of length ≥ t. Although the Xi are notindependent, we can use linearity of expectation toobtain

E[Y ] = E[X1 + X2 + · · ·+ Xn−t+1]

= E[X1] + E[X2] + · · ·+ E[Xn−t+1]

= (n − t + 1)2−t

To find a likely value for t, we can set E[Y ] = 1: i.e.,we want to find the value of t for which we expect tohave one run of t 1’s.


Todd L.Veldhuizen

[email protected]

Bibliography

Probability Basics X

I This is a simple asymptotics exercise: takinglogarithms of E[Y ] = 1, we obtain

t = log(n − t + 1)

= log(n)−Θ“ t

n

”= log(n)−Θ

„log n

n

«I Note also the similarity to the union bound: if we view

the Xi as events (rather than random variables), wecan say

Pr

[i

Xi

!≤X

i

Pr(Xi )

= (n − t + 1)2−t

This works because the expected value of a indicatorvariable is just its probability of being 1.


Todd L.Veldhuizen

[email protected]

Bibliography

Probability Basics XI

I We can combine the above results to obtain aconcentration inequality for the length of the longestrun: set t = log(n) + δ. Then

Pr (run of length ≥ log(n) + δ) ≤ (n − log n − δ + 1)2− log n−δ

= (n − log n − δ + 1)1

n2−δ

= (1− o(1))2−δ

So, with very little work we have obtained anexponential concentration for the length of the longestrun.

8. Markov’s inequality. This inequality gives us quick butloose bounds on the deviation of random variables fromtheir expectation. If X is a random variable and α > 0,then

Pr(|X | ≥ α) ≤ E[|X |]α


Todd L.Veldhuizen

[email protected]

Bibliography

Probability Basics XII

If X takes on only positive values, then the | · |’s may bedropped.

I Example. Recall that the expected height of a binarysearch tree constructed by inserting a random sequenceof n keys is c log n, with c ≈ 4.311. If we let H be arandom variable representing the height of the tree,then Markov’s inequality gives

Pr(H ≥ βn) ≤ c log n

βn= O

(log n

n

)So, the probability of getting a tree of height Θ(n)

tends to zero as O(

log nn

).


Todd L.Veldhuizen

[email protected]

Bibliography

Probability Basics XIII

I Example. Suppose an algorithm runs in average-casetime f (n) but has worst-case time g(n), whereg(n) f (n). What is the probability that the algorithmwill take time g(n)? Treat the running time as arandom variable; applying Markov’s inequality, weimmediately obtain

Pr(running in time ≥ g(n)) ≤ f (n)

g(n)

Since f (n) ≺ g(n), the probability of running in time≥ g(n) is tending to zero, at least as quickly as theratio between the average and worst-case time.


Todd L.Veldhuizen

[email protected]

Bibliography

Probability Basics XIV

9. Variance and Standard Deviation. Recall that thevariance and standard deviation of a random variable Xare:

Var[X ] = E[(X − E[X ])2] = E[X 2]− (E[X ])2

σ[X ] = (Var[X ])1/2

A common special case: if X is a Bernoulli randomvariable with probability β of being 1, thenVar[X ] = β(1− β).If X1,X2, . . . are independent random variables, then

Var

[∑i

Xi

]=∑

i

Var[Xi ]


Todd L.Veldhuizen

[email protected]

Bibliography

Probability Basics XV10. Chebyshev’s bound. Chebyshev’s bound, like Markov’s

bound, is a tail inequality that gives a bound on howslowly probability can drop as you move away from themean.

Pr(|X − E[X ]| ≥ aσ[X ]) ≤ 1

a2

11. Distributions. A probability space (Ω,F , µ) on Ω = Nor Ω = R coincides with our familiar idea of a“probability distribution.”A random variable X : F → ΩX has an associateddistribution (probability space) (ΩX ,FX , µX ), where

I The sample space is ΩX ;I The measurable events are given by

FX = E ⊆ ΩX : X−1(E ) ∈ F

I The probability measure is µX (E ) = µ(X−1(E )).


Todd L.Veldhuizen

[email protected]

Bibliography

Probability Basics XVI

I Consider a uniform distribution on 3-bit strings, i.e.,Ω = 0, 13, and a random variable X that counts thenumber of bits that are 1, e.g. X (011) = 2. Then

µX (2) = µ(X−1(2)) = µ(110, 101, 011)

I For a continuous random variable, e.g., something ofthe form X : Ω → R, the familiar probability densityfunction (pdf) and cumulative density function (cdf)are:

F (x) = µ( (0, x ] )

f (x) = ddx F (x)


Todd L.Veldhuizen

[email protected]

Bibliography

Bibliography I

[1] Noga Alon and Joel Spencer.The Probabilistic Method.John Wiley, second edition, 2000. bib

[2] Per Martin-Lof.The definition of random sequences.Information and Control, 9(6):602–619, December 1966.bib

[3] Michael Mitzenmacher and Eli Upfal.Probability and Computing: Randomized Algorithms andProbabilistic Analysis.Cambridge University Press, 2005. bib


Markov Chains andtheir Applications

Todd L.Veldhuizen

[email protected]

Bibliography

ECE750-TXB Lecture 12: Markov Chainsand their Applications



Canada

February 20, 2007



Todd L.Veldhuizen

[email protected]

Bibliography

Markov Chains I

I The probabilistic counter is a simple example of aMarkov chain. Roughly speaking, a Markov chain is afinite state machine with probabilistic transitions.Consider the follow two-state Markov chain withtransition probabilities as shown:

/.-,()*+A1−p 99

p""/.-,()*+B 1−pff

p

bb

Let fA(n) be the probability that the machine is in stateA at step n, similarly fB(n). Obviously the machine canonly be in one state at a time, so

fA(n) + fB(n) = 1



Todd L.Veldhuizen

[email protected]

Bibliography

Markov Chains II

Let’s consider the scenario where the machine is in stateA to begin:

fA(0) = 1

fB(0) = 0

These are the initial conditions.We can write equations to describe how the systemevolves. For each state, we look at the incident edges,and write the probability of being in that state at timen in terms of where it was at time n − 1. We reachstate A at time n

I with probability (1− p) if we were in state A at timen − 1;

I with probability p if we were in state B at time n − 1



Todd L.Veldhuizen

[email protected]

Bibliography

Markov Chains IIIThis leads to the equations

fA(n) = (1− p)fA(n − 1) + pfB(n − 1)fB(n) = pfA(n − 1) + (1− p)fB(n − 1)

To encode the initial conditions we add δ(n) to theequation for fA; this will result in fA(0) = 1. In matrixform, we can write the equations as:[

fA(n)fB(n)

]=

[1− p p

p 1− p

]︸︷︷︸

P

[fA(n − 1)fB(n − 1)

]+

[δ(n)0

]

These are called the Chapman-Kolmogorov equations.The matrix P is the transition matrix.Taking Z-transforms, we obtain[

FA(z)FB(z)

]=

[1− p p

p 1− p

]︸︷︷︸

P

[z−1FA(z)z−1FB(z)

]+

[10

]



Todd L.Veldhuizen

[email protected]

Bibliography

Markov Chains IV

Rearranging,[

1− p pp 1− p

]︸︷︷︸

P

z−1 −[

1 00 1

]︸︷︷︸

I

[

FA(z)FB(z)

]︸︷︷︸

F

=

[10

]︸︷︷︸

x

which we can write as the equation

(Pz−1 − I)F = x (1)

or,

z−1(P− zI)F = x (2)



Todd L.Veldhuizen

[email protected]

Bibliography

Markov Chains VI If you are familiar with eigenvalues, the term (P− zI) should

look conspicuous. Note that we can solve Eqn. (1) byleft-multiplying through by z(P − zI)−1 to obtain

F = z(P − zI)−1x

Furthermore, (P − zI)−1 can be written in terms of theadjoint and determinant: recall that K−1 = adj K

|K| , where |K|is the determinant. We can then write the solution as:

F =z · adj (P − zI)

|P − zI| x

So, the poles of the functions in F will occur at values of zwhere

|P − zI| = 0

Compare this to the characteristic equation for theeigenvalues of P:

|P − λI| = 0



Todd L.Veldhuizen

[email protected]

Bibliography

Markov Chains VI

I the poles are located at λ1, λ2, · · · , where the λi arethe eigenvalues of the transition matrix P.

Solving for FA(z), we obtain

FA(z) =(1− (1− p)z−1)

(1− z−1)(1− (1− 2p)z−1)

This has poles at z = 1 and 1− 2p, and a zero at1− p.

I The pole at z = 1 reflects the limiting distribution ofthe chain. (Recall that Z−1[ c

1−z−1 ] = c · u(n).)I The pole at 1− 2p produces transient behaviour (so

long as 0 < p < 1.)I If p = 0 or p = 1, the zero at 1− p cancels one of the

poles.



Todd L.Veldhuizen

[email protected]

Bibliography

Markov Chains VIITaking the inverse transform, we obtain.

fA(n) =1

2︸︷︷︸pole z=1

+1

2(1− 2p)n︸︷︷︸

pole z=1−2p

fB(n) =1

2− 1

2(1− 2p)n

where we have used fB(n) = 1− fA(n).Depending on the value of p, this two-state Markovchain can exhibit five distinct behaviours:

1. p = 0: the machine always stays in state 0. The onlypossible sequence is AAAAAA · · · .

2. p = 1: always get a strict alternation between states:ABABABAB · · ·

3. p < 12 : get monotone, exponentially fast approach to

limiting distribution fA(n) = fB(n) = 12 .

4. p = 12 : get fA(n) = fB(n) = 1

2 for n ≥ 1. Everysequence of A′s and B ′s is equally likely.



Todd L.Veldhuizen

[email protected]

Bibliography

Markov Chains VIII

5. p > 12 : get oscillating decay to limiting distribution of

fA(n) = fB(n) = 12 .

I Markov Chains.I A (finite) Markov chain is a set of states

S = s1, . . . , sn together with a matrix P of transitionprobabilities pij of moving from state si to state sj .

I The sample space is Ω = Sω, i.e., infinite sequences ofstates.

I For an initial distribution u, the distribution after nsteps is Pnu.

I Write p(n)ij for the probability of going from state i to

state j in n steps. (Note: this is the entry (i,j) from thematrix Pn.)

Write i+→ j if there is an n > 0 such that p

(n)ij > 0, i.e.,

j can be reached from i .

If i+→ j and j

+→ i , we say i and j are communicating,and we can write i ↔ j . The relation ↔ is an



Todd L.Veldhuizen

[email protected]

Bibliography

Markov Chains IXequivalence relation, and it partitions the states intoclasses of states that communicate with each other.

I Classification of Markov chains:

Markov chain

Reducible

kkkkkkkkkIrreducible/Ergodic

UUUUUUUUUU

Aperiodic/Mixing

iiiiiiiiiiPeriodic

RRRRRRRRR

I Irreducible: all states are communicating. This meansthere is only one class of long-term behaviours.If a chain is irreducible, there is a limiting distribution usuch that Pu = u, and the chain spends a proportion oftime ui in state i .This chain is irreducible:

/.-,()*+A

1

%% /.-,()*+B

1oo/.-,()*+C1

OO



Todd L.Veldhuizen

[email protected]

Bibliography

Markov Chains X

and the distribution u =[

13

13

13

]Tis a limiting

distribution.In particular, for a sample sequence,

ui = limn→∞1n (#times in state i)

with probability 1. (This is a Cesaro limit: theprobability of being in state i after n steps might notconverge — it might be 0, 0, 1, 0, 0, 1, . . . — but theCesaro limit does converge.)An irreducible Markov chain is ergodic — meaning thatsample space average coincides with time averages (withprobability 1). (Note that some authors use “ergodic”as a synonym for “mixing,” which is not quite correct.)

I A Markov chain is aperiodic if for all i , j ,

gcdn : p(n)ij > 0 = 1

otherwise, the chain is called periodic.



Todd L.Veldhuizen

[email protected]

Bibliography

Markov Chains XII If a chain is not irreducible, it is called reducible. There

are multiple equivalence classes under ↔; means thereis more than one possible long-term behaviour. Asimple example is:

/.-,()*+A

1

/.-,()*+B

12oo

12 // /.-,()*+C

1

I An irreducible chain is mixing if for any initialdistribution p0,

limn→∞

Anp0 = u

where u is the limiting distribution.This chain is mixing:

/.-,()*+A

3/4

14

/.-,()*+B

3/4qq

14nn

/.-,()*+C3/4

QQ 14

==



Todd L.Veldhuizen

[email protected]

Bibliography

Markov Chains XII

A chain that is mixing “forgets” its initial conditions, inthe sense that the distribution Anp0 is asymptoticallyindependent of the distribution p0.

I An absorbing state is one that communicates only withitself:

A

1

A chain is called absorbing if every state communicateswith an absorbing state.

I Applications discussed in lecture:I PageRank (Google)I Anomoly detection



Todd L.Veldhuizen

[email protected]

Bibliography

Convergence Time of Markov Chains I

I Recall that the z-transform of theChapman-Kolmogorov equations has the form

F = N(z)|P−zI |x

where P is the transition matrix, N(z) is somepolynomial in z , and x encodes the initial conditions.

I The inverse z-transform of fi (n) will have terms

corresponding to the poles of N(z)|P−zI | ; for example,

fi (n) = ui + α2(λ2)n + α3(λ3)

n + · · ·

where λk are the poles; equivalently the eigenvalues ofthe transition matrix P. By convention we number theeigenvalues so that

|λ1| ≥ |λ2| ≥ |λ3| ≥ · · ·



Todd L.Veldhuizen

[email protected]

Bibliography

Convergence Time of Markov Chains II

The largest eigenvalue, λ1, always satisfies λ1 = 1,which generates the limiting distribution term ui .

I The rate of convergence to the limiting distribution isgoverned by |λ2|, the magnitude of the second-largestpole/eigenvalue. (Or, the first largest eigenvalue with|λi | < 1, if there are multiple eigenvalues equal to 1.)For example, if λ2 = 0.99, then get a term of the formα2(0.99)n, which has a half-life of 4n ≈ 69 (very slowconvergence.) If λ2 = 0.7, then half-life is 4n ≈ 2 (veryfast convergence.)

I |λ2| is sometimes referred to as the SLEM: SecondLargest Eigenvalue Modulus. (Modulus = magnitude).

I In designing randomized algorithms that can bemodelled as Markov chains, we can optimize theconvergence time by minimizing |λ2|.



Todd L.Veldhuizen

[email protected]

Bibliography

Leader Election I

I Scenario: have some number of computers that cancommunicate with each other. We want them torandomly elect one of them to be the leader. Eachcomputer must run the same algorithm.

I One method for leader election is the following:I Initially, each machine thinks it is the leader.I At each time step, machines broadcast whether they

think they are the leader or not.I If a machine thinks it is the leader, but some other

machine also does, it flips a coin to decide whether todrop out or stay in the leadership race.

I The process is finished when there is only one machinethat thinks it is the leader.

I If nobody thinks they are the leader, start all over againwith everyone thinking they are the leader.

I Goal: minimize the amount of time needed to elect aleader.



Todd L.Veldhuizen

[email protected]

Bibliography

Leader Election III Implies: make the Markov chain converge to a stable

configuration as quickly as possible.I Implies: choose the dropout probability p in order to

make the secondary pole as close to the origin aspossible.

I E.g., with 2 players.

I 4 states: 11 (everyone thinks they could be a leader),10,01 (one leader), 00 (both drop out).

11p(1−p)

~~||||

|||| p(1−p)

BBB

BBBB

B

(1−p)2

p2

011 11 10 1mm

00

1

HH



Todd L.Veldhuizen

[email protected]

Bibliography

Leader Election III

I We will show that the average time to elect a leader isminimized when p = 1

2(3−√

5) ≈ 0.381966.

I Equations:2664f00(n)f01(n)f10(n)f11(n)

3775 =

26640 0 0 p2

0 1 0 p(1 − p)0 0 1 p(1 − p)1 0 0 (1 − p)2

37752664

f00(n − 1)f01(n − 1)f10(n − 1)f11(n − 1)

3775 +

2664000

δ(n)

3775I The eigenvalues/pole locations are:

λ1 = 1

λ2 = 1

λ3,4 = 12p2 − p + 1

2 ±12

√p4 − 4p3 + 10p2 − 4p + 1

(via Maple.)



Todd L.Veldhuizen

[email protected]

Bibliography

Leader Election IV

I The positive pole dominates. To find the p thatminimizes it, set dλ3

dp = 0 to obtain

p = 12(3−

√5)

I This yields λ3 ≈ 0.618, corresponding to a half-life ofabout 1.44. (Choosing p = 1/2, an unbiased coin flip,would yield λ3 ≈ 0.64, and half-life ≈ 1.55.)



Todd L.Veldhuizen

[email protected]

Bibliography

Bibliography I

ECE750-TXBLecture 13:Information

Theory

Todd L.Veldhuizen

[email protected]

Bibliography

ECE750-TXB Lecture 13: InformationTheory



Canada

February 22, 2007


Theory

Todd L.Veldhuizen

[email protected]

Bibliography

Entropy I

I The central concept of information theory is entropy,which measures:

1. The “amount of randomness” in a distribution;2. How many bits are required, on average, to represent an

object, assuming the distribution is known to both theencoder and decoder.

I Entropy is a functional from distributions to R: if µ is adistribution, then the entropy H(µ) is a real numberdescribing “how random” the distribution µ is.

I The following requirements lead to a unique definitionof entropy (up to a multiplicative constant). Suppose µis a distribution on 1, 2, . . . , n; we can treat µ as avector of Rn, i.e.,

µ = [p1 p2 · · · pn]

where p1 is the probability of outcome 1, etc.


Theory

Todd L.Veldhuizen

[email protected]

Bibliography

Entropy II1. Continuity: H(µ) should be a continuous function of µ.

Using a δ − ε definition of continuity, for each µ andeach ε > 0 there is a δ > 0 such that

‖µ− µ′‖ < δ implies ‖H(µ)− H(µ′)‖ < ε

2. Maximality: H(µ) attains its maximum value whenµ = [ 1

n1n · · · 1

n ], i.e., a uniform distribution is the“most random.”

3. Additivity: If µ and µ′ are two distributions then

H(µ× µ′) = H(µ) + H(µ′)

i.e., the entropy of a product of probability spaces is thesum of their entropy.

4. Expandability: If we expand the distribution µ from thedomain 1, 2, . . . , n to the domain 1, 2, . . . , n + 1,then

H([p1 p2 · · · pn]) = H([p1 p2 · · · pn 0])


Theory

Todd L.Veldhuizen

[email protected]

Bibliography

Entropy III

I The unique function satisfying these conditions is

H(µ) = βn∑

i=1

−pi log2 pi

where 0 log 0 = 0 to satisfy continuity. The constant βis usually taken to be 1 so that the entropy can beinterpreted as “the number of bits needed to representan outcome.”

I Example. Let µ = [ 1n

1n · · · 1

n ]. Then

(µ) =n∑

i=1

−pi log pi

= n · (− 1n log 1

n )

= log n


Theory

Todd L.Veldhuizen

[email protected]

Bibliography

Entropy IV

With a uniform distribution on n outcomes, log n bitsare needed to represent an outcome. The uniformdistribution is the only distribution on n outcomes thathas entropy log n.

I Example. Let µ = [0 0 1 0 0 · · · 0]. Then

H(µ) = (n − 1)(−0 log 0)− (1 log 1)

= 0

The only distributions with H(µ) = 0 are those where asingle outcome has probability 1.

I Example. Let µ = [1212 ]. Then H(µ) = 1: one bit is

required to represent the outcome of a uniformdistribution on two outcomes (e.g., a fair coin flip.)


Theory

Todd L.Veldhuizen

[email protected]

Bibliography

Entropy VI Example. Let µ = [p (1− p)]. (A Bernoulli RV with

probability p.) Then

H(µ) = −p log p − (1− p) log(1− p)

00.10.20.30.40.50.60.70.80.9

1

0 0.2 0.4 0.6 0.8 1

Probability p

Entropy of a Bernoulli random variable

H


Theory

Todd L.Veldhuizen

[email protected]

Bibliography

Noiseless coding theorem I

I Recall 0, 1∗ is the set of finite binary strings.

I Let C ⊆ 0, 1∗ be a prefix-free set of codewords.(Prefix-free means no codeword occurs as a prefix ofanother codeword; see the lecture on tries.)

I A code for µ is a function c : dom µ → C . (Note thatwe can treat c as a random variable.)

I The average code length of c is

c = E[c] =∑

x∈dom µ

µ(x)|c(x)|

I Shannon’s noiseless coding theorem states that:

1. c ≥ H(µ). That is, no code can achieve an averagecode length less than the entropy.

2. There exists a code achieving c ≤ 1 + H(µ).


Theory

Todd L.Veldhuizen

[email protected]

Bibliography

Applications of information theory to algorithms

I Information theory is applied in algorithm design andanalysis in several ways:

1. Deriving lower bounds on time or space required for analgorithm or data structure using a uniform distribution.(Usually called “information-theoretic lower bound.”)

2. Designing structures for searching that exploit theentropy of the distribution on keys.

3. The noiseless coding theorem can be used to deriveprobability bounds. (Often such arguments are phrasedin terms of Kolmogorov complexity; the technique iscalled “The Incompressibility Method.”)


Theory

Todd L.Veldhuizen

[email protected]

Bibliography

Information-theoretic lower bounds I

I Information theory provides a quick method for obtainlower bounds, i.e., Ω(·)-bounds, on time and space.

I Example: sorting an array. Consider the problem ofsorting an array of n integers using comparisonoperations and swaps.

I Assume no two elements of the array are equal.I Assume comparisons such as “a[i ] ≤ a[j ]” are the only

method of obtaining information about the input arrayordering.

I There are n! orderings of the input array, but only onepossible output ordering.

I Each comparison test of the form “a[i ] ≤ a[j ]′′ yields atrue-or-false answer, i.e., at most one bit of informationabout the ordering of the input array.

I To distinguish amongst the n! possible orderings of theinput array, at least log n! tests are required, onaverage. This can be established in several ways:


Theory

Todd L.Veldhuizen

[email protected]

Bibliography

Information-theoretic lower bounds II

I Decision tree: consider the sequence of comparisonsperformed by a sorting algorithm as a path through atree, where each tree node represents a comparisontest; the left branch is taken if a[i ] > a[j ], and theright branch is taken if a[i ] ≤ a[j ]. Each leaf of thetree is a rearrangement (sorting) of the input array.Since there are n! possible derangements of the inputarray, the tree must have ≥ n! leaves; since it is abinary tree, it must be of depth ≥ log n!.

I It must be possible to recover the initial array orderingfrom the sequence of test outcomes; so, the sequenceof test outcomes constitutes a “code” for input arrayorderings. Apply the noiseless coding theorem: at leastlog n! comparisons are required, on average.


Theory

Todd L.Veldhuizen

[email protected]

Bibliography

Information-theoretic lower bounds III

I After sorting the input array, all information about theoriginal ordering is lost. If we wanted to “run thealgorithm backwards” and recover the original inputarray, we would need log n! bits of information toreproduce the original, unsorted array. We say thatsorting the array incurs an “irreversibility cost” oflog n! bits. RAM and Turing-machine models allow atmost O(1) bits of information to be “erased” at eachstep. Therefore Ω(log n!) = Ω(n log n) time steps arerequired. (These ideas are stock-in-trade of“thermodynamics of computation,” which investigates(among other things) the minimum amount of heatthat must be produced when computations areperformed.)

From one of these three arguments, one can concludethat any comparison-based sorting algorithm requiresΩ(log n!) = Ω(n log n) worst-case time.


Theory

Todd L.Veldhuizen

[email protected]

Bibliography

Information-theoretic lower bounds IV

I It is possible to sort in o(n log n) time in certaincircumstances: for example, to sort a large array ofvalues in the range [0, 255] one can simply maintain anarray of 256 counters, scan the array to build ahistogram, and then expand the histogram into a sortedarray. This can be done with O(n) operations.

I However, due to either the second or third argumentabove (the coding argument or the reversibility costargument), it is never possible to sort an array in lessthan O(H) operations, where H is the entropy of theordering of the input array.

I Example: Binary Decision Diagrams (BDDs).BDDs are a very popular representation for booleanfunctions.

I A boolean function on k variables is a functionf (x1, . . . , xk) : 0, 1k → 0, 1.


Theory

Todd L.Veldhuizen

[email protected]

Bibliography

Information-theoretic lower bounds VI Example: the majority function on three-variables

MAJ(x , y , z) is true when a majority of its inputs aretrue:

x y z MAJ(x , y , z)0 0 0 00 0 1 00 1 0 00 1 1 11 0 0 01 0 1 11 1 0 11 1 1 1

I Note that a boolean function on k variables has 2k

possible input combinations (hence, 8 lines in the abovetruth table.)

I BDDs are popular because they can often represent aboolean function very compactly, e.g., linear in thenumber of variables. Can they always do this?


Theory

Todd L.Veldhuizen

[email protected]

Bibliography

Information-theoretic lower bounds VI

I Put a uniform distribution on boolean functions of kvariables. There are 2k possible input combinations; aboolean function can be true or false for each input

combination; hence there are 22k

boolean functions onk variables.

I Under a uniform distribution, the entropy is

log2 22k

= 2k .I Any representation of boolean functions can be viewed

as a “code” to represent the boolean function inmemory. Applying the noiseless coding theorem, we get

c ≥ 2k

That is, any representation for boolean functionsrequires an average code length of 2k bits —exponential in the number of inputs.

I We can therefore say: any representation of booleanfunctions requires Ω(2k) bits per function, on average,with a uniform distribution.


Theory

Todd L.Veldhuizen

[email protected]

Bibliography

Information-theoretic lower bounds VII

I The fact that BDDs often achieve small representationssuggests that:

1. the distribution on boolean functions used in practicehas quite low entropy;

2. BDDs are a reasonably efficient code for thatdistribution.


Theory

Todd L.Veldhuizen

[email protected]

Bibliography

Entropy and searching I

I Scenario: retrieving records from a database.

I Let K be a set of keys, and n = |K |. We have seen thatbinary search, or binary search trees, yield worst-casetime of O(log n).

I It turns out that the best average-case performancethat can be achieved does not depend on n per se, butrather on the entropy of the input distribution on keys.

I Suppose µ is a probability distribution on K , indicatingthe frequency with which keys are requested.

I In some applications, search problems have highlynonuniform distributions on input keys.

I Often the probability distribution on very large key setsfollows a distribution where the mth most popular keyhas probability ≺ m−1 (cf. Zipf’s law, Chris Anderson’sarticle The Long Tail.).

I Let H = H(µ) be the entropy of the input distribution.


Theory

Todd L.Veldhuizen

[email protected]

Bibliography

Entropy and searching II

I We can achieve search times of O(H) by placingcommonly requested keys close to the root of the tree.

I Example: suppose we are performing dictionary lookupson the following set of keys:

Key Probabilityentropy 1

2caffeine 1

4Markov 1

16thermodynamics 1

16convex 1

16stationary 1

16


Theory

Todd L.Veldhuizen

[email protected]

Bibliography

Entropy and searching III

I We can use the following search tree:

entropy

nnnn RRRR

caffeine

oooostationary

llll UUUUU

convex markov thermodynamics

I We have placed the most commonly requested keys(entropy, caffeine) close to the root.

I Average depth is

12(1) + 1

4(2) + 116(2) + 3 · 1

16(3) = 1.69

I There are good algorithms to design such search treesoffline (i.e., when the distribution is known): one canbuild binary search trees that achieve optimal searchtimes e.g. [1].


Theory

Todd L.Veldhuizen

[email protected]

Bibliography

Entropy and searching IV

I However, in some situations it is impractical to buildsuch trees offline, since the contents of the database arechanging, or the key distribution is not stationary (e.g.,every day brings different “hot topics” that people aresearching for.)

I Splay trees [2] are fascinating binary search trees thatreorganize themselves in response to input keydistributions. The underlying idea is very simple: eachtime a key is requested, it is moved to the root of thetree. In this way, popular keys tend to hang out close tothe root.

I Splay trees are known to be optimal for a staticdistribution on keys. Their performance fornonstationary distributions is a longstanding openquestion: the “dynamic optimality conjecture.”


Theory

Todd L.Veldhuizen

[email protected]

Bibliography

Bibliography I

[1] Kurt Mehlhorn.Nearly optimal binary search trees.Acta Informat., 5(4):287–295, 1975. bib pdf

[2] Daniel Dominic Sleator and Robert Endre Tarjan.Self-adjusting binary search trees.J. ACM, 32(3):652–686, July 1985. bib pdf


Typical Inputs

Todd L.Veldhuizen

[email protected]

Bibliography

ECE750-TXB Lecture 14: Typical Inputs



Canada

February 27, 2007


Typical Inputs

Todd L.Veldhuizen

[email protected]

Bibliography

Asymptotic Distributions I

I Scenario for average-case analysis:I n is the input “size”I Inputs: (Kn)n∈N is a family of sets indexed by n, giving

the possible inputs to an algorithm of size n.I For each n, there is a probability distribution µn on

inputs.I Example: if an algorithm operates on binary strings, we

could choose size to mean “length of the string,” andKn = 0, 1n (strings of length n.)

I An asymptotic distribution is a family of probabilitydistributions (µn)n∈N where µn is a probability measureon the sample space Kn.

I When our meaning is obvious, we will write µn(w) forthe probability of the input w ∈ Kn. (If µn is a measure

then to be fastidious we should write µn(w), where w is

the event that the outcome is w . But, writing µ(w) is

clearer.)


Typical Inputs

Todd L.Veldhuizen

[email protected]

Bibliography

Asymptotic Distributions II

I Example: for binary strings of length n, the uniformdistribution on 0, 1n is defined by

µn(w) =1

2n

since there are |Kn| = 2n binary strings of length n.

I To design algorithms that behave well on average, ithelps to know what properties are “typical” for thedistribution of inputs.


Typical Inputs

Todd L.Veldhuizen

[email protected]

Bibliography

Sets of asymptotic measure 1 I

I Let K =⋃

i∈N Kn be all possible inputs of any length.

I For a class of inputs A ⊆ K , the asymptotic measure ofA, if it exists, is given by

µ∞(A) = limn→∞

µn(A ∩ Kn)

I Note that in general the limit may not exist. Forexample, taking Kn to be binary strings, the probabilityof the set

A = w ∈ 0, 1∗ : |w | is even

alternates between 0 and 1:

µ3(A ∩ K3) = 0

µ4(A ∩ K4) = 1

µ5(A ∩ K5) = 0


Typical Inputs

Todd L.Veldhuizen

[email protected]

Bibliography

Sets of asymptotic measure 1 II...

and so the limit fails to exist.I If µ∞(A) = 1, we can say

I A has asymptotic measure 1;I A is almost surely true;I A happens almost surely;I The phrases with probability 1, almost certain, almost

always, and with high probability are also used.I The abbreviation a.s. is commonly used for almost

surely.

I Let’s look at a few examples of almost sure propertiesof random binary strings:

1. Runs of 1’s2. The balance of 0’s and 1’s3. The position of the first nonzero bit4. The number of prime divisors of the string when

interpreted as a base-2 integer.


Typical Inputs

Todd L.Veldhuizen

[email protected]

Bibliography

Runs in random strings I

I Example: Runs of 1’s in binary strings.

I Let Kn = 0, 1n be binary strings of length n, and µn

the uniform distribution.

I Define the random variable R : Kn → N to be thelength of the longest run of 1’s. For example,

R(0100111110100) = 5

I Recall that in previous lectures we obtained aconcentration inequality for R:

Pr(R ≥ t) ≤ (n − t + 1)2−t

We used the union bound: let Xi be the event that arun of t 1’s starts at position i ; then

Pr(R ≥ t) = µ(X1 ∪ · · ·Xn−t+1)


Typical Inputs

Todd L.Veldhuizen

[email protected]

Bibliography

Runs in random strings II

≤n−t+1∑

i=1

µ(Xi )

= (n − t + 1)2−t

where we are requiring that t ≤ n of course.

I Assume t ≺ n, set Pr(R ≥ t) = 1 and take logarithms:

t ≤ log(n − t + 1) = log n −Θ( t

n

)I Choosing t(n) = log n + δ, we obtain

Pr(R ≥ log n + δ) ≤ (n − log n − δ + 1)2− log n−δ

= 1n (n − log n − δ + 1)2−δ

= (1− o(1))2−δ


Typical Inputs

Todd L.Veldhuizen

[email protected]

Bibliography

Runs in random strings III

Conversely,

Pr(R < log n + δ) = 1− Pr(R ≥ log n + δ)

> 1− (1− o(1))2−δ

I If δ ∈ ω(1) then Pr(R < log n + δ) → 1.

I We can say “Almost surely, a binary string chosenuniformly at random does not have a run of lengthlog n + ω(1).”

I Define

Aδ ≡ w ∈ K : longest run length is < log |w |+ δ(|w |)

(Aδ is a family of sets of strings, indexed by a function δ.)


Typical Inputs

Todd L.Veldhuizen

[email protected]

Bibliography

Runs in random strings IV

I For any function δ,

µn(Aδ ∩ Kn) > 1− (1− o(1))2−δ(n)

A less sharp, but clearer statement is:

µn(Aδ ∩ Kn) = 1− O(2−δ(n))

Note δ ∈ ω(1) implies µ∞(Aδ) = 1.

I Aδ is an example of what we shall call a typical set.


Typical Inputs

Todd L.Veldhuizen

[email protected]

Bibliography

Balance of 0’s and 1’s I

I Example: Balance of 0’s and 1’s in a string.

I Choose a binary string of length n uniformly at random,and define the random variables Y1, . . . ,Yn by:

Yi =

+1 if the i thbit is 1

−1 if the i thbit is 0

Then E[Yi ] = 0, and

Var[Yi ] = E[(Yi − E[Yi ])2] = 1

I Let Y =∑n

i=1 Yi .I Y can be interpreted as a “random walk” on Z, where

each bit of the string indicates whether to move up ordown.

I |Y | is the discrepancy between the number of zeros andones.


Typical Inputs

Todd L.Veldhuizen

[email protected]

Bibliography

Balance of 0’s and 1’s II

I The expectation and variance of Y are:

E[Y ] = 0

Var[Y ] =n∑

i=1

Var[Yi ] = n

I To bound the discrepancy |Y | we can use:

Theorem (Chernoff inequality)

Let Y1, . . . ,Yn be discrete, independent random variableswith E[Yi ] = 0 and |Yi | ≤ 1 for all i . Let Y =

∑ni=1 Xi , and

σ2 = Var[Y ] be the variance of Y . Then

Pr(|Y | ≥ λσ) ≤ 2e−λ2/4


Typical Inputs

Todd L.Veldhuizen

[email protected]

Bibliography

Balance of 0’s and 1’s IIII Applying the Chernoff inequality with σ2 = n, we obtain

Pr(|Y | ≥ λ√

n) ≤ 2e−λ2/4

I Let’s work the right-hand side of the inequality into theform 2−δ. Setting 2−δ = 2e−λ2/4 and solving we obtain

λ = 2√

ln 2(1 + δ)

I Substituting,

Pr(|Y | ≥ 2√

n(δ + 1) ln 2) ≤ 2−δ

I Let Bδ be the set of binary strings satisfying this bound:

Bδ =

w ∈ K : discrepancy < 2√|w |(δ + 1) ln 2

I As in the previous example,

µn(Bδ ∩ Kn) = 1− O(2−δ)


Typical Inputs

Todd L.Veldhuizen

[email protected]

Bibliography

First nonzero bit I

I Example: First nonzero bit in a string.

I As before consider binary strings of length n under auniform distribution.

I Let Y be an R.V. indicating the position of the firstnonzero bit: for example,

Y (000010110111) = 5

I Y has a geometric distribution with probability p = 12 :

E[Y ] =1

1− p= 2

Pr(Y ≤ δ) =δ∑

k=1

(1

2

)k

= 1− 2−δ


Typical Inputs

Todd L.Veldhuizen

[email protected]

Bibliography

First nonzero bit II

I Let Cδ be strings whose first nonzero bit is at position≤ δ; then

µn(Cδ ∩ Kn) = 1− 2−δ

I Almost surely, a binary string of length n has a 1 in aposition ≤ f (n) for any f ∈ ω(1).


Typical Inputs

Todd L.Veldhuizen

[email protected]

Bibliography

Erdos-Kac theorem I

I Example: Number of prime divisors.

I Let w be a binary string of length n chosen uniformly atrandom.

I We can interpret w as a number (written in base 2): forexample, given the string w = 010011, we can take

0100112 = 19

I Let W be a random variable counting the number ofprime divisors of w .

I The Erdos-Kac theorem [1] states that the distributionof W converges to a normal distribution:

E[W ] = ln n + ln ln 2 + o(1)

Pr

(a ≤ W − E[W ]√

E[W ]≤ b

)=

1√2π

∫ b

ae−

12 t2

dt + o(1)


Typical Inputs

Todd L.Veldhuizen

[email protected]

Bibliography

Erdos-Kac theorem II

I Choosing a = −b and integrating the normaldistribution,

Pr

(−b ≤ W − E[W ]√

E[W ]≤ b

)= 1− erfc

(b√2

)I We employ the following inequality, found on the

internet (MathWorld) so it must be true:

erfc(α) <2√π

e−α2

α +√

α2 + 2

I This yields

Pr

(−b ≤ W − E[W ]√

E[W ]≤ b

)> 1− 2√

π

e−b2

2

b√2

+√

b2

2 + 2


Typical Inputs

Todd L.Veldhuizen

[email protected]

Bibliography

Erdos-Kac theorem III

I If b = b(n) = ω(1) then

Pr

(−b ≤ W − E[W ]√

E[W ]≤ b

)> 1− O

(e−

b2

2

)where we have deliberately made the asymptotic boundless sharp to make the next step easier: setting

O(2−δ)

= O

(e−

b2

2

)we obtain b =

√2δ ln 2.


Typical Inputs

Todd L.Veldhuizen

[email protected]

Bibliography

Erdos-Kac theorem IV

I Therefore the number of prime divisors W satisfies

Pr(|W − E[W ]| ≤

√2δ ln 2 · E[W ]

)> 1− O

(2−δ)(1)

where

E[W ] = ln n + ln ln 2 + o(1)

I Let Dδ be the set of strings w ∈ K satisfying Eqn. (1),where n = |w |.


Typical Inputs

Todd L.Veldhuizen

[email protected]

Bibliography

Typical sets I

The following definition of typical sets is loosely inspired by a similar

idea in information theory, but using a parameter δ resembling the

“randomness deficiency” of Kolmogorov complexity [3, 2].

DefinitionLet Aδ be a family of sets indexed by functions δ : N → R.We say Aδ is typical if

µn(Aδ ∩ Kn) = 1− O(2−δ(n))

I We will call Aδ a typical set, even though strictlyspeaking it is a family of sets indexed by δ.

I The following properties are straightforward:

1. If δ ∈ ω(1) then µ∞(Aδ) = 1.2. If Aδ ⊆ Bδ, and Aδ is a typical set, then so is Bδ.3. The set of all possible inputs Kδ = K =

⋃n∈N Kn is

typical.


Typical Inputs

Todd L.Veldhuizen

[email protected]

Bibliography

Typical sets II

I A typical set represents an almost sure property with anexponential concentration inequality:

I Every input is in Aδ almost surely when δ ∈ ω(1);I The probability of not being in Aδ falls off as O(2−δ).

I The intersection Aδ ∩Bδ ∩ Cδ ∩ . . . of any finite numberof typical sets is also typical. We prove this for theintersection of two sets; any finite number follows byinduction.

Proposition

If Aδ and Bδ are typical, so is Cδ = Aδ ∩ Bδ.


Typical Inputs

Todd L.Veldhuizen

[email protected]

Bibliography

Typical sets III

I We will use the following elementary probability identity.

Pr(α ∧ β) = Pr(¬¬(α ∧ β))

= 1− Pr(¬(α ∧ β))

= 1− Pr(¬α ∨ ¬β)

= 1− [Pr(¬α) + Pr(¬β)− Pr((¬α) ∧ (¬β))]

= 1− (1− Pr(α))− (1− Pr(β)) + Pr((¬α) ∧ (¬β))

= Pr(α) + Pr(β)− 1 + Pr(¬α ∧ ¬β)

Proof.


Typical Inputs

Todd L.Veldhuizen

[email protected]

Bibliography

Typical sets IV

Let Aδ,n = Aδ ∩ Kn, and similarly for Bδ,n. Write Aδ,n for thecomplement Kn \ Aδ,n. We start from the following identity:

µ(Aδ,n ∩ Bδ,n) = µ(Aδ,n) + µ(Bδ,n)− 1 + µ(Aδ,n ∩ Bδ,n)

Note that

µ(Aδ,n) = 1− µ(Aδ,n) = 1− (1− O(2−δ)) = O(2−δ)

and similarly for µ(Bδ,n). Since µ(Aδ,n ∩ Bδ,n) ≤ max(µ(Aδ,n), µ(Bδ,n)),

µ(Aδ,n ∩ Bδ,n) = O(2−δ)

Therefore

µ(Aδ,n ∩ Bδ,n) = µ(Aδ,n)| z =1−O(2−δ)

+ µ(Bδ,n)| z =1−O(2−δ)

−1 + µ(Aδ,n ∩ Bδ,n)| z O(2−δ)

= 1− O(2−δ)


Typical Inputs

Todd L.Veldhuizen

[email protected]

Bibliography

Typical sets V

I Binary strings chosen uniformly at random have all ofthe following properties, almost surely, for any δ ∈ ω(1):

1. A run of 1’s no longer than log n + δ;2. The discrepancy between the number of 0’s and 1’s is

less than√

4n(δ + 1) ln 2;3. The first nonzero bit appears at a position ≤ δ;4. When viewed as a base-2 integer, has ln n + ln ln 2 prime

divisors ±δ1/2√

2 ln 2(ln n + ln ln 2).

I For example, choosing δ = 10 (2−δ ≈ 10−3), a 1024-bitstring has, with fairly high probability:

1. A run of ≤ 20 bits;2. A discrepancy of ≤ 176 bits;3. A 1 in the first 10 positions;4. About 6.5± 9.5 prime divisors.


Typical Inputs

Todd L.Veldhuizen

[email protected]

Bibliography

Typical sets VI

I (Note that the constant factors associated with the concentration

inequality 1− O(2−δ) may change when we take intersections of

typical sets. For these examples I am just using δ = 10 and hiding

the constant factor inside the waffly “fairly high probability.”)


Typical Inputs

Todd L.Veldhuizen

[email protected]

Bibliography

Typical sets as a filterI We have established the following properties of typical

sets:1. If Aδ is typical, and Aδ ⊆ Bδ, then Bδ is typical.2. If Aδ and Bδ are typical, then Aδ ∩ Bδ is typical.3. The set of all inputs Kδ = K =

⋃n∈N Kn is typical.

4. The empty set ∅ is not typical.I The typical sets form a mathematical structure called a

filter.I A filter on a set K is a collection F ⊆ 2K of subsets of

K satisfying these properties:1. If A ∈ F and A ⊆ B, then B ∈ F .2. If A,B ∈ F then (A ∩ B) ∈ F ;3. K ∈ F ;4. ∅ 6∈ F .

I Filters are a bit abstract, but powerful. One useful

application is an ultraproduct, which can be used to

construct a single (infinite) structure that embodies the Σ11

properties of typical inputs. (Σ11 properties are definable by

second-order sentences of the form

∃R1, . . . ,Rk . ψ(R1, . . . ,Rk) — which includes first-order

sentences. For example, χ-colourability of graphs is a Σ11

property.)


Typical Inputs

Todd L.Veldhuizen

[email protected]

Bibliography

Typical sets and average-case time I

I Say an algorithm runs in time O(f (n)) on a typical setAδ if for any δ ∈ O(1), the algorithm has worst-caseperformance O(f (n)) on Aδ.

I Question: does running in time O(f (n)) on a typical setimply average-case time O(f (n))?

I Answer: not necessarily — it’s easy to constructcounterexamples. Consider the following algorithm onstrings:

function Broken(w)if w = 111 · · · 11 then

wait for 22|w| secondsreturn

I It returns right away, unless the string is all 1’s, inwhich case it takes O(22n

) time (where n = |w |).


Typical Inputs

Todd L.Veldhuizen

[email protected]

Bibliography

Typical sets and average-case time II

I So, it runs in O(1) time on a typical set. (Using, forexample, the set Aδ of strings with runs of length< log |w |+ δ.)

I Average time is (1− 2−n) · c + 2−n ·O(22n

) = O(22n−n).Doubly-exponential!

I Suppose the worst-case running time of the algorithmcan be expressed in the form O(g(n, δ)): note thatO(g(n,O(1))) gives worst-case time on a typical set.

I The average-case time is then:

T (n) =

log |Kn|∑δ=0

O(2−δ)O(g(n, δ))

=

log |Kn|∑δ=0

O(2−δg(n, δ))


Typical Inputs

Todd L.Veldhuizen

[email protected]

Bibliography

Typical sets and average-case time III

Note that anything of the form∑∞

k=0 2−kkc wherec ∈ O(1) converges to a constant — an exponentialconcentration swallows any polynomial.

I If g(n,δ)g(n,O(1)) is at most polynomial in δ, then worst-case

time on the typical set equals average-case time.


Typical Inputs

Todd L.Veldhuizen

[email protected]

Bibliography

Example: No-carry adder I

I The no-carry adder is a simple algorithm for addingbinary numbers.

I Let x0, y0 be n-bit integers. The no-carry adder repeatsthe following iteration:

xi+1 = xi ⊕ yi

yi+1 = (xi&yi ) LSH 1

where ⊕ is bitwise XOR, & is bitwise AND, and LSH 1shifts left by one bit. At each iteration xi holds a partialsum, and yi holds carry bits. The iteration continuesuntil yi = 0.


Typical Inputs

Todd L.Veldhuizen

[email protected]

Bibliography

Example: No-carry adder II

I Example: to calculate the sum of

x0 = 011011102

y0 = 000000102

the following steps occur:

x1 = 011011002

y1 = 000001002

x2 = 011010002

y2 = 000010002

x3 = 011000002

y3 = 000100002

x4 = 011100002

y4 = 000000002

I How many iterations are required?


Typical Inputs

Todd L.Veldhuizen

[email protected]

Bibliography

Example: No-carry adder IIII The number of iterations is determined by the length of

the longest “carry sequence,” i.e., the longest spanacross which a carry must be propagated.

I For there to be a carry sequence of length t, there mustbe a bit position where x0 and y0 are both 1, followedby t − 1 positions where x0 and y0 have opposite bits:

x0 = 01101110

y0 = 00000010

I The probability of a carry sequence of length t is easilybounded by employing the union bound: let Zi be theevent that x0, y0 match in bit positions i throughi + t − 2. Then

Pr(⋃

Zi ) ≤∑

Pr(Zi )

= (n − t + 1)2−t+1


Typical Inputs

Todd L.Veldhuizen

[email protected]

Bibliography

Example: No-carry adder IVThis is very close to the equation for a run of 1’s; usingt = log n + δ, we obtain

Pr(⋃

Zi ) ≤ 1− O(2−δ)

I So, the number of iterations is O(g(n, δ)) where

g(n, δ) = log n + δ

I To calculate the average case:

T (n) =n∑

δ=0

O(2−δ)O(log n + δ)

=n∑

δ=0

O(2−δ log n + δ2−δ)

= O(log n)

since∑

δ log n2−δ = log n ·∑

δ 2−δ = log n, and∑δ δ2−δ = O(1).


Typical Inputs

Todd L.Veldhuizen

[email protected]

Bibliography

Bibliography I

[1] P. Erdos and M. Kac.The Gaussian law of errors in the theory of additivenumber theoretic functions.Amer. J. Math., 62:738–742, 1940. bib pdf

[2] M. Li and P. Vitanyi.An introduction to Kolmogorov complexity and itsapplications.Springer-Verlag, New York, 2nd edition, 1997. bib

[3] V. G. Vovk.The Kolmogorov-Stout law of the iterated logarithm.Mat. Zametki, 44(1):27–37, 154, 1988. bib

ECE750-TXBLecture 16:RandomizedAlgorithms

Todd L.Veldhuizen

[email protected]

Bibliography

ECE750-TXB Lecture 16: RandomizedAlgorithms



Canada

March 6, 2007


Todd L.Veldhuizen

[email protected]

Bibliography

Stochastic algorithms and data structures I

I A stochastic algorithm or data structure is one withaccess to a stream of random bits (a.k.a. coin flips).

I These random bits can be used to make or influencedecisions about how to proceed. The intended effectmight be to:

I Avoid worst casesI Achieve an average-case performance even for arbitrary

inputs;I Use the random bits to guess answers, if good answers

are plentiful.

I In understanding stochastic algorithms/data structuresthere are two distributions to keep in mind:

1. The distribution of inputs;2. The distribution of the random bits being used.

I Some possibly familiar examples of stochasticalgorithms include simulated annealing, geneticalgorithms, Kernighan-Lin graph partitioning, etc.


Todd L.Veldhuizen

[email protected]

Bibliography

Stochastic algorithms and data structures II

I A randomized algorithm (or data structure) is one thatoffers good performance for any input, with highprobability. i.e., there are no classes of inputs/operationsequences for which the performance is asymptoticallypoor.

I A classic application of randomization is to Hoare’sQuickSort. To sort an array A[1 . . . n]:

1. If n = 1 then done.2. Otherwise, choose a pivot element A[i ] by examining

some finite number of elements of the array.3. Partition the array into three parts: items > A[i ], items

< A[i ], and items = A[i ].4. Recursively sort the first two partitions, and merge the

resulting arrays.


Todd L.Veldhuizen

[email protected]

Bibliography

Stochastic algorithms and data structures III

I If the pivot element is chosen deterministically, then wecan force the algorithm to take Θ(n2) time by designingthe input array carefully. For example, a commonheuristic is “median of three”: choose the pivot to bethe median of A[1],A[bn/2c],A[n]. By placing themaximum elements of the array in these positions, thearray is partitioned into subarrays of size n − 3, 3, and0. Repeating this design recursively yields an array forwhich QuickSort requires Θ(n2) time.

I In Randomized Quicksort, one choses the pivot elementuniformly at random from 1 . . . n. Then, it is impossibleto design a worst case array input without knowing inadvance the random bits being used to choose the pivot.

I Performance for randomized algorithms is usuallymeasured as “worst average case”:


Todd L.Veldhuizen

[email protected]

Bibliography

Stochastic algorithms and data structures IVI The time required for an input w , which we write T (w),

is no longer a deterministic function, but a randomvariable of the coin flip sequence used by the algorithm.

Algorithm

Random bits (coin flips)s = 011001101100...

input w Randomized

I We measure the time required by the algorithm as

maxw∈KnEs [T (w)]

I The maximum over all inputs w ∈ Kn of length nI of the expectation with respect to the random bit

sequence s of the running time.

I The input distribution is ignored: one is concerned withthe worst-case (with respect to inputs) of the averagetime (with respect to the random bits).


Todd L.Veldhuizen

[email protected]

Bibliography

Randomized Equality Protocol I

communicationMachine A

File of n bits File of n bits

Machine B

reliable

I Consider the problem of maintaining a mirror of a largedatabase across a reliable network connection. Bothmachine A and B have a copy of the database, and wewish to determine whether the files are the same.

I Any algorithm achieving zero error for arbitrary filesmust transmit ≥ n bits: one can do no better than justtransmitting the entire file from machine A to B.

I Why? Each bit transmitted can be thought of as theoutcome of some test performed on a file. If t tests areperformed, and t < n, then there are 2t test outcomesand 2n > 2t possible files; by pigeonhole there must betwo different files with the same test outcomes.


Todd L.Veldhuizen

[email protected]

Bibliography

Randomized Equality Protocol III Note that if we transmit, e.g., an md5 checksum, there

exist pairs of files that are different but have the samechecksums, called hash collisions. (In fact there aregrowing databases one can access on the internet toattempt to produce md5 hash collisions.)

I There is a simple randomized algorithm that:

1. Transmits O(log n) bits;2. Achieves an astronomically low error probability, and

this probability can be made as low as desired;3. It is impossible to produce “hash collisions” that reliably

cause the algorithm to wrongly report files are equalwhen they are not.

I Randomized Equality Protocol:I Alice has a file x = x0x1 · · · xn−1, and Bob has a file

y = y0y1 · · · yn−1. (xi , yi are bits; we interpret x and yas large integers.)

I Alice chooses a prime p uniformly at random in [2, n2].(This prime can be represented in ≤ 2 log n bits.)


Todd L.Veldhuizen

[email protected]

Bibliography

Randomized Equality Protocol III

I Alice computes

s = x mod p

and transmits s and p to Bob. (This requires≤ 2d2 log ne bits, plus change.)

I Bob computes

q = y mod p

If q = s, Bob outputs “x = y .” If q 6= s, Bob outputs“x 6= y .”

I Note that for a file of 1016 bytes (≈ 900 Tb), theamount of data transmitted is ≈ 256 bytes.

I To analyze the error, we take the usual “worst-caseaverage” approach: for the worst possible choice offiles, what is the average probability of error?


Todd L.Veldhuizen

[email protected]

Bibliography

Randomized Equality Protocol IV

I Say a prime p is ‘bad’ for (x , y) if x mod p = ymod p, but x 6= y . Otherwise, say p is ‘good’ for (x , y).Our general approach is to prove that the ‘good’ primesvastly outnumber the bad ones, and so our chance ofpicking a ‘good’ prime is high.

I The probability that an error occurs is

#bad primes in [2, n2]

#primes in [2, n2]

I The number of primes in [2, n2] is

π(n2) ∼ Li(n2) ∼ n2

ln n2

(Prime number theorem; Li is the logarithmic integral.)


Todd L.Veldhuizen

[email protected]

Bibliography

Randomized Equality Protocol VI An error occurs when x 6= y but x , y are the same

modulo p, i.e., we can write

x = x ′ · p + s

y = y ′ · p + s

for some integers x ′, y ′. Then, p divides (x − y), sincex − y = x ′ · p + s − (y ′ · p + s) = (x ′ − y ′) · p.Let r = |x − y |. Since r ≤ 2n, r has ≤ n − 1 primedivisors. The probability ε of p being a prime divisor ofw is therefore

ε ≤ n − 1

π(n2)∼ n

n2

ln n2

=2 ln n

n

Therefore the probability of error is

ε ≤ 2 ln n

n(1− o(1))


Todd L.Veldhuizen

[email protected]

Bibliography

Randomized Equality Protocol VI

For example, if n = 1016, the error probability is≈ 10−14.

I This is a specific example of a general pattern:“abundance of witnesses.” The principle is that if x 6= yand p does not divide (x − y), then p is a “witness” tothe fact “x 6= y .” There are lots of witnesses, so if wechoose a potential witness (a prime) at random, we’relikely to find one.

I To get an even lower error, we can repeat the protocolk times: Alice chooses k primes uniformly at randomfrom [1, n2] and transmits x mod pi for each prime.With k independent trials, and failure probability ε ineach, the probability of k failures is ≤ εk . For example,with n = 1016 and k = 10, by sending ≈ 2 kb of data,we can obtain a probability of error ≈ 10−141. This isan example of success amplification.


Todd L.Veldhuizen

[email protected]

Bibliography

Classification of randomized algorithms I

I Stochastic algorithms: use random bits in some way

1. Las Vegas algorithms: no error; use coin flips to avoidworst cases; get good worst-case expected time.

2. Monte Carlo algorithms: allow some probability of error.

2.1 One-sided Monte Carlo (1MC): a NO answer is alwayscorrect, a YES answer has probability of error (i.e.,false positives are possible).

2.2 Bounded error Monte Carlo (2MC): computes afunction f (w) with probability ≥ 1

2+ δ of being

correct, δ > 0.

2.3 Unbounded error Monte Carlo (UMC).


Todd L.Veldhuizen

[email protected]

Bibliography

One-sided Monte Carlo I

I Recall that a decision problem is described by some setL; we are asked to decide “Is w ∈ L?”

I An algorithm A is a One-sided Monte Carlo when:I If x ∈ L then Prob(A(x) = 1) ≥ 1

2 .I If x 6= L then Prob(A(x) = 0) = 1.

I The randomized-equality protocol we saw was aone-sided Monte Carlo algorithm:

I It had zero probability of error if the files were equal,and some probability of error when the files wereunequal.

I To match the definition of one-sided MC, we could takethe set being decided to be pairs of files of n bits thatdiffer. (A NO answer to the decision problem = the filesare equal.)

I Since the probability of error is < 12 , can get an error δ

with t ≤ − log δ repetitions.


Todd L.Veldhuizen

[email protected]

Bibliography

Bounded-Error Monte Carlo I

I Also known as Two-Sided Monte Carlo.

I Computes a function f : Σ∗ → Σ∗, e.g., a function ofbinary strings.

I Probability of being correct is ≥ 12 + ε, for some ε > 0

constant.

I Since ε > 0, to obtain an error probability < δ we needonly a constant number of iterations, independent of n.

I If ε ∈ o(1) then might need exponentially manyrepetitions (in n) to achieve an error probability < δ.

I Success amplification:I Run the algorithm t times.I If an output appears at least dt/2e times, output it

(i.e., majority vote).I Otherwise, output “?” (algorithm fails.)


Todd L.Veldhuizen

[email protected]

Bibliography

Bounded-Error Monte Carlo II

I A tedious analysis shows that to achieve an errorprobability < δ, it suffices to choose

t ≥ 2 ln δ

ln(1− 4ε2)

I Note this formula does not depend on the length of theinput.


Todd L.Veldhuizen

[email protected]

Bibliography

Unbounded Error Monte Carlo

I Have probability 1/2 + ε(n) of being correct, i.e., betterthan chance (but possibly not much better!)

I Using the same formula as before, to obtain an error δ,the number of repetitions required is

t ≥ 2 ln δ

ln(1− 4ε2(n))

If ε ∈ o(1), then 1− 4ε2(n) → 1, andln(1− 4ε2(n)) → 0. So, t ∈ ω(1).

I Need t ∈ Ω(ε−2) to keep error bounded.

I Could be that exponentially many repetitions of thealgorithm are required.


Todd L.Veldhuizen

[email protected]

Bibliography

Bibliography I

[1] Rajiv Gupta, Scott A. Smolka, and Shaji Bhaskar.On randomization in sequential and distributedalgorithms.ACM Comput. Surv., 26(1):7–86, 1994. bib pdf

[2] Rajeev Motwani and Prabhakar Raghavan.Randomized Algorithms.Cambridge University Press, Cambridge, 1997 edition,1995. bib

[3] Rajeev Motwani and Prabhakar Raghavan.Randomized algorithms.ACM Comput. Surv., 28(1):33–37, 1996. bib


Algorithms forbinary relations

and graphs

Todd L.Veldhuizen

[email protected]

ECE750-TXB Lecture 17: Algorithms forbinary relations and graphs



Canada

March 8, 2007



and graphs

Todd L.Veldhuizen

[email protected]

Binary Relations I

I Recall that a binary relation on a set X is a set R ⊆ X 2.

I We may interpret a binary relation as a directed graphG = (X ,R).

I Some common axioms relations may satisfy:

1. Transitive (T ):

∀x , y , z . (R(x , y) ∧ R(y , z)→ R(x , z))

// ## // ;; //If there is a path from x to z, there is an edge from x toz.



and graphs

Todd L.Veldhuizen

[email protected]

Binary Relations II2. Reflexive: ∀x . R(x , x)

<< // bbEvery vertex has an edge to itself.

3. Symmetric (S)

∀x , y . R(x , y)→ R(y , x)

ZZ

If there is an edge from x to y, there is an edge from yto x.Usually one draws the graph without arrows:

and it is called simply a “graph” rather than a directedgraph.



and graphs

Todd L.Veldhuizen

[email protected]

Binary Relations III4. Antisymmetric (A)

∀x , y . R(x , y) ∧ R(y , x)→ (x = y)

When the relation is reflexive, transitive and alsoantisymmetric, it is a partial order.

I A rough classification of binary relations:

Binary Relation/Directed Graph

Graph (S) Preorder/Quasiorder (T,R)

Equivalence (T,R,S) Partial order/Poset (T,R,A)

Tree order

Total order



and graphs

Todd L.Veldhuizen

[email protected]

Binary Relations IV

I Good algorithms for managing the common classes ofbinary relations are known. If you can identify theabstract relation(s) underlying a problem, this may leadyou directly to efficient algorithms.



and graphs

Todd L.Veldhuizen

[email protected]

Part I

Equivalence Relations



and graphs

Todd L.Veldhuizen

[email protected]

Equivalence relations and partitions I

I An equivalence relation ∼ is a binary relation that isreflexive, transitive, and symmetric. (The most familiarexample: equality, “=”).

I Pictured as a graph, an equivalence relation is acollection of cliques:

a b e f

c

qqqqqqqqqqqqqd

MMMMMMMMMMMMMg

>>>>>>>>

I For an equivalence ∼⊆ X 2, we writeI [a]∼ = b ∈ X : a ∼ b for the equivalence class of a;



and graphs

Todd L.Veldhuizen

[email protected]

Equivalence relations and partitions II

I X/ ∼ for the set of equivalence classes induced by ∼:

X/ ∼ = [a]∼ : a ∈ X

X/ ∼ is a partition. (Recall that a partition of a set Xis a collection of subsets Y1, . . . ,Yk of X that arepairwise disjoint and satisfy

⋃Yi = X .)

I Example: In the above figure, the equivalence classesare a, b, c , d, e, f , g.

I Example: take N with a ∼ b ≡ (a mod 5 = b mod 5).The equivalence classes N/ ∼ are0, 5, 10, . . ., 1, 6, 11, . . ., . . . , 4, 9, 14, . . ..

I Common algorithmic problems we encounter withequivalence classes:

I Answering queries of the form “Is a ∼ b?”



and graphs

Todd L.Veldhuizen

[email protected]

Equivalence relations and partitions III

I Maintaining an equivalence relation as we progressivelydecide objects are equivalent. (This results from aninductively defined equivalence relation.) Example: theNelson-Oppen method for equational reasoning [7].

I Maintaining an equivalence relation as we progressivelydecide objects are not equivalent. (This results from aco-inductive definition of equivalence [6].) Example:minimizing states of a DFA [4], maintainingbisimulations, congruence closure [3].

I A system of representatives is the primary means forefficient manipulation of equivalence relations.

I A system of representatives for ∼ is a functions : (X/ ∼)→ X choosing a single element from eachblock of the partition, such that

a ∼ b if and only if s(a) = s(b)



and graphs

Todd L.Veldhuizen

[email protected]

Equivalence relations and partitions IV

I Example: to reason about equivalence of integersmodulo 5, we could choose the representatives 0, 1, 2, 3,and 4. The integer 1 represents the equivalence class[1]∼ = 1, 6, 11, 16, . . ..

I With a means to quickly compute representatives, wecan test whether a ∼ b by computing therepresentatives of the equivalence classes [a]∼ and [b]∼,then using equality.

I If the equivalence relation is static, one can precomputea system of representatives as e.g., a table. If theequivalence relation is discovered dynamically, moresophisticated methods are needed.



and graphs

Todd L.Veldhuizen

[email protected]

Disjoint Set Union I

I Disjoint Set Union is algorithms-speak for maintainingan inductively-defined equivalence relation:

I Initially we have a set of objects, none of which areknown to be equivalent.

I We gradually discover that objects are equivalent, andwe wish to maintain a representation of the equivalencerelation that lets us quickly answer queries of the form“Is a ∼ b?”

I Interface:I union(a, b): include a ∼ b in the equivalence relationI find(a): returns an equivalence class representative

(ECR) for a.

I There is wonderfully elegant data structure due toTarjan [8] that performs these operations in O(nα(n))time, where α(n) ≤ 3 for n less than (cosmologists’ bestestimate of) the number of particles in the universe.



and graphs

Todd L.Veldhuizen

[email protected]

Disjoint Set Union II

I Tarjan’s data structure maintains the equivalencerelation on the set X as a forest — a collection of trees.Each node in a tree is an element of the set X , eachtree is an equivalence class, and each root is anequivalence class representative.

b e

a

@@c

OO

d

^^>>>>>

f

OO

g

OO

A forest representation of the equivalence classesa, b, c, d, e, f , g.

I Each element has a pointer to its parent; to determinethe equivalence class representative, we follow theparent pointers to the root of the tree.



and graphs

Todd L.Veldhuizen

[email protected]

Disjoint Set Union III

I The efficiency of the representation depends on howdeep the trees are. To keep the trees shallow, twotechniques are employed: (i) path compression; and (ii)‘union by rank.’

I Record representation: for each element x ∈ X , wetrack

I parent(x): a pointer to the parent of x , or a pointer toitself if it is the root (alternately, a null pointer can beused.)

I rank(x): indicates how deep trees are (but, not depthper se).

I Pseudocode for find(a):

find (a)if parent(a) 6= a then

parent(a) ← find(parent(a))return parent(a)



and graphs

Todd L.Veldhuizen

[email protected]

Disjoint Set Union IV

This recursively follows the parent pointers up to theroot, then rewrites all the parent pointers so they pointdirectly at the root, called “path compression”:

f f

d

@@e

^^=====d

@@c

OO

e

^^=====

c

OO

Left: tree. Right: after calling find(c).

I A simple way to implement union(a,b): just make the root of a’stree have b as a parent.

union(a,b)parent( find (a)) ← b



and graphs

Todd L.Veldhuizen

[email protected]

Disjoint Set Union V

However, this can lead to poorly balanced trees. For betterasymptotic efficiency, one can track how deep the trees are andalways make the deeper tree the parent of the shallower tree:called “union by rank.”

union(a,b)pa ← find(a)pb ← find(b)if pa=pb then return

if rank(pa) > rank(pb) thenparent(pb) ← pa

elseparent(pa) ← pbif (rank(pa) = rank(pb))

rank(pb) ← rank(pa) + 1



and graphs

Todd L.Veldhuizen

[email protected]

Disjoint Set Union VI

I Tarjan proved that using both path compression and union byrank, a sequence of n calls to union and find requires O(nα(n))time, where α(n) ≤ 3 for

n ≤ 222

...

29>>=>>; 65536

i.e., a tower of 65536 powers-of-two. The function α(n) is the‘inverse’ of the Ackermann function; see CLR [2] or [8] for details.

I For any practical purpose, the time required by Tarjan’s algorithmis indistinguishable from O(n) for a sequence of n operations; orO(1) per operation amortized time (to come.)



and graphs

Todd L.Veldhuizen

[email protected]

Bibliography

Part II

Graphs



and graphs

Todd L.Veldhuizen

[email protected]

Bibliography

Representation of Graphs I

I Here are four common methods of representing graphs.

I If the graph is large (e.g., infinite), the structure is notknown beforehand, etc., we may choose an implicitrepresentation for the graph, where vertices and edgesare computed on-the-fly as needed. For example, thegraph G = (N,E ) where (x , y) ∈ E if and only if ydivides x , is an infinite graph where the edges can becomputed on the fly by factorization.

I An explicit representation is one where we directlyencode the structure of the graph in a data structure.Some common methods for this:



and graphs

Todd L.Veldhuizen

[email protected]

Bibliography

Representation of Graphs II

I Adjacency matrix: an n × n matrix A of 0’s and 1’s,with Aij = 1 if and only if vi , vj ∈ E . Row i indicatesthe out edges for vertex i , and column i indicates the inedges.

A =

0 1 1 00 0 0 10 0 0 10 0 0 0

b // d

a

OO

// c

OO



and graphs

Todd L.Veldhuizen

[email protected]

Bibliography

Representation of Graphs III

I Adjacency lists: each vertex maintains a set of verticesto/from which there is an edge e.g.

out(a) = b, cout(b) = dout(c) = dout(d) = ∅

I If the graph structure is static (i.e., not changing as thealgorithm runs), it is common to represent lists of in-and out- edges as vectors, for efficiency.

I For more elaborate algorithms on e.g. weighted graphs,a representation of this sort is commonly used:



and graphs

Todd L.Veldhuizen

[email protected]

Bibliography

Representation of Graphs IV

public class Edge Vertex x, y;double weight;

public class Vertex Set<Edge> out;Set<Edge> in;



and graphs

Todd L.Veldhuizen

[email protected]

Bibliography

Depth-First Search I

I One of the commonest operations on a graph is to visitthe vertices of the graph one by one in some desiredorder. This is commonly called a search.

I In a depth-first search, we explore along a single pathinto the graph as far as we can until no new verticescan be reached; then we return to some earlier pointwhere new vertices are still reachable and continue.(Think of exploring a maze.)

I Example of a depth-first search (yellow) starting at thecenter vertex of this graph:



and graphs

Todd L.Veldhuizen

[email protected]

Bibliography

Depth-First Search II

I As we visit each new vertex, we perform some actionthere. The choice of action depends on what we hopeto accomplish; for now we will just call it “visiting thevertex,” but later we will see examples of specific usefulactions. We might choose to visit the vertex the firsttime we see it (preorder), or the last time we see it(postorder)

I Here is a recursive implementation of depth-first search.It uses a set Seen to track which vertices have beenvisited. One can also include a flag field as part of thevertex data structure that can be “marked” to indicatethe vertex has been seen.



and graphs

Todd L.Veldhuizen

[email protected]

Bibliography

Depth-First Search III

dfs(x)dfs(x, ∅)

dfs(x, Seen)if x 6∈ Seen

Seen ← Seen ∪ xpreorderVisit (x) // Do somethingFor each edge (x , y),

dfs(y,Seen)postorderVisit (x) // Do something

I This search is easily implemented in a nonrecursiveversion, using a stack data structure to keep track ofthe current path into the graph:



and graphs

Todd L.Veldhuizen

[email protected]

Bibliography

Depth-First Search IV

dfs(x)Seen = ∅Stack Spush(S,x)while S is not empty,

y ← pop(S)if y 6∈ Seen then

Seen ← Seen ∪ y preorderVisit (y)for each edge (y , z),

push(S,z)



and graphs

Todd L.Veldhuizen

[email protected]

Bibliography

Topological Sort I

I A Directed Acyclic Graph (DAG) is a graph in whichthere are no cycles (i.e., paths from a vertex to itself.)

I The reflexive, transitive closure of a DAG is a partialorder. (If you add to a DAG an edge (x , y) wheneverthere is a path from x to y , plus self-loops (x , x), theresulting edge relation is a partial order: reflexive,transitive, and anti-symmetric.)

I Every finite partial order can be extended to a totalorder: i.e., if v is a partial order on a finite set, there isa total order ≤ such that (x v y)⇒ (x ≤ y); or, moreobtusely, v⊆≤. (Axiom of choice implies this forinfinite sets also.)



and graphs

Todd L.Veldhuizen

[email protected]

Bibliography

Topological Sort III Example: let V = N2 (pairs of natural numbers), and

for all i , j , put edges (i , j)→ (i + 1, j) and(i , j)→ (i , j + 1):

......

...

//

OO

//

OO

OO

// · · · //

OO

//

OO

OO

// · · · //

OO

OO

//OO

// · · ·Then the transitive reflexive closure of this graph is apartial order v where (i , j) v (i ′, j ′) if and only if i ≤ i ′

and j ≤ j ′:

(2, 0) (1, 1) (0, 2)

(1, 0)

IIIIIuuuuu

(0, 1)

uuuuuIIIII

(0, 0)

IIIIIuuuuu



and graphs

Todd L.Veldhuizen

[email protected]

Bibliography

Topological Sort III

One way to extend v to a total order is:

...

OO

!!BBBB

aaBBBB

OO ?

??

__???

//aaBBBB //

__???

An example of what computer scientists call“dovetailing.”

I Topological sort is a method for obtaining a total-orderextension of a partial order.



and graphs

Todd L.Veldhuizen

[email protected]

Bibliography

Topological Sort IV

I Example: Suppose we want to evaluate a digital circuit:

e

ad

c

b

Build a graph where signals are vertices, and an edgeindicates that one signal depends upon another (a‘dependence graph’):

d

a b

e

c



and graphs

Todd L.Veldhuizen

[email protected]

Bibliography

Topological Sort VThe transitive, reflexive closure of this graph yields anorder w, where e.g., ‘e w d ’ means signal e can beevaluated only after signal d .Extending w to a total order ≥ gives us a valid order inwhich to evaluate the signals, e.g.,

e ≥ d ≥ c ≥ b ≥ a

If we evaluate signals in the order a, b, c , d , e we respectthe dependencies.

I Other examples:I Ordering the presentation of topics in a course or paper.I Solving equationsI MakefilesI Planning (keeping track of task dependencies)I Spreadsheets and dataflow languages [5]I Ordering static initializers in programming languagesI Dynamization of static algorithms e.g. [1]



and graphs

Todd L.Veldhuizen

[email protected]

Bibliography

Topological Sort VI

I Here is an algorithm for topological sort based ondepth-first search. Note that there are many ways inwhich a partial order can be extended to a total order;this is just one method.

TopologicalSort (V,E)Set<Node> visited;List<Node> order;

for x ∈ Vdfs(x, visited , order)

dfs(x, visited , order)if x 6∈ visited

visited .add(x)for each out edge (x,y)

dfs(y, visited , order)order . insertBack(x)



and graphs

Todd L.Veldhuizen

[email protected]

Bibliography

Topological Sort VII

I We search the dependence graph depth-first, visitingvertices postorder at which time we insert them at theback of the list.

I Example: for the circuit example, a depth-first searchmight visit the vertices in the order a, b, d , c , e.



and graphs

Todd L.Veldhuizen

[email protected]

Bibliography

Connected components of undirected graph I

I Defn: A set of vertices Y ⊆ V is connected if for everya, b ∈ Y there is a path from a to b. Y is a maximalconnected component if it cannot be enlarged, i.e., forany connected set of vertices Y ′ with Y ⊆ Y ′, Y = Y ′.

I Note that the connected components of a graph form apartition of the vertices:

g d==

=

a b

c

e

q

The connected components are a, b, g , q, c, d , e.I Using Tarjan’s disjoint set union, there is a very simple

algorithm for connected components:



and graphs

Todd L.Veldhuizen

[email protected]

Bibliography

Connected components of undirected graph II

1. Have a parent pointer and rank associated with eachvertex (e.g., by creating a separate record for eachvertex, or by storing these fields directly in the vertexdata structure.)

2. For each edge (a, b), call union(a, b).

No searching is necessary! The complexity isO(|E + V |α(|E + V |)), ‘practically’ linear in thenumber of vertices and edges.



and graphs

Todd L.Veldhuizen

[email protected]

Bibliography

Bibliography I

[1] Umut A. Acar, Guy E. Blelloch, Robert Harper, Jorge L.Vittes, and Shan Leung Maverick Woo.Dynamizing static algorithms, with applications todynamic trees and history independence.In SODA ’04: Proceedings of the fifteenth annualACM-SIAM symposium on Discrete algorithms, pages531–540, Philadelphia, PA, USA, 2004. Society forIndustrial and Applied Mathematics. bib pdf




and graphs

Todd L.Veldhuizen

[email protected]

Bibliography

Bibliography II

[3] Peter J. Downey, Ravi Sethi, and Robert Endre Tarjan.Variations on the common subexpression problem.Journal of the ACM (JACM), 27(4):758–771, 1980. bib

pdf

[4] J. E. Hopcroft.An n log n algorithm for minimizing the states in afinite-automaton.In Z. Kohavi, editor, Theory of Machines andComputations, pages 189–196. Academic Press, 1971.bib

[5] Wesley M. Johnston, J. R. Paul Hanna, and Richard J.Millar.Advances in dataflow programming languages.ACM Comput. Surv., 36(1):1–34, 2004. bib pdf



and graphs

Todd L.Veldhuizen

[email protected]

Bibliography

Bibliography III

[6] Y. N. Moschovakis.Elementary Induction on Abstract Structures.North-Holland, Amsterdam, 1974. bib

[7] Greg Nelson and Derek C. Oppen.Fast decision procedures based on congruence closure.Journal of the ACM (JACM), 27(2):356–364, 1980. bib

pdf

[8] R. E. Tarjan.Efficiency of a good but not linear disjoint set unionalgorithm.Journal of the ACM (JACM), 22:215–225, 1975. bib

pdf

ECE750-TXB Lecture 18: Graph Algorithms



Canada

March 13, 2007

Weighted Graphs

I A weighted graph is a triple G = (V ,E ,w) where w is aweight function: often

w : E → R ∪ +∞w : E → Q+

I Often w(x , y) > 0 and represents a “distance” or “score.”

I Example: vertices are cities, and edges represent driving timesbetween adjacent cities.

Distance metric on graphs I

I If edge weights are positive, we can define a distance metric(or quasimetric) on the graph. You are familiar with Euclideandistance, for example, in R2:

d(x , z) =√

(x1 − z1)2 + (x2 − z2)2

We can define distances in graphs in such a way that theyshare many of the useful properties of Euclidean distance.

I A distance metric d : V 2 → R satisfies:

1. d(x , y) ≥ 02. d(x , y) = 0 if and only if x = y3. d(x , y) = d(y , x) (symmetry)4. d(x , y) + d(y , z) ≥ d(x , z) (triangle inequality)

If the symmetry axiom (3) is omitted, d is called aquasimetric. (For weighted directed graphs, a quasimetricmay be appropriate.)

Distance metric on graphs II

I A set V together with a distance metric d : V 2 → R is calleda metric space.

I A connected graph with nonnegative edge weights can beturned into a metric space:

1. Define the length of a path to be the sum of edge weightsalong the path;

2. Define d(x , x) = 0 and d(x , y) to be the minimum path lengthfrom x to y .

I In R2 we can define open and closed discs:

(x , y) :√

x2 + y2 < r

(x , y) :√

x2 + y2 ≤ r

Distance metric on graphs III

I In a metric space (V , d) we can define open and closed balls:

Br (x) = y ∈ V : d(x , y) < r

E.g. if we construct a graph of settlements in Ontario whereedges indicate roads and weights are driving times, then a ballis e.g., settlements that are within two hours of Waterloo.

Breadth-first search I

I Breadth-first search is another method to visit all the verticesof a graph. Conceptually, we put weights of 1 on each edge.Then from some starting vertex x , we consider balls of radiusr around x and take r →∞; we visit vertices in the order theyare added to the ball.

Breadth-first search II

I Basic scheme: we maintain a queue of vertices that are justoutside the current ball.

BFS(V,E,x)Seen = ∅Queue QEnqueue(Q,x)While Q is not empty,

Get y = next element in queue.if y 6∈ Seen,

Seen ← Seen ∪yVisit yFor each edge y , z ∈ E ,

Enqueue(Q,z)

This algorithm is linear in the number of edges.

Single-source shortest paths I

I The BFS algorithm is easily modified to solve the followingproblem: given a connected graph with nonnegative edgeweights and a specified vertex x , compute d(x , y) for ally ∈ V . That is, find the length of the shortest path from x toevery other vertex in the graph.

I Intuition: again, consider balls centered around x , but useedge weights. We want to visit vertices in order of theirdistance from x , so we modify the BFS algorithm to use apriority queue. We put pairs (z , d) into the priority queue,where z is a vertex, and d is the length of some path from xto z . The priority queue orders (z , d) pairs by d , using e.g., amin heap, so that at each step we can efficiently retrieve thenext closest vertex to x .

Single-source shortest paths II

SSSP(V,E,x,w)Seen ← ∅PriorityQueue PQ.Put (x , 0) in PQ.While PQ is not empty,

Get (y,d) from PQ (least element).if y 6∈ Seen then

Seen ← Seen ∪ y Visit (y,d)For each edge (y,z ), put (z,d+w(y,z)) in PQ.

I Time complexity: if we use a min-heap, this achievesO(|E | log |E |) time. It is possible to get this down toO(|V | log |V |+ |E |) by the use of somewhat exotic datastructures such as Fibonacci heaps.

Transitive Closure I

I Let G = (V ,E ) be a graph.

I The transitive closure of G is G ′ = (V ,E ∗) where (x , y) ∈ E ∗

if there is a path from x to y in G .

I Define T (E ) = (x , y) : path from x to y in E.I Then E ∗ = T (E ).

I T is a closure operator:

1. E ⊆ T (E ) (nondecreasing)2. (E1 ⊆ E2)⇒ (T (E1) ⊆ T (E2)) (monotone)3. T (T (E )) = T (E ) (idempotent/fixpoint)

I The complexity of transitive closure is closely linked to that ofmatrix multiplication.

Transitive Closure III There is a path of length 2 from i to j if there is some vertex k

such that E (i , k) ∧ E (k, j). We can write this as:

E 2(i , j) =∨k∈V

E (i , k) ∧ E (k, j)

where∨

k∈V is a disjunction over all vertices k ∈ V (∨

is to ∨as

∑is to +,

∏is to ×, etc.)

====

====

i

////

////

////

/ j

Transitive Closure IIII Compare to matrix multiplication: if B = AA, then

bij =∑

k

aik · akj

I If A is the adjacency matrix of the graph, then to find paths oflength 2 we can compute the matrix product A2 in the booleanring (B,+, ·, 0, 1) where B = 0, 1, addition is disjunction(α + β) ≡ (α ∨ β), and multiplication is conjunction(α · β) ≡ (α ∧ β).

I To find paths of any length, we can write

A∗ = I + A + A2 + A3 + · · · (1)

where we need only compute terms up to An−1, wheren = |V |, the number of vertices.With the leading I term, Eqn. (1) gives a reflexive transitive closure. The

difference between transitive closure and reflexive-transitive closure is

trivial: the latter has E∗(x , x) for every x .

Transitive Closure IVI The obvious method of evaluating Eqn. (1) requires O(n5)

time. If we write

A∗ = I + A(I + A(I + A(I + A(I + · · · )))) (2)

we can compute A∗ with O(n4) operations.

I By using power trees, e.g., A8 = (((A2)2)2) we can computethe transitive closure with O(n3 log n) operations; orO(nγ log n), where γ is the exponent of e.g. Strassen matrixmultiplication.

I There is a simple algorithm (Warshall’s Algorithm) thatcomputes transitive closure from the adjacency matrix inO(n3) time.

I In practice, the best way to compute transitive closuredepends strongly on the anticipated structure of the inputgraph, its size, density, planarity, etc. There is a largeliterature on algorithms for TC.

Transitive Closure V

I Perhaps surprisingly, there is an algorithm computingtransitive closure in O(n2) average time, for a uniformdistribution on graphs.

I The G (n, p) random graph model is a distribution on graphs ofn vertices where each edge is present independently withprobability p. Choosing p = 1/2 gives a uniform distributionon graphs.

I In G (n, 12 ), transitive closure can be computed in O(n2) time

on average.I The reason why: with probability 1, every vertex is at most

two steps away from every other vertex:I Let x , y ∈ V be vertices; there are n − 2 choices of

intermediate vertices to make a path of length 2 from x to y .With each intermediate vertex w , we have a probability 1

4 ofhaving both the edge (x ,w) and (w , y).

I The probability of there being no w such thatE (x ,w) ∧ E (w , y) is ( 3

4 )n−2.

Transitive Closure VI

I There are(n2

)choices of x and y . Let Z be the event, “there

exist (x , y) such that there is no w where E (x ,w) ∧ E (w , y).”Then, using the union bound,

Pr(Z ) ≤(

n

2

) (34

)n−2

= O(n2(

34

)n) = o(1)

This probability goes to 0 very fast as n→∞.I To turn this insight into an algorithm, consider paths of length

1, 2, . . .. For each pair (x , y), consider all possible intermediatesequences of vertices; stop when an intermediate sequence isfound that gives a path from x to y . Then, stop when a pathis found between every pair of vertices.

I To find paths of length 2, the number of intermediate verticesthat need to be examined follows a geometric distribution,with a mean of ( 1

4 )−1 = 4 vertices.

Transitive Closure VII

I And, with probability tending to 1, we can stop after onlyconsidering paths of length 2. Because the probability isconverging exponentially, we get an average time complexity ofO(n2).

I That transitive closure can be computed quickly in theG (n, 1

2) random graph model is an instance of a much deeperpattern arising from “zero-one laws” in finite model theory.

I Note that we can write transitive closure as an iteration of afirst-order sentence:

E 0(x , y) = ⊥E k+1(x , y) = (x = y) ∨ E k(x , y) ∨ ∃w . E k(x ,w) ∧ E (w , y)

“There is a path of length ≤ k + 1 from x to y if there is apath of length k, or there is a vertex w so there is a path oflength k from x to w , and an edge from w to y .”

I This is an example of a FO+lfp (first order logic with leastfixpoint) definition.

Transitive Closure VIII

I For every FO+lfp definable relation, there is an FO definablerelation (i.e., without iteration) that is equivalent withprobability tending to 1 as n→∞, in the G (n, 1/2) randomgraph model. For example, the “approximate” transitiveclosure given by:

E∗(x , y) = (x = y) ∨ ∃w . E (x ,w) ∧ E (w , y)

is equal to the real transitive closure E∗ with asymptoticprobability 1.

I cf. Almost-everywhere equivalence [2].

Strongly Connected Components I

I Strongly connected components is the directed graphanalogue of connected components.

I Consider this set of equations:

x = 3 + yy = x − 2z = x + 4ww = z − 8

We can solve such systems of linear equations with Gaussianelimination in time O(n3).

Strongly Connected Components II

I If we look at the dependence graph, we discover somethinguseful:

x!!y``

z

OO

w``

We do not need to solve the whole system at once; instead wecan first solve x , y and then solve z ,w.x , y, z ,w are the strongly connected components.

I A subset of vertices Y ⊆ V is strongly connected if for eachx , y ∈ Y , there is a path from x to y , and a path from y to x .

I Write x . y if there is a path from x to y .I . is transitive and reflexive, but not necessarily antisymmetric.I . is a preorder.

I Any preorder can be decomposed into:

Strongly Connected Components III

1. An equivalence relation ∼, where V / ∼ are the stronglyconnected components:

x ∼ y ≡ (x . y) ∧ (y . x)

2. A partial order ≤ on V / ∼.

[x ]∼ ≤ [y ]∼ ≡ x . y

(This is a common method of constructing hierarchies. For example, in

computability theory and structural complexity theory, a reducibility

relation defines a preorder on problems, and the resulting partial order ≤on interreducible problems gives a hierarchy of what are sometimes called

‘degrees.’ The class of NP-complete problems we shall see later can be

constructed this way.)

Strongly Connected Components IVI Example: for the set of equations above, we obtain the

equivalence relation ∼ given by the partition x , y, z ,w,and the corresponding partial order is given by this Hassediagram:

x , y

z ,w

I The graph obtained by merging all the elements of thestrongly connected components into “supernodes”, andkeeping whatever edges remain between supernodes, is calleda condensation graph.

I Strongly connected components can sometimes be used todecompose a problem into a collection of smaller problemsthat can be solved in order. For example, to solve a system oflinear equations efficiently, one can

Strongly Connected Components V

1. Compute the strongly connected components;2. Compute a topological sort of the resulting condensation

graph;3. Solve the smaller systems of equations according to the

topological sort order.

This technique is used in program analysis and compilers, forefficient solution of lattice equations.

I A simple (but not terribly efficient) method to compute thestrongly connected components:

1. Compute the transitive closure E∗;2. For each (x , y), if both E∗(x , y) and E∗(y , x), then do

union(x , y) with e.g., Tarjan’s disjoint set union.

I However there is a much more efficient algorithm due toKosaraju, which you can find in CLR [1], which uses twopasses of depth-first search. This is quite suitable for anexplicit graph representation.

Spanning trees I

I A spanning tree of a connected, undirected graph G = (V ,E )is a subset T ⊆ E such that for all x , y ∈ V , there is a uniquepath from x to y through T .

The yellow edges form a spanning tree.

I Proposition: |T | = |V | − 1. (By induction on the number ofvertices.)

I Proposition: If |T | = |V | − 1 and (V ,T ) is connected, it is aspanning tree.

I These two properties make finding a spanning tree dead easy:any set of |V | − 1 edges that do not contain a cycle form aspanning tree.

I Generic spanning tree algorithm:

Spanning trees II

1. Set T = ∅.2. Pick an edge e ∈ E .

3. If T ∪ e has no cycle, set T ← T ∪ e.4. If |T | < |V | − 1, go to 2.

Minimum Spanning Tree (MST) I

I Problem: given a weighted graph (V ,E ,w), find a spanningtree T for G such that ∑

(x ,y)∈T

w(x , y)

is minimized.

I Kruskal’s algorithm: always pick the lowest-cost edge thatdoes not cause a cycle. (This is an example of a greedyalgorithm.)

PriorityQueue Q of edges, ordered ascending by weight.Put each edge in Q.While |T | 6= |V | − 1,

Take edge e from Q.If T ∪ e has no cycle, set T ← T ∪ e.

Minimum Spanning Tree (MST) II

I To detect cycles, keep track of connectivity of T with Tarjan’sdisjoint set union:

PriorityQueue Q of edges, ordered ascending by weight.Put each edge in Q.While |T | 6= |V | − 1,

Take edge e = (x , y) from queue.If find (x) 6= find(y),

union(x,y)T ← T ∪ e

I The time required is O(|E | log |E |): dominated by the time tomaintain the heap.

Bibliography I


[2] Lauri Hella, Phokion G. Kolaitis, and Kerkko Luosto.Almost everywhere equivalence of logics in finite model theory.The Bulletin of Symbolic Logic, 2(4):422–443, December 1996.bib pdf ps

ECE750-TXB Lecture 19: Greedy Algorithms andDynamic Programming



Canada

March 15, 2007

Multi-stage Decision Problems I

I A multi-stage decision problem is one where we can think ofthe problem as having a tree structure: each tree noderepresents a decision point, and each decision leads us toanother decision to be made.

I For example: choose as many elements of the set1, 3, 7, 9, 13 as possible without exceeeding 21. We canview this as a tree:

1 3 7 9 13

1,3 1,7 1,9 1,13

Here we have decided to include 1 in the set, which opens up a further

set of choices.

Greedy algorithms I

I A greedy algorithm always chooses the most attractive optionavailable to it: each decision is made based on its immediatereward.

I For example, we might pick elements of the set 1, 3, 7, 9, 13according to their benefit-cost ratio: picking 3, for example,increases the size of the set by 1 (benefit) while increasing thesum by 3 (cost), so we would assign it a benefit-cost ratio of13 . Choosing elements according to this scheme gives the set1, 3, 7, 9.

I Kruskal’s minimum spanning tree algorithm is anotherexample of a greedy algorithm: at each step it chooses theedge with minimum cost, without thinking ahead as to howthat might affect future decisions.

Greedy algorithms III Greedy algorithms only rarely give optimal answers.

Surprisingly often they give “reasonably-good” answers; forexample, greedy algorithms are one method of obtainingapproximate solutions to NP-hard optimization problems.

I There is a greedy algorithm for constructing a prefix-free codethat yields an optimal code, called a Huffman code.

I Recall that the entropy of a discrete distribution on a set ofsymbols S is given by

H(µ) =∑s∈S

−µ(si ) log µ(si )

And, from Shannon’s noiseless coding theorem, we know thereexists a prefix code achieving an average code length ofc ≤ H(µ) + 1.

I Huffman’s algorithm uses a greedy method to construct anoptimal code:

Greedy algorithms III1. Start with a collection of singleton trees, one for each symbol.

For each tree we keep track of its probability; initially eachsingleton tree has the probability of its symbol.

2. While there is more than one tree,

2.1 Choose two trees with least probabilities;2.2 Combine the two trees into one by making a new root whose

left child is one tree, and whose right child is the other. Theprobability of the new tree is the sum of the probabilities ofthe subtrees. We label the edge to the left subtree with “0”,and the edge to the right subtree with “1”.

The end result is a trie giving an optimal prefix code.I Example: consider this set of symbols

Symbol Probabilitya 0.3b 0.2c 0.2d 0.1e 0.1f 0.05g 0.05

Greedy algorithms IV

This distribution has H(µ) = 2.546.Huffman’s algorithm proceeds this way: The two symbols withleast probability are f and g, so these are combined into alittle tree with combined probability 0.1. We then have severalchoices of what to do next; we choose to combine d and e.Our collection of trees then looks like this:

1.0

0.2

0.1

0.3

0.4

0.5

0.6

0.7

0.8

0.9

b c

a

d e f g

0 10 1

We then combine the subtrees d-e and f-g, and so forth. Thefinal result is this tree:

Greedy algorithms V

1

0.5

0.6

0.7

0.8

0.9

1.0

01

0 1

0 1

0

b c

a

d e f g

0 10 1

0.2

0.1

0.3

0.4

Which gives us the code table:

Symbol Codea 10b 00c 01d 1100e 1101f 1110g 1111

Greedy algorithms VI

This achieves an average code length of c = 2.6, only slightlymore than the entropy of 2.546.

I Huffman’s algorithm is a rare exception: greedy algorithmsrarely give optimal answers because they are myopic.

I For a greedy algorithm to give an optimal answer, a problemmust have an underlying ‘matroid’ structure [4].

I Dynamic programming may succeed where greedy algorithmsfail to yield an optimal answer; it can be more farsighted.

Variational Problems I

I Certain optimization problems can be formulated asroutefinding problems, possibly in an abstract sense. Forexample, finding a shortest driving route between two cities;or finding a “least cost” path through a multistage decisionproblem.

I We consider first the continuous analogue of these problems,sometimes called trajectory optimization.

I Consider the problem of designing a tobogganning hill (or, ifyou like, a downhill ski slope.)

Variational Problems II

I Making the usual unrealistic assumptions (no friction, no airresistance, no brakes), we want to design a hill that letspeople travel from A to B in the shortest time possible:

A

B

I This is a classical problem solved by Bernoulli, called theBrachistochrone problem. It launched the study of thecalculus of variations, a method for solving continuous,infinite-dimensional optimization problems such as finding acurve of optimal shape.

Variational Problems III

I In a variational problem, one has a functional to minimize ormaximize. A functional maps functions to real values,typically via an integral. For our tobogganning problem, wewant to find a function (say) y(x), and the cost functionwould look something like:∫

PC (x , y , y)dx

where y = dydx , and the functional assigns to each curve y(x)

the time required to travel from A to B.

I Using the calculus of variations, one can obtain from the functional a

differential equation that can be solved to find an optimal path; the

process is analogous to the way one can find the extrema of a convex

function F (z) by setting ddz

F (z) = 0. However, our interest is in discrete

analogues of variational problems, which can be solved efficiently by

dynamic programming.

Variational Problems IV

I Variational problems possess two important propertiescommon to a wide class of problems that can be tackled withdynamic programming:

1. We can write the cost of a path as a sum of the cost ofsubpaths; for example, the time required to descend the hill isthe sum of the time required to reach the midpoint, plus thetime to travel from the midpoint to the end.

2. Any subpath of an optimal path is optimal. Otherwise, wecould excise the suboptimal subpath and replace it with anoptimal subpath and decrease the overall cost, contradictingthe premiss that the path is optimal.

I Dynamic programming can be used to solve discretizedversions of variational problems; this was one of the earliestapplications. Discretizing such a problem results in a discreteroutefinding problem.

Variational Problems V

I In a discrete routefinding problem, we have a state space S ,and in each state there are moves we can make, each with anassociated cost. Given a pair of states a, b ∈ S , we want tofind a minimal cost path between them. Abstractly, we canthink of S as a weighted graph, where states are vertices,edges are moves, and each edge has a cost.

I For example, this graph shows driving distances between sometowns close to Waterloo:

Variational Problems VI

Waterloo

Cambridge

27

Hamilton

76

Milton

53 Guelph

30

Brantford

27

37 50

52

17

We might want to find the shortest path between, say,Waterloo and Hamilton.

Variational Problems VII

I For a more abstract example, suppose we have a sequence ofoperations we wish to perform on a data structure: insertingkeys, finding keys, and iterating through the keys, forexample. Some operations can be performed very efficientlyon a linked list, for example, inserting a key. Others can beperformed quickly on a binary search tree, such as finding akey. If we have a long sequence of inserts followed by a longsequence of find operations, it might make sense to start witha linked list, and then switch to a binary search tree. How canwe determine the optimal times to switch configurations ofthe data structure?This is an example of a metrical task system [2], in which onehas a sequence of tasks to perform, and configurations thesystem can switch between (e.g., linked list and BST). Eachconfiguration has different costs for performing the tasks; andswitching between configurations has a specified cost.

Variational Problems VIII

For the linked list/BST example, the possible strategies forma graph like this:

Task # 1 2 3 4 . . . n

Linked List // //

//

//

. . . //>

>>>>

>

88ppppppppp

&&NNNNNNNNN BST // //

@@

//

@@

//

@@

. . . //??

Before performing each task, we have the option of switchingbetween a linked list and a tree. Edges are weighted with thecost of performing a task in a given configuration, or with thecost of switching between configurations.Finding an optimal strategy for switching betweenrepresentations consists of finding a minimal path from thestart vertex (far left) to the end vertex (far right).

Dynamic Programming I

I Consider finding the shortest path from a to b through thefollowing graph:

a 2 //

1

d3 //

4

f

2

c3

// e2

// b

I There are only two edges incident on b. An optimal path willeither:

I Go from a to f, then take the edge from f to b;I Go from a to e, then take the edge from e to b.

I Write d(x , y) for the shortest path between vertices. Usingthe above idea, we can write this equation for d(a, b):

d(a, b) = min(2 + d(a, f ), 2 + d(a, e))

Dynamic Programming II

I We could turn this idea into a recursive procedure. (Assumethe graph is acyclic.)

d(r , s)if r=s then

// Base casereturn 0

otherwisedist ← ∞for each edge (x, s ),

dist ← min(dist, d(r,x) + weight(x,s))return dist

Dynamic Programming III

I However, this procedure would solve the same subproblemsover and over again. Graphs of this form would requireexponential time:

a?? x

!!== y

?? z

!!== b

We would call d(a, z) twice, d(a, y) four times, d(a, x) eighttimes, etc.

I Instead of writing a recursive function, consider the followingsystem of equations for the example graph shown earlier.

d(a, b) = min(2 + d(a, f ), 2 + d(a, e))

d(a, f ) = min(3 + d(a, d))

d(a, e) = min(4 + d(a, d), 3 + d(a, c))

d(a, c) = min(1 + d(a, a))

d(a, d) = min(2 + d(a, a))

Dynamic Programming IV

I These are called the Bellman equations in honour of RichardBellman, who invented dynamic programming [1]. The reasonwhy it is called ‘dynamic programming’ is amusing; see [3].

I Incidentally: these equations use only the operations min and +,

under which the integers form an algebraic structure sometimes

called a tropical semiring [5], also called min-plus algebras (or

max-plus algebras). Tropical, because a person strongly associated

with them is Imre Simon, who did his PhD at Waterloo in the

1970’s and became a professor at the University of Sao Paulo!

I Now draw a dependence graph for the values d(r , s), whereedges go from terms on the left-hand side of an equation tothings appearing on the right-hand side:

d(a, a) d(a, d)oo d(a, f )oo

d(a, c)

OO

d(a, e)

OO

oo d(a, b)

OO

oo

Dynamic Programming V

I Using topological sort, we obtain an order in which we cansolve the equations. (For example, perform a depth-firstsearch starting from d(a, b).)

d(a, a) = 0

d(a, d) = 2

d(a, c) = 1

d(a, e) = min(4 + 2, 3 + 1) = 4

d(a, f ) = 5

d(a, b) = min(2 + 5, 2 + 4) = 6

So, the shortest path from a to b is of length 6.

I A few things to note:I We only had to evaluate each equation once; the number of

calculations was linear in the number of edges. This is a vastimprovement over the potentially exponential recursive version!

I The dependence graph looks just like the original graph, butwith the edges reversed.

Dynamic Programming VI

I In solving the equations, we actually find a shortest path froma to every vertex. If we draw these paths all on the samegraph, we get the directed-graph analogue of a spanning tree,called an arborescence:

a2 //

1

d3 // f

c3

// e2

// b

I The shortest path from a to b contains the shortest paths fromc to b, a to e, etc.

I The recursive procedure described earlier would achieve thesame efficiency as the equations approach if we maintained acache of results of calling d(r , s), and consulted this cache forthe answer each time the procedure was called. This is calledmemoization.

Dynamic Programming VIII Finding shortest paths in a graph is a canonical example of

dynamic programming. Dynamic programming can be appliedwhen problems have the following properties:

1. Optimal substructure: an optimal solution is composed ofoptimal solutions to subproblems.

2. Overlapping subproblems: a recursive solution would solve thesame subproblems repeatedly.

I A dynamic programming solution usually starts from someprinciple describing how solutions to subproblems can becombined into solutions to the entire problem. The Bellmanequations shown earlier are an example of this.

I One exploits overlapping subproblems by rememberinganswers to subproblems. This can be done by memoization(caching function return values), or by maintaining a table orother data structure of answers to subproblems.

I Some applications of dynamic programming:I route finding

Dynamic Programming VIII

I trajectory optimizationI solving discretized versions of variational problemsI finding optimal query strategies for relational databasesI Viterbi algorithm (Hidden Markov Models)I metrical task systems (offline form)I parsingI approximation algorithms for NP-hard problems, for example,

the knapsack problemI calculating edit distances (e.g., the minimal number of local

changes required to transform one string or tree into another.)

Bibliography I

[1] Richard Bellman.On the theory of dynamic programming.Proc. Nat. Acad. Sci. U. S. A., 38:716–719, 1952. bib pdf

[2] Allan Borodin, Nathan Linial, and Michael E. Saks.An optimal on-line algorithm for metrical task system.J. ACM, 39(4):745–763, 1992. bib pdf

[3] Stuart Dreyfus.Richard Bellman on the birth of dynamic programming.Oper. Res., 50(1):48–51, 2002.50th anniversary issue of Operations Research. bib pdf

[4] Jack Edmonds.Matroids and the greedy algorithm.Math. Programming, 1:127–136, 1971. bib pdf

Bibliography II

[5] Jean-Eric Pin.Tropical semirings.In Idempotency (Bristol, 1994), volume 11 of Publ. NewtonInst., pages 50–69. Cambridge Univ. Press, Cambridge, 1998.bib pdf


Amortization,Online algorithms

Todd L.Veldhuizen

[email protected]

ECE750-TXB Lecture 20: Amortization,Online algorithms



Canada

March 20, 2007



Todd L.Veldhuizen

[email protected]

Part I

Amortized Analysis



Todd L.Veldhuizen

[email protected]

Amortized Analysis I

I In accounting, amortization is a method for spreading alarge lump-sum amount over a period of time bybreaking it into smaller payments; for example, whenone purchases a house the mortgage payments arearranged according to an amortization schedule.

I In algorithm analysis, amortization looks at the totalcost of a sequence of operations, averaged over thenumber of operations. If a single operation is veryexpensive, that expense can be “amortized” over a longsequence of operations.

I If a sequence of m operations takes O(f (m)) time, wesay the amortized cost per operation is O(m−1f (m)).



Todd L.Veldhuizen

[email protected]

Amortized Analysis III If the worst case time per operation is O(g(m)), this

implies the amortized time is O(g(m)) also, but theconverse is not true: in amortized analysis we can allowa small number of very expensive operations, so long asthat expense is “averaged out” (amortized) over asufficiently long sequence of operations.

I Example: recall the binary tree iterator:

class BSTIterator implements Iterator Stack stack ;

public BSTIterator(BSTNode t)

stack = new Stack();fathom(t);

public boolean hasNext()



Todd L.Veldhuizen

[email protected]

Amortized Analysis IIIreturn ! stack .empty();

public Object next()

BSTNode t = (BSTNode)stack.pop();if (t . right child != null)

fathom(t. right child );return t ;

void fathom(BSTNode t)

do stack .push(t );t = t. left child ;

while (t != null );



Todd L.Veldhuizen

[email protected]

Amortized Analysis IVI If the binary tree contains n elements and is balanced,

then the fathom() operation takes no more thanO(logn) time; and there must be at least some nodes ofdepth ≥ c log n. Therefore each invokation of next()requires Θ(log n) time in the worst case.

I However, iterating through the entire tree by a sequenceof next() operations requires O(1) amortized time:

1. The iterator visits each node at most twice: once whenit is pushed onto the stack by fathom(), and once whenit is popped from the stack by next().

2. The total time spent in the next() and fathom()methods is linear in the number of pushes and popsonto the stack.

3. Assuming push and pop operations take O(1) time(e.g., linked list implementation of stack), the total costof iterating through the tree is O(n).

4. Therefore the amortized cost of the iteration isO(n−1n) = O(1).



Todd L.Veldhuizen

[email protected]

Dynamic Arrays I

I Arrays have two advantages over more complex datastructures: (1) They are very fast to access, both forrandom access and for iterating through the contents ofthe array. (2) They are very efficient in memory use.

I For example, to store a set of n single-precisionfloating-point values (4 bytes apiece),

1. An array requires 4n + O(1) bytes;2. A binary search tree requires ≈ 16n + O(1) bytes: for

each tree node, need 4 bytes for the float, 2*4 bytes forthe left/right child pointers. And typically small objectssuch as this are padded up to an alignment boundary,e.g., 16 bytes. So, a tree can take 3-4 times as muchmemory as an array, for storing small objects.

I However, appending items to an array can be veryinefficient: if the array is full, one must usually allocatea new, larger array and copy the elements over, for acost of O(n).



Todd L.Veldhuizen

[email protected]

Dynamic Arrays II

I A dynamic array is one that keeps room for extraelements, resizing itself according to a schedule thatyields an O(1) amortized time for append operations,despite the occasional operation taking O(n) time.

I The array maintains

1. A size (number of elements in the array)2. A capacity (allocated size of the array)3. The array itself.

I When an append operation is performed, size isincremented; if size exceeds capacity then:

1. Allocate a new array of size f (capacity), where f is tobe determined;

2. Copy the elements 1..size to the new array3. Set capacity = the new capacity.



Todd L.Veldhuizen

[email protected]

Dynamic Arrays III

I Now analyze the amortized time complexity. Consider asequence of n insert operations, starting from an emptyarray. Each time we hit the array capacity, we incur acost of O(n); other append operations incur only anO(1) cost.

Operation #

1 2 3 4 5 6 7 8 9 10 ...

Tim

e r

eq

uir

ed

fo

r a

pp

en

d



Todd L.Veldhuizen

[email protected]

Dynamic Arrays IVI Suppose the array has an initial capacity of 1. The cost

of the resizings will be∑k : f (k)(1)≤n

f (k)(1)

where f (0)(x) = x , and f (i+1)(x) = f (f (i)(x)).

I If we take f (k) = k + 16, i.e., we increase the capacityof the array by 16 elements each time we run out ofroom, then the total cost is O(n2).

I If however we choose f (k) = βk, with β > 1, then usingthe geometric series formula, we have a total cost of∑

k : βk≤n

βk =βm+1 − β

β − 1

∣∣∣∣m=logβ n

=β(n − 1)

β − 1



Todd L.Veldhuizen

[email protected]

Dynamic Arrays V

= O(n)

So, the amortized cost of appends into the dynamicarray is O(n−1n) = O(1).

I Increasing the capacity of the array by, say, 5% eachtime we run out of space leads to an O(1) amortizedtime for appends.



Todd L.Veldhuizen

[email protected]

Bibliography

Part II

Online Algorithms



Todd L.Veldhuizen

[email protected]

Bibliography

Online Algorithms I

I Consider the problem of assigning restaurant patrons totables. If you know in advance who wants to eat dinnerat your restaurant, when they want to arrive, how longthey will stay, and how much they will spend, you canfigure out in advance what subset of patrons to acceptto maximize your revenue.

I In the real world, people just show up at restaurantsexpecting to be fed: when each party arrives, you mustdecide whether or not you can seat them; and oncegiven a table, you can’t evict them before they arefinished eating to make room for someone else.

I The difference is that between an offline and onlineproblem.

I In an offline scenario, we know the entire sequence ofrequests that will be made in advance; in principle wecan use this knowledge to find an optimal solution.



Todd L.Veldhuizen

[email protected]

Bibliography

Online Algorithms III In an online scenario, we are presented with requests one

at a time, and we must commit to a decision withoutknowing what the subsequent requests might be.

I Many realistic problems are online, for example:I Page replacement policies in operating systems and

caches;I Call routing in networks;I Memory allocation;I Data structure operations.

I The field of online algorithms studies such problems, inparticular how online solutions compare to their optimaloffline versions [1].

I In an online problem, one is presented with a requestsequence I = (σ1, σ2, . . . , σn), and each σi must behandled with no knowledge of σi+1, σi+2, . . . , σn. Oncea decision of how to handle request σi is made, itcannot be altered.



Todd L.Veldhuizen

[email protected]

Bibliography

Online Algorithms IIII We characterize performance by assigning a cost to a

sequence of decisions.

I Write OPT(I ) for the optimal offline solution.

I If ALG(I ) is an online algorithm, we say ALG is an(asymptotic) c-approximation algorithm if for all legalrequest sequences I ,

ALG(I )− c ·OPT(I ) ≤ α

for some constant α not depending on I . If α = 0,ALG is a c-approximation algorithm, and we have

ALG(I )

OPT(I )≤ c

The online algorithm yields a cost that is at most ctimes the optimal cost.

I c is called the competitive ratio.



Todd L.Veldhuizen

[email protected]

Bibliography

Example: Load Balancing I

I Consider the following load balancing problem: we haven jobs to complete. Each job j ∈ 1, . . . , n requirestime T (j). We have m machines, each equally capable.

I We want to assign the jobs to machines so that we arefinished all jobs as quickly as possible.

I In an offline version, we would know T [j ] in advance. Inan online version, we must assign job j to a machineknowing only the jobs 1..j − 1 and what machines theywere assigned to.

I An assignment A : 1, . . . , n → 1, . . . ,m whereA(j) = i means job j is assigned to machine i .



Todd L.Veldhuizen

[email protected]

Bibliography

Example: Load Balancing II

I The makespan is how long we must wait until all jobsare finished:

Makespan(A) = maxi

∑j : A(j)=i

T (j)

to machine m.

I There is an online greedy algorithm that provides acompetitive ratio of 2 (i.e., the schedule chosen takes atmost twice as long as the optimal offline version.) Thealgorithm is simple:

I Always assign job j to a machine with an earliestfinishing time.

I The proof of the competitive ratio stems from twoobservations:



Todd L.Veldhuizen

[email protected]

Bibliography

Example: Load Balancing III

1. If the sum of the times of all jobs is, say, 60 minutes,and we have 3 machines, an optimal schedule can’tpossibly require less than 60/3 = 20 minutes. Ingeneral:

OPT ≥ 1m

∑j

T (j)

2. The optimal schedule has to be at least as long as thelongest job:

OPT ≥ maxj

T (j)



Todd L.Veldhuizen

[email protected]

Bibliography

Example: Load Balancing IV

I Suppose machine i has the longest running time, and jis the last job assigned to machine i .

(B)

i j

(A)

The above diagram shows a schedule with 3 machines.

I During the time up to the beginning of job j , all themachines are in full use (region (A) in the figureabove.) Otherwise, there would be a machine finishingearlier than the one to which j has been assigned. Thesum of all the times in region (A) is ≤

∑j T (j). Since

OPT ≥ 1m

∑j T (j), the length of time up until job j

starts is ≤ OPT.



Todd L.Veldhuizen

[email protected]

Bibliography

Example: Load Balancing V

I The length of region (B) in the above figure is theduration of the job j . Since OPT ≥ maxjT (j), region(B) is also ≤ OPT in duration.

I Therefore the time until all the jobs is finished is≤ 2 ·OPT.

I The greedy algorithm yields a schedule at most twice aslong as the optimal offline solution; we say it is“2-competitive.”

I The current best known online algorithm is1.9201-competitive; it is known that no deterministicalgorithm can have a competitive ratio better than1.88. There is a randomized algorithm achieving acompetitive ratio of 1.916.



Todd L.Veldhuizen

[email protected]

Bibliography

Bibliography I

[1] Allan Borodin and Ran El-Yaniv.Online Computation and Competitive Analysis.Cambridge University Press, Cambridge, 1998. bib


Memory hierarchyand locality

Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

ECE750-TXB Lecture 21: Memory hierarchyand locality



Canada

March 22, 2007



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

I So far we have primarily been concerned withasymptotic efficiency of algorithms. Having chosen analgorithm that is efficient in theory, there are stillserious engineering challenges to getting highperformance in practice.

I To obtain decent performance on problems involvingnontrivial amounts of data, you need to understand

1. How the memory hierarchy works;2. How to take advantage of this.

I Historical performance of typical desktop machine:

Year Clock cycle DRAM Disk(ns) latency (ns) latency (ms)

1980 500 375 871990 50 100 282000 1 60 8



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

In two decades:I CPU: 500x faster clock cycle.I Main memory: 6x faster.I Disk: 10x faster.

I It can now take hundreds of clock cycles to access datain main memory: “Memory is the new disk.”

I To mitigate the widening gap between CPU speed andmemory access time, an elaborate memory hierarchy hasevolved.



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

I Example memory hierarchy of a desktop machine:Bandwidth Latency Size Block size (B)

L1 Cache 20 Gb/s 1 ns 16kb -L2 Cache 10 Gb/s 8 ns 1Mb -Main memory 2 Gb/s 200 ns 2Gb 64 bytesDisk 0.08 Gb/s 106 ns 400Gb 1024 bytes

I Bandwidth is the rate at which data can be transferredin a sustained, bulk manner, scanning throughcontiguous memory locations. Main memory can supplydata at 1/10th the rate of L1 cache.

I Latency is the amount of time that elapses between arequest for data and the start of its arrival. Disk is amillion times slower than L1 cache.

I Block size is the “chunk size” in which data istransferred up to the next-fastest level of the hierarchy.



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

I The speed at which a program can run is alwaysdetermined by some bottleneck: the cpu, main memory,the disk. Programs where the bottleneck is mainmemory are called “memory bound” — most of theexecution time is spent waiting for data to arrive frommain memory or disk.

I A desktop machine with, say, a 2 GHz cpu, effectivelyruns much slower if it is memory bound: e.g.,

I If L2 cache is the bottleneck: effective speed 1 GHz(about a Cray Y-MP supercomputer, circa 1988).

I If main memory is the bottleneck: effective speed 200MHz (about a Cray 1 supercomputer, circa 1976).

I If disk is the bottleneck: effective speed between < 1MHz (about a 1981 IBM PC) and ≈ 80 MHz,depending on access patterns.



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

Latency and throughput I

I Many operations (transfers of data from disk tomemory, floating-point vector operations, networkcommunication, etc.) follow a characteristicperformance curve: slow for a small number of items,faster for a large number of items.

I Let R∞ be the asymptotic rate achievable (e.g.,bandwidth)

I Let t0 be the latency.I Then, an operation on n items takes time ≈ t0 + n

R∞.

I The effective rate is

R(n) =n

t0 + nR∞

=R∞

1 + R∞t0n−1

∼ R∞ − R2∞t0n + O(n−2)



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

Latency and throughput II

I E.g. with R∞ = 1 and t0 = 10:

0 20 40 60 80 100 120 140 160 180 2000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

I A useful parameter: n1/2 is the value at which half the

asymptotic rate is attained: R(n1/2) = 12R∞.

n1/2 = t0R∞



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

Latency and throughput III

I n1/2 gives an approximate “chunk size” required toachieve close to the asymptotic performance. Forexample, current typical disk has R∞ = 0.08Gb/s andt0 = 4ms. For these parameters, n1/2 = 320000 bytes— about 320 kb.

I If you are dealing in chunks substantially smaller thann1/2, actual performance may be a tiny fraction of R∞.

I Example: performance of a FAXPY on a desktopmachine. A FAXPY operation looks like this:

float X[N], Y[N];for (int i=0; i < N; ++i)

Y[i] = Y[i] + a*X[i];



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

Latency and throughput IV

(the term FAXPY comes from the BLAS library, a livingrelic from Fortran 77.)The following graph shows the millions of floating pointoperations per second (Mflops/s) versus the length ofthe vectors, N. Each plateau corresponds to a level ofthe memory hierarchy.



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

Latency and throughput V

100

101

102

103

104

105

106

107

108

100

200

300

400

500

600

700

800

900DAXPY Benchmark

Vector length

Mflo

ps/s

Vector<T>

The first part of the curve follows a typical latency-throughput

curve (i.e., R∞, n1/2). It reaches a plateau corresponding to the

L1 cache. Performance then drops to a second plateau, when the

vectors fit in L2 cache. The third plateau is for main memory.



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

Latency and throughput VI

The abrupt drop at the far right of the graph indicates that the

vectors no longer fit in memory, and the operating system is

paging data to disk.



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

Units of memory transfer I

I Memory is transferred between levels of the hierarchy inminimal block sizes:

I Between CPU and cache: a word, typically 64 or 32bits.

I Between cache and main memory: a cache line,typically 32, 64 or 128 bytes. (Pentium 4 has 64-bytecache lines.)

I Between main memory and disk: a disk block or page,often 512 bytes to 4096 bytes for desktop machines, butsometimes 32kb-256kb for serving large media files.



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

Units of memory transfer II

Byte

Disk Block1024 bytes

Word64 bits

Cache Line64 bytes

I The ideal memory access pattern for good performance:

I Appropriate-sized chunks (cache lines, blocks);I in contiguous memory locations, e.g., reading a long

sequence of disk pages that are stored consecutively.I Doing as much work as possible on a given piece of

data before moving on to other memory locations.



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

Units of memory transfer III

I Memory layout of arrays and data structures is a crucialperformance issue.

I Memory layout is largely beyond your control inlanguages like Java that provide a programming modelfar removed from the actual machine.

I We’ll look at some examples in C/C++, where memorylayout can be controlled at quite a fine level of detail.



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

Matrix Multiplication I

I Example: how to write a matrix-multiplicationalgorithm that will perform well?

I NB: In general, you would be well advised to use a matrix

multiplication routine from a high-performance library, and not

waste time tuning your own. (But, the principles are worth

knowing.)

I Here is a naive matrix multiplication:

for i=1 to nfor j=1 to nfor k=1 to n

C(i,j) = C(i,j) + A(i,k) * B(k,j)

I If the matrices are small enough to fit in cache, thismay perform acceptably.

I For large matrices, this is disastrously slow.



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

Matrix Multiplication II

I Assuming C-style (row major) arrays, the layout of theB matrix in memory looks like this:

B(1,1)

B(2,1)

B(3,1)

B(4,1)

B(5,1)

B(1,2) B(1,3)

B(2,2)

etc.

B(2,8)

k

I Each element of the array is a double that takes 8bytes.

I The innermost loop (k) is iterating down a column ofthe B matrix.

I Light-blue shows a cache line of 64 bytes. It lies along arow of the matrix.



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

Matrix Multiplication III

I If 64*n is greater than the cache size, each element ofB accessed in the innermost loop will bring in an entirecache line from memory (64 bytes), of which only 8bytes will be used. The remaining 56 bytes will bediscarded before they are used.

I The innermost loop travels over a row of A, so it willtravel along (rather than across) cache lines.

I Each step of the innermost loop will bring in ≈ 64 + 8bytes from memory on average, but only use 16 bytes ofit = an efficiency of only 22%. (By transposing the Bmatrix, we could make it run 4.5 times faster.)



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

Matrix Multiplication IVI A generally useful strategy is blocking (also called

tiling): perform the matrix multiplication oversubmatrices of size m ×m, where m is chosen so that8m2 cache size. A11 A12 . . .

A21 A22...

B11 B12 . . .

B21 B22...

If each submatrix is of size m×m, then a multiplicationsuch as A11B11 requires 16m2 bytes from memory, butperforms 2m3 floating point operations. The number offloating point operations per memory access is16m2

2m3 = 8m . As m is made bigger, the overhead of

accessing main memory can be made very small: theexpensive memory access is amortized over a largeamount of computation.



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

Matrix Multiplication V

I Linear algebra libraries (e.g. ATLAS) can automaticallytune themselves to a machine’s memory hierarchy bychoosing appropriate block sizes, unrollings of innerloops, etc. [2, 9].



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

Example: Linked lists I

I Designing data structures that will perform well onmemory hierarchies is nontrivial:

I Data structures are often built from small pieces (e.g.,nodes, vertices) that may be smaller than a cache line.

I Manipulation of data structures often entails followingpointers unpredictably through memory, so thatprefetching hardware is unable to anticipate memorylocations that will be needed.

I Many techniques developed in the past for datastructures on disk (called external memory datastructures [1, 8]) are increasingly relevant as techniquesfor managing data structures in main memory!

I Example: Linked lists are useful for maintaining lists ofunpredictable size, e.g., queues, hash table chaining,etc.

I Recall that a simple linked list has a structure such as



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

Example: Linked lists II

struct ListNode float data;ListNode* next;

;

(Here we are storing a list of floating-point values.)

I The memory layout of this structure in a 64-byte cacheline:

nextdata



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

Example: Linked lists III

I Each ListNode contains 8 bytes of data. But, if it isbrought in from main memory we will read an entirecache line— say, 64 bytes of data. Unless we’re lucky,those extra 54 bytes contain nothing of value; we onlyget 4 bytes of actual useful data (the data field) for 64bytes read from memory, an efficiency of only4/64 = 6.25%.

I This style of data structure is among the worst possiblefor reading from main memory:

I Iterating through a list means “pointer-chasing”:following pointers to unpredictable memory locations.Some processors can exploit constant-stride memoryaccess patterns and prefetch data, but are powerlessagainst irregular access patterns.

I Each list node accessed brings in 64 bytes of memory,only 4 bytes of which may be useful!



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

Example: Linked lists IV

I A better layout is to compromise between a linked listand array: have linked list nodes that contain littlearrays, sized so that each list node fills a cache line:

struct ListNode2 float data[14];int count;ListNode2* next;

;

This version stores up to 14 pieces of data per node,and uses the count field to keep track of how full it is.The ListNode structure is exactly 64 bytes long(assuming 4-byte int, 4-byte pointers).Get 56 bytes of useful data for every 64 byte cache-line:about 87.5% efficiency if most of the nodes are full.



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

Example: Linked lists V

I The following graph shows benchmark results comparingthe performance of an array, a linked list, and linkedlists of arrays (sized so that each element of the list fills1, 4, or 16 cache lines.) The operation being timed is toscan through the list and add all the data elements:

ListNode* p = first;do

s += p->data;p = p->next;

while (p != 0);



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

Example: Linked lists VI

101

102

103

104

105

106

107

108

0

50

100

150

200

250

300

350

400

450

Number of items in list

Acc

ess

rate

(m

illio

ns o

f ite

ms

per

seco

nd)

ArrayLinked ListList of arrays (1 cacheline)List of arrays (4 cachelines)List of arrays (16 cachelines)



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

Example: Linked lists VII

I In the L1 cache, the linked list is very fast. But,compare performance to other regions of the memoryhierarchy (numbers are millions of items per second):

Array Linked List Linked List ofarrays (1 cache line)

L1 cache 411 411 317L2 cache 406 90 285Main memory 377 7 84

In L2 cache, the list-of-arrays is 3 times faster; in mainmemory it is 12 times faster.

I In main memory, the linked list is 50 times slower thanthe array.

I The regular memory access pattern for the array allowsthe memory prefetching hardware to predict whatmemory will be needed in advance, resulting in the highperformance for the array.

I Lessons:



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

Example: Linked lists VIII

I Data structure nodes of size less than one cache linecan cause serious performance problems when workingout-of-cache.

I Performance can sometimes be improved by making“fatter” nodes that are one or more cache lines long.

I Nothing beats accessing a contiguous sequence ofmemory locations (an array). For this reason, manycache-efficient data structures “bottom out” to anarray of some size at the last level.



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

Locality I

I Efficient operation of the memory hierarchy rests ontwo basic strategies:

I Caching memory contents in the hope that memory,once used, is likely to be soon revisited.

I Anticipating where the attention of the program is likelyto wander, and prefetching those memory contents intocache.

I The caches and prefetching strategies of the memoryhierarchy are only effective if memory access patternsexhibit some degree of locality of reference.

I Temporal locality: if a memory location is accessed attime t, it is more likely to be referenced at times closeto t.

I Spatial locality: if memory location x is referenced, oneis more likely to access memory locations ‘close’ to x .

I Denning’s notion of a working set [3, 4]:



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

Locality III Let W (t, τ) be the items accessed during the time

interval (t − τ, t). This is called the working set (orlocality set.)

I Denning’s thesis was that programs progress through aseries of working sets or locales, and while in that localedo not make (many) references outside of it. Optimalcache management then consists of guaranteeing thatthe locale is present in high-speed memory when aprogram needs it.

I Working sets are a model that programs often adhere to.One of the early triumphs of the concept was explainingwhy multiprocess systems when overloaded ground to ahalt (thrashing) instead of degrading gracefully:thrashing occurred when there was not room in memoryfor the working sets of the processes at the same time.

I The memory hierarchy has evolved around the conceptof a working set, with the result that “programs havesmall working sets” has gone from being a descriptivestatement to a prescription for good performance.



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

Locality III

I Let ω(t, τ) = |W (t, τ)| be the number of distinct itemsaccessed in the interval (t − τ, t) — the size of theworking set. If ω(t, τ) grows rapidly as a function of τ ,then caching of recently used data will have little effect.If however it levels off, then a high “hit rate” in thecache can be achieved.

ineffective

Cache size

τ

ω(t, τ)

good locality

of reference

caching



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

Locality IVI Algorithms can sometimes be manipulated to improve

their locality of reference. The matrix multiplicationexample seen earlier was an example: by multiplyingm ×m blocks instead of columns, one can do O(m3)work while reading only O(m2) memory locations.

I Good compilers will try to do this automatically forcertain simple arrangements of loops, but the generalproblem is too difficult for automated solutions.

I Some standard strategies for increasing locality:I Blocking and tiling: decompose multidimensional data

sets into a collection of “blocks” or “tiles”. Example:

If a large image is stored on disk column-by-column,then rendering a small region (dotted line) requires



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

Locality V

retrieving many disk pages (left). If instead the image isbroken into tiles, many fewer pages are needed (right).

I Iteration-space tiling or pipelining: in certain multistepalgorithms, instead of performing one step over theentire structure, one can perform several steps overeach local region.

I Space-filling curves: if an algorithm requires traversingsome multidimensional space, one can sometimesimprove locality by following a space-filling curve, forexample:



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

Locality VI

I Graph partitioning: if a dataset lacks a clear geometricstructure, one can sometimes still achieve results similarto tiling by partitioning the dataset into regions thathave small “perimeters” relative to the amount of datathey contain. There are numerous good algorithms forthis, in particular from the VLSI and parallel computingcommunities [7, 5, 6]. Indeed, the idea of “tiling” can



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

Locality VII

be regarded as a special case of graph partitioning forvery regular graphs.

I Vertical partitioning: the term comes from databases,but the concept applies to the design of memory layoutsalso. A simple example is layout of complex-valuedarrays, where each element has a real and imaginarycomponent. Could store pairs (a, b) for each arrayelement to represent a + bi , or could have two separatearrays, one for the real and one for the imaginarycomponent. These two layouts have very differentperformance characteristics!Another common situation where vertical partitioningcan apply: suppose a computation proceeds in phases,and different information is needed during each phase.E.g., in a graph algorithm one might have:



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

Locality VIII

struct Edge Node* from;Node* to;float weight; /* Needed in phase one */Edge* parent; /* Phase two: union-find */;

Instead of having all the data needed for each phase inone structure, it might be better to break up the Edgeobject into several structures, one for each phase of thealgorithm.



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

External Memory Data Structures I

I External memory data structures (also known asout-of-core data structures) are used when the data setis too large to fit in memory, and must be stored ondisk. Databases are the primary application.

I Some good surveys: [1, 8].

I Basic concerns:I Disk has very high latency: often ≈ 106 clock cycles.I Bandwidth is lower than main memory, perhaps 1/10 or

1/20th the rate.I Block sizes are very large (multiples of 1kbyte)

compared to main memory.I Only a small fraction of the data can be in memory at

once; only a minute fraction can be in cache.I Disk space is cheap (in dollar cost).

I Basic coping strategies:



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

External Memory Data Structures II

I High latency = very large n1/2 ⇒ “fat” data structurenodes that contain a lot of data, stored contiguously inarrays.

I The high latency (≈ 106 clock cycles) means that a lotof computational resources can be expended to decideon the best strategy for storing and retrieving data:

I External memory data structures often manage theirown I/O, rather than relying on the operating system’spage management, since they can better predict whatpages to prefetch and evict.

I Databases expend a lot of effort to find a goodstrategy for answering a query, since a bad strategycan be disastrously costly.

I To improve the rate of data transfer, data compressionand parallel I/O (parallel disk model) are sometimesused.

I Since disk space is cheap, it can pay off to duplicatedata in several different forms, each appropriate for acertain class of queries.



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

External Memory Data Structures III

I Classic example: the B-tree, a binary search treesuitable for external memory.

I Each node has Θ(B) children, and the tree is keptbalanced, so that the height is roughly logB N.

I Supported operations: find, insert, delete, and rangesearch (retrieve all records in an interval [k1, k2].)

I When insertions cause a node to overflow, it is split intotwo half-full nodes, which can cause the parent node tooverflow and split, etc.

I Typically the branching factor is very large, so that amassive data set can be stored in a B-tree of height 3 or4, for example.



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

Summary: Principles for effective use of thememory hierarchy I

I Algorithms should exhibit locality of reference; whenpossible, rearrange the order in which computations areperformed so as to reuse data that will be in cache.

I The ability of the memory hierarchy to cache andprefetch data depends on predictable access patterns.When working out-of-cache:

I Performance is best when a long, contiguous sequenceof memory locations are accessed (e.g., scanning anarray).

I Performance is worst when memory accesses areunpredictable, accessing many small pieces of datascattered through memory in an unpredictable pattern.(e.g., pointer-chasing through linked lists, binary searchtrees, etc.)

I Design data layouts so that:I Items used together are stored together.



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

Summary: Principles for effective use of thememory hierarchy II

I Items not used together are stored apart.I If nodes of a data structure reside at a level of the

memory hierarchy where transfer blocks are of size B,then those nodes should be of size k · B when possible,where k > 1. i.e., for main-memory data structures, usenodes that are several cache lines long; for datastructures on disk, use nodes that are several pages insize.

I Maximize the amount of useful information in eachblock.



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

Bibliography I

[1] Lars Arge.External memory data structures.In Handbook of massive data sets, pages 313–357.Kluwer Academic Publishers, Norwell, MA, USA, 2002.bib pdf

[2] Jeff Bilmes, Krste Asanovic, Chee–whye Chin, and JimDemmel.Optimizing matrix multiply using PHiPAC: a Portable,High-Performance, ANSI C coding methodology.In Proceedings of International Conference onSupercomputing, Vienna, Austria, July 1997. bib

[3] Peter J. Denning.The working set model for program behavior.Commun. ACM, 11(5):323–333, 1968. bib pdf



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

Bibliography II

[4] Peter J. Denning.The locality principle.In J. Barria, editor, Communication Networks andComputer Systems, pages 43–67. Imperial College Press,2006. bib pdf

[5] Josep Dıaz, Jordi Petit, and Maria Serna.A survey of graph layout problems.ACM Comput. Surv., 34(3):313–356, 2002. bib pdf

[6] Ulrich Elsner.Graph partitioning: A survey.Technical Report S393, Technische UniversitatChemnitz, 1997. bib pdf



Todd L.Veldhuizen

[email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

Bibliography III

[7] Bruce Hendrickson and Robert Leland.A multilevel algorithm for partitioning graphs.In Supercomputing ’95: Proceedings of the 1995ACM/IEEE conference on Supercomputing (CDROM),page 28, New York, NY, USA, 1995. ACM Press. bib

pdf

[8] Jeffrey Scott Vitter.External memory algorithms and data structures: dealingwith massive data.ACM Comput. Surv., 33(2):209–271, 2001. bib pdf

[9] R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra.Automated empirical optimizations of software and theATLAS project.Parallel Computing, 27(1–2):3–35, January 2001. bib


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

ECE750-TXB Lecture 22: NP-Completeproblems



Canada

March 27, 2007


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Tractable problems I

I Recall that P is the class of decision problems (i.e.,those with yes/no answers) that can be answered intime O(nc) on a Turing machine, c a constant.

I In the 1960s, proposals were made to identify the classP with “feasible” problems i.e., problems for whichcomputers can obtain an answer in a reasonable amountof time. (One early proponent of this idea was JackEdmonds, a professor at Waterloo and a pioneer ofcombinatorial optimization.)

I This idea of ‘P = feasible’ has gained widespreadacceptance, of course modulo the common-sensecaveats that must accompany such a blanketgeneralization: algorithms requiring 10100 · n2 orO(n1000) time are unlikely to be practical. However,most ‘natural’ problems in P (those that arise fromquestions of interest, as opposed to those artificially


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Tractable problems II

constructed by e.g., diagonalization) have lowexponents such as O(n), O(n2), O(n3), etc. Very fewinteresting problems in P have best-known exponents ofn10 or more. So, equating P with practical algorithms isa quite reasonable generalization.

I However, many very important problems are not knownto be in P. Some of these problems are conceptuallyvery close to problems that are known to be in P:

I We have seen a polynomial-time algorithm to find theshortest path between two vertices in a graph. Whatabout finding the longest path between two vertices?Nobody knows if this problem can be solved inpolynomial time.


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Tractable problems III

I An Euler tour of a connected graph is a path that startsand ends at the same vertex and traverses each edgeexactly once. Testing whether a graph has an Euler touris trivial: a graph has an Euler tour if and only if everyvertex has an even number of edges incident on it. Thiscan clearly be tested in polynomial time! On the otherhand, a Hamiltonian circuit is a path that starts andends at the same vertex and visits each vertex exactlyonce. It is not known whether Hamiltonian-circuit has apolynomial time decision procedure.

I Many of the interesting problems not known to be in Pare easily seen to belong to the class NP, which


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Tractable problems IV

informally is the class of problems whose an-swers we can check in polynomial time. More technically:

NP is the class of decision problems where, ifthe answer is YES, there exists a certificate (astring in some alphabet) that can be checkedin polynomial time by a Turing machine.

I The terms proof and witness are commonly used inplace of certificate.

I NP stands for nondeterministic polynomial time.I A deterministic machine is one whose state transition

relation allows only one possible trace: every state issucceeded by a uniquely defined state, so that a tracelooks like

// // // · · ·


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Tractable problems V

I A nondeterministic machine can branch simultaneouslyinto several successor states, and the semantics aretypically defined so that the machine accepts an input ifany of its branches accept. A trace is a tree branchinginto the future:

// · · ·

BB // // · · ·

BB

<<<

<< // · · ·

AA

<<<

<<

// · · ·


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Tractable problems VI

I Do not confuse nondeterminism with randomness! Ifyou know what threads are, a good conceptual model isa multithreaded program where each thread can createanother new thread whenever it wants, and all threadsrun simultaneously with no slowdown, as if there werean infinite number of processors and no resourcecontention.

I A nondeterministic machine can simultaneously exploreall possible certificates, and halt with a YES answer if itfinds one that certifies the answer is YES.

I The following diagram illustrates the containmentrelationships between the classes we will visit in thislecture:


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Tractable problems VII

NP−Complete

P

NPco−NP

NP hard

I Note that NP contains the class P: every problem in Pis automatically an NP problem also.

I Whether NP is a strict superset of P is a centralproblem in complexity theory, and one of the ClayInstitute Millenium Problems. It could be that:

1. P 6= NP; or2. P = NP; or


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Tractable problems VIII

3. It could be that the truth of ‘P = NP’ is independentof the usual (ZF) set theory axioms, as the continuumhypothesis and axiom of choice were found to be.

I It is widely believed that P 6= NP, i.e., that hardproblems in NP require superpolynomial time.

I Note that ‘superpolynomial’ includes, but is notequivalent to, exponential time. The class ofsuperpolynomial subexponential functions includesfunctions of the form nf (n) where 1 ≺ f ≺ n

ln n . Forexample, n1+log log n is superpolynomial andsubexponential.

I A didactic ditty on the theme of superpolylogarithmicsubexponential functions may be found in [5].


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Circuit-Satisfiability I

I Example: Circuit-satisfiability. Given a single-outputboolean circuit composed of AND, OR, and NOT gates,is there a setting of the input signals causing the outputto be true? (If so, the circuit is said to be satisfiable.

a

c

b

d

I Clearly we can answer this problem by enumerating atruth table of all possible inputs:

a b c d

0 0 0 00 1 0 01 0 0 01 1 0 00 0 1 00 1 1 1←


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Circuit-Satisfiability II

This requires 2n time, where n is the number of inputsignals.

I A ‘certificate’ that the circuit is satisfiable is a settingof the input signals — a = 0, b = 1, c = 1 — causingthe output to be true. We can say thata = 0, b = 1, c = 1 is a witness to the satisfiability ofthe circuit.

I A verifier can check very efficiently (on a RAM modelin, say, O(m) time where m is the number of gates)that this combination of inputs does, in fact, produced = 1. Hence the problem is in NP.


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

The class co-NP I

I The class co-NP consists of decision problems having aNO certificate (proof,witness) that can be checked inpolynomial time.

I Example: circuit equivalence. Decision problem: giventwo single-output boolean circuits, do they compute thesame function?

a

c

b

d

d’

Are d and d ′ always the same?


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

The class co-NP II

I Again, we can decide whether the two circuits areequivalent by enumerating the truth table:

a b c d d ′

0 0 0 0 00 1 0 0 01 0 0 0 01 1 0 0 00 0 1 0 00 1 1 1 11 0 0 0 01 0 1 1 0←

They differ on the input a = 1, b = 0, c = 1. Theanswer to the decision problem is NO.

I The input signal setting a = 1, b = 0, c = 1 is awitness to the fact that the two circuits are notequivalent.


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

The class co-NP III

I Note that the complement of an NP set is a co-NPset, and vice versa.

I Example: the set of circuits that are not satisfiable is aco-NP set. (The complement of satisfiable circuits.)

I Example: the set of pairs of circuits that areinequivalent is an NP set. (The complement ofequivalent pairs of circuits.)

I It is believed that P 6= NP ∩ co-NP, but this is notknown.


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Part I

Some examples of NP problems


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Decision vs. optimization problems I

I The classes P and NP are defined in terms of decisionproblems with YES/NO answers.

I Many of the interesting problems in NP areoptimization problems, where one aims to minimize ormaximize an objective function. Such problems can beexpressed in two versions:

1. The decision version: Does there exist a ... of cost lessthan/more than ...? (This has a YES/NO answer, andcan be said to be in the class NP if the objectivefunction can be evaluated in polynomial time, etc.)

2. The optimization version: Find a ... ofminimal/maximal cost/profit. (There is a class NPO ofNP optimization problems we shall see next lecture, inwhich problems are expressed in this form.)

I The class of NP-hard optimization problems includesmany of high industrial relevance:

I VLSI layout


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Decision vs. optimization problems II

I optimizing compilers (register allocation, alias analysis)I choosing warehouse locationsI vehicle routing (e.g., designing delivery routes)I designing networks (fibre, highways, subways, public

transit)I spectrum allocation on wireless networksI airline schedule planningI packing shipping containersI choosing an investment portfolioI verification of circuitsI mixed-integer programmingI (simplified models of) protein foldingI and thousands more!


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Example: Graph colouring I

I Graph colouring: given a graph G , an r -colouring of Gis a function χ : V → 1, 2, . . . , r from vertices to“colours” 1, 2, . . . , r such that no two adjacent verticeshave the same colour: if (v1, v2) ∈ E , thenχ(v1) 6= χ(v2).

I The least r such that an r -colouring is possible is calledthe chromatic number of G .

I Graph colouring is an NP problem: a witness to thedecision problem “Does there exist a colouring with ≤ rcolours?” is an r -colouring of the graph.

I Graph colouring has numerous practical applications. Itis an excellent tool for reasoning about contention forresources.


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Example: Graph colouring II

I For example, one of the last steps of compiling aprogram is to emit assembly code. The program will runfaster if variables are placed in registers, rather than instack- or heap-locations. In a typical register allocationscenario, one has a sequence of operations such as this:

y := 3x := y + 1z := x + 2w := x + 1u := z ∗ w

Each variable has a liveness range: from its initialassignment to its last use. We want to place thesevariables in registers in such a way that a register isnever simultaneously being used to store two variables.(Here, the registers are the resource under contention.)


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Example: Graph colouring IIII This problem can be solved by graph coloring [1]:

I Construct a graph whose vertices are variables, andthere is an edge (x , y) if the liveness ranges of x and yoverlap. Liveness ranges:

u

y = 3x = y + 1z = x + 2w = x + 1u = z * wreturn u

x

y

z w

I Registers are colours: a valid colouring ensures that ifthe liveness ranges of two variables overlap, they will bein different registers.

y

x

z

w

u


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Example: Graph colouring IV

I In this example, three registers suffice: put the ‘red’variables in r1, the ‘green’ variables in r2, and the ‘blue’variable in r3.

r1 ← 3r2 ← +(r1, 1)r1 ← +(r2, 2)r3 ← +(r2, 1)r2 ← +(r1, r3)

I Some other examples of colouring:I Spectrum allocation: given a set of transmitter

locations, construct a graph where the vertices aretransmitters and edges indicate pairs of transmittersclose enough to interfere with each other if assigned thesame frequency. Then, the chromatic number of thisgraph tells you how many different frequencies arerequired for none of the towers to interfere with each


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Example: Graph colouring V

other. A colouring gives you a frequency assignmentthat achieves this.

I Circuit layout: given a planar layout of a circuit, build agraph where vertices represent wires and there is anedge between two wires if they cross. Then, thechromatic number tells you how many layers suffice toroute all the signals with none of them crossing. (Thisalso works for subways!)


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Example: Steiner networks I

I Suppose a company has a number of downtownlocations it wants to connect with a highspeed fibernetwork. Suppose fibre costs ≈ $100/m to lay in adowntown area, and can only travel along streets.

location

street

I Problem: find a minimal-cost network connecting thelocations.

location

street


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Example: Steiner networks II

I This is the rectilinear Steiner tree problem, which hasbeen used in VLSI layout to route signals. (‘Rectilinear’here refers to the constraint that edges travel onlyup-down or left-right.)

I Finding a minimal Steiner tree is an optimizationproblem, the decision version of which (Does there exista network of cost < x?) is in NP.


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Example: Graph Partitioning I

I Problem: Given a weighted graph G = (V ,E ,w), wherew : E → Q+, partition V into V1,V2 where||V1| − |V2|| ≤ 1 and so that the total weight of cutedges is minimized:

minimize∑

(v1,v2)∈E : v1∈V1,v2∈V2

w(v1, v2)

I This has numerous applications in parallel loadbalancing, circuit layout, improving locality of reference,etc.


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Bibliography

Part II

NP-Complete problems


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Bibliography

Propositional Satisfiability I

I Recall that propositional logic consists ofBoolean variables V = p1, p2, . . .Literals pi ,¬pi

Formulas ϕ ::= pi | ϕ ∧ ϕ | ϕ ∨ ϕ | ¬ϕ

I A formula in conjunctive normal form (CNF) is aconjunction of disjunctions, e.g.,

(p1 ∨ p2) ∧ (p2 ∨ p3 ∨ p4) ∧ (p1 ∨ p3)

I > is true, ⊥ is false.


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Bibliography

Propositional Satisfiability II

I A truth assignment is a substitution ρ : V → >,⊥.We extend ρ to a function ρ on formulas:

ρ(pi ) = ρ(pi )

ρ(¬ϕ) =

> if ρ(ϕ) = ⊥⊥ if ρ(ϕ) = >

ρ(α ∨ β) =

> if ρ(α) = > or ρ(β) = >⊥ otherwise

ρ(α ∧ β) =

> if ρ(α) = > and ρ(β) = >⊥ otherwise

I The propositional satisfiability problem, or SAT: given apropositional formula ϕ with free variables p1, . . . , pk ,decide whether there exists a truth assignment ρ suchthat ρ(ϕ) = >.


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Bibliography

Propositional Satisfiability IIII Obviously SAT and Circuit-satisfiability are closely

related. We can formalize this by means of a “Karpreduction” (or many-one reduction):

I Let us make the simplifying assertion that all decisionproblem instances can be expressed as binary strings.For example, a boolean circuit can be described by abinary string that enumerates the gates and theirinterconnections in some prefix-free code. LettingZ ⊆ Σ∗ be the set of all strings representing circuitsthat are satisfiable, the decision problem “Is this circuitsatisfiable?” becomes “Is the string x (representing thecircuit) in the set Z?”

I A many-one, or Karp, reduction from a problem Y to aproblem Z is a function r : Σ∗ → Σ∗ computable inpolynomial time such that y ∈ Y if and only ifr(y) ∈ Z .

I Intuitively, we take a problem of one sort and turn itinto a problem of a different sort, while preserving theYES/NO outcome of the decision.


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Bibliography

Propositional Satisfiability IV

I For example, let Y be the set of binary stringsrepresenting satisfiable boolean formulas, and Z bebinary strings representing satisfiable circuits. To reduceY to Z , we just convert the boolean formulas into acircuit: r(α ∧ β) becomes an AND gate whose inputsare r(α) and r(β), and so forth. This translation from aboolean formula to a circuit can be done in polynomialtime: SAT is many-one reducible to Circuit-SAT.

I Note:If Y is Karp-reducible to Z, then apolynomial-time solution for Z would imply apolynomial time solution for Y .

(We can turn instances of Y into instances of Z inpolynomial time.)


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Bibliography

Propositional Satisfiability V

I Karp-reducibility is a preorder: transitive and reflexive.We can write

Y ≤Pm Z

to mean Y is Karp-reducible to Z . (The P stands forpolynomial time, and the m stands for many-one.)

I Recall that whenever we have a preorder, we can turn itinto an equivalence relation and a partial order (ahierarchy):

I Define X ∼ Y to mean (X ≤Pm Y ) ∧ (Y ≤P

m X ). (Thisis an equivalence relation, whose equivalence classesconsist of problems that are interreducible. Theseequivalence classes are sometimes called degrees.)

I Then define [X ]∼ ≤ [Z ]∼ on equivalence classes by[X ]∼ ≤ [Z ]∼ iff X ≤P

m Z .I The class P is the least element of the partial order.


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Bibliography

Propositional Satisfiability VII If we restrict ourselves to problems in NP, this poset

has a maximal element: the class NPC of NP-completeproblems.

I It is known that if P 6= NP, then there are equivalenceclasses strictly between P and NP, calledNP-intermediate problems. Proving the existence ofsuch a class would immediately imply P 6= NP.

I Defn: X is NP-hard if, for any Y ∈ NP, Y ≤Pm X .

(Note: X is not required to be in NP. For example, thehalting problem is NP-hard.)

I Defn: X is NP-complete if:

1. X ∈ NP; and2. X is NP-hard.

I Roughly speaking, NPC (the set of NP-completeproblems) consists of the very hardest problems knownto be in NP; if we could solve just one of those problemsin polynomial time, then every problem in NP could besolved in polynomial time, which would imply P = NP.


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Bibliography

Propositional Satisfiability VII

I First NP-comleteness proof was the Cook-Levintheorem (1971): propositional satisfiability (SAT) is anNP-complete problem. (Stephen Cook is a professor atthe University of Toronto.)

I Proof sketch: encode valid traces of a nondeterministicTuring machine running in s steps and s tape locationsas a propositional formula. Set boundary conditions sothat if the formula is satisfiable the machine halts witha YES output.

I Import: any problem in NP is Karp-reducible to SAT; apolynomial time algorithm for SAT would implyP = NP.

I To prove a problem X is NP-complete, it is necessary toprove that (i) the problem is NP (this is usually easy);and that (ii) some problem already known to be in NPCcan be reduced to X .


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Bibliography

Propositional Satisfiability VIII

I Since the original NPC problems were discovered, manyuseful problems have been proven to be in this class;there are some 1000+ known NPC problems!

I The classic text on the subject: Garey and Johnson [2].

I Why is knowing whether a problem is in NPCimportant?

I Know not to waste your time looking for apolynomial-time algorithm (unless have a fondness fortilting at windmills).

I There is a rich understanding of how NPC optimizationproblems can be approximately solved in polynomialtime, e.g., to within a small difference from optimality.Identifying a problem as NPC is a first step to figuringout how to best approach its approximation.


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Bibliography

Example NPC proof: Independent set I

I To determine if a problem is NP-complete, your firststep should be to check a catalogue of knownNP-complete problems, such as the online survey “Acompendium of NP optimization problems”.

I We will look at a simple example of an NP-completenessproof as an example of how such results are obtained.

I In a graph G = (V ,E ), a subset of vertices V ′ ⊆ V arean independent set if for all v1, v2 ∈ V ′, there is noedge (v1, v2) ∈ E .

The green vertices form an independent set of size 4.


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Bibliography

Example NPC proof: Independent set II

I An independent set V ′ is the ‘opposite’ of a clique, inthe sense that the vertices V ′ form a clique in the graph(V ,V 2 \ E ).

I The Independent Set problem: given a graphG = (V ,E ) and a positive integer m, decide if G has anindependent set on m vertices.

I Proving NP-completeness requires proving (i) theproblem is in NP; (ii) a known NP-complete problemcan be reduced to it.

I Clearly, independent set is in NP: a certificate is the setof m vertices forming the independent set, which can bechecked in polynomial time.

I We will reduce 3-SAT to Independent Set. (This proofand example are from [3].)


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Bibliography

Example NPC proof: Independent set III

I 3-SAT is a restricted form of propositional satisfiability,in which formulas are in CNF form, with eachdisjunction containing at most three literals, forexample:

(x1 ∨ x2 ∨ x3) ∧ (x1 ∨ x2 ∨ x3) ∧ (x1 ∨ x2 ∨ x3) ∧ (x1 ∨ x2 ∨ x3)

(We write xi as a shortform for ¬xi .)I 3-SAT is known to be an NP-complete problem: every

SAT problem can be rewritten into the above form.

I We demonstrate a translation from a 3-SAT problem toan independent set problem, such that an independentset of m vertices exists if and only if the 3-SAT formulais satisfiable.

I For a 3-SAT formula of m clauses (disjunctions),construct a graph where:

I There is a vertex for each occurrence of a literal in theformula;


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Bibliography

Example NPC proof: Independent set IV

I Within each disjunction, there is an edge between eachvertex corresponding to a literal in that disjunction;

I There is an edge between every occurrence of a literaland its negation, i.e., we connect each x1 to each x1,etc.

I This graph can be constructed in polynomial time.

1x

1

x3

x3

x3

x1

x1

x2

x2

x2

x3

x2

x

I We then ask: Is there an independent set of size m?

I In the above graph, the green vertices show anindependent set of size 4.


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Bibliography

Example NPC proof: Independent set VI Because each clause (disjunction) corresponds to a

clique (e.g. a triangle) in the graph, any independentset on m vertices must have at most one vertex fromeach clique.

I Choose a truth assignment so that the literalcorresponding to each vertex of the independent set is> (true). E.g., if x2 is in the independent set, wechoose x2 = ⊥ so that x1 = >.

I Since there is an edge between all literals and theirnegations, we cannot simultaneously choose xi = > andxi = ⊥; so a consistent truth assignment exists.

I We will have one vertex from each clique = one literalfrom each clause made true. Therefore each disjunctionwill evaluate to >, and their conjunction will be > also.

I E.g., in the above example we choose x1 = >, x2 = ⊥,and x3 can be either > or ⊥.


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Bibliography

Example NPC proof: Independent set VI

I The graph has an independent set of size m if and onlyif the 3-SAT formula is satisfiable. Since the reductioncan be done in polynomial time, this is a Karpreduction.

I This reduction demonstrates that if we could solveIndependent Set in polynomial time, we could solve3-SAT in polynomial time, which in turn implies wecould solve any problem in NP in polynomial time.

I Hence Independent Set is NP-Complete.


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Bibliography

Bibliography I

[1] Gregory J. Chaitin, Marc A. Auslander, Ashok K.Chandra, John Cocke, Martin E. Hopkins, and Peter W.Markstein.Register allocation via coloring.Computer Languages, 6(1):47–57, 1981. bib

[2] M. R. Garey and D. S. Johnson.Computers and intractability; a guide to the theory ofNP-completeness.W.H. Freeman, 1979. bib

[3] H. R. Lewis and C. H. Papadimitriou.Elements of the theory of computation.Prentice-Hall, Englewood Cliffs, New Jersey, 1981. bib

[4] bib

bib


NP-Completeproblems

Todd L.Veldhuizen

[email protected]

Bibliography

Bibliography II

[5] Alan T. Sherman.On superpolylogarithmic subexponential functions (PartI).SIGACT News, 22(1):65, 1991. bib ps


ApproximationAlgorithms forNP-Complete

problems

Todd L.Veldhuizen

[email protected]

Introduction

Minimal VertexCover (APX)

Steiner networks(APX)

Knapsack(FPTAS)

Bibliography

ECE750-TXB Lecture 23: ApproximationAlgorithms for NP-Complete problems



Canada

March 29, 2007



problems

Todd L.Veldhuizen

[email protected]

Introduction



Knapsack(FPTAS)

Bibliography

Approximation Algorithms I

I Recall thatI P = decision problems, decidable in polynomial time.I NP = decision problems, YES certificates checkable in

polynomial time.I NPC = the ‘hardest’ problems in NP; if any of these

can be solved in polynomial time then P = NP.

I It is believed that P 6= NP: NPC problems are thoughtto require superpolynomial time.

I However, for many useful optimization problems we canobtain approximate solutions in polynomial time.

I In an optimization problem, we are trying to minimizeor maximize some objective function; the correspondingNP decision version is a question “Is there a solutionwith cost ≤ k?” or “Is there a solution with profit≥ k?”



problems

Todd L.Veldhuizen

[email protected]

Introduction



Knapsack(FPTAS)

Bibliography

Approximation Algorithms II

I The class NPO consists of optimization problemswhose decision version is in NP. Technically, an NPOproblem consists of:

1. A set of valid problem instances D ⊆ Σ∗, recognizablein polynomial time. (Here we take Σ∗ to be binarystrings, for simplicity.)

2. A set of feasible solutions S ⊆ Σ∗, and a polynomialtime algorithm that decides S given a problem instance.

3. An objective function J(I , s) that maps an instanceI ∈ D and a solution s ∈ S to a cost or profit in Q+.

4. An indication of whether we aim to maximize orminimize J.



problems

Todd L.Veldhuizen

[email protected]

Introduction



Knapsack(FPTAS)

Bibliography

Approximation Algorithms III

I For example, the problem Maximum Independent Setasks for the largest set of vertices in a graph such thatno two vertices have an edge between them. This is anNPO problem: the problem instances are graphs(suitably encoded as binary strings), feasible solutionsare independent sets, and the objective functionmeasures the size of an independent set.

I Note that a polynomial time algorithm for an NPOoptimization problem implies a polynomial timealgorithm for the decision version — e.g., to answer thequestion “Does this graph has an independent set of≥ 4 vertices?” we can find the largest independent setand see how large it is.

I If the decision version of an optimization problem isNP-complete, then a polynomial time algorithm for theoptimization problem would imply P = NP.



problems

Todd L.Veldhuizen

[email protected]

Introduction



Knapsack(FPTAS)

Bibliography

Approximation Algorithms IV

I Between P and NP-completeness there are severalgrades of optimization problems where we can obtainapproximate solutions in polynomial time. Let OPTrepresent the optimal value of an objective function,and suppose for simplicity we seek to minimize theobjective function.

1. The class APX (approximable) contains NPO problemswhere it is possible to obtain ≤ δ ·OPT in polynomialtime, for some fixed constant δ.

2. The class PTAS (Polynomial Time ApproximationScheme) contains NPO problems where for any fixedconstant ε > 0, we can obtain ≤ (1 + ε) ·OPT inpolynomial time. (However, the time required might benf (ε) where f (ε) →∞ as ε → 0. The algorithm ispolynomial only if ε is fixed.)



problems

Todd L.Veldhuizen

[email protected]

Introduction



Knapsack(FPTAS)

Bibliography

Approximation Algorithms V3. The class FPTAS (Fully Polynomial Time

Approximation Scheme) contains NPO problems wherefor any fixed constant ε > 0, we can obtain≤ (1 + ε) ·OPT in time O(ncε−d) where c , d areconstants.

4. For maximization problems, replace ≤ (1 + ε) ·OPT with

≥ (1− ε) ·OPT.

I The containment relations between these classes areillustrated by this diagram:

NPO

APX

PTAS

FPTAS

P



problems

Todd L.Veldhuizen

[email protected]

Introduction



Knapsack(FPTAS)

Bibliography

Approximation Algorithms VI



problems

Todd L.Veldhuizen

[email protected]

Introduction



Knapsack(FPTAS)

Bibliography

Example: Vertex cover I

I Suppose that to improve the performance of a computernetwork, you want to collect statistics on packets beingtransmitted. Draw a graph where the vertices arecomputers/routers, and edges are communication links:

Collecting statistics can slow down the network, andrequires installing custom software etc. So, we want tomonitor the traffic on as few nodes as possible. Wechoose as small a set of nodes as possible on which toinstall the monitoring software, so that eachcommunication link has monitoring software on at leastone end, e.g.:



problems

Todd L.Veldhuizen

[email protected]

Introduction



Knapsack(FPTAS)

Bibliography

Example: Vertex cover II

If we install the monitoring software on the green nodes,every communication link has at least one end green.

I This is a vertex cover.

I Minimum Vertex Cover: Given a graph G = (V ,E ) finda set V ′ ⊆ V of minimal size such that every edge hasat least one vertex in V ′.

I This problem is in NPO:

1. Instances: graphs.2. Feasible solutions: subsets of vertices V ′ such that

every edge has at least one end in V ′.3. Objective function: |V ′|.



problems

Todd L.Veldhuizen

[email protected]

Introduction



Knapsack(FPTAS)

Bibliography

Example: Vertex cover III

4. It is a minimization problem.

I The decision version (Does G have a vertex cover of≤ m vertices?) is in NP: a vertex cover V ′ with|V ′| ≤ m is a YES certificate.

I Minimum Vertex Cover is known to be NP-complete (inits decision version).



problems

Todd L.Veldhuizen

[email protected]

Introduction



Knapsack(FPTAS)

Bibliography

An approximation algorithm for vertex cover I

I Now let’s see an approximation algorithm for vertexcover that lets us get a nearly optimal answer inpolynomial time.

I The approximation we’ll see uses a maximal matching.

I A matching of a graph G = (V ,E ) is a subset of edgesE ′ ⊆ E such that no vertex is the endpoint of morethan one edge.

I A matching is maximal if it cannot be enlarged.

A graph (left) and a maximal matching (right). Picking any more

edges would result in a vertex being the endpoint of two edges.



problems

Todd L.Veldhuizen

[email protected]

Introduction



Knapsack(FPTAS)

Bibliography

An approximation algorithm for vertex cover II

I The vertices that are endpoints of the edges in thematching form a vertex cover:

The endpoints of the edges in the matching (green) form a vertex

cover of the graph.

I Why? Suppose the endpoints of the maximal matching didn’t

form a vertex cover. From the definition of a vertex cover, this

would imply there was an edge (v1, v2) such that neither v1 nor v2

are in the vertex cover. This would imply there was an edge

(v1, v2) such that neither v1 nor v2 are endpoints of an edge in the

maximal matching. So, (v1, v2) could be added to the matching.

But, this contradicts the premiss that the matching is maximal.



problems

Todd L.Veldhuizen

[email protected]

Introduction



Knapsack(FPTAS)

Bibliography

An approximation algorithm for vertex cover III

I This gives us an approximation algorithm for MinimumVertex Cover:

1. Find a maximal matching E ′ ⊆ E . (This can be doneby simply considering the edges in arbitrary order, andadding them to E ′ if they do not have an endpoint incommon with an edge already in E ′.)

2. Output the list of vertices that are endpoints of E ′.

TheoremThis approximation algorithm yields a vertex cover of size≤ 2 ·OPT.

Proof.



problems

Todd L.Veldhuizen

[email protected]

Introduction



Knapsack(FPTAS)

Bibliography

An approximation algorithm for vertex cover IV

Let E ′ ⊆ E be a maximal matching. We prove that|E ′| ≤ OPT. Since the vertex cover output is of size 2|E ′|,this establishes that the size of the vertex cover is ≤ 2 ·OPT.Suppose to the contrary that |E ′| > OPT. Let V ∗ ⊆ V be avertex cover with |V ∗| = OPT. By pigeonhole there must bean edge (v1, v2) ∈ E ′ such that neither v1 nor v2 are in V ∗;but this contradicts the premiss that V ∗ is a vertexcover.

Illustration of a contradiction: if the maximal matching contains 4

edges, it is impossible that there could be a vertex cover of only 3

vertices (green), since this would leave an edge uncovered.



problems

Todd L.Veldhuizen

[email protected]

Introduction



Knapsack(FPTAS)

Bibliography

An approximation algorithm for vertex cover V

I The maximal matching approximation finds a vertexcover that is at most twice optimal. This establishesthat Minimum Vertex Cover is in the class APX.

I Whether it is possible to better than 2 ·OPT is an openproblem.



problems

Todd L.Veldhuizen

[email protected]

Introduction



Knapsack(FPTAS)

Bibliography

Steiner trees I

I A network design problem is one of the following form:given a set of points/vertices, find a minimal networkspanning them. For example, we might consider how tobuild roads between four cities so that the total roadlength is minimized:



problems

Todd L.Veldhuizen

[email protected]

Introduction



Knapsack(FPTAS)

Bibliography

Steiner trees II

I A more general setting of theproblem is the Minimum Steiner Tree problem on graphs:

Given a graph G = (V ,E ), a positive weightw(v1, v2) on edges, and a subset V ′ ⊆ V ofvertices to be connected, find a minimum costsubtree of G that includes all the vertices inV ′.

I The cost is measured as the sum of the weights of theedges in the tree.

I The tree may contain vertices not in V ′; these arecalled Steiner nodes.



problems

Todd L.Veldhuizen

[email protected]

Introduction



Knapsack(FPTAS)

Bibliography

Steiner trees III

I Example:

a

3 2

41

1

b c

d

4

2 2

4

11

1

4

The black vertices are the set V ′ to be connected.

I The decision version of this problem is known to beNP-complete.

I However, there is a fast approximation algorithm:

1. For each pair v1, v2 ∈ V ′, compute the shortest pathbetween them, and call this d(v1, v2).



problems

Todd L.Veldhuizen

[email protected]

Introduction



Knapsack(FPTAS)

Bibliography

Steiner trees IV

2. Construct a complete graph G ′ on V ′, having edges forevery pair v1, v2 ∈ V ′, and weights d(v1, v2).

23

3

3

2

b c

da3

3. Find a minimum spanning tree on the graph G ′.

2

3

3

23

b c

da

3



problems

Todd L.Veldhuizen

[email protected]

Introduction



Knapsack(FPTAS)

Bibliography

Steiner trees V4. Construct a tree on the original graph G by taking the

union of each path represented by an edge of theminimum spanning tree on the graph G ′. Whennecessary, prune edges to remove cycles.

14

1

1

b c

da

14

2 2

4

1

4

3 2

TheoremThis approximation algorithm produces a network of cost≤ 2 ·OPT.

The proof involves Euler tours and Hamiltonian cycles; seee.g. [1].



problems

Todd L.Veldhuizen

[email protected]

Introduction



Knapsack(FPTAS)

Bibliography

Steiner trees VI

I This approximation algorithm establishes that MinimumSteiner Tree is in APX. However, it is known that thisproblem is in APX-Complete, which is the ‘hardproblems’ of APX; this implies that Minimum SteinerTree is not in PTAS unless P = NP.

I There is an approximation algorithm for this problemthat achieves (1 + (ln 3)/2) ·OPT ≈ 1.55 ·OPT.



problems

Todd L.Veldhuizen

[email protected]

Introduction



Knapsack(FPTAS)

Bibliography

Knapsack I

I So far we have seen two examples of problems in APX.Now let’s see one in FPTAS, where we can get towithin (1 + ε) ·OPT in O(ncε−d) time, i.e., timepolynomial in n and 1

ε .

I In a knapsack problem, one has

1. A set S = a1, . . . , an of objects;2. A profit function profit : S → Z+;3. A size function size : S → Z+;4. A knapsack capacity B ∈ Z+.

I A feasible solution is a subset S ′ ⊆ S of objects with∑s∈S ′ size(s) ≤ B.

I The objective is to maximize the profit∑

s∈S ′ profit(s).



problems

Todd L.Veldhuizen

[email protected]

Introduction



Knapsack(FPTAS)

Bibliography

Knapsack III Example: here is a knapsack problem with 3 objectsa1, a2, a3.

Object size(ai ) profit(ai )

a1 3 2a2 2 1a3 1 3

If the capacity of the knapsack is B = 5, the bestsolution is to pick objects a1 and a3, which achieves aprofit of 5 and a size of 4.

I Knapsack has numerous real, practical applications:I Shipping (deciding how to pack shipping containers to

maximize profit);I Multirate networks (given a total available bandwidth

and users who bid for various bandwidths at variousprices, what subset of users should be chosen tomaximize profit?)



problems

Todd L.Veldhuizen

[email protected]

Introduction



Knapsack(FPTAS)

Bibliography

Knapsack III

I Capital budgeting (given a fixed amount of moneyavailable for capital purchases, a set of objects thatcould be bought, and an estimate of how much theirpurchase would benefit the company, what subset ofobjects should be purchased?)

I Web caching (given a fixed amount of memory/diskspace on a web cache, web pages of various sizes, andestimates of how much latency/bandwidth could besaved by caching those pages, what subset of pagesshould be cached?)

I The naive approach of greedily choosing objects withthe best profit/size ratio can be made to performarbitrarily badly. (Consider for example objects a, b withsizes 1, 100, profits 10, 900, and capacity 100.)



problems

Todd L.Veldhuizen

[email protected]

Introduction



Knapsack(FPTAS)

Bibliography

Knapsack IVI Knapsack (in its decision version) is an NP-complete

problem. However, it has an FPTAS: we can obtain asolution that is ≥ (1− ε) ·OPT in time O(n3ε−1): witheach doubling of the amount of time we are willing tospend, we can halve the distance between theapproximate answer and the optimal answer.

I The approximation algorithm we will see is based ondynamic programming. Let P = maxs∈S profit(s) bethe maximum profit of any item. We can find an exactsolution to the knapsack problem in time O(n2P):

1. Clearly the optimal solution has to have profit ≤ n · P:number of objects times the maximum profit of anyobject.

2. We will consider the objects in order, and figure outhow much profit we can make with some subset of thefirst i objects.

3. Let S(i , p) be a subset of s1, . . . , si with profit p andminimum size, or ∅ if no such set exists.



problems

Todd L.Veldhuizen

[email protected]

Introduction



Knapsack(FPTAS)

Bibliography

Knapsack V4. Let A(i , p) be the size of S(i , p), or +∞ if it doesn’t

exist:

A(i , p) =

+∞ if S(i , p) = ∅∑

s∈S(i,p) profit(s) otherwise

5. Given the solutions to A(i , p) for the first i objects, wecan obtain the solutions for A(i + 1, p) by consideringeither taking or not taking object ai+1, for each value ofp:

I If profit(ai+1) < p, then

A(i + 1, p) = min( A(i , p),size(ai+1) + A(i , p − profit(ai+1)))

I Otherwise, A(i + 1, p) = A(i , p).

6. Once we have solved all the A(n, p), we choose themaximum p such that A(n, p) < B. This is the optimalanswer. (To recover the objects in the optimal set, weneed to keep track of them as we build the table; butthis is not difficult.)



problems

Todd L.Veldhuizen

[email protected]

Introduction



Knapsack(FPTAS)

Bibliography

Knapsack VI

I Example: Here is the table of A(i , p) for the exampleshown earlier: we have the maximum profit of any oneobject is P = 3, so we needn’t look farther than a profitof 9.

i 0 1 2 3 4 5 6 7 8 9

1 0 32 0 2 3 53 0 2 3 1 3 4 6

I The running time is determined by the size of this table,which is controlled by n and P. If we could make eithern or P smaller, the algorithm would run faster.

I We choose to make P smaller by scaling and roundingthe profit values so the table A(i , p) becomes smaller;this yields an approximate answer.

I Approximation algorithm:



problems

Todd L.Veldhuizen

[email protected]

Introduction



Knapsack(FPTAS)

Bibliography

Knapsack VII1. Let K = εP

n , where P is the maximum profit of anyitem. We will be effectively rounding profits down tothe nearest multiple of K ; as ε → 0, the rounding hasless and less effect.

2. For each object ai , let profit′(ai ) = bprofit(ai )K c.

3. Use dynamic programming to solve this new problemand find the most profitable set S ′.

4. Output S ′.

TheoremThis approximation algorithm yields a profit of≥ (1− ε) ·OPT.

Proof. Let O be a set of objects achieving the optimalprofit. Write OPT = profit(O) for the sum of the profits ofthe objects in O. Since for each object our maximumrounding error of the profit is ≤ K , we have

profit(O)︸︷︷︸OPT

− K · profit′(O)︸︷︷︸profit with rounding

≤ nK (1)



problems

Todd L.Veldhuizen

[email protected]

Introduction



Knapsack(FPTAS)

Bibliography

Knapsack VIIIRearranging,

K · profit′(O) ≥ profit(O)− nK (2)

Since the dynamic programming algorithm yields an optimalsolution for the rounded problem, the answer we get (S ′)must be at least as good as O under the rounded profits:

profit(S ′) ≥ K · profit′(O) (3)

≥ profit(O)− nK from Eqn.(2) (4)

= OPT− εP using K = εPn (5)

Since OPT ≥ P (assuming no objects are bigger than theknapsack!),

profit(S ′) ≥ (1− ε)OPT



problems

Todd L.Veldhuizen

[email protected]

Introduction



Knapsack(FPTAS)

Bibliography

Knapsack IX

I This establishes that Knapsack is in FPTAS: there is anapproximation algorithm yielding ≥ (1− ε) ·OPT intime O(n3ε−1).



problems

Todd L.Veldhuizen

[email protected]

Introduction



Knapsack(FPTAS)

Bibliography

Where to go from here?

I Vazirani’s text Approximation Algorithms [1] is anexcellent introduction to this area.

I There is another text I have not yet had the opportunityto read by Ausiello et al, Complexity andApproximation, that includes a reference guide toknown results on approximation. The reference guide isavailable online and is well worth browsing:

I A Compendium of NP-Hard Optimization ProblemsI http://www.nada.kth.se/∼viggo/wwwcompendium/

The compendium is organized hierarchically by problemclass (graph theory, network design, etc.) and for eachproblem lists the best known approximation algorithms.Includes almost 500 references to the literature.



problems

Todd L.Veldhuizen

[email protected]

Introduction



Knapsack(FPTAS)

Bibliography

Bibliography I

[1] Vijay V. Vazirani.Approximation Algorithms.Springer-Verlag, 2001. bib

Documents

ECE750-TXB Lecture 1: Asymptotics Motivationece750-ads/notes/alllectures.pdf · Basic asymptotic notations Asymptotic behaviour as n ! 1 , where for our purposes n is the \problem