Locality Sensitive Distributed Computing

Preview:

DESCRIPTION

Locality Sensitive Distributed Computing. David Peleg Weizmann Institute. Structure of mini-course. Basics of distributed network algorithms Locality-preserving network representations Constructions and applications. Part 1: Basic distributed algorithms. Model Broadcast - PowerPoint PPT Presentation

Citation preview

Locality Sensitive Distributed Computing

David PelegWeizmann Institute

Structure of mini-course

1. Basics of distributed network algorithms

2. Locality-preserving network representations

3. Constructions and applications

Part 1: Basic distributed algorithms

• Model• Broadcast • Tree constructions• Synchronizers• Coloring, MIS

The distributed network model

Point-to-point communication network

The distributed network model

Described by undirected weightedgraph G(V,E,)

V={v1,…,vn} - Processors (network sites)E - bidirectional communication links

The distributed network model

: E + edge weight functionrepresenting transmission costs (usually satisfies triangle inequality)

Unique processor ID's: ID : V SS={s1,s2,…} ordered set of integers

Communication

Processor v has deg(v,G) ports (external connection points)

Edge e represents pair ((u,i),(v,j)) = link connecting u's port i to v's port j

Communication

Message transmission from u to neighbor v:

• u loads M onto port i• v receives M in input buffer of port j

CommunicationAssumption:At most one message can occupy a communication link at any given time(Link is available for next transmission only after previous message is removed from input buffer by receiving processor)

Allowable message size = O(log n) bits(messages carry a fixed number of vertex ID's, e.g., sender and destination)

Issues unique to distributed computing

There are several inherent differences between the distributed

and the traditional centralized-sequential computational models

Communication

In centralized setting: Issue nonexistent

In distributed setting: Communication• has its limits (in speed and capacity)• does not come “for free”

should be treated as a computational resourcesuch as time or memory(often - the dominating consideration)

Communication as a scarce resource

One common model: LOCAL

Assumes local processing comes for free (Algorithm pays only for communication)

Incomplete knowledge

In centralized-sequential setting: Processor knows everything (inputs, intermediate results, etc.)

In distributed setting:Processors have very partial picture

Partial topological knowledge

Model of anonymous networks: Identical nodes no ID'sno topology knowledge

Intermediate models:Estimates for network diameter, # nodes etcunique identifiersneighbor knowledge

Partial topological knowledge (cont)

Permissive models:Topological knowledge of large regions, or even entire network

Structured models:Known sub-structure, e.g., spanning tree / subgraph / hierarchical partition / routing service available

Other knowledge deficiencies

• know only local portion of the input• do not know who else participates• do not know current stage of other participants

Coping with failures

In centralized setting: Straightforward -Upon abnormal termination or system crash:Locate source of failure, fix it and go on.

In distributed setting: Complication -When one component fails, others continue

Ambitious goal: ensure protocol runs correctly despite occasional failures at some machines (including “confusion-causing failures”, e.g., failed processors sending corrupted messages)

Timing and synchrony

Fully synchronous network:• All link delays are bounded• Each processor keeps local clock• Local pulses satisfy following property:

Think of entire system as driven by global clock

Message sent from v to neighbor u at pulse p of v arrives u before its pulse p+1

Timing and synchrony

Machine cycle of processors - composed of 3 steps:

1. Send msgs to (some) neighbors

2. Wait to receive msgs from neighbors

3. Perform some local computation

Asynchronous model

Algorithms are event-driven :

• No access to global clock

• Messages sent from processor to neighbor arrive within finite but unpredictable time

Asynchronous model

Clock can't tell if message is coming or not:perhaps “the message is still on its way”

Impossible to rely on ordering of events(might reverse due to different message transmission speeds)

Nondeterminism

Asynchronous computations are inherently

nondeterministic

(even when protocols do not use randomization)

Nondeterminism

Reason: Message arrival order may differ from one execution to another (e.g., due to other events concurrently occurring in the system – queues, failures)

Run same algorithm twice on same inputs - get different outputs / “scenarios”

Nondeterminism

Complexity measures

• Traditional (time, memory)• New (messages, communication)

Time

For synchronous algorithm :Time() = (worst case) # pulses during

execution

For asynchronous algorithm ?

(Even a single message can incur arbitrary delay ! )

Time

For asynchronous algorithm :

Time() = (worst-case) # time units from start to end of execution,

assuming each message incurs delay < 1 time unit (*)

Time

Note:1. Assumption (*) is used only for performance

evaluation, not for correctness.2. (*) does not restrict set of possible scenarios

– any execution can be “normalized” to fit this constraint

3. “Worst-case” means all possible inputs and all possible scenarios over each input

Memory

Mem() = (worst-case) # memory bits used throughout the network

MaxMem() = maximum local memory

Message complexity

Basic message = O(log n) bits

Longer messages cost proportionally to length

Sending basic message over edge costs 1

Message() = (worst case) # basic messages sent during execution

Distance definitions

Length of path (e1,...,es) = s

dist(u,w,G) = length of shortest u - w path in G

Diameter:

Diam(G) = maxu,vV {dist(u,v,G)}

Distance definitions (cont)

Radius:

Rad(G) = minvV {Rad(v,G)}

Rad(v,G) = maxwV {dist(v,w,G)}

A center of G: vertex v s.t. Rad(v,G)=Rad(G)

Observe: Rad(G) < Diam(G) < 2Rad(G)

Broadcast

Goal:Disseminate message M originated at source r0 to all vertices in network

M

M

M

M

MM

M

Basic lower bounds

Thm:For every broadcast algorithm B:

• Message(B) > n-1,

• Time(B) > Rad(r0,G) = (Diam(G))

Tree broadcast

Algorithm Tcast(r0,T)

• Use spanning tree T of G rooted at r0

• Root broadcasts M to all its children• Each node v getting M, forwards it to children

Tree broadcast (cont)

Assume: Spanning tree known to all nodes

(Q: what does it mean in distributed context?)

Tree broadcast (cont)

Claim: For spanning tree T rooted at r0:

• Message(Tcast) = n-1

• Time(Tcast) = Depth(T)

Tcast on BFS tree

BFS (Breadth-First Search) tree = Shortest-paths tree:

The level of each v in T is dist(r0,v,G)

Tcast (cont)

Corollary:For BFS tree T w.r.t. r0:

• Message(Tcast) = n-1

• Time(Tcast) < Diam(G) (Optimal in both)

But what if there is no spanning tree ?

The flooding algorithm

Algorithm Flood(r0)

1. Source sends M on each outgoing link

2. For other vertex v:• On receiving M first time over edge e: store in buffer; forward on every edge ≠ e• On receiving M again (over other edges): discard it and do nothing

Flooding - correctness

Lemma: 1. Alg. Flood yields correct broadcast2. Time(Flood)=(Rad(r0,G)) = (Diam(G))

3. Message(Flood)=(|E|) in both synchronous and asynchronous model

Proof:Message complexity: Each edge delivers m at

most once in each direction

Neighborhoods

(v) = -neighborhood of v = vertices at distance or less from v

0(v)

1(v)

2(v)

Time complexity

Verify (by induction on t) that:

After t time units, M has already reached every vertex at distance < t from r0

(= every vertex in the t-neighborhood t(r0) )

Note: In asynchronous model, M may have reached additional vertices(messages may travel faster)

Time complexity

Note: Algorithm Flood implicitly constructs directed spanning tree T rooted at r0,

defined as follows:

The parent of each v in T is the node from which v received M for the first time

Lemma: In the synchronous model,T is a BFS tree w.r.t. r0, with depth Rad(r0,G)

Flood time

Note: In the asynchronous model, T may be deeper (< n-1)

Note: Time is still O(Diam(G)) even in this case!

r0

Broadcast with echo

Goal: Verify successful completion of broadcast

Method: Collect acknowledgements on a spanning tree T

Broadcast with echo

Converge(Ack) process - code for v

Upon getting M do:• For v leaf in T:

- Send up an Ack message to parent• For v non-leaf:

- Collect Ack messages from all children- Send Ack message to parent

Collecting Ack’s

Semantics of Ack from v“Joint ack” for entire subtree Tv rooted at v,signifying that each vertex in Tv received M

r0 receives Ack from all children

only after all vertices received MClaim: On tree T,

• Message(Converge(Ack)) = O(n)

• Time(Converge(Ack))=O(Depth(T))

Tree selection

Tree broadcast alg: Take same tree used for broadcast.Time / message complexities grow by const factor.

Flooding alg: Use tree T defined by broadcastSynch. model: BFS tree - complexities doubleAsynch. model: no guarantee

Tree selection - complexity

Lemma: In network G(V,E) of diameter D, complexities of “broadcast with echo” are:

• Message(FloodEcho)=O(|E|)• Time(FloodEcho)=

O(D) in synchronous model,O(n) in asynchronous model.

• In both models, M reaches all by time D

BFS tree constructions

In synchronous model:

Algorithm Flood generates BFS tree of optimal

• Message(Flood)=(|E|)

• Time(Flood) = (Diam(G))

In asynchronous model:Tree generated by Algorithm Flood is not BFS

Level-synchronized BFS construction (Dijkstra)

Idea:• Develop BFS tree from root r0 in phases,

level by level• Build next level by adding all vertices

adjacent to nodes in lowest tree level

After p phases: Constructed partial tree Tp

• The tree Tp is a BFS tree for p(r0)

• Each v in Tp knows its parent, children, depth

Level-synchronized BFS (Dijkstra)

Level-synchronized BFS (Dijkstra)

Phase p+1:1. r0 broadcasts message Pulse on Tp

2. Each leaf of Tp sends “exploration” message Layer to all neighbors except parent.

Level-synchronized BFS (Dijkstra)

3. Vertex w receiving Layer message for the first time (possibly from many neighbors) picks one neighbor v, lists it as parent, sends back Ack messages to all Layer messages

Vertex w in Tp receiving Layer message sends back Ack messages to all Layer messages

Level-synchronized BFS (Dijkstra)

4. Each leaf v collects acks on exploration msgs.If w chose v as parent, v lists w as child

5. Once receiving Ack on all Layer messages, leaf v Ack s parent.Acks are convergecast on Tp

back to r0.6. Once convergecast

terminates, r0 starts

next phase

AnalysisCorrectness: By induction on p, show:

• After phase p, variables parent and child define legal BFS tree spanning r0's p-neighborhood

Algorithm constructs BFS

tree rooted at r0.

Analysis (cont)

Time complexity:

Time(Phase p) = 2p+2

Time = ∑p 2p+2 = O(Diam2(G))

Analysis (cont)

Message complexity:For integer p > 0 let

Vp = vertices in layer pEp= edges internal to Vp

Ep,p+1 = edges between Vp and Vp+1

Analysis (cont)

Phase p:Layer msgs of phase p - sent only on Ep and Ep,p+1

• Only O(1) messages

sent over each edge• Tp edges are traversed

twice (< 2n messages)

Analysis (cont)Comm(Phase p) = O(n) + O(|Ep|+|Ep,p+1|)

In total: Comm = ∑p O(n + |Ep|+|Ep,p+1|) =O(n Diam(G)+|E|)

Complexities of BFS algorithms

Reference Messages Time

Lower bound E D(+ Sync. Model)

Dijkstra E+ n D D2

Bellman-Ford nE D

Best known E + n log3 n D log3 n

SynchronizersGoal: Transform algorithm for synchronous networks into algorithmfor asynchronous networks.

Motivation:Algorithms for the synchronous model - easier to design / debug / testthan ones for the asynchonous model

(Behavior of asynchronous system - harder to analyze)

SynchronizersSynchronizer: Methodology for such simulation:

Given algorithm S for synchronous network, and synchronizer , combine them to yield protocol A=(S) executable on asynchronous network

Correctness requirement:A's execution on asynchronous network - “similar” to S's execution on synchronous one

Underlying simulation principles

Combined protocol A composed of two parts:• original component• synchronization component(each with its own local var's and msg types)

Pulse generator: Processor v has pulse var Pv,generating sequence of local clock pulses,i.e., periodically increasing Pv=0,1,2,...

Underlying simulation principles

Under protocol A, each v performs during time interval when Pv=p precisely the actions it should perform during round p of the synchronous algorithm S

Def: t(v,p) = global time when v increased its pulse to p.

We say that “v is at pulse Pv=p” during the time interval

(v,p) = [t(v,p),t(v,p+1))

Underlying simulation principles

Pulse compatibility:

If processor v sends original message M to neighbor w during its pulse Pv=p

then w receives M during its pulse Pw=p

Correct simulations

Synchronous protocol S

Simulating protocol A=(S)

Execution S = S(G,I) of S in synch' network

Execution A = A(G,I) of A in asynch' network

(same topology G, same input I)

Correct simulations (cont)

Similar executions:Executions A and S are similar iffor every v, for every neighbor w, for every original local variable X at v, for every integer p > 0:

1. X value at beginning of pulse p in A = X value at beginning of round p in S

Correct simulations (cont)

2. Original messages sent by v to w during pulse p in execution A - same as those sent by v to w during round p in execution S

3. Original messages received by v from w during pulse p in A -same as those received by v from w during round p in S

4. Final output of v in A – same as in S

Correct simulations (cont)Correct simulation:Asynchronous protocol A simulates synchronous protocol S if for every network topology and initial input, the executions of A and S are similar

Synchronizer is correct if for every synchronous protocol S,protocol A=(S) simulates S

Correct simulations (cont)

Lemma: If synchronizer guarantees pulse compatibility then it is correct

Goal: Impose pulse compatibility

Correct simulations (cont)

Fundamental question:

When is it permissible for a processor to increase its pulse number?

Correct simulations (cont)

First answer: Increase pulse from p to p+1once certain that original messages of algorithm S sent by neighbors during their pulse p will no more arrive

Question:How can that be ensured?

Correct simulations (cont)Readiness property: Processor v is ready for pulse p,denoted Ready(v,p), once it already received all algorithm messagessent to it by neighbors during their pulse p-1.

Readiness rule:Processor v may generate pulse p once it finished its original actions for pulse p-1, and Ready(v,p) holds.

Correct simulations (cont)Problem: Obeying the readiness rule does not impose pulse compatibility

(Bad scenario: v is ready for pulse p, generates pulse p, sends msg of pulse p to neighbor w,yet w is still “stuck” at pulse p-1,waiting for msgs of pulse p-1 from some other neighbor z)

Correct simulations (cont)

Fix: Delay messages that arrived too early

Delay rule:Receiving in pulse p-1 msg sent from w on its pulse p, temporarily store it;

Process it only after generating pulse p

Correct simulations (cont)

Lemma: A synchronizer imposing both readiness and delay rules guarantees pulse compatibility

Corollary: If synchronizer imposes the readiness and delay rules, then it is correct

Implementation phases

Problem: To satisfy Ready(v,p), v must ensure that it already received all algorithm messages sent to it by its neighbors in pulse p-1

If w did not send any message to v in pulse p-1, then v must wait forever

(link delays in an asynchronous network are unpredictable...)

Implementation phases

Phase A:1. Each processor sends its original messages2. Processor receiving message from neighbor

sends Ack

Each processor learns (within finite time) that all messages it sent during pulse p have arrived

Conceptual solution: Employ two communication phases

Implementation phases

Fact:If each neighbor w of v satisfies Safe(v,p), then v satisfies Ready(v,p+1)

Node may generate new pulse once it learns all neighbors are safe w.r.t. current pulse.

Safety property: Node v is safe w.r.t. pulse p,denoted Safe(v,p), if all messages it sent during pulse p have already arrived.

Implementation phases

Phase B:

Apply a procedure to let each processor know when all its neighbors are safe w.r.t. pulse p

Synchronizer constructions:• based on 2-phase strategy • all use same Phase A procedure• but different Phase B procedures

Synchronizer complexity

Initialization costs:Tinit() and Cinit() = time and message costs of initialization procedure setting up synchronizer

Pulse overheads:Cpulse() = cost of synchronization messagessent by all vertices during their pulse p

Tpulse() = ?

Synchronizer complexity

Tpulse() = ?

Time periods during which different nodes v is at pulse p…

Synchronizer complexityLet tmax(p) = maxvV {t(v,p)}(time when slowest processor reached pulse p)

tmax(1) tmax(2) tmax(3)

Tpulse() = maxp>0 {tmax(p+1) - tmax(p)}

Synchronizer complexity

Lemma:For synchronous algorithm S and asynchronous A = (S),

• Comm(A) = Cinit() + Comm(S) + Time(S) * Cpulse(),

• Time(A) = Tinit() + Time(S) * Tpulse()

Basic synchronizer

Phase B of synchronizer : Direct.

After executing pulse p, when processor v learns it is safe, it reports this fact to all neighbors.

Claim: Synchronizer is correct.

Basic synchronizer Claim:• Cinit()=O(|E|)

• Tinit()=O(Diam)

• Cpulse()=O(|E|)

• Tpulse()=O(1)

Note: Synchronizer is optimal for trees, planar graphs and bounded-degree networks (mesh, butterfly, cube-connected cycle, ring,..)

Basic synchronizer

Assume: rooted spanning tree T in G

Phase B of : convergecast process on T

Basic synchronizer

• When processor v learns all its descendants in T are safe, it reports this fact to parent.

• When r0 learns all processors in G are safe, it broadcasts this along tree.

Stage 1: Stage 2:

Convergecast ends all nodes are safe

Basic synchronizer Claim: Synchronizer is correct.

Claim:• Cinit()=O(n|E|)

• Tinit()=O(Diam)

• Cpulse()=O(n)

• Tpulse()=O(Diam)

Note: Synchronizer is optimal for bounded-diameter networks.

Understanding the effects of locality

Model: • synchronous• simultaneous wakeup• large messages allowed

Goal: Focus on limitations stemming from

locality of knowledge

Symmetry breaking algorithms

Vertex coloring problem: Associate a color v with each v in V, s.t. any two adjacent vertices have different color

Naive solution: Use unique vertex ID's = legal coloring by n colors

Goal: obtain coloring with few colors

Symmetry breaking algorithms

Basic palette reduction procedure:Given legal coloring by m colors, reduce # colors

(G) = max vertex degree in G

Reduction idea: v's neighbors occupy at most distinct colors

+1 colors always suffice to find a “free” color

Symmetry breaking algorithms

First Free coloring(For set of colors and node set W v V)

FirstFree(W,) = min color in that is currently not used by any vertex in W

Standard palette:m = {1,...,m}, for m > 1

Sequential color reduction

For every node v do (sequentially):

v FirstFree((v),+1)

/* Pick new color 1 < j < +1, different from those used by the neighboring nodes */

Procedure Reduce(m) - code for v

Palette:3 = {1,...,3},

21

13

3

21

Procedure Reduce(m) - parallelization

Code for v:

For round j= +2 to m do:/* all nodes colored j re-color themselves simultaneously */

• If v's original color is v = j then do:

1. Set v FirstFree((v),+1)/* Pick new color 1 < j < +1, different from those used by the neighbors */

2. Inform all neighbors

Procedure Reduce(m) - code for v

Lemma:• Procedure Reduce produces a legal coloring

of G with +1 colors• Time(Reduce(m)) = m-+1

Proof:Time bound: Each iteration requires one time unit.

Procedure Reduce(m) - code for v

Correctness: Consider iteration j.• When node v re-colors itself, it always finds a

non-conflicting color (< neighbors, and +1 color palette)

• No conflict with nodes recolored in earlier iterations (or originally colored 1, 2, …, +1).

• No conflict with choices of other nodes in iteration j (they are all mutually nonadjacent, by legality of original coloring )

New coloring is legal

3-coloring trees

Goal: color a tree T with 3 colors in time O(log*n)

Recall: log(1)n = log nlog(i+1)n = log(log(i)n)log*n = min { i | log(i)n < 2 }

General idea: • Look at colors as bit strings. • Attempt to reduce # bits used for colors.

3-coloring trees|v| = # bits in v

v[i] = ith bit in the bit string representing v

Specific idea: Produce new color from old v:1. find index 0 < i < |v| in which

v's color differs from its parent's.(Root picks, say, index 0.)

2. set new color to: i , v[i]/* the index i concatenated with the bit v[i] */

3-coloring trees

We will show: a. neighbors have different new colorsb. length of new coloring is roughly logarithmic

in that of previous coloring

root

Old coloring:

3-coloring trees (cont)

Algorithm SixColor(T) - code for v

Let v ID(v) /* initial coloring */

Repeat:• |v|• If v is the root then set I 0

else set I min{ i | v[i]≠parent(v)[i] }

•Set v I; v[I]•Inform all children of this choice

until |v| =

3-coloring trees (cont)Lemma:In each iteration, Procedure SixColor produces a legal coloring

Proof:Consider iteration i, neighboring nodes v,w T, v=parent(w).

I = index picked by v; J = index picked by w

3-coloring trees (cont)

If I≠ J: new colors of v and w differ in 1st component

If I=J: new colors differ in 2nd component

v

wv

w

i=1

j=2i=2

j=2

3-coloring trees (cont)

Ki = # bits in color representation after ith iteration.

(K0=K=O(log n) = # bits in original ID coloring.)

Note: Ki+1 = dlog Kie + 1

2nd coloring uses about log(2)n bits, 3rd - about log(3)n, etc

3-coloring trees (cont)Lemma: Final coloring uses six colors

Proof:Final iteration i satisfies Ki = Ki-1 < 3

In final coloring, there are < 3 choices for the index to the bit in (i-1)st coloring, and two choices for the value of the bit

Total of six possible colors

Reducing from 6 to 3 colors

Shift-down operation:Given legal coloring of T:1. re-color each non-root vertex by color of

parent2. re-color root by new color (different from

current one)

Reducing from 6 to 3 colors

Claim:1. Shift-down step preserves coloring legality2. In new coloring, siblings are monochromatic

Reducing from 6 to 3 colors

Cancelling color x, for x {4,5,6}:

1. Perform shift-down operation on current coloring,

2. All nodes colored x apply FirstFree((v),3)/* choose a new color from among {1,2,3} not used by any neighbor */

Reducing from 6 to 3 colors

shift-down FirstFree

Claim: Rule for cancelling color x produces legal coloring

Example: cancelling color 4

Overall 3 coloring process

Thm: There is a deterministic distributed algorithm for 3-coloring trees in time O(log*n)

1. Invoke Algorithm SixColor(T) (O(log*n) time)2. Cancel colors 6, 5, 4 (O(1) time)

+1-coloring for arbitrary graphs

Goal: Color G of max degree with +1 colors in O( log n) time

Node ID’s in G = K-bit strings

Idea: Recursive procedure ReColor(x),where x = binary string of < K bits.

Ux = { v | ID(v) has suffix x } (|Ux| < 2K-|x|)

The procedure is applied to Ux, and returns with a coloring of Ux vertices with +1 colors.

+1-coloring for arbitrary graphs

Procedure ReColor(x) - intuition

If |x|=K (Ux has < one node) then return color 0.Otherwise:1. Separate Ux into two sets U0x and U1x

2. Recursively compute +1 coloring for each, invoking ReColor(0x) and ReColor(1x)

3. Remove conflicts between the two colorings by altering the colors of U1x vertices, color by color, as in Procedure Reduce.

ReColor – distributed implementation

• Set |x|• If = K /* singleton Ux = {v} */

then set v 0 and return

• Set b aK- /* v Ubx */

• v ReColor(bx).

Procedure ReColor(x) – code for v Ux

/* ID(v)=a1a2... aK , x = aK-|x|+1... aK */

Procedure ReColor(x) - code for v

/* Reconciling the colorings on U0x and U1x */

• If b=1 then do:• For round i=1 through +1 do:

• If v=i then do:

• v FirstFree((v), +1) (pick a new color 1 < j < +1, different

from those used by any neighbor)• Inform all neighbors of this choice

AnalysisLemma:For = empty word:• Procedure ReColor() produces legal coloring

of G with +1 colors• Time(ReColor()) = O( log n)

AnalysisProof:Sub-claim: ReColor(x) yields legal +1-coloring for vertices of subgraph G(Ux) induced by Ux

Proof:By induction on length of parameter x.Base (|x|=K): ImmediateGeneral case: Consider run of ReColor(x).

Note: Coloring assigned to U0x is legal (by Ind. Hyp.), and does not change later.

Analysis (cont)Consider v in U1x recoloring itself in some

iteration i via the FirstFree operation.

Note: v always finds a non-conflicting color:• No conflict with nodes of U1x recolored in

earlier iterations, or with nodes of U0x

• No conflict with other nodes that recolor in iteration i (mutually non-adjacent, by legality of coloring generated by ReColor(1x) to set U1x)

new coloring is legal

Analysis (cont)

Time bound: Each of the K=O(log n) recursion levels requires +1 time units

O( log n) time

Lower bound for 3-coloring the ringLower bound: Any deterministic distributed algorithm for 3-coloring n-node rings requires at least (log*n-1)/2 time.

Applies in strong model: After t time units, v knows everything known to anyone in its t-neighborhood.

In particular, given no inputs but vertex ID's: after t steps, node v learns the topology of its t-neighborhood t(v) (including ID's)

Lower bound for 3-coloring the ringOn a ring, v learned a (2t+1)-tuple (x1,...,x2t+1) from space W2t+1,n, whereWs,n= {(x1,...,xs) | 1 < xi < n, xi≠xj},

• xt+1 = ID(v),

• xt and xt+2 = ID's of v’stwo neighbors,

• etc.

Coloring lower bound (cont)

W.l.o.g., any deterministic t(n)-step algorithm At for coloring a ring in cmax colors follows a 2-phase policy:

• Phase 1: For t rounds, exchange topology info.At end, each v holds a tuple (v) W2t+1,n

• Phase 2: Select v A((v)) where

A : W2t+1,n {1,...,cmax}is the coloring function of algorithm A

Coloring lower bound (cont)

Define a graph Bs,n = (Ws,n, Es,n), whereEs,n contains all edges of form

(x1,x2,...,xs)

(x2,...,xs,xs+1)

satisfying x1 ≠ xs+1

Coloring lower bound (cont)Note: Two s-tuples of Ws,n ,(x1,x2,...,xs) and (x2,...,xs,xs+1) ,are connected in Bs,n

they may occur as tuples corresponding to two neighboring nodes in some ID assignment for the ring.

s

s

Coloring lower bound (cont)

Lemma: If Algorithm At produces a legal coloring for any n-node ring,then the function A defines a legal coloring for the graph B2t+1,n

Proof:Suppose A is not a legal coloring for B2t+1,n , i.e., there exist two neighboring vertices =(x1,x2,...,x2t+1) and =(x2,...,xs,x2t+2) in B2t+1,n s.t.

A() = A()

Coloring lower bound (cont)

Consider n-node ring with the following ID assignments:

Coloring lower bound (cont)Then algorithm A colors the neighboring nodes v and w by colors A() and A() respectively.These colors are identical,

so the ringcoloring is illegal;contradiction

Coloring lower bound (cont)Corollary:If the n-vertex ring can be colored in t rounds using cmax colors, then (B2t+1,n) < cmax

Thm:Any deterministic distributed algorithm for coloring the (2n)-vertex ring with two colors requires at least n-1 rounds.

Coloring lower bound (cont)

Proof:By Corollary, if there is a 2-coloring algorithm working in t time units,then (B2t+1,2n) < 2 (or, B2t+1,n is 2-colorable)hence B2t+1,n is bipartite.

But for t < n-2, this leads to contradiction, sinceB2t+1,2n contains an odd length cycle, hence it is not bipartite.

Coloring lower bound (cont)

The odd cycle:

(1,2,…,2t+1)

(2,…,2t+1, 2t+1)

(3,…, 2t+3)

(4,…, 2t+3,1)(2t+3,1,2,…,2t)

(5,…, 2t+3,1,2)

Coloring lower bound (cont)

Def: Family of directed graphs s,n = (s,n, s,n), s,n = {(x1,...,xs) | 1 < x1 < ... < xs < n },s,n = all (directed) arcs

(x1,x2,...,xs)

(x2,...,xs,xs+1)

Returning to 3-coloring: We prove the following:Lemma: (B2t+1,n) > log(2t)n

Coloring lower bound (cont)

Claim: (s,n) < (Bs,n)

Proof: The undirected version of s,n is a subgraph of Bs,n

To prove the lemma, i.e., bound (B2t+1,n), it suffices to show that (2t+1,n) > log(2t)n

Coloring lower bound (cont)Recursive representation for directed graphs : based on directed line graphs

in digraph H

e e’

e e’

in (H)

Def: For a directed graph H=(U,F), line graph of H, (H), is a directed graph withV((H)) = F,E((H)) contains an arc e,e' (for e,e‘ F) iff in H, e' starts at the vertex in which e ends

Coloring lower bound (cont)

Lemma:1. 1,n = complete directed graph on n nodes

(with every two vertices connected by one arc in each direction)

2. s+1,n = (s,n)

ProofClaim 1: immediate from definition.

Coloring lower bound (cont)

Claim 2: Establish appropriate isomorphism between s+1,n and (s,n) as follows.Consider

e = (x1,...,xs) , (x2,...,xs+1)e = arc of s,n = node of (s,n)

Map e to node (x1,...,xs,xs+1) of s+1,n

Straightforward to verify this mapping preserves the adjacency relation

Coloring lower bound (cont)

Coloring lower bound (cont)Lemma: For every directed graph H,

((H)) > log (H)Proof: Let k=((H)).Consider k-coloring of (H).

= edge coloring for H, s.t. if e' starts at vertex in which e ends, then (e') ≠ (e).

coloring can be used to create a 2k-coloring for H, by setting the color of node v to be the set v = { (e) | e ends in v }

Coloring lower bound (cont)

Note: uses < 2k colors. is legal

(H) < 2k, proving the lemma.

Coloring lower bound (cont)

Corollary: (s,n) > log(s-1)n

Proof:Immediate from last two lemmas:(1) 1,n = complete directed n node graph s+1,n = (s,n)(2)((H)) > log (H)

Corollary: (2t+1,n) > log(2t)nCorollary: (B2t+1,n) > log(2t)n

Coloring lower bound (cont)

Thm:Any deterministic distributed algorithm for coloring n-vertex rings with 3 colors requires time t > (log*n-1)/2Proof:If A is such an algorithm and it requires t rounds, then log(2t)n < (B2t+1,n) < 3,

log(2t+1)n < 2

2t+1 > log*n

Distributed Maximal Independent Set

Goal: Select MIS in graph G

Independent set: U V s.t.

u,w U u,w non-adjacent

Maximal IS: Adding any vertex violates independence

Distributed Maximal Independent Set

Note: Maximal IS ≠ Maximum IS

MaximalIS

Non-maximal, Non-maximumIS

MaximumIS

Distributed Maximal Independent Set

Sequential greedy MIS construction

Set U V, M

While U ≠ do:• Pick arbitrary v in U• Set U U - (v)• Set M M [ {v}

Distributed Maximal Independent Set

Note:1.M is independent throughout process2.once U is exhausted, M forms an MIS

Complexity: O(|E|) time

Distributed implementation

Distributedly marking an MIS: Set local boolean variable at each v:

v MIS = 1

v MIS = 0

Distributed implementationAlgorithm MIS-DFS• Single token traversing G in depth-first order,

marking vertices as in / out of MIS.• On reaching an unmarked vertex:

1. add it to MIS (by setting to 1),2. mark its neighbors as excluded from MIS

Complexity:• Message(MIS-DFS)=O(|E|)• Time(MIS-DFS)=O(n)

Lexicographically smallest MIS

LexMIS: The lexicographically smallest MISover V={1,…,n}

Note: Possible to construct LexMIS by simple sequential (non-distributed) procedure (go over node list 1,2,…:

- add v to MIS, - erase its neighbors from list)

{1,3,5,9} < {1,3,7,9}

Distributed LexMIS computation

Algorithm MIS-Rank - code for v• Invoke Procedure Join• On getting msg Decided(1) from neighbor w

do:- Set 0- Send Decided(0) to all neighbors

• On getting msg Decided(0) from neighbor w do:

- Invoke Procedure Join

Distributed LexMIS computation

Procedure Join – code for v• If every neighbor w of v with larger ID

has decided (w)=0then do:

- Set 1- Send Decided(1) to all neighbors

Complexity – Distributed LexMIS

Claim:• Message(MIS-Rank)=O(|E|)• Time(MIS-Rank)=O(n)

Note: Worst case complexities no better than naive sequential procedure

Reducing coloring to MIS

Procedure ColorToMIS(m) - code for v

For round i=1 through m do:- If v's original color is v = i then do:

• If None of v's neighbors joined MIS yet then do:Decide 1 (join MIS)Inform all neighbors

• Else decide 0

AnalysisLemma: Procedure ColorToMIS constructs MIS

for G in time m

Proof:Independence:• Node v that joins MIS in iteration i

is not adjacent to any w that joined MIS earlier. • It is also not adjacent to any w trying to join in

current iteration(since they belong to same color class)

Analysis

Maximality: By contradiction. For M marked by procedure,suppose there is a node v M s.t. M [ {v} is independent. Suppose v=i. Then in iteration i, the decision made by vwas erroneous.

Analysis (cont)

Corollary: Given algorithm for coloring G with f(G) colors in time T(G), it is possible to construct MIS for G in time T(G)+f(G)

Corollary:There is a deterministic distributed MIS algorithm for trees / bounded-degree graphs with time O(log*n).

Analysis (cont)

Corollary:There is a deterministic distributed MIS algorithm for arbitrary graphs with time complexity O((G) log n).

Lower bound for MIS on ringsFact: Given MIS for the ring, it is possible to 3-

color the ring in one round.

Proof: v MIS: takes color 1, sends “2” to left neighborw MIS: takes color 2 if it gets msg “2”; otherwise takes color 3

Reducing coloring to MIS (cont)Validity of 3-coloring: Since MIS vertices are spaced 2 or 3 places apart around the ring

Corollary: Any deterministic distributed MIS algorithm for the n-vertex ring requires at least (log*n-3)/2 time.

Randomized distributed MIS algorithm

Doable in time O(log n)

“Store and forward” routing schemes

Routing scheme: Mechanism specifying for each pair u,v V a path in G connecting u to v

Routing labels: Labeling assignment

Labels = (v1,...,vn) for G vertices

Headers = { allowable message headers }

“Store and forward” routing schemes

Data structures: Each v stores:

1. Initial header function Iv: Labels Headers

2. Header function Hv: Headers Headers

3. Port function Fv: Headers [1.. deg(v,G)]

Forwarding protocol

For u to send a message M to v:

1. Prepare header h=Iu(v), attach it to M(Typically consists of label of destination, v, plus some additional routing information)

2. Load M onto exit port i=Fu(h)

Forwarding protocol

Message M with header h' arriving at node w:

• Read h', check whether w = final destination.

• If not:1. Prepare new header by setting h=Hw(h')

replace old header h' attached to M by h2. Compute exit port by setting i=Fu(h)3. load M onto port i

Routing schemes (cont)For every pair u,v, scheme RS specifies a route

(RS,u,v)=(u=w1,w2,...,wj=v),

through which M travels from u to v.

|(RS,u,v)| = route length

Partial routing schemes: Schemes specifying a route only for some vertex pairs in G

Performance measures(e) = cost of using link e ~ estimated link delay for message sent on e

Comm(RS,u,v) = cost of uv routing by RS

= weighted route length, |(RS,u,v)|

Performance measures (cont)

Stretch factor:Given routing scheme RS for G,we say RS stretches the path from u to v by

Dilation(RS,u,v) = Comm(RS,u,v)| / dist(u,v)

Dilation(RS,G) = maxu,vV {Dilation(RS,u,v)}

Performance measures (cont)

Memory requirement:Mem(v,Iv,Hv,Fv) = # memory bits for storing the label and functions Iv, Hv, Fv in v.

Total memory requirement of RS:Mem(RS)=∑vLabels Mem(v,Iv,Hv,Fv)

Maximal memory requirement of RS:MaxMem(RS)=maxvLabels Mem(v,Iv,Hv,Fv)

Routing strategiesRouting strategy: Algorithm computing a routing scheme RS for every G(V,E,).

A routing strategy has stretch factor k if for every G it produces scheme RS with Dilation(RS,G) < k.

Memory requirement of routing strategy (as function of n) =maximum (over all n-vertex G) memory requirement of routing schemes produced.

Routing strategies (cont)Solution 1: Full tables routing (FTR)Port function Fv stored at v specifies entire table(one entry per each destination u ≠ v) listing exit port used for forwarding M to u.

Port functionfor node 1:

FTR (cont)

Note: The pointers to a particular destination u form shortest path tree rooted at u

Optimal communication cost:Dilation(FTR,G)=1

Disadvantage: Expensive for large systems (each v stores O(n log n) bit routing table)

FTR (cont)Example: Unweighted ring

Consider unit cost n-vertex ring.

FTR strategy implementation:• Label vertices consecutively as 0,...,n-1• Route from i to j along shorter of two ring

segments (inferred from labels i,j)

Stretch = 1 (optimal routes)2log n bits per vertex (stores own label and n)

Solution 2: Flooding

Origin broadcasts M throughout entire network.

Requires no routing tables (optimal memory)

Non-optimal communication (unbounded stretch)

FTR vs. Flooding:

Extreme endpointsof communication-memory tradeoff

Part 2: Representations

1. Clustered representations• Basic concepts: clusters, covers, partitions • Sparse covers and partitions• Decompositions and regional matchings

2. Skeletal representations

• Spanning trees and tree covers• Sparse and light weight spanners

Basic idea of locality-sensitive distributed computing

Utilize locality to both simplify control structures and algorithms and reduce their costs

Operation performed in large network may concern few processors in small region

(Global operation may have local sub-operations)

Reduce costs by utilizing “locality of reference”

Components of locality theory

• General framework, complexity measures and algorithmic methodology

• Suitable graph-theoretic structures and efficient construction methods

• Adaptation to wide variety of applications

Fundamental approach

Clustered representation:• Impose clustered hierarchical organization on

arbitrary given network• Use it efficiently for bounding complexity of

distributed algorithms.

Skeletal representation:• Sparsify given network • Execute applications on remaining skeleton,

reducing complexity

Clusters, covers and partitions

Cluster = connected subset of vertices S V.

Cover of G(V,E,) = collection of clusters={S1,...,Sm} containing all vertices of G

(i.e., s.t. [ = V).

PartitionsPartial partition of G = collection of disjointclusters ={S1,...,Sm}, i.e., s.t. S Å S'=

Partition= cover and partial partition.

Evaluation criteria

Locality and Sparsity

Locality level: cluster radius

Sparsity level: vertex / cluster degrees

Evaluation criteria

Locality - sparsity tradeoff:

locality and sparsity parametersgo opposite ways:

better sparsity ⇔ worse locality (and vice versa)

Evaluation criteria

Locality measures

Weighted distances:

Length of path (e1,...,es) = ∑1<i< s (ei)

dist(u,w,G) = (weighted) length of shortest path

dist(U,W) = min{ dist(u,w) | uU, wW }

Evaluation criteria

Diameter, radius: As before, except weighted

For clusters collection :• Diam()=maxi Diam(Si)

• Rad ()=maxi Rad (Si)

Sparsity measuresCover sparsity measure - overlap:

deg(v,) = # occurrences of v in clusters Si.e., degree of v in hypergraph (V,)

C() = maximum degree of cover

Av() = average degree of = ∑vV deg(v,) / n = ∑S|S| / n

deg(v) = 3

v

Partition sparsity measure - adjacency

Intuition: “contract” clusters into super-nodes,look at resulting cluster graph of ,()=(, ),={(S,S') | S,S‘ ,G contains edge (u,v) for u S and v S'}

edges: inter-cluster edges

Example: A basic construction

Goal: produce a partition with:

1. clusters of radius < k2. few inter-cluster edges (or, low Avc())

Algorithm BasicPart

Algorithm operates in iterations,each constructing one cluster

Example: A basic construction

At end of iteration:- Add resulting cluster S to output collection - Discard it from V- If V is not empty then start new iteration

Iteration structure• Arbitrarily pick a vertex v from V

• Grow cluster S around v, adding layer by layer

• Vertices added to S are discarded from V

Iteration structure

• Layer merging process is carried repeatedly until reaching required sparsity condition:

- next iteration increases # vertices by a factor of < n1/k

(I.e., |(S)| < |S| n1/k)

Analysis

Thm: Given n-vertex graph G(V,E), integer k > 1,Alg. BasicPart creates a partition satisfying:1) Rad() < k-1,2) # inter-cluster edges in () < n1+1/k

(or, Avc() < n1/k)

Analysis

Proof:

Correctness:• For every S added to is (connected) cluster• The generated clusters are disjoint

(Alg' erases from V every v added to cluster)• is a partition (covers all vertices)

Analysis (cont)Property (2):By termination condition of internal loop,resulting S satisfies |(S)| < n1/k |S|

(# inter-cluster edges touching S) < n1/k |S|

Number can only decrease in later iterations, ifadjacent vertices get merged into same cluster

|| < ∑S n1/k |S| = n1+1/k

Analysis (cont)Property (1):Consider iteration of main loop.

Let J = # times internal loop was executed.

Let Si = S constructed on i'th internal iteration

|Si| > n(i-1)/k for 2 < i < J (By induction on i)

Analysis (cont) J < k (otherwise, |S| > n)

Note: Rad(Si) < i-1 for every 1 < i < J (S1 is composed of a single vertex, each additional layer increases Rad(Si) by 1)

Rad(SJ) < k-1

Synchronizers revisitedGoal: Synchronizer capturing reasonable middle points on time-communication tradeoff scale

Synchronizer

Assumption: Given a low-degree partition

For each cluster in , build rooted spanning tree.

In addition, between any two neighboring clusters designate a synchronization link.

Synchronizer

Handling safety information (in Phase B)

Step 1: For every cluster separately apply synchronizer (By end of step, every v knows every w in its cluster is safe)Step 2: Every processor incident to synchronization link sends a message to other cluster, saying its cluster is safe.

Handling safety information (in Phase B)

Step 3: Repetition of step 1, except the convergecast performed in each cluster carries different information:

• Whenever v learns all clusters neighboring its subtree are safe, it reports this to parent.

Step 4: When root learns all neighboring clusters are safe, it broadcasts “start new pulse” on tree

Synchronizer

Phases of synchronizer In each cluster

1. Converge(Æ,Safe(v,p))2. Tcast(ClusterSafe(p))3. Send ClusterSafe(p) messages to adjacent

clusters

4. Converge(Æ,AdjClusterSafe(v,p)),5. Tcast(AllSafe(p))

AnalysisCorrectness:Recall:

Readiness property: Processor v is ready for pulse p once it already received all alg' msgs sent to it by neighbors during their pulse p-1.

Readiness rule:Processor v may generate pulse p once it finished its original actions for pulse p-1, and Ready(v,p) holds.

Analysis

To prove Sync. properly implements Phase B, need to show that it imposes readiness rule.

Claim: Synchronizer is correct.

ComplexityClaim:1. Cpulse()=O(n1+1/k)

2. Tpulse()=O(k)Proof:Time to implement one pulse: < 2 broadcast / convergecast rounds in clusters(+ 1 message-exchange step among border

vertices in neighboring clusters)

Tpulse() < 4 Rad() +1 = O(k)

ComplexityMessages: Broadcast / convergecast rounds,

separately in each cluster,cost O(n) msgs total(clusters are disjoint)

Single communication step among neighboring clusters requires n Avc() = O(n1+1/k) msgs

Cpulse() = O(n1+1/k)

Recommended