47
Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian Zhang

Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

Embed Size (px)

Citation preview

Page 1: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

Graph Problems in the Streaming Model

Sampath KannanUniversity of Pennsylvania

Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian Zhang

Page 2: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

Graph Streaming

G=(V,E), V known; |V| = n E revealed in arbitrary order (e1, e2, …)

Space allowed O(n polylog n): Semi streaming

Page 3: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

Motivation?

Fundamental problems … help ‘calibrate’ model

Massive graphs such as the webgraph can appear as stream

Recommendation systems… and more generally data mining

Page 4: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

Why so much space?

Even simple problems need it:

Given u,v, and a streamed graph G, is there path of length 2 between u & v?

Requires (n) space.More generally … for balanced graph

properties …

Page 5: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

Balanced Properties

v

A property is balanced, if there existsstream of edges such that: before seeing lastedge:

There exists v: last edge is (v,x)...for Ω(n) x’s, property holds for Ω(n) x’s property doesn’t hold.

Page 6: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

Lower Bound for Balanced Props

Consider all isomorphic versions of the graphthat demonstrates the balance property.

Before seeing last edge, streaming algorithmhas to remember the subset x of vertices suchthat the addition of edge (v,x) causes propertyto hold.

As we range over isomorphisms... this is anarbitrary subset of the given cardinality... andthere are exponentially many possibilities.

Page 7: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

“Exceptions”

Counting Local Structures

• Counting triangles (Bar-Yossef et al, Buriol et al)

• Counting |E(G2)| (Ganguly et al)

• Duplicate elimination and aggregation (Cormode,Muthukrishnan)

Page 8: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

One algorithm design techniqueSparsification (Eppstein, Galil,Italiano,Nissenzweig ‘97)

For graph property P: G’ strong certificate for G if ∀ H: (G ⋃ H) ∈ P ⇔ (G’ ⋃ H) ∈ P.

Existence of quickly computable, sparse, strong certificates leads to good semi-streaming algorithms

Page 9: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

Sparsification-based algorithms

Bipartiteness, 1-, 2-, 3-vertex connectedcomponents, 2-, 3-edge connected components: O((n)) per edge

MST, 4-vertex connected comps., 3-edge connected comps. O(log n)

Higher connectivities: O~(n). (Zelke)

Page 10: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

Bipartite Matching

Approximable with local greed

Matching (maximal)

Augmenting path

Page 11: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

Constant-pass 2/3-approx for bip. matching

• Maximal matching is .5 approx:

If M’ maximum and M maximal thenM matches at least one endpoint of each

edgein M’… has |M’|/2 edges.

• If M has only |M| vertex-disjoint 3-aug-paths =>|M| (1 + ) ≥ 2 OPT/3

M’ maximum: M’∆ M – bunch of augmenting paths. Count!

Page 12: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

• Can find maximal matching

• To go beyond: Need to get most aug. paths of length 3.

Randomly project all free vertices into Layer 0 or Layer 3

• Matched edges go from layer 1 to layer 2.

• Expect half the augmenting paths of length 3 to respect layering

• Use maximal matchings between successive layers to get constant fraction of these.

• Gives constant-pass 2/3 - approximation

Page 13: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

To get approximation scheme: Need to findmost augmenting paths of length

• Again project vertices into k+1 layers to find augmenting paths of length k

• Use carefully chosen maximal matchings algorithms between successive layers

• Repeat constant number of times

Gives streaming linear time approx scheme for unweighted matching in general graphs (McGregor)

Page 14: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

Weighted Matching

Page 15: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

A 1/6 Approximation in 1 Pass

At all times we store some matching M.

On seeing edge e =(u,v) we compare the w(e) with the weight W of edges e1 and e2 in M incident on u and v.

If w(e) > 2W then

M M e \ e1,e2

Page 16: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

• To show 1/6 approx: Account for the weight of edges lost in terms of weight of edges that survive

• Can improve approx to 1/2 - (McGregor) in constant number of passes:

• Choose an edge if it is (1 + ) times the weight of edges that it kills.

Page 17: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

Approximating Distances

Page 18: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

The “Sketch” Approach

A two-stage approach First stage: While going through the stream,

construct a small sketch of the input graph. Second stage: Compute the distance using

the sketch, without further access to the stream.

Perform BFS-like computations in the second stage.

Page 19: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

Graph Spanners as Sketches

Multiplicative t-spanner: Edge subgraph H of a graph G, s.t., for any pair of vertices u and v, distH(u,v) t·distG(u,v).

There is a t-Spanner with O(n1+1/t) edges.

Reduce streaming graph distance to streaming spanner construction.

BFS-like subroutines are used in most existing spanner constructions.

Page 20: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

Streaming Spanner Construction For each incoming edge, decide whether it should be

in the spanner. If the edge causes a cycle of length t, do not put the

edge in the spanner. This gives a t-spanner, because there is a path P of

length < t connecting the two endpoints of any discarded edge.

This spanner is sparse. Thm [Bollobás78] : A graph whose girth is larger than k can

only have O(n1+2/(k-1)) edges. Need to know: For an incoming edge, does a short

path exist?

Page 21: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

Baswana & Sen show almost linear time non-streaming algorithm for spanners… growingBFS-trees from appropriate nodes.

Difficult to do in streaming fashion…

Instead we grow a BFS-like tree not just from itsroot!

Clusters: Rooted BFS treesPreclusters: Free floating pieces of BFS trees …

will attach to clusters

Page 22: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

Summary of the One-Pass Algorithm

Use a vertex-labeling scheme to construct clusters. Structure of the algorithm:

– In the pre-processing phase, generate a multi-level set of labels for the vertices.

– Go through the stream; for each edge: • According to the current assignment of labels to vertices,

decide whether to put this edge in the spanner.• Depending on the type of edge, possibly assign more

labels to one of its endpoints.

Next, an example with t = log n

Page 23: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

Labels

– logn/2 levels– w.h.p., there are top-level labels.– Semantics of labels:

• The set of vertices assigned the same top-level label forms a cluster.

• The set of vertices assigned the same lower-level label forms a “pre-cluster.”

(0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7) (0,8) (0,9)(0,10) (0,11) (0,12)

(1,2) (1,4) (1,7) (1,11)

(2,2) (2,7)

Level 0

Level 1

Level 2

(0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7) (0,8) (0,9) (0,10) (0,11) (0,12)

(1,2) (1,4) (1,7) (1,11)

Page 24: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

Initial Label Assignment

v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12

(0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7) (0,8) (0,9) (0,10)(0,11) (0,12)

(1,2) (1,4) (1,7) (1,11)

(2,2) (2,7)

Level 0

Level 1

Level 2

Page 25: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

On arrival of an edge

Already know what to do with:– Intra-cluster/pre-cluster edges– Inter-cluster edges

Edges connecting pre-clusters: the sticky edges– They are added to the spanner.– They may lead to new label assignment and

cluster growth.

Page 26: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

“Good” Neighbor (1)

(3,2)

(2,2)

(1,2)

(0,2)

(1,6)

(0,6)

(2,2)

(3,2)

v u

Has marked labels

Page 27: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

Good Neighbor (2)

v uC(1,2)

C(2,2)

C(3,2)

C(1,6)

Page 28: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

“Bad” Neighbor

(3,2)(1,6)

v u

No marked labels

Page 29: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

Properties of the Clusters

Small diameter

Number of clusters bounded by .

Do not need to cover the whole graph with clusters, but the uncovered subgraph is sparse.

The uncovered subgraph consists of sticky edges, and there are not too many of them.

Page 30: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

Sticky Edges are Rare

u1

u2

u3

u4

v u1, u2, u3, u4 …

A neighbor is good with probability at least ½. After seeing at most logn/2 good neighbors, v will be assigned a top-

level label and be included in a cluster. No more sticky edges for v. The number of sticky edges can be bounded by the length of the

shortest prefix in the above sequence that contains logn/2 good neighbors.

Page 31: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

4. Lower Bounds

Page 32: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

One-pass diameter lower bound

Theorem: For any , any one-pass algorithm thatreturns a k (slightly better than 1/) approx to diameterin weighted graph requires n1+) space.

Proof (Sketch):

Some properties of random graph G in Gn,p with p = 1/n1-

•w.h.p. Contains set E’ of edges: |E’| = n1+64 :

• no edge in E’ is in a cycle of length k or less.• When all edges in E’ are removed, graph still has diameter < 2/

Fix one such G = (V, E E’)

Page 33: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

Sketch (cont’d): Reduce from INDEX (hard for comm. cmplxty)

INDEX: Alice has m-bit string x and Bob has index i. One-way comm. complexity for Bob to learn xi is m.

Reduction: m edges in E’ enumerated 1 .. m.

Alice constructs prefix of stream corresponding to multiple copies of H = (V,E E’’) where E’’ E’ are the indices where xi=1. All Alice’s edges have weight 1Bob constructs rest of stream: If his index corresponds to edge (a,b) in E’

• He connects vertex b in one copy with vertex a in next copy at 0 weight• Also creates source s and sink t and connects s to a in 1st copy and b in last copy to t at high weight.

Properties: If xi = 1 where i is Bob’s index then small diameter; else large diameter.

Small space streaming violates comm. lower bound.

Page 34: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

Open Problems

Are there interesting subclasses of graphs for which distances and diameters are “easier” in streaming model?

Is there a more generous but reasonable model?

Page 35: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

Network Intrusion Detection Systems

Current techniques fairly primitive:– Misuse: Pattern match packets with misuse

signatures in database – Anomaly: Look for statistical anomalies in

individual packet headers and payload Needed:

– Look across multiple packets for intrusions– Deal with interleaved traffic

Page 36: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

An Example: Browsing habits

You read sports and cartoons. You’re equally likely to read both. You do not remember what you read last.

You’d expect a “random” sequence

SCSSCSSCSSCCSCCCSSSSCSC…

Page 37: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

Two readers

I like health, entertainment, and politics I always read entertainment first, health

next and politics last The sequence would be

EHPEHPEHPEHPEHPEHPEHP…

Page 38: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

Two readers, one log file

If there is one log file… Assume there is no correlation between us

SECHSSPECSHPESCSSHCPCESCHCCPSESHPESSHPE…

Is there enough information to tell that there are two people browsing? What are they browsing? How are they browsing?

Page 39: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

Clues in stream? Yes, under model assumptions.

H, E, P have special relationship. They cannot belong to different

(uncorrelated) people.

Not clear about S and C ... These could be two people or one person.

SECHSSPECSHPESCSSHCPCESCHCCPSESHPESSHPE…

Page 40: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

Markov Chains as Stochastic Sources

12

3

4

5

6

7

.2

.4

.4

.7

.3

.1

.9

.5

.5.8

.2.9

.1

Output sequence:1 4 7 7 1 2 5 7 ...

1

Page 41: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

Markov chains on S,E,C,H,F

SC

1/2

1/2

1/21/2

Modeled by …

H

1

E

F

1

1

Page 42: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

Need more realistic generalizations of such analysis todeal with:

• Worm detection

• Anomaly detection at high traffic links in a network

• TCP compliance

• BGP policy behavior

Page 43: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

Partial Solution: Clusters (1)

A cluster is a subset of vertices and a small diameter spanning tree built on these vertices.

Intra-cluster edge

Page 44: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

Partial Solution: Clusters (2)

Inter-cluster edges

Bollobás’s result no longer applies. Need to control the number of clusters (i.e., make it ).

Page 45: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

Open Shortest Path First (OSPF)Packet routing protocol:• Each link broadcasts its weight (initially could be 1/bw...)• To route from A to B, each router sends along shortest path to B, dividing traffic evenly if many shortest paths.

Adjustments:• Human operator observing congestion on link could raise wt• Local decisions could lead to oscillation & suboptimality

• Link latency: Convex function of its utilization• Goal: Minimize max link latency, total link latency, expected path latency, etc.• Exact optimizations typically NP-hard

Page 46: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

Streaming problem

• Can we automate the weight adjustments?

Simple scenario:• Assume weights have been optimized for current traffic matrix• Assume we now have a new (unknown) traffic matrix observed at routers• Assume some simple goal ... minimize time to converge to new solution ... or something ...

Streaming algorithm should itself be allowed to generatetraffic for communication between monitors and for diagnostics, but this overhead should be low.

Page 47: Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian

Early Worm Detection

EarlyBird System [Singh et al] identifies following characteristics:

1. Substantial volume of identical traffic2. Rising infection levels (# sources & destinations increasing)3. Random probing (infected source tries many IP addresses)

1. Top-k type streaming algorithm can identify high volume of identical traffic at one location. Can we do better in distributed fashion?2. How do we communicate to detect rising inf. levels?3. Sophisticated worms may not use random probing. What other discriminating tests are possible?4. Sophisticated worms are polymorphic… not “identical” traffic.