Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan...

Preview:

Citation preview

Graph Problems in the Streaming Model

Sampath KannanUniversity of Pennsylvania

Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian Zhang

Graph Streaming

G=(V,E), V known; |V| = n E revealed in arbitrary order (e1, e2, …)

Space allowed O(n polylog n): Semi streaming

Motivation?

Fundamental problems … help ‘calibrate’ model

Massive graphs such as the webgraph can appear as stream

Recommendation systems… and more generally data mining

Why so much space?

Even simple problems need it:

Given u,v, and a streamed graph G, is there path of length 2 between u & v?

Requires (n) space.More generally … for balanced graph

properties …

Balanced Properties

v

A property is balanced, if there existsstream of edges such that: before seeing lastedge:

There exists v: last edge is (v,x)...for Ω(n) x’s, property holds for Ω(n) x’s property doesn’t hold.

Lower Bound for Balanced Props

Consider all isomorphic versions of the graphthat demonstrates the balance property.

Before seeing last edge, streaming algorithmhas to remember the subset x of vertices suchthat the addition of edge (v,x) causes propertyto hold.

As we range over isomorphisms... this is anarbitrary subset of the given cardinality... andthere are exponentially many possibilities.

“Exceptions”

Counting Local Structures

• Counting triangles (Bar-Yossef et al, Buriol et al)

• Counting |E(G2)| (Ganguly et al)

• Duplicate elimination and aggregation (Cormode,Muthukrishnan)

One algorithm design techniqueSparsification (Eppstein, Galil,Italiano,Nissenzweig ‘97)

For graph property P: G’ strong certificate for G if ∀ H: (G ⋃ H) ∈ P ⇔ (G’ ⋃ H) ∈ P.

Existence of quickly computable, sparse, strong certificates leads to good semi-streaming algorithms

Sparsification-based algorithms

Bipartiteness, 1-, 2-, 3-vertex connectedcomponents, 2-, 3-edge connected components: O((n)) per edge

MST, 4-vertex connected comps., 3-edge connected comps. O(log n)

Higher connectivities: O~(n). (Zelke)

Bipartite Matching

Approximable with local greed

Matching (maximal)

Augmenting path

Constant-pass 2/3-approx for bip. matching

• Maximal matching is .5 approx:

If M’ maximum and M maximal thenM matches at least one endpoint of each

edgein M’… has |M’|/2 edges.

• If M has only |M| vertex-disjoint 3-aug-paths =>|M| (1 + ) ≥ 2 OPT/3

M’ maximum: M’∆ M – bunch of augmenting paths. Count!

• Can find maximal matching

• To go beyond: Need to get most aug. paths of length 3.

Randomly project all free vertices into Layer 0 or Layer 3

• Matched edges go from layer 1 to layer 2.

• Expect half the augmenting paths of length 3 to respect layering

• Use maximal matchings between successive layers to get constant fraction of these.

• Gives constant-pass 2/3 - approximation

To get approximation scheme: Need to findmost augmenting paths of length

• Again project vertices into k+1 layers to find augmenting paths of length k

• Use carefully chosen maximal matchings algorithms between successive layers

• Repeat constant number of times

Gives streaming linear time approx scheme for unweighted matching in general graphs (McGregor)

Weighted Matching

A 1/6 Approximation in 1 Pass

At all times we store some matching M.

On seeing edge e =(u,v) we compare the w(e) with the weight W of edges e1 and e2 in M incident on u and v.

If w(e) > 2W then

M M e \ e1,e2

• To show 1/6 approx: Account for the weight of edges lost in terms of weight of edges that survive

• Can improve approx to 1/2 - (McGregor) in constant number of passes:

• Choose an edge if it is (1 + ) times the weight of edges that it kills.

Approximating Distances

The “Sketch” Approach

A two-stage approach First stage: While going through the stream,

construct a small sketch of the input graph. Second stage: Compute the distance using

the sketch, without further access to the stream.

Perform BFS-like computations in the second stage.

Graph Spanners as Sketches

Multiplicative t-spanner: Edge subgraph H of a graph G, s.t., for any pair of vertices u and v, distH(u,v) t·distG(u,v).

There is a t-Spanner with O(n1+1/t) edges.

Reduce streaming graph distance to streaming spanner construction.

BFS-like subroutines are used in most existing spanner constructions.

Streaming Spanner Construction For each incoming edge, decide whether it should be

in the spanner. If the edge causes a cycle of length t, do not put the

edge in the spanner. This gives a t-spanner, because there is a path P of

length < t connecting the two endpoints of any discarded edge.

This spanner is sparse. Thm [Bollobás78] : A graph whose girth is larger than k can

only have O(n1+2/(k-1)) edges. Need to know: For an incoming edge, does a short

path exist?

Baswana & Sen show almost linear time non-streaming algorithm for spanners… growingBFS-trees from appropriate nodes.

Difficult to do in streaming fashion…

Instead we grow a BFS-like tree not just from itsroot!

Clusters: Rooted BFS treesPreclusters: Free floating pieces of BFS trees …

will attach to clusters

Summary of the One-Pass Algorithm

Use a vertex-labeling scheme to construct clusters. Structure of the algorithm:

– In the pre-processing phase, generate a multi-level set of labels for the vertices.

– Go through the stream; for each edge: • According to the current assignment of labels to vertices,

decide whether to put this edge in the spanner.• Depending on the type of edge, possibly assign more

labels to one of its endpoints.

Next, an example with t = log n

Labels

– logn/2 levels– w.h.p., there are top-level labels.– Semantics of labels:

• The set of vertices assigned the same top-level label forms a cluster.

• The set of vertices assigned the same lower-level label forms a “pre-cluster.”

(0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7) (0,8) (0,9)(0,10) (0,11) (0,12)

(1,2) (1,4) (1,7) (1,11)

(2,2) (2,7)

Level 0

Level 1

Level 2

(0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7) (0,8) (0,9) (0,10) (0,11) (0,12)

(1,2) (1,4) (1,7) (1,11)

Initial Label Assignment

v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12

(0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7) (0,8) (0,9) (0,10)(0,11) (0,12)

(1,2) (1,4) (1,7) (1,11)

(2,2) (2,7)

Level 0

Level 1

Level 2

On arrival of an edge

Already know what to do with:– Intra-cluster/pre-cluster edges– Inter-cluster edges

Edges connecting pre-clusters: the sticky edges– They are added to the spanner.– They may lead to new label assignment and

cluster growth.

“Good” Neighbor (1)

(3,2)

(2,2)

(1,2)

(0,2)

(1,6)

(0,6)

(2,2)

(3,2)

v u

Has marked labels

Good Neighbor (2)

v uC(1,2)

C(2,2)

C(3,2)

C(1,6)

“Bad” Neighbor

(3,2)(1,6)

v u

No marked labels

Properties of the Clusters

Small diameter

Number of clusters bounded by .

Do not need to cover the whole graph with clusters, but the uncovered subgraph is sparse.

The uncovered subgraph consists of sticky edges, and there are not too many of them.

Sticky Edges are Rare

u1

u2

u3

u4

v u1, u2, u3, u4 …

A neighbor is good with probability at least ½. After seeing at most logn/2 good neighbors, v will be assigned a top-

level label and be included in a cluster. No more sticky edges for v. The number of sticky edges can be bounded by the length of the

shortest prefix in the above sequence that contains logn/2 good neighbors.

4. Lower Bounds

One-pass diameter lower bound

Theorem: For any , any one-pass algorithm thatreturns a k (slightly better than 1/) approx to diameterin weighted graph requires n1+) space.

Proof (Sketch):

Some properties of random graph G in Gn,p with p = 1/n1-

•w.h.p. Contains set E’ of edges: |E’| = n1+64 :

• no edge in E’ is in a cycle of length k or less.• When all edges in E’ are removed, graph still has diameter < 2/

Fix one such G = (V, E E’)

Sketch (cont’d): Reduce from INDEX (hard for comm. cmplxty)

INDEX: Alice has m-bit string x and Bob has index i. One-way comm. complexity for Bob to learn xi is m.

Reduction: m edges in E’ enumerated 1 .. m.

Alice constructs prefix of stream corresponding to multiple copies of H = (V,E E’’) where E’’ E’ are the indices where xi=1. All Alice’s edges have weight 1Bob constructs rest of stream: If his index corresponds to edge (a,b) in E’

• He connects vertex b in one copy with vertex a in next copy at 0 weight• Also creates source s and sink t and connects s to a in 1st copy and b in last copy to t at high weight.

Properties: If xi = 1 where i is Bob’s index then small diameter; else large diameter.

Small space streaming violates comm. lower bound.

Open Problems

Are there interesting subclasses of graphs for which distances and diameters are “easier” in streaming model?

Is there a more generous but reasonable model?

Network Intrusion Detection Systems

Current techniques fairly primitive:– Misuse: Pattern match packets with misuse

signatures in database – Anomaly: Look for statistical anomalies in

individual packet headers and payload Needed:

– Look across multiple packets for intrusions– Deal with interleaved traffic

An Example: Browsing habits

You read sports and cartoons. You’re equally likely to read both. You do not remember what you read last.

You’d expect a “random” sequence

SCSSCSSCSSCCSCCCSSSSCSC…

Two readers

I like health, entertainment, and politics I always read entertainment first, health

next and politics last The sequence would be

EHPEHPEHPEHPEHPEHPEHP…

Two readers, one log file

If there is one log file… Assume there is no correlation between us

SECHSSPECSHPESCSSHCPCESCHCCPSESHPESSHPE…

Is there enough information to tell that there are two people browsing? What are they browsing? How are they browsing?

Clues in stream? Yes, under model assumptions.

H, E, P have special relationship. They cannot belong to different

(uncorrelated) people.

Not clear about S and C ... These could be two people or one person.

SECHSSPECSHPESCSSHCPCESCHCCPSESHPESSHPE…

Markov Chains as Stochastic Sources

12

3

4

5

6

7

.2

.4

.4

.7

.3

.1

.9

.5

.5.8

.2.9

.1

Output sequence:1 4 7 7 1 2 5 7 ...

1

Markov chains on S,E,C,H,F

SC

1/2

1/2

1/21/2

Modeled by …

H

1

E

F

1

1

Need more realistic generalizations of such analysis todeal with:

• Worm detection

• Anomaly detection at high traffic links in a network

• TCP compliance

• BGP policy behavior

Partial Solution: Clusters (1)

A cluster is a subset of vertices and a small diameter spanning tree built on these vertices.

Intra-cluster edge

Partial Solution: Clusters (2)

Inter-cluster edges

Bollobás’s result no longer applies. Need to control the number of clusters (i.e., make it ).

Open Shortest Path First (OSPF)Packet routing protocol:• Each link broadcasts its weight (initially could be 1/bw...)• To route from A to B, each router sends along shortest path to B, dividing traffic evenly if many shortest paths.

Adjustments:• Human operator observing congestion on link could raise wt• Local decisions could lead to oscillation & suboptimality

• Link latency: Convex function of its utilization• Goal: Minimize max link latency, total link latency, expected path latency, etc.• Exact optimizations typically NP-hard

Streaming problem

• Can we automate the weight adjustments?

Simple scenario:• Assume weights have been optimized for current traffic matrix• Assume we now have a new (unknown) traffic matrix observed at routers• Assume some simple goal ... minimize time to converge to new solution ... or something ...

Streaming algorithm should itself be allowed to generatetraffic for communication between monitors and for diagnostics, but this overhead should be low.

Early Worm Detection

EarlyBird System [Singh et al] identifies following characteristics:

1. Substantial volume of identical traffic2. Rising infection levels (# sources & destinations increasing)3. Random probing (infected source tries many IP addresses)

1. Top-k type streaming algorithm can identify high volume of identical traffic at one location. Can we do better in distributed fashion?2. How do we communicate to detect rising inf. levels?3. Sophisticated worms may not use random probing. What other discriminating tests are possible?4. Sophisticated worms are polymorphic… not “identical” traffic.