of 51/51
1 QSX: Querying Social Graphs Graph Pattern Matching Graph pattern matching via subgraph isomorphism Graph pattern matching via graph simulation Revisions of graph simulation for social network analysis

1 QSX: Querying Social Graphs Graph Pattern Matching Graph pattern matching via subgraph isomorphism Graph pattern matching via graph simulation Revisions

  • View
    222

  • Download
    0

Embed Size (px)

Text of 1 QSX: Querying Social Graphs Graph Pattern Matching Graph pattern matching via subgraph isomorphism...

  • Slide 1
  • 1 QSX: Querying Social Graphs Graph Pattern Matching Graph pattern matching via subgraph isomorphism Graph pattern matching via graph simulation Revisions of graph simulation for social network analysis
  • Slide 2
  • 2 The need for studying graph pattern matching Prevalent use in traditional and emerging applications Applications pattern recognition knowledge discovery intelligence analysis transportation network analysis Web site classification, social position and community detection social media marketing knowledge fusion... 2
  • Slide 3
  • Subgraph isomorphism: complexity and algorithm 33
  • Slide 4
  • Gen 4 Directed graph G = (V, E, f A ) attributes f A (u): label Social Graphs Med Soc AIAI Chem Simplification: node labels DB Assume f A (u) has a unique attribute: label 4 Eco
  • Slide 5
  • 5 Subgraph isomorphism A function f from the nodes of Q to the nodes of G: For each node u in Q, u and f(u) have the same label; There exists an edge (u, u) in Q if and only if there exists an edge (f(u), f(u)) in G A bijection: identical label matching, edge-to-edge relations A B D B v1v1v1v1 v2v2v2v2 E G A B DE Q 5
  • Slide 6
  • 6 Matching by subgraph isomorphism Input: A directed graph G, and a graph pattern Q Output: all subgraphs of G that are isomorphic to Q intractable 6 Complexity Remains NP-hard even when Q is a tree and G is a forest Q is acyclic and G is a tree PTIME if Q is a forest and G is a tree NP-completeExponentially many matches The lower bounds is rather robust
  • Slide 7
  • 7 Algorithms for computing subgraph isomorphism Match(P) if P covers all nodes in Q then output P; else compute the set S(P) of all candidate pairs for inclusion in P for each pair p = (u, v) in S(P) if p passes feasibility check then P P {p}; call Match(P); restore data structures Input: pattern Q and graph G Output: all isomorphic mappings P from Q to G nodes that are directly connected to those already in P, with the same labels P: partial mappings, initially empty Recursion, refinement for each pair p = (u, v) in S(P): enumerate all possible extensions, for refinement if the feasibility test is not successful, drop it and try the next Guarantee correctness
  • Slide 8
  • 8 VF2 Match(P) if P covers all nodes in Q then output P; else compute the set S(P) of all candidate pairs for inclusion in P for each pair p = (u, v) in S(P) if p passes feasibility check then P P {p}; call Match(P); restore data structures Five k-look-ahead rules, to make sure that P is a partial isomorphic mapping VF2: a popular algorithm for subgraph isomorphism Feasibility rules: for each pair (u, v) in P their predecessors are already mapped and included in P their successors can possibly be mapped Certain conditions on cardinalities of predecessors and successors to ensure correctness and expandability Guarantee correctness and reduce backtracking L. P. Cordella, P. Foggia, C. Sansone, M. Vento. A (Sub)Graph Isomorphism Algorithm for Matching Large Graphs, IEEE Trans. Pattern Anal. Mach. Intell. 26, 2004
  • Slide 9
  • 9 Ullmans algorithm Backtrack(P) if P covers all nodes in Q then output P and return; for each node u in Q that is not yet in P find a node v in G; p (u, v); P P {p}; if P makes a partial mapping (injective function, preserving edges) then call Backtrack(P); Use adjacency matrices of G and Q, their transposes, and a form of permutation matrices An algorithm that is still being used Expanding permutation matrices representing P for each candidate pair p = (u, v): enumerate all possible extensions, for refinement Backtracking: no matter whether the test is successful or not, go back to the previous level and try another p J. R. Ullman. An Algorithm for Subgraph Isomorphism. JACM 1976
  • Slide 10
  • Graph simulation: complexity and algorithm 10
  • Slide 11
  • 11 Graph Simulation 11 A relation: identical label matching, edge-to-edge mapping A binary relation R on the nodes of Q and the nodes of G: For each node u in Q, there exists a node v in G such that (u, v) is in R, and u and v have the same label; If there exists an edge (u, u) in Q and each pair (u, v) is in R, then there exists an edge (v, v) in G such that (u, v) is in R A B D B v1v1v1v1 v2v2v2v2 E G A B DE Q 11 relations as opposed to functions
  • Slide 12
  • 12 Matching by graph simulation Input: A directed graph G, and a graph pattern Q Output: the maximum simulation relation R Quadratic time 12 Maximum simulation relation: always exists and is unique If a match relation exists, then there exists a maximum one Otherwise, it is the empty set still maximum Use relations instead of functions Complexity: O((| V | + | V Q |) (| E | + | E Q | ) The output is a unique relation, possibly of size |Q||V|
  • Slide 13
  • 13 Data locality Given a pattern Q, a graph G and a node v in G, can we decide whether v matches some node in Q by inspecting only nodes within d hops of v, where d is determined by Q only? Graph simulation: a recursive computation Q G d: the diameter of Q We only need to inspect the d-neighborhood of v 13 Graph simulation does not have the data locality Subgraph isomorphism has the data locality
  • Slide 14
  • 14 Algorithm for computing graph simulation Similarity(P) for all nodes u in Q do sim(u) the set of candidate matches w in G; while there exist (u, v) in Q and w in sim(u) (in G) that violate the simulation condition sim(u) sim(u) {w}; output sim(u) for all u in Q Input: pattern Q and graph G Output: for each u in Q, sim(u): the matches w in G successor(w) sim(v) = Correct, but not in quadratic time successor(w) sim(v) = There exists an edge from u to v in Q, but the candidate w of u has no corresponding edge to a node w that matches v refinement with the same label; moreover, if u has an outgoing edge, so does w
  • Slide 15
  • 15 speedup For each node u in pattern Q, prevsim(u) once considered for candidate matches of u for each edge (u, v) in Q and each w in sim(u) successor(w) prevsim(v) terminate if prevsim(u) = sim(u) for all nodes u in G prevsim(u) sim(u): invalid candidates Each node in prevsim(u) is looked up only once 15 a superset of sim(u) If successor(w) prevsim(v) = w should be removed from sim(u); u: a predecessor of v Propagate violations upward Cant be refined further Once w is removed, it is never put back
  • Slide 16
  • 16 Algorithm Similarity(P) for all nodes v in Q do sim(v) the set of candidate matches in G; prevsim(v) the set of all the nodes in G; while there exists a node v in Q and such that sim(v) prevsim(v) remove predecessor(sim(v)) predecessor(prevsim(v)); for all u in predecessor(v) do sim(u) sim(u) remove; prevsim(v) sim(v); output sim(v) for all v in Q Can be implemented in O((| V | + | V Q |) (| E | + | E Q | ) time refinement with the same label; moreover, if u has an outgoing edge, so does w Propagate up For each w prevsim(v) sim(v), w is checked only once, hence |V Q | |V| in total A dynamically maintained remove
  • Slide 17
  • Graph simulation revised for social network analysis 17
  • Slide 18
  • 18 Input: a query Q and a data graph G, Output: all the matches of Q in G. subgraph isomorphism a bijective function f on nodes: (u,u ) Q iff (f(u), f(u)) G a binary relation S on nodes for each (u,v) S, each edge (u,u) in Q is mapped to an edge (v, v ) i n G, such that (u,v ) S graph simulation 18 Graph pattern matching: The conventional Can we use the conventional notions for social network analysis?
  • Slide 19
  • Find all matches of a pattern in a graph Example query: graph pattern matching Identify suspects in a drug ring Identify suspects in a drug ring 19 Understanding the structure of drug trafficking organizations pattern graph B A1A1 AmAm W W W W W W WW 3 3 1 B AMS FW 19
  • Slide 20
  • Pattern matching in social graphs 20 not allowed by bijection relation instead of function edges to paths Neither subgraph isomorphism nor graph simulation works B A1A1 AmAm W W W W W W WW 3 3 1 B AM S FW For both scalability and effectiveness 20
  • Slide 21
  • Gen 21 Directed graph G = (V, E, f A ) attributes f A (u): a tuple (A 1 = a 1,..., A n = a n ) Social Graphs Med Soc AIAI Chem (dept=CS, field=AI) (dept=CS, field=DB) (dept=Bio, field=Gen) (dept=Bio, field=Eco) Social graphs: modeling attributes DB label, keywords, blogs, comments, rating 21 Eco
  • Slide 22
  • CS Bio Soc Med * 3 * 2 2 3 22 Bounded patterns Pattern graph: Q = (V Q, E Q, f v, f e ) f v (u): a conjunction of A op a, op in, f e (u,u): a constant k or a symbol , bound Bounded Unbounded f v (): dept=CS Incorporating search conditions and bounds on the number of hops Search condition within k hops 22
  • Slide 23
  • 23 G = (V, E, f A ) matches Q = (V Q, E Q, f v, f e ) via bounded simulation, if there exists a binary relation S V Q V such that S is a total mapping, satisfies search conditions and bounds on edge-to-path mappings Bounded Simulation CS DB Soc Med Gen Soc Eco * 3 * 2 2 3 AI Chem S Mapping edges to bounded paths Bio for each u V Q, there exists v V such that (u,v) S for each (u,v) S, attributes f A (v) satisfies predicate f v (u) each (u,u ) in E Q is mapped to a path from v to v of length f e (u,u ) i n G, (u,v ) S 23 There exists a unique maximum match
  • Slide 24
  • Bounded simulation in social graphs 24 The set of all suspects involved in a drug ring edges to paths B A1A1 AmAm W W W W W W WW 3 3 1 B AM S FW relation instead of function 24
  • Slide 25
  • O(| V | | E | + | E Q | | V | 2 + | V Q | | V |) 25 Complexity Subgraph isomorphism: intractable Graph simulation: O((| V | + | V Q |) (| E | + | E Q | ) Input: Pattern Q and data graph G Output: Q ( G ), the unique maximum match relation cubic time comparable: Q is small in practice To identify sensible matches and be computable in low PTIME 25 Query driven approximation: use bounded simulation instead of subgraph isomorphism. Criteria: Lower complexity Effectiveness: the query answers are sensible Always exist Algorithm? The reading list
  • Slide 26
  • 26 Bounded simulation vs. graph simulation Graph simulation: a special case of bounded simulation The same bound 1 on all pattern edges (edge-to-edge mapping) Unique attributes vs. search conditions: label equality O((| V G | + | V Q |) (| E G | + | E Q | ) vs. O(| V G | | E G | + | E Q | | V G | 2 + | V Q | | V G |) Process calculus Web site classification Social position detection, Capture more sensible matches in social graphs (by 80%) 26
  • Slide 27
  • 27 Homeomorphism and monomorphism Graph homeomorphism: G = (V, E) matches Q = (V Q, E Q ) an injective function from V Q V edges to pairwise node-disjoint simple paths in G function rather than relation Strike a balance between expressive power and complexity constraints on paths Monomorphism revised: G = (V, E) matches Q = (V Q, E Q ) an injective function from V Q V edges to nonempty paths in G Intractable, even when Q is a tree and G is a DAG 27
  • Slide 28
  • Graph pattern matching: Incorporating edge relationships 28
  • Slide 29
  • Edge relationships 29 What is this pattern to find? S: supervise C: co-author Ann, CS Pat, DB John, DB Bill, Bio Don, Gen Tom, Bio C S S S C C C C C Mat, DB DB CS Bio C C S+S+ pattern 29
  • Slide 30
  • Edge relation (Alice, Facebook) (Alice, Sunita) (Jose, Twitter) (Jose, Sunita) (Mikhail, Facebook) (Mikhail, Twitter) (Sunita, Facebook) (Sunita, Alice) (Sunita, Jose) 30 Alice Sunita Jose Mikhail Twitter Facebook
  • Slide 31
  • Graph encodings: Adding edge types (Alice, fan-of, Facebook) (Alice, friend-of, Sunita) (Jose, fan-of, Twitter) (Jose, friend-of, Sunita) (Mikhail, fan-of, Facebook) (Mikhail, fan-of, Twitter) (Sunita, fan-of, Facebook) (Sunita, friend-of, Alice) (Sunita, friend-of, Jose) 31 Alice Sunita Jose Mikhail Twitter Facebook fan-of friend-of fan-of Adding edge labels
  • Slide 32
  • Graph encodings: Adding weights (Alice, fan-of, 0.5, Facebook) (Alice, friend-of, 0.9, Sunita) (Jose, fan-of, 0.5, Twitter) (Jose, friend-of, 0.3, Sunita) (Mikhail, fan-of, 0.8, Facebook) (Mikhail, fan-of, 0.7, Twitter) (Sunita, fan-of, 0.7, Facebook) (Sunita, friend-of, 0.9, Alice) (Sunita, friend-of, 0.3, Jose) 32 Alice Sunita Jose Mikhail Twitter Facebook fan-of friend-of fan-of 0.5 0.9 0.7 0.3 0.8 0.7 0.5 Even further, you can add weights and others
  • Slide 33
  • 33 Regular patterns Pattern: Q = (V Q, E Q, f v, f e ) f v (u): a conjunction of A op a, op in, f e (u,u ): a regular expression of the form Bounded Unbounded Mapping edges to paths satisfying associated regular expressions DB CS Bio C C S+S+ F ::= c | c k | c + | FF Simple regular expressions: fairly common optimizing patterns (checking containment in linear-time) low complexity in matching 33
  • Slide 34
  • O(| V | | E | + m | E Q | | V | 2 + | V Q | | V |) 34 Complexity bounded simulation: a special case single color c (hence m = 1) f e (u,u ) = c Input: Pattern Q and data graph G Output: Q ( G ) m : the number of distinct colors in Q Adding edge colors does not incur extra complexity general regular expressions? 34
  • Slide 35
  • Graph pattern matching: Capturing graph topology 35
  • Slide 36
  • 36 Limitations of graph simulation A disconnected graph matches a connected pattern The yellow node in the pattern has 3 parents, in contrast to 1 in the data graph An undirected cycle matches a tree Simulation does not preserve the topologic in matching pattern graph What is wrong? 36
  • Slide 37
  • 37 Limitations of graph simulation A cycle with two nodes matches a cycle of unbounded length The match relation may be excessively large The need for revising simulation to enforce locality pattern graph When social distances increase, the closeness of relationships decrease 37
  • Slide 38
  • 38 G = (V, E, f A ) matches Q = (V Q, E Q, f v, f e ) via dual simulation, if there exists a binary relation S V Q V such that S is a total mapping, satisfies search conditions, and preserves both child and parent relationships Dual simulation Preserve parent relationships and connectivity for each (u,v) S, each (u,u ) in E Q is mapped to an edge (v, v ) in G, (u, v ) S each (u, u) in E Q is mapped to an edge (v, v) in G, (u, v ) S Q(G) : a unique maximum match relation 38
  • Slide 39
  • 39 diameter d Q : the maximum shortest distance (undirected paths) Locality Locality: matches contained in G[v, d Q ] for some v d Q -radius subgraph G[v, d Q ] : centered at v, within d Q hops 2 1 v Excessive match 39
  • Slide 40
  • 40 G matches Q via strong simulation, if there exists a node v in G such that G[v, d Q ] matches Q via dual simulation duality local Strong simulation Matching: given Q and G, find the set Q(G) of all matches Match: the subgraph G S of G[v, d Q ] representing the maximum match S for each (u,v) in the maximum match S, v is in G S for each edge (u,u ) in Q, (v, v ) is in G S if (u,v ) S 40
  • Slide 41
  • 41 Child and parent relationships Preserving the topology of patterns What about graph simulation? connectivity: if Q is connected (via undirected path), so is G S cycles: a directed (resp. undirected) cycle in Q matches a directed (resp. undirected) cycle in G S bounded matches: the diameter of G S is at most 2 * d Q |M(Q, G)| |V| 41
  • Slide 42
  • O(| V | (| V | + (| V Q | + | E Q |) (| V | + | E |)) 42 Strong simulation vs. graph simulation Input: Pattern Q and data graph G Output: Q ( G ) cubic time hierarchy A balance between the complexity and the ability to preserve topology G matches Q via dual simulation G matches Q via graph simulation G matches Q via strong simulation G matches Q via subgraph isomorphism preserve topology, but not bounded match does not preserve parents, connectivity, undirected cycles, bounded match Complexity of strong simulation 42
  • Slide 43
  • 43 Bounded cycles Making strong simulation stronger? Both extensions make matching from PTIME to intractable Bisimulation instead of simulation: find all subgraphs that are bisimilar to a pattern If G matches Q, then the longest simple cycle in G is no longer than its counterpart in Q for each (u,v) S, each (u,u ) in E Q is mapped to an edge (v, v ) in G s, (u,v ) S each edge (v, v ) in G s is mapped to an edge (u,u ) in E Q, (u, v ) S 43
  • Slide 44
  • Summing up 44
  • Slide 45
  • 45 Various notions for graph pattern matching Query driven approximation: from subgraph isomorphism (intractable) to strong simulation or bounded simulation (cubic-time) matchingcomplexity|M(Q, G)| subgraph isomorphismNP-complete|V| |VQ| graph simulationquadratic time|V| |V Q | bounded simulationcubic time|V| |V Q | regular matchingcubic time|V| |V Q | strong simulationcubic time|V| 45
  • Slide 46
  • Summary Graph pattern matching Subgraph isomorphism Graph simulation Bounded simulation Regular matching Strong simulation ... 46 The study has raised as many questions as it has answered Querying both topology and data content What query language should we use for social data analysis? Strike a balance between the expressivity and complexity A uniform framework for these 46 Reading : W. Fan. Graph Pattern Matching Revised for Social Network Analysis, ICDT 2012. (survey of graph pattern matching)
  • Slide 47
  • 47 Summary and review What is subgraph isomorphism? Complexity? Algorithm? Name a few applications What is graph simulation? Complexity? Understand its algorithm. Name a few applications Why do we need to revise conventional graph pattern matching for social network analysis? How should we do it? Why? Understand bounded simulation. Read its algorithm. Complexity? What is strong simulation? Complexity? Name a few applications in which strong simulation is useful. Find other revisions of conventional graph pattern matching that are not covered in the lecture.
  • Slide 48
  • 48 Project (1) Recall bounded graph simulation 48 Implement an algorithm that, given a pattern Q and a graph G, computes the maximum match of Q in G via bounded simulation Develop optimization strategies Experimentally evaluate your algorithm, especially its scalability with the size of G Write a survey on revisions of conventional graph simulation, as related work A development project
  • Slide 49
  • 49 Project (2) Recall graph simulation 49 Develop a MapReduce algorithm that, given a pattern Q and a graph G, computes the maximum match of Q in G via graph simulation Develop optimization strategies Experimentally evaluate your algorithm, especially its scalability with the size of G Write a survey on revisions of conventional graph simulation, as part of the related work A research and development project
  • Slide 50
  • 50 Project (3) Recall subgraph isomorphism 50 Develop two algorithms that, given a pattern Q and a graph G, computes the maximum match of Q in G via subgraph isomorphism, in MapReduce (see Lecture 4) BSP (see Lecture 5) Develop optimization strategies to reduce parallel computational cost and data shipment cost Experimentally evaluate your algorithms, especially their scalability with the size of G Write a survey on parallel algorithms for subgraph isomorphism A development project
  • Slide 51
  • Papers for you to review 51 M. R. Henzinger, T. Henzinger, and P. Kopke. Computing simulations on finite and infinite graphs. FOCS, 1995. http://infoscience.epfl.ch/record/99332/files/HenzingerHK95.pdf L. P. Cordella, P. Foggia, C. Sansone, M. Vento. A (Sub)Graph Isomorphism Algorithm for Matching Large Graphs, IEEE Trans. Pattern Anal. Mach. Intell. 26, 2004 (search Google scholar) A. Fard, M. U. Nisar, J. A. Miller, L. Ramaswamy, Distriuted and scalable graph pattern matching: models and algorithms. Int. J. Big Data. http://cobweb.cs.uga.edu/~ar/papers/IJBD_final.pdf W. Fan J. Li, S. Ma, and N. Tang, and Y. Wu. Graph pattern matching: From intractable to polynomial time, VLDB, 2010. W. Fan, J. Li, S. Ma, N. Tang, and Y. Wu. Adding Regular Expressions to Graph Reachability and Pattern Queries, ICDE 2011. S. Ma, Y. Cao, W. Fan, J. Huai, T. Wo: Strong simulation: Capturing topology in graph pattern matching. TODS 39(1): 4, 2014.