Jure Leskovec Machine Learning Department Carnegie Mellon University

Embed Size (px)

Citation preview

  • Slide 1
  • Jure Leskovec Machine Learning Department Carnegie Mellon University
  • Slide 2
  • Today: Large on-line systems have detailed records of human activity On-line communities: Facebook (64 million users, billion dollar business) MySpace (300 million users) Communication: Instant Messenger (~1 billion users) News and Social media: Blogging (250 million blogs world-wide, presidential candidates run blogs) On-line worlds: World of Warcraft (internal economy 1 billion USD) Second Life (GDP of 700 million USD in 07) Opportunities for impact in science and industry 2
  • Slide 3
  • b) Internet (AS) c) Social networks a) World wide web d) Communicatione) Citationsf) Protein interactions 3
  • Slide 4
  • We know lots about the network structure: Properties: Scale free [Barabasi 99], 6-degrees of separation [Milgram 67], Navigation [Adamic-Adar 03, LibenNowell 05], Bipartite cores [Kumar et al. 99], Network motifs [Milo et al. 02], Communities [Newman 99], Conductance [Mihail-Papadimitriou-Saberi 06], Hubs and authorities [Page et al. 98, Kleinberg 99] Models: Preferential attachment [Barabasi 99], Small- world [Watts-Strogatz 98], Copying model [Kleinberg el al. 99], Heuristically optimized tradeoffs [Fabrikant et al. 02], Congestion [Mihail et al. 03], Searchability [Kleinberg 00], Bowtie [Broder et al. 00], Transit-stub [Zegura 97], Jellyfish [Tauro et al. 01] We know much less about processes and dynamics of networks 4
  • Slide 5
  • Network Dynamics: Network evolution How network structure changes as the network grows and evolves? Diffusion and cascading behavior How do rumors and diseases spread over networks? 5
  • Slide 6
  • We need massive network data for the patterns to emerge: MSN Messenger network [WWW 08, Nature 08] (the largest social network ever analyzed) 240M people, 255B messages, 4.5 TB data Product recommendations [EC 06] 4M people, 16M recommendations Blogosphere [work in progress] 60M posts, 120M links 6
  • Slide 7
  • Diffusion & Cascades Network Evolution Patterns & Observations Q1: What do cascades look like? Trees? Stars? Chains? Q4: How does network structure change as the network grows/evolves? Models & Algorithms Q2: How do we find influential nodes? Q3: How do we quickly detect epidemics? Q5: How do we generate realistic looking and evolving networks? 7
  • Slide 8
  • Diffusion & Cascades Network Evolution Patterns & Observations Q1: What do cascades look like? Trees? Stars? Chains? Q4: How does network structure change as the network grows/evolves? Models & Algorithms Q2: How do we find influential nodes? Q3: How do we quickly detect epidemics? Q5: How do we generate realistic looking and evolving networks? 8
  • Slide 9
  • Behavior that cascades from node to node like an epidemic News, opinions, rumors Word-of-mouth in marketing Infectious diseases As activations spread through the network they leave a trace a cascade Cascade (propagation graph) Network 9
  • Slide 10
  • People send and receive product recommendations, purchase products Data: Large online retailer: 4 million people, 16 million recommendations, 500k products 10% credit10% off 10 [w/ Adamic-Huberman, EC 06]
  • Slide 11
  • Bloggers write posts and refer (link) to other posts and the information propagates Data: 10.5 million posts, 16 million links 11 [w/ Glance-Hurst et al., SDM 07]
  • Slide 12
  • Viral marketing cascades are more social: Collisions (no summarizers) Richer non-tree structures Are they stars? Chains? Trees? Information cascades (blogosphere): Viral marketing (DVD recommendations): (ordered by frequency) 12 propagation [w/ Kleinberg-Singh, PAKDD 06]
  • Slide 13
  • Diffusion & Cascades Network Evolution Patterns & Observations A1: Cascade shapes. Viral marketing cascades more social. Q4: How does network structure change as the network grows/evolves? Models & Algorithms Q2: How do we find influential nodes? Q3: How do we quickly detect epidemics? Q5:How do we generate realistic looking and evolving networks? 13
  • Slide 14
  • Blogs information epidemics Which are the influential/infectious blogs? Viral marketing Who are the trendsetters? Influential people? Disease spreading Where to place monitoring stations to detect epidemics? 14
  • Slide 15
  • How to quickly detect epidemics as they spread? 15 c1c1 c2c2 c3c3 [w/ Krause-Guestrin et al., KDD 07] (best student paper)
  • Slide 16
  • Cost: Cost of monitoring is node dependent Reward: Minimize the number of affected nodes: If A are the monitored nodes, let R(A) denote the number of nodes we save We also consider other rewards: Minimize time to detection Maximize number of detected outbreaks R(A) A 16 ( ) [w/ Krause-Guestrin et al., KDD 07] (best student paper)
  • Slide 17
  • Reward for detecting cascade i Given: Graph G(V,E), budget M Data on how cascades C 1, , C i, ,C K spread over time Select a set of nodes A maximizing the reward subject to cost(A) M Solving the problem exactly is NP-hard Max-cover [Khuller et al. 99] 17
  • Slide 18
  • Problem structure: Submodularity of the reward functions CELF: algorithm with approximate guarantee Speed up: Lazy evaluation to speed-up CELF We extend the result of [Kempe-Kleinberg-Tardos 03] 18 [w/ Krause-Guestrin et al., KDD 07] (best student paper)
  • Slide 19
  • Theorem: Reward function R is submodular (diminishing returns, think of it as convexity) 19 (best student paper) [w/ Krause-Guestrin et al., KDD 07] Gain of adding a node to a small setGain of adding a node to a large set R(A {u}) R(A) R(B {u}) R(B) A B S1S1 S2S2 Placement A={S 1, S 2 } S New monitored node: Adding S helps a lot S2S2 S4S4 S1S1 S3S3 Placement B={S 1, S 2, S 3, S 4 } S Adding S helps very little
  • Slide 20
  • We develop CELF (cost-effective lazy forward-selection) algorithm: Two independent runs of a modified greedy Solution set A: ignore cost, greedily optimize reward Solution set A: greedily optimize reward/cost ratio Pick best of the two: arg max(R(A), R(A)) Theorem: If R is submodular then CELF is near optimal CELF achieves (1-1/e) factor approximation For the size of our problems nave CELF is too slow 20 (best student paper) [w/ Krause-Guestrin et al., KDD 07]
  • Slide 21
  • Observation: Submodularity guarantees that marginal rewards decrease with the solution size Idea: Use marginal reward from previous step as a upper bound on current marginal reward d Marginal reward a b c d e 21
  • Slide 22
  • CELF algorithm: Keep an ordered list of marginal rewards r i from previous step Re-evaluate r i only for the top node a b c a b c d d e e Marginal reward 22
  • Slide 23
  • CELF algorithm: Keep an ordered list of marginal rewards r i from previous step Re-evaluate r i only for the top node a a b c d dbce e Marginal reward 23
  • Slide 24
  • CELF algorithm: Keep an ordered list of marginal rewards r i from previous step Re-evaluate r i only for the top node a c a b c d d b Marginal reward e e 24
  • Slide 25
  • For more info see our website: www.blogcascade.org Which blogs should one read to catch big stories? CELF In-links Random # posts Out-links Number of selected blogs (sensors) Reward (higher is better) 25 (used by Technorati)
  • Slide 26
  • CELF Greedy Exhaustive search CELF runs 700x faster than simple greedy algorithm Number of selected blogs (sensors) Run time (seconds) (lower is better) 26
  • Slide 27
  • Given: a real city water distribution network data on how contaminants spread over time Place sensors (to save lives) Problem posed by the US Environmental Protection Agency SS 27 c1c1 c2c2 [w/ Krause et al., J. of Water Resource Planning]
  • Slide 28
  • Our approach performed best at the Battle of Water Sensor Networks competition CELF Population Random Flow Degree AuthorScore CMU (CELF) 26 Sandia21 U Exter20 Bentley systems19 Technion (1)14 Bordeaux12 U Cyprus11 U Guelph7 U Michigan4 Michigan Tech U3 Malcolm2 Proteo2 Technion (2)1 Number of placed sensors Population saved (higher is better) 28 [w/ Ostfeld et al., J. of Water Resource Planning]
  • Slide 29
  • Diffusion & Cascades Network Evolution Patterns & Observations A1: Cascade shapes. Viral marketing cascades more social. Q4: How does network structure change as the network grows/evolves? Models & Algorithms A2, A3: CELF algorithm for detecting cascades and outbreaks Q5: How do we generate realistic looking and evolving networks? 29
  • Slide 30
  • Empirical findings on real graphs led to new network models Such models make assumptions/predictions about other network properties What about network evolution? 30 log degree log prob. Model Power-law degree distributionPreferential attachment Explains
  • Slide 31
  • Networks are denser over time Densification Power Law: a densification exponent (1 a 2) What is the relation between the number of nodes and the edges over time? Prior work assumes: constant average degree over time Internet Citations a=1.2 a=1.6 N(t) E(t) N(t) E(t) 31 [w/ Kleinberg-Faloutsos, KDD 05] (best research paper)
  • Slide 32
  • Prior models and intuition say that the network diameter slowly grows (like log N, log log N) time diameter size of the graph Internet Citations Diameter shrinks over time as the network grows the distances between the nodes slowly decrease 32 [w/ Kleinberg-Faloutsos, KDD 05] (best research paper)
  • Slide 33
  • Want to generate realistic networks: Why synthetic graphs? Anomaly detection, Simulations, Predictions, Null-model, Sharing privacy sensitive graphs, Q: What is a good model that we can fit to the data? A: Next slide. Q: Which network properties do we care about? A: Dont commit, lets match adjacency matrices 33 Compare graphs properties, e.g., degree distribution Given a real network Generate a synthetic network
  • Slide 34
  • We prove Kronecker graphs mimic real graphs: Power-law degree distribution, Densification, Shrinking/stabilizing diameter, Spectral properties Initiator (9x9) (3x3) (27x27) Kronecker product of graph adjacency matrices 34 [w/ Chakrabarti-Kleinberg-Faloutsos, PKDD 05] p ij Edge probability
  • Slide 35
  • Maximum likelihood estimation Nave estimation takes O(N!N 2 ): N! for different node labelings: Our solution: Metropolis sampling: N! (big) const N 2 for traversing graph adjacency matrix Our solution: Kronecker product (E