17
1 Computer Science Web as a graph Anna Karpovsky

Web as a graph

  • Upload
    margo

  • View
    21

  • Download
    0

Embed Size (px)

DESCRIPTION

Web as a graph. Anna Karpovsky. Anna Karpovsky: Web Search and more accurate topic-classification algorithms, enumerating emergent cyber-communities. Motivation. Is Web graph really random? Can it be described by Erdos-Renyi or Watts-Strogatz Models? - PowerPoint PPT Presentation

Citation preview

Page 1: Web as a graph

Computer Science

1

Computer Science

Web as a graph Anna Karpovsky

Page 2: Web as a graph

Computer Science

2

Computer Science

Motivation

Is Web graph really random? Can it be described by Erdos-Renyi or Watts-

Strogatz Models? Web graph is a fascinating object of study

Unregulated growthVarietyImproved Web algorithmsSociological information

Anna Karpovsky:

Web Search and more accurate topic-classification algorithms, enumerating emergent cyber-communities

Anna Karpovsky:

Web Search and more accurate topic-classification algorithms, enumerating emergent cyber-communities

Page 3: Web as a graph

Computer Science

3

Computer Science

Outline

HITS algorithm + Trawling Web graph properties Random graph models

Anna Karpovsky:

They are both driven by the presence of certain structures in the Web graph. These structures appear to be fundamental by product of the manner in which Web content is created

Anna Karpovsky:

They are both driven by the presence of certain structures in the Web graph. These structures appear to be fundamental by product of the manner in which Web content is created

Page 4: Web as a graph

Computer Science

4

Computer Science

HITS Algorithm: revisited

Sampling step:Root set (200 pages)Base set (1000-3000 pages)

Weight-propagation stepAuthority weight: Hub weight:Compact way:

Anna Karpovsky:

Power iteration to AAT, converge to principle eigenvalues, weights are intrinsic feature of collection of linked pages. Pages with large weights represent a very dense pattern of linkage form pages of large hub weight to pages of large authority weight.

Anna Karpovsky:

Power iteration to AAT, converge to principle eigenvalues, weights are intrinsic feature of collection of linked pages. Pages with large weights represent a very dense pattern of linkage form pages of large hub weight to pages of large authority weight.

Page 5: Web as a graph

Computer Science

5

Computer Science

Trawling Algorithm

Definitions:Complete bipartite cliqueBipartite core

On any sufficiently well represented topic on the Web, there will be a bipartite core in the Web graph

Page 6: Web as a graph

Computer Science

6

Computer Science

Elimination-generation paradigm Elimination

Necessary conditions – elimination filters Generation

Identify barely-qualifying nodes – generation filter

Page 7: Web as a graph

Computer Science

7

Computer Science

Properties

Degree distributionProb(D=i) proportional to 1/i^a – power law

Prob of finding documents with a large number of links and finding very popular addresses is rather significant

Anna Karpovsky:

The probability of finding documents with a large number if links is rather significant, the network connectivity being dominated by highly connected web pages. The probability of finding very popular addresses, to which a large number of other documents point, is a non-negligible, an indication of the flocking sociology of www. While the owner of each web page has completely freedom in choosing the number of links on a document and he addresses to which they point, the overall scaling laws characteristic only of highly interactive self-organized systems and critical phenomena

Anna Karpovsky:

The probability of finding documents with a large number if links is rather significant, the network connectivity being dominated by highly connected web pages. The probability of finding very popular addresses, to which a large number of other documents point, is a non-negligible, an indication of the flocking sociology of www. While the owner of each web page has completely freedom in choosing the number of links on a document and he addresses to which they point, the overall scaling laws characteristic only of highly interactive self-organized systems and critical phenomena

Page 8: Web as a graph

Computer Science

8

Computer Science

Properties

Number of bipartite coresExperiments generated well over 100,000 bipartite

cores

Diameter of the web graph = 19 links

Page 9: Web as a graph

Computer Science

9

Computer Science

Random Graph Models

Erdos-Renyi ModelN nodes, each pair of nodes is connected with

probability p

Watts-Strogatz ModelN nodes form regular lattice. With probability p,

each edge is rewired randomly.

Page 10: Web as a graph

Computer Science

10

Computer Science

Traditional Random Graph Models Random Graph

Degree distribution- Poisson or binomial

Number of bipartite cores- Negligible

Number of vertices- Fixed

Connectivity- Random and uniform

Page 11: Web as a graph

Computer Science

11

Computer Science

Copying process

Create and delete nodes at randomLinear growth – links available right away Exponential growth – only see the previous

“epochs” of pages With some probability, b, add k edges from v to

random nodes With probability 1-b, copy k edges from randomly

chosen node to v Two probabilistic processes: which to copy from

and how many to copy

Anna Karpovsky:

Intuition: author decides to create a new web page, more likely to choose larger topics. New viewpoint about the topic will probably link to many pages “within” the topic, but also probably introduce a new spin on the topic, linking to some new pages whose connection to the topic previously unrecognized

Anna Karpovsky:

Intuition: author decides to create a new web page, more likely to choose larger topics. New viewpoint about the topic will probably link to many pages “within” the topic, but also probably introduce a new spin on the topic, linking to some new pages whose connection to the topic previously unrecognized

Page 12: Web as a graph

Computer Science

12

Computer Science

ACL(Aiello, Chung, Lu)

Degree sequence is given by a power-law – fixes number of vertices and edges

a is the logarithm of the size of the graph and b can be regarded as the log-log growth rate of the graph, y vertices of degree x

Set is constructed with as many copies of each vertex as its degree

Random matching in this set is chosen

Anna Karpovsky:

Power-law for degrees is an intrinsic feature, rather than emerging

Do not explain large number of bipartite cliques observed in the web

Not clear how to adopt to evolving graph

Anna Karpovsky:

Power-law for degrees is an intrinsic feature, rather than emerging

Do not explain large number of bipartite cliques observed in the web

Not clear how to adopt to evolving graph

Page 13: Web as a graph

Computer Science

13

Computer Science

Scale-free Model

Preferential connectivityHigher prob to be linked to a vertex that already has

a large number of connections (kj) = ki/kj

Independent of time

Anna Karpovsky:

Power law observed describes systems of different sizes at different stages of their development

Anna Karpovsky:

Power law observed describes systems of different sizes at different stages of their development

Page 14: Web as a graph

Computer Science

14

Computer Science

Analysis on number of cliques Evolving copying models

There are many (t^) large cliquesIdea: Let vt’ called a leader if at least one of its d

out-links is chosen uniformly. Let v be duplicator if it copies all d of its out-links. On each epoch there is some probability that at least one vertex copies from vt’. So can derive expected number of duplicators of vt’. vt’ and its duplicators form a complete bipartite subgraph.

Page 15: Web as a graph

Computer Science

15

Computer Science

Analysis on number of cliques Evolving uniform model

Number of Cij is negligible for ij > i+j Idea:

observed out-degree = 7.2

Page 16: Web as a graph

Computer Science

16

Computer Science

Analysis on number of cliques Cliques in the ACL model

Number of Cij is constant for i > 2/(-2)Idea: Summing over all i-tuples and j-tuples of

vertices, the probability that all the edges exist between them. We know: maximum degree of a vertex is given by exp(/) (0<= logy = - logx) and the probability that a vertex has degree d is given by exp()/d^ .

Page 17: Web as a graph

Computer Science

17

Computer Science

Comments

Links are not invariant in time Documents are not stable Hierarchical structure of the web pages