Upload
margo
View
21
Download
0
Embed Size (px)
DESCRIPTION
Web as a graph. Anna Karpovsky. Anna Karpovsky: Web Search and more accurate topic-classification algorithms, enumerating emergent cyber-communities. Motivation. Is Web graph really random? Can it be described by Erdos-Renyi or Watts-Strogatz Models? - PowerPoint PPT Presentation
Citation preview
Computer Science
1
Computer Science
Web as a graph Anna Karpovsky
Computer Science
2
Computer Science
Motivation
Is Web graph really random? Can it be described by Erdos-Renyi or Watts-
Strogatz Models? Web graph is a fascinating object of study
Unregulated growthVarietyImproved Web algorithmsSociological information
Anna Karpovsky:
Web Search and more accurate topic-classification algorithms, enumerating emergent cyber-communities
Anna Karpovsky:
Web Search and more accurate topic-classification algorithms, enumerating emergent cyber-communities
Computer Science
3
Computer Science
Outline
HITS algorithm + Trawling Web graph properties Random graph models
Anna Karpovsky:
They are both driven by the presence of certain structures in the Web graph. These structures appear to be fundamental by product of the manner in which Web content is created
Anna Karpovsky:
They are both driven by the presence of certain structures in the Web graph. These structures appear to be fundamental by product of the manner in which Web content is created
Computer Science
4
Computer Science
HITS Algorithm: revisited
Sampling step:Root set (200 pages)Base set (1000-3000 pages)
Weight-propagation stepAuthority weight: Hub weight:Compact way:
Anna Karpovsky:
Power iteration to AAT, converge to principle eigenvalues, weights are intrinsic feature of collection of linked pages. Pages with large weights represent a very dense pattern of linkage form pages of large hub weight to pages of large authority weight.
Anna Karpovsky:
Power iteration to AAT, converge to principle eigenvalues, weights are intrinsic feature of collection of linked pages. Pages with large weights represent a very dense pattern of linkage form pages of large hub weight to pages of large authority weight.
Computer Science
5
Computer Science
Trawling Algorithm
Definitions:Complete bipartite cliqueBipartite core
On any sufficiently well represented topic on the Web, there will be a bipartite core in the Web graph
Computer Science
6
Computer Science
Elimination-generation paradigm Elimination
Necessary conditions – elimination filters Generation
Identify barely-qualifying nodes – generation filter
Computer Science
7
Computer Science
Properties
Degree distributionProb(D=i) proportional to 1/i^a – power law
Prob of finding documents with a large number of links and finding very popular addresses is rather significant
Anna Karpovsky:
The probability of finding documents with a large number if links is rather significant, the network connectivity being dominated by highly connected web pages. The probability of finding very popular addresses, to which a large number of other documents point, is a non-negligible, an indication of the flocking sociology of www. While the owner of each web page has completely freedom in choosing the number of links on a document and he addresses to which they point, the overall scaling laws characteristic only of highly interactive self-organized systems and critical phenomena
Anna Karpovsky:
The probability of finding documents with a large number if links is rather significant, the network connectivity being dominated by highly connected web pages. The probability of finding very popular addresses, to which a large number of other documents point, is a non-negligible, an indication of the flocking sociology of www. While the owner of each web page has completely freedom in choosing the number of links on a document and he addresses to which they point, the overall scaling laws characteristic only of highly interactive self-organized systems and critical phenomena
Computer Science
8
Computer Science
Properties
Number of bipartite coresExperiments generated well over 100,000 bipartite
cores
Diameter of the web graph = 19 links
Computer Science
9
Computer Science
Random Graph Models
Erdos-Renyi ModelN nodes, each pair of nodes is connected with
probability p
Watts-Strogatz ModelN nodes form regular lattice. With probability p,
each edge is rewired randomly.
Computer Science
10
Computer Science
Traditional Random Graph Models Random Graph
Degree distribution- Poisson or binomial
Number of bipartite cores- Negligible
Number of vertices- Fixed
Connectivity- Random and uniform
Computer Science
11
Computer Science
Copying process
Create and delete nodes at randomLinear growth – links available right away Exponential growth – only see the previous
“epochs” of pages With some probability, b, add k edges from v to
random nodes With probability 1-b, copy k edges from randomly
chosen node to v Two probabilistic processes: which to copy from
and how many to copy
Anna Karpovsky:
Intuition: author decides to create a new web page, more likely to choose larger topics. New viewpoint about the topic will probably link to many pages “within” the topic, but also probably introduce a new spin on the topic, linking to some new pages whose connection to the topic previously unrecognized
Anna Karpovsky:
Intuition: author decides to create a new web page, more likely to choose larger topics. New viewpoint about the topic will probably link to many pages “within” the topic, but also probably introduce a new spin on the topic, linking to some new pages whose connection to the topic previously unrecognized
Computer Science
12
Computer Science
ACL(Aiello, Chung, Lu)
Degree sequence is given by a power-law – fixes number of vertices and edges
a is the logarithm of the size of the graph and b can be regarded as the log-log growth rate of the graph, y vertices of degree x
Set is constructed with as many copies of each vertex as its degree
Random matching in this set is chosen
Anna Karpovsky:
Power-law for degrees is an intrinsic feature, rather than emerging
Do not explain large number of bipartite cliques observed in the web
Not clear how to adopt to evolving graph
Anna Karpovsky:
Power-law for degrees is an intrinsic feature, rather than emerging
Do not explain large number of bipartite cliques observed in the web
Not clear how to adopt to evolving graph
Computer Science
13
Computer Science
Scale-free Model
Preferential connectivityHigher prob to be linked to a vertex that already has
a large number of connections (kj) = ki/kj
Independent of time
Anna Karpovsky:
Power law observed describes systems of different sizes at different stages of their development
Anna Karpovsky:
Power law observed describes systems of different sizes at different stages of their development
Computer Science
14
Computer Science
Analysis on number of cliques Evolving copying models
There are many (t^) large cliquesIdea: Let vt’ called a leader if at least one of its d
out-links is chosen uniformly. Let v be duplicator if it copies all d of its out-links. On each epoch there is some probability that at least one vertex copies from vt’. So can derive expected number of duplicators of vt’. vt’ and its duplicators form a complete bipartite subgraph.
Computer Science
15
Computer Science
Analysis on number of cliques Evolving uniform model
Number of Cij is negligible for ij > i+j Idea:
observed out-degree = 7.2
Computer Science
16
Computer Science
Analysis on number of cliques Cliques in the ACL model
Number of Cij is constant for i > 2/(-2)Idea: Summing over all i-tuples and j-tuples of
vertices, the probability that all the edges exist between them. We know: maximum degree of a vertex is given by exp(/) (0<= logy = - logx) and the probability that a vertex has degree d is given by exp()/d^ .
Computer Science
17
Computer Science
Comments
Links are not invariant in time Documents are not stable Hierarchical structure of the web pages