Upload
oli-unima
View
400
Download
5
Tags:
Embed Size (px)
DESCRIPTION
The Graph Structure of the Web - Aggregated by Pay-Level Domain @ Web Sciene 2014
Citation preview
The Graph Structure of the Web- Aggregated by Pay-Level Domain
Oliver Lehmberg, Robert Meusel, Christian BizerResearch Group Data and Web Science
General Knowledge about the Web Graph
• Broder et al.* in 2000:– In- and Outdegree follow power laws
– There is a directed path between two pages in 25% of all cases
– The Web Graph has the bow-tie structure
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer
*A. Broder, R. Kumar, F. Maghoul, P. Raghavan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. In WWW’00, pages 309–320. North-Holland Publishing Co, 2000.
Slide 2
The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer
Our Contributions
• R. Meusel, S. Vigna, O. Lehmberg, and C. Bizer. Graph structure in the web – revisted. WWW ’14, 2014.– Analysis of the 2012 Web Graph on page level
• This presentation:– Analysis of the same graph, aggregated by pay-level domain (PLD)
– Focus on inter-website connections
– No intra-website links
• Additionally:– Interconnections between topical groups of websites
– Public Suffix aggregation
Version 6/25/2014 Slide 3
DATA SET
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer
Slide 4
Web Data Commons Hyperlink Graph
• Page level: the largest hyperlink graph available to the public – extracted from Common Crawl
– 3.5 billion nodes (web pages)
– 128 billion arcs (hyperlinks)
• Aggregated by pay-level domain– 43 million nodes (websites)
– 623 million arcs (aggregated hyperlinks)
– 240 million registered domains in the Web in 2012 (18%)*
• Pay-level domain:– dws.informatik.uni-mannheim.de uni-mannheim.de
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer
*http://www.verisigninc.com/assets/domain-name-brief-oct2012.pdf
Slide 5
Downloading the WDC Hyperlink Graph
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer
http://webdatacommons.org/hyperlinkgraph/
• 4 aggregation levels:
• Extraction code is published under Apache License– Extraction costs per run: ~ 200 US$ in Amazon EC2 fees
Graph #Nodes #Arcs Size (zipped)
Page graph 3.56 billion 128.73 billion 376 GB
Subdomain graph 101 million 2,043 million 10 GB
1st level subdomain graph 95 million 1,937 million 9.5 GB
PLD graph 43 million 623 million 3.1 GB
Slide 6
The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer
GRAPH HANDS-ON
Version 6/25/2014 Slide 7
The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer
Node Centrality Ranking
http://wwwranking.webdatacommons.org
Version 6/25/2014 Slide 8
Top PLD Lists
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer
Rank
Website Outdegree Website Indegree Website PageRank
1 blogspot.com 3.898.561 wordpress.org 1.822.440 wordpress.org 113,388
2 wordpress.com 2.249.553 youtube.com 1.319.548 gmpg.org 111,173
3 youtube.com 1.078.938 wikipedia.org 1.243.291 youtube.com 88,206
4 wikipedia.org 862.705 gmpg.org 1.156.727 twitter.com 54,644
5 serebella.com 699.609 blogspot.com 1.034.450 wikipedia.org 54,081
6 refertus.info 668.271 google.com 782.660 blogspot.com 40,901
7 top20directory.com 650.884 wordpress.com 710.590 google.com 40,799
8 typepad.com 551.360 twitter.com 646.239 wordpress.com 28,018
9 botw.org 496.645 yahoo.com 554.251 yahoo.com 27,594
10 tumblr.com 496.045 flickr.com 339.231 networkadvertising.org 27,395
11 dmoz.org 476.890 facebook.com 314.051 apple.com 23,929
12 vindhetviahier.nl 424.646 apple.com 312.396 phpbb.com 22,329
13 jcsearch.com 423.918 miibeian.gov.cn 289.605 miibeian.gov.cn 22,165
14 startpagina.nl 392.543 vimeo.com 269.003 hugedomains.com 20,793
15 yahoo.com 371.087 tumblr.com 226.596 facebook.com 20,254
16 tatu.us 370.918 joomla.org 201.863 joomla.org 18,146
17 freeseek.org 362.310 amazon.com 196.690 flickr.com 17,966
18 lap.hu 352.668 w3.org 196.507 adobe.com 17,903
19 blau-webkatalog.com 312.924 nytimes.com 193.907 linkedin.com 16,083
20 allepaginas.nl 276.578 sourceforge.net 189.663 w3.org 15,539
Slide 9
Most interlinked PLDs
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer
Slide 10
GRAPH ANALYSIS
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer
Slide 11
In- and Outdegree – Power-Laws?
Power-Law:
Methodology:
• Clauset et al.* Maximum-likelihood fitting (plfit *²)
• Goodness-of-fit test
Indegree results:
Cannot reject power law hypothesis
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer
Slide 12* Clauset et al.: Power-Law Distributions in Empirical Data. SIAM Review 2009.*² https://github.com/ntamas/plfit
In- and Outdegree – Power-Laws?
Outdegree results:
Must reject power law hypothesis
Yet unclear which distribution fits
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer
Slide 13
Bow-Tie Structure
Observations:
Small IN component
Large OUT component
TEND and TUBES almost non-existent
Compared to Broder et al.:
Unbalanced
LSCC much larger
Compared to our page graph*:
Proportions of IN and OUT exchanged
Large fraction of IN pages were merged into LSCC (ca. 1 billion pages)
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer
* R. Meusel, S. Vigna, O. Lehmberg, and C. Bizer. Graph structure in the web – revisted. WWW ’14, 2014.
Slide 14
Distance Distribution
Methodology:
Approximate distribution several times (using Hyperball*)
Connected pairs:
Avg. distance:
Diameter (at least):
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer
*P. Boldi and S. Vigna. In-core computation of geometric centralities with HyperBall: A hundred billion nodes and beyond. In ICDMW 2013. IEEE, 2013
Slide 15
The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer
High connectivity based on Hubs?
• LSCC of 51.9%, 42% connected pairs & avg. distance of 4.27– How important are hubs in this graph?
• Approach:– A) Remove links to Hubs (i.e. high indegree)
– B) Keep only links to Hubs
– Repeat this for different indegree values as thresholds and then measure largest remaining WCC/SCC
• Results– Removing links to nodes with high indegree: no large SCC once all links to
nodes with indegree 10 or higher are removed
– Removing links to nodes with low indegree: the more links we remove, the more likely are the remaining nodes to be part of the largest SCC
Version 6/25/2014 Slide 16
Two Layer Model
04/11/2023 Data and Web Science Group 17
Approach:
Remove incoming links from the graph and measures sizes of largest SCC/WCC
Subgraph with indegree
• 73.7% of all nodes weakly connected
• No large strongly connected component
• Low Degree Layer
Subgraph with indegree
• Removed incoming links of 79.2% of all nodes
• 16.1% of all nodes strongly connected
• High Degree Layer
PLD Topic Graph
Approach:
Use topical categories from the open directory project* to categorise our websites.
15 topical categories
Results:
“computers”: 6th largest, but largest number of links
“shopping”: much more incoming than outgoing links, few internal links
Conclusion:
No obvious patterns, more properties needed
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer
home
sportsociety
shopping
scie
nce
refe
renc
e
recre
ation
games
computers
busin
ess
artsadult
health Kids and teens
news
Slide 18
*http://dmoz.org
Public Suffix (PS) Graph
Approach:
Top ten PSs from our PLD graph + “others”
Generally agrees with Verisign Domain Industry Brief*
gTLDs:
more external than internal links
ccTLDs:
more internal than external links
Extreme cases:
.com does not follow this rule
.de half of all links are from a single spammer
Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer
co.uk ru
others
org
nl
netit
infode
com
*http://www.verisigninc.com/assets/domain-name-brief-oct2012.pdf
Slide 19
WebDataCommons.org also offers:
1.Corpus of 17 billion RDFa, Microdata, Microformats statements
2.Corpus of 147 million relational HTML tables
Thank you for your attention!
Advertisement
The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer