20
The Graph Structure of the Web - Aggregated by Pay-Level Domain ver Lehmberg , Robert Meusel, Christian Bizer earch Group Data and Web Science

The Graph Structure of the Web - Aggregated by Pay-Level Domain

Embed Size (px)

DESCRIPTION

The Graph Structure of the Web - Aggregated by Pay-Level Domain @ Web Sciene 2014

Citation preview

Page 1: The Graph Structure of the Web - Aggregated by Pay-Level Domain

The Graph Structure of the Web- Aggregated by Pay-Level Domain

Oliver Lehmberg, Robert Meusel, Christian BizerResearch Group Data and Web Science

Page 2: The Graph Structure of the Web - Aggregated by Pay-Level Domain

General Knowledge about the Web Graph

• Broder et al.* in 2000:– In- and Outdegree follow power laws

– There is a directed path between two pages in 25% of all cases

– The Web Graph has the bow-tie structure

Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer

*A. Broder, R. Kumar, F. Maghoul, P. Raghavan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. In WWW’00, pages 309–320. North-Holland Publishing Co, 2000.

Slide 2

Page 3: The Graph Structure of the Web - Aggregated by Pay-Level Domain

The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer

Our Contributions

• R. Meusel, S. Vigna, O. Lehmberg, and C. Bizer. Graph structure in the web – revisted. WWW ’14, 2014.– Analysis of the 2012 Web Graph on page level

• This presentation:– Analysis of the same graph, aggregated by pay-level domain (PLD)

– Focus on inter-website connections

– No intra-website links

• Additionally:– Interconnections between topical groups of websites

– Public Suffix aggregation

Version 6/25/2014 Slide 3

Page 4: The Graph Structure of the Web - Aggregated by Pay-Level Domain

DATA SET

Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer

Slide 4

Page 5: The Graph Structure of the Web - Aggregated by Pay-Level Domain

Web Data Commons Hyperlink Graph

• Page level: the largest hyperlink graph available to the public – extracted from Common Crawl

– 3.5 billion nodes (web pages)

– 128 billion arcs (hyperlinks)

• Aggregated by pay-level domain– 43 million nodes (websites)

– 623 million arcs (aggregated hyperlinks)

– 240 million registered domains in the Web in 2012 (18%)*

• Pay-level domain:– dws.informatik.uni-mannheim.de uni-mannheim.de

Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer

*http://www.verisigninc.com/assets/domain-name-brief-oct2012.pdf

Slide 5

Page 6: The Graph Structure of the Web - Aggregated by Pay-Level Domain

Downloading the WDC Hyperlink Graph

Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer

http://webdatacommons.org/hyperlinkgraph/

• 4 aggregation levels:

• Extraction code is published under Apache License– Extraction costs per run: ~ 200 US$ in Amazon EC2 fees

Graph #Nodes #Arcs Size (zipped)

Page graph 3.56 billion 128.73 billion 376 GB

Subdomain graph 101 million 2,043 million 10 GB

1st level subdomain graph 95 million 1,937 million 9.5 GB

PLD graph 43 million 623 million 3.1 GB

Slide 6

Page 7: The Graph Structure of the Web - Aggregated by Pay-Level Domain

The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer

GRAPH HANDS-ON

Version 6/25/2014 Slide 7

Page 8: The Graph Structure of the Web - Aggregated by Pay-Level Domain

The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer

Node Centrality Ranking

http://wwwranking.webdatacommons.org

Version 6/25/2014 Slide 8

Page 9: The Graph Structure of the Web - Aggregated by Pay-Level Domain

Top PLD Lists

Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer

Rank

Website Outdegree Website Indegree Website PageRank

1 blogspot.com 3.898.561 wordpress.org 1.822.440 wordpress.org 113,388

2 wordpress.com 2.249.553 youtube.com 1.319.548 gmpg.org 111,173

3 youtube.com 1.078.938 wikipedia.org 1.243.291 youtube.com 88,206

4 wikipedia.org 862.705 gmpg.org 1.156.727 twitter.com 54,644

5 serebella.com 699.609 blogspot.com 1.034.450 wikipedia.org 54,081

6 refertus.info 668.271 google.com 782.660 blogspot.com 40,901

7 top20directory.com 650.884 wordpress.com 710.590 google.com 40,799

8 typepad.com 551.360 twitter.com 646.239 wordpress.com 28,018

9 botw.org 496.645 yahoo.com 554.251 yahoo.com 27,594

10 tumblr.com 496.045 flickr.com 339.231 networkadvertising.org 27,395

11 dmoz.org 476.890 facebook.com 314.051 apple.com 23,929

12 vindhetviahier.nl 424.646 apple.com 312.396 phpbb.com 22,329

13 jcsearch.com 423.918 miibeian.gov.cn 289.605 miibeian.gov.cn 22,165

14 startpagina.nl 392.543 vimeo.com 269.003 hugedomains.com 20,793

15 yahoo.com 371.087 tumblr.com 226.596 facebook.com 20,254

16 tatu.us 370.918 joomla.org 201.863 joomla.org 18,146

17 freeseek.org 362.310 amazon.com 196.690 flickr.com 17,966

18 lap.hu 352.668 w3.org 196.507 adobe.com 17,903

19 blau-webkatalog.com 312.924 nytimes.com 193.907 linkedin.com 16,083

20 allepaginas.nl 276.578 sourceforge.net 189.663 w3.org 15,539

Slide 9

Page 10: The Graph Structure of the Web - Aggregated by Pay-Level Domain

Most interlinked PLDs

Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer

Slide 10

Page 11: The Graph Structure of the Web - Aggregated by Pay-Level Domain

GRAPH ANALYSIS

Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer

Slide 11

Page 12: The Graph Structure of the Web - Aggregated by Pay-Level Domain

In- and Outdegree – Power-Laws?

Power-Law:

Methodology:

• Clauset et al.* Maximum-likelihood fitting (plfit *²)

• Goodness-of-fit test

Indegree results:

Cannot reject power law hypothesis

Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer

Slide 12* Clauset et al.: Power-Law Distributions in Empirical Data. SIAM Review 2009.*² https://github.com/ntamas/plfit

Page 13: The Graph Structure of the Web - Aggregated by Pay-Level Domain

In- and Outdegree – Power-Laws?

Outdegree results:

Must reject power law hypothesis

Yet unclear which distribution fits

Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer

Slide 13

Page 14: The Graph Structure of the Web - Aggregated by Pay-Level Domain

Bow-Tie Structure

Observations:

Small IN component

Large OUT component

TEND and TUBES almost non-existent

Compared to Broder et al.:

Unbalanced

LSCC much larger

Compared to our page graph*:

Proportions of IN and OUT exchanged

Large fraction of IN pages were merged into LSCC (ca. 1 billion pages)

Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer

* R. Meusel, S. Vigna, O. Lehmberg, and C. Bizer. Graph structure in the web – revisted. WWW ’14, 2014.

Slide 14

Page 15: The Graph Structure of the Web - Aggregated by Pay-Level Domain

Distance Distribution

Methodology:

Approximate distribution several times (using Hyperball*)

Connected pairs:

Avg. distance:

Diameter (at least):

Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer

*P. Boldi and S. Vigna. In-core computation of geometric centralities with HyperBall: A hundred billion nodes and beyond. In ICDMW 2013. IEEE, 2013

Slide 15

Page 16: The Graph Structure of the Web - Aggregated by Pay-Level Domain

The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer

High connectivity based on Hubs?

• LSCC of 51.9%, 42% connected pairs & avg. distance of 4.27– How important are hubs in this graph?

• Approach:– A) Remove links to Hubs (i.e. high indegree)

– B) Keep only links to Hubs

– Repeat this for different indegree values as thresholds and then measure largest remaining WCC/SCC

• Results– Removing links to nodes with high indegree: no large SCC once all links to

nodes with indegree 10 or higher are removed

– Removing links to nodes with low indegree: the more links we remove, the more likely are the remaining nodes to be part of the largest SCC

Version 6/25/2014 Slide 16

Page 17: The Graph Structure of the Web - Aggregated by Pay-Level Domain

Two Layer Model

04/11/2023 Data and Web Science Group 17

Approach:

Remove incoming links from the graph and measures sizes of largest SCC/WCC

Subgraph with indegree

• 73.7% of all nodes weakly connected

• No large strongly connected component

• Low Degree Layer

Subgraph with indegree

• Removed incoming links of 79.2% of all nodes

• 16.1% of all nodes strongly connected

• High Degree Layer

Page 18: The Graph Structure of the Web - Aggregated by Pay-Level Domain

PLD Topic Graph

Approach:

Use topical categories from the open directory project* to categorise our websites.

15 topical categories

Results:

“computers”: 6th largest, but largest number of links

“shopping”: much more incoming than outgoing links, few internal links

Conclusion:

No obvious patterns, more properties needed

Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer

home

sportsociety

shopping

scie

nce

refe

renc

e

recre

ation

games

computers

busin

ess

artsadult

health Kids and teens

news

Slide 18

*http://dmoz.org

Page 19: The Graph Structure of the Web - Aggregated by Pay-Level Domain

Public Suffix (PS) Graph

Approach:

Top ten PSs from our PLD graph + “others”

Generally agrees with Verisign Domain Industry Brief*

gTLDs:

more external than internal links

ccTLDs:

more internal than external links

Extreme cases:

.com does not follow this rule

.de half of all links are from a single spammer

Version 6/25/2014 The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer

co.uk ru

others

org

nl

netit

infode

com

*http://www.verisigninc.com/assets/domain-name-brief-oct2012.pdf

Slide 19

Page 20: The Graph Structure of the Web - Aggregated by Pay-Level Domain

WebDataCommons.org also offers:

1.Corpus of 17 billion RDFa, Microdata, Microformats statements

2.Corpus of 147 million relational HTML tables

Thank you for your attention!

Advertisement

The Graph Structure in the Web - Aggregated by Pay-Level Domain Lehmberg/Meusel/Bizer