43
WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh, Andy Kacsmar, Sep Kamvar, Wang Lam, Larry Page, Andreas Paepcke, Sriram Raghavan, Gary Wesley

WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

Embed Size (px)

DESCRIPTION

3 WebBase WEB PAGE

Citation preview

Page 1: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

WebBase:Building a Web Warehouse

Hector Garcia-MolinaStanford University

Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala,Jun Hirai, Glen Jeh, Andy Kacsmar, Sep Kamvar, Wang Lam, Larry Page, Andreas Paepcke, Sriram Raghavan, Gary Wesley

Page 2: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

2

The Web

• A universal information resource– Model weak, strong agreement

• How to exploit it?

web

Page 3: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

3

WebBase

WEB PAGE

Page 4: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

4

WebBase Goals

• Manage very large collections of Web pages– Today: 1500GB HTML, 200 M pages

• Enable large-scale Web-related research• Locally provide a significant portion of the Web• Efficient wide-area Web data distribution

Page 5: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

5

WebBase Architecture

Page 6: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

6

WebBase Remote Users

• Berkeley• Columbia• U. Washington• Harvey Mudd• Università degli

Studi di Milano• U. of Arizona

• California Digital Library

• Cornell• U. of Houston• Learning Lab

Lower Saxony (L3S)• France Telecom• U. Texas

Page 7: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

7

Outline

• Technical Challenges• WebBase Use• The Future

Page 8: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

8

Challenges

• Scalability– crawling– archive distribution– index construction– storage

• Consistency– freshness– versions

• Dissemination

• Archiving– “units”– coordination

• IP Management– copy access– link access– access control

• Hidden Web• Topic-Specific

Collection Building

Page 9: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

9

What is a Crawler?

web

init

get next url

get page

extract urls

initial urls

to visit urls

visited urls

web pages

Page 10: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

10

Parallel Crawling

C

C

C

...

web

Page 11: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

11

Independent Crawlers

C

C

web

a

e

d c

b

site 1

fh

i

g

site 2

Page 12: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

12

Partition: Firewall

C

C

a

e

d c

b

site 1

fh

i

g

site 2

partition·URL hash·Site hash·Hierarchical

Page 13: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

13

Partition: Cross-Over

C

C

a

e

d c

b

site 1

fh

i

g

site 2

partition

Page 14: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

14

Partition: Cross-Over

C

C

a

e

d c

b

site 1

fh

i

g

site 2

partition

Page 15: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

15

Partition: Exchange

C

C

a

e

d c

b

site 1

fh

i

g

site 2

partition

Page 16: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

16

Partition: Exchange

C

C

a

e

d c

b

site 1

fh

i

g

site 2

partition

Page 17: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

17

Coverage vs Overlap

cross-over crawler; 5 random seeds per C-proc

Page 18: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

18

WebBase Parallel Crawling

web

sitequeues ...

process

sitequeues ...

process

...

computer

other computers

coordinator

Page 19: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

19

WebBase Parallel Crawling

0

500

1000

1500

2000

2500

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

pages/sec cpu utilization sites-being-crawled

100%

2 cpuutilzation

0%

200%

number of processes

Page 20: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

20

Challenges

• Scalability– crawling– archive distribution– index construction– storage

• Consistency– freshness– versions

• Dissemination

• Archiving– “units”– coordination

• IP Management– copy access– link access– access control

• Hidden Web• Topic-Specific

Collection Building

done

next

Page 21: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

21

How to Refresh?

a

b

a

b

webrepository

a changes daily

b changes once a week

can visitone page per week

• How should we visit pages?– a a a a a a a ...– b b b b b b b ...– a b a b a b a b ... [uniform]

– a a a a a a b a a a ... [proportional]

– ?

Page 22: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

22

Using WebBase

• Fast Page Rank• Complex Queries

Page 23: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

23

Structure of the Web

Color the nodes by their domainred = stanford.edugreen = berkeley.edublue = mit.edu

Page 24: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

24

Structure of the Web

stanford.edu berkeley.edu

mit.edu

Page 25: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

25

Nested Block Structure of the Web

Berkeley

Stanford

from

to

Page 26: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

26

Personalized Page Rank

ab

Page 27: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

27

Complex Queries

Stanford WebBase Repositor

y

Text searchE.g., Search for “SARS Symptoms”

Bulk/Streaming accessLarge-scale mining & indexingE.g., compute PageRank, extract communities

Complex queriesDeclarative analysis interface

Page 28: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

Example of a Complex Query

Rank pages in S by PageRank

Rank domains in R by sum (incoming ranks)

Web Entire Web

Compute S = stanford.edu pages containing phrase

“Mobile networking”stanford.ed

uMobile

networking pages

(S)

Compute R = set of all “.edu”

domains pointed to by

pages in SS

RList top 10 domains in

R

find universitiescollaborating with Stanfordon mobile networking

Page 29: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

29

Supernodes

P1

P2

P3

P4

P5

Web graph

= {N1, N2, N3}

N1 N3

N2

N1

N2

N3

E1,2E3,2

E1,3E3,1

Supernode graph

P1 P2

IntraNode1

P2 P5

SEdgePos1

,3

P4 P5

IntraNode3

SEdgeNeg3

,2

P5P3

Page 30: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

30

Growth of Supernode Graph

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100 120

Number of pages (Millions)

Size

of s

uper

node

gra

ph (M

B)

82MB, 115M pages(830 GB of

raw HTML)

Page 31: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

31

Query Execution Times

Query

Tim

e fo

r nav

igat

ion

oper

atio

n (s

ecs)

0

100

200

300

400

500

600

Query 1 Query 2 Query 3 Query 4 Query 5 Query 6

S-Node representationRelational DBConnectivity Server

Files of adjacency lists

Page 32: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

32

Query Optimization

P

4pDepth

".net/%domainmy 2." LIKE pURL

P

5pDepth

1000

pURL

Page 33: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

33

Impact of cluster-based optimization

0

100

200

300

400

500

600

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9

Sample Queries

Que

ry E

xecu

tion

Tim

e (s

ecs)

No optimizationOptimization enabled

35-million page dataset600 million links300GB of HTML

40-45% reduction in query execution times

Page 34: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

34

Conclusion (So Far)

• Web is universal information resource• WebBase exploits this resource• WebBase Challenges:

– scalability, consitency, complex queries...

• The Future for WebBase(and clones)??

Page 35: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

35

Will WebBase Scale?

web content(indexable)

webBasecapacity(pesimistic)

webBasecapacity(optimistic)

timetoday

Page 36: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

36

Pessimistic Scenario

• Specialized WebBases– sports– shopping– ...

web content(indexable)

webBasecapacity(pesimistic)

timetoday

Page 37: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

37

Optimistic Scenario

• Web in a Box– web delivered in

“CD” monthy– search engine

handles updates

web content(indexable)

webBasecapacity(optimistic)

timetoday

Page 38: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

38

Legal Issues?

• Is WebBase legal?– copies– links, deep linking

• International regulations

Page 39: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

39

Biasing Results

• How long will Google, Altavista, etc.resist “temptations”?

• Biasing Crawler• Link and Content Spam

Page 40: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

40

Access Data

• WebBase does not capture access patterns

web

? WebBase

Page 41: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

41

Semantic Web?

• Will tags be generated?• By whom?• Agreement?

web

? WebBase

semantic tags

Page 42: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

42

Future Technical Challenges

• Incremental Updates• Query Optimization• Crawling Deep Web

Page 43: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

43

Final Conclusion

• Many challenges ahead...• Additional information:

Google: Stanford WebBase

WEB PAGE