1 Clustering Web Content for Efficient Replication Yan Chen, Lili Qiu, Weiyu Chen, Luan Nguyen, Randy H. Katz EECS Department UC Berkeley Microsoft Research

1

Clustering Web Content for Efficient Replication

Yan Chen, Lili Qiu*, Weiyu Chen, Luan Nguyen, Randy H. Katz

EECS DepartmentUC Berkeley

*Microsoft Research

2

Motivation• Amazing growth in WWW traffic

– Daily growth of roughly 7M Web pages– Annual growth of 200% predicted for next 4 years

• Content Distribution Network (CDN) commercialized to improve Web performance

– Un-cooperative pull-based replication

• Paradigm shift: cooperative push more cost-effective– Push replicas with greedy algorithms can achieve close to

optimal performance [JJKRS01, QPV01]– Improving availability during flash crowds and disasters

• Orthogonal issue: scalability– Per Website? Per URL? -> Clustering! – Clustering based on aggregated clients’ access patterns

• Adapt to users’ dynamic access patterns– Incremental clustering (online and offline)

3

Outlines

• Motivation• Architecture• Related Work • Problem Formulation• Simulation methodology• Granularity of replication• Dynamic clustering and replication• Conclusions

4

CDN name server

Client 1

Local DNS server

Local CDN server

1. G

ET r

equest

4. lo

cal C

DN

serv

er

IP

addre

ss

Web content server

Client 2

Local DNS server

Local CDN server

2. Request for hostname resolution

3. Reply: local CDN server IP

address

5.GET request

8. Response6.GET request if cache miss

ISP 2

ISP 1

Conventional CDN: Un-cooperative Pull

7. Response

Big waste of replication!

5

CDN name server

Client 1

Local DNS server

Local CDN server

1. G

ET r

equest

4. R

edir

ect

ed

serv

er

IP

addre

ss

Web content server

Client 2

Local DNS server

Local CDN server

2. Request for hostname resolution

3. Reply: nearby replica server or

Web server IP address

ISP 2

ISP 1

5.GET request

6. Response

6. Response

5.GET request if no replica yet

Cooperative Push-based CDN

0. P

ush

repl

icas

Significantly reduce # of replicas and consequently,the update cost (only 4% of un-coop pull)

6

Outlines


7

Related Work

• Many existing work model replica placement as NP-hard problem and propose greedy algorithms

• Ignore scalability problem

• Clustering of Web contents based on individuals’ access patterns for

– Pre-fetching, Web organization, etc.

• Little on the dynamics of replica placement / clustering

8

Problem Formulation

• Subject to the total replication cost (e.g., # of URL replicas)• Find a scalable, adaptive replication strategy to reduce avg access cost

9

Outlines


10

Simulation Methodology• Network Topology

– Pure-random, Waxman & transit-stub models from GT-ITM– A real AS-level topology from 7 widely-dispersed BGP peers

• Web Workload

Web Site

Period Duration # Requests avg –min-max

# Clients avg –min-max

# Client groups avg –min-max

MSNBC Aug-Oct/1999 10–11am 1.5M–642K–1.7M 129K–69K–150K 15.6K-10K-17K

NASA Jul-Aug/1995 All day 79K-61K-101K 5940-4781-7671 2378-1784-3011

– Aggregate MSNBC Web clients with BGP prefix» BGP tables from a BBNPlanet router» 10K groups left, chooses top 10% covering >70% of requests

– Aggregate NASA Web clients with domain names– Map the client groups onto the topology

• Performance Metric: average retrieval cost– Sum of edge costs from client to its closest replica

11

Outlines


12

Per Web site

1

2

3

4

1

2

3

4

Per URL

13

Where R: # of replicas/URL K: # of clusters M: # of URLs (M >> K)

C: # of clients S: # of CDN serversf: placement adaptation frequency

Replication Scheme States to Maintain Computation Cost

Per Website O (R) f × O(R × S × C)

Per Cluster O(R × K + M) f × O(K × R × (K + S × C))

Per URL O(R × M) f × O(M × R × (M + S × C))

• 60 – 70% average retrieval cost reduction for Per URL scheme

• Per URL is too expensive for management!

Replica Placement: Per Website vs. Per URL

14

Clustering Web Content

• General clustering framework– Define the correlation distance between URLs– Cluster diameter: the max distance b/t any two

members» Worst correlation in a cluster

– Generic clustering: minimize the max diameter of all clusters

• Correlation distance definition based on– Spatial locality– Temporal locality– Popularity– Semantics (e.g., directory)

15

Spatial Clustering

k

i

k

i ii

k

iii

BA

BABAdistcor

1 1

22

11),(_

• Correlation distance between two URLs defined as– Euclidean distance– Vector similarity

• URL spatial access vector

– Blue URL

1

2

3

4

0

2

0

1

16

Clustering Web Content (cont’d)

• Popularity-based clustering

– OR even simpler, sort them and put the first N/K elements into the first cluster, etc. - binary correlation

|)(_)(_|),(_ BfreqaccessAfreqaccessBAdistcor

)()(

),(_1),(_

BoccurrenceAoccurrence

BAoccurrencecoBAdistcor

• Temporal clustering– Divide traces into multiple individuals’ access sessions [ABQ01]

– In each session,

– Average over multiple sessions in one day

17

0

10

20

30

40

50

60

70

80

90

100

1 10 100 1000Number of clusters

Ave

rag

e re

trie

val c

ost

Spatial clustering: Euclidean distance

Spatial clustering: cosine similarity

Temporal clustering

Access frequency clustering0

10

20

30

40

50


Co

mp

uta

tio

nal

co

st (

ho

urs

)

Performance of Cluster-based Replication

• Tested over various topologies and traces• Spatial clustering with Euclidean distance and

popularity-based clustering perform the best– Even small # of clusters (with only 1-2% of # of URLs) can

achieve close to per-URL performance, with much less overhead

MSNBC, 8/2/1999, 5 replicas/URL

18

Effects of the Non-Uniform Size of URLs

• Replication cost constraint : bytes• Similar trends exist

– Per URL based replication outperforms per Website dramatically – Spatial clustering with Euclidean distance and popularity-based

clustering are very cost-effective

1

2

3

4

19

Outlines

• Motivation• Architecture• Related Work • Problem Formulation• Simulation methodology• Granularity of replication• Dynamic clustering and replication

– Static clustering– Incremental clustering

• Conclusions

20

Static clustering and replication

• Two daily traces: old trace and new trace

• Static clustering performs poorly beyond a week– Average retrieval cost almost doubles

Methods Static 1 Static 2 Optimal

Traces used for clustering Old Old New

Traces used for replication

Old New New

Traces used for evaluation

New New New

0

10

20

30

40

50

60

8/3 8/4 8/5 8/10 8/11 9/27 9/28 9/29 9/30 10/1New traces

Av

era

ge

re

trie

va

l c

os

t Staticclustering 1

Staticclustering 2

Reclustering,re-replication(optimal)

21

Incremental Clustering• Generic framework

1. If new URL u match with existing clusters c, add u to c and replicate u to existing replicas of c

2. Else create new clusters and replicate them

• Online incremental clustering– Push before accessed -> high availability– Predict access patterns based on semantics– Simplify to popularity prediction – Groups of URLs with similar popularity? Use hyperlink

structures!» Groups of siblings» Groups of the same hyperlink depth: smallest # of links

from root

22

Online Popularity Prediction

• Experiments– Use WebReaper to crawl http://www.msnbc.com on 5/3/2002

with hyperlink depth 4, then group the URLs– Use corresponding access logs to analyze the correlation– Groups of siblings has the best correlation

• Measure the divergence of URL popularity within a group:

)_(

)_(_

frequencyaccessaverage

frequencyaccessdevstd

access freq span =

23

Online Incremental Clustering• Semantics-based incremental clustering

– Put new URL into existing clusters with largest # of siblings– When there is a tie, choose the cluster with more replicas

• Simulation on 5/3/2002 MSNBC– 8-10am trace: static popularity clustering + replication– At 10am: 16 new URLs emerged - online incremental

clustering + replication – Evaluation with 10-12am trace: 16 URLs has 33,262

requests

1 2 3 4 5 6

+ ?

2 3 5 61 4

1

4

2

3

5 6

24

Online Incremental Clustering & Replication

Results

0

50

100

150

200

250

300

350

400

450

500

No replicationof new URLs

Randomreplication of

new URLs

Onlineincremental

clustering andreplication

Staticclustering and

replication(oracle)

Ave

rag

e re

trie

val c

ost

25

Offline Incremental Clustering

• Assume access history as input• Study spatial clustering and popularity-based clustering• For instance, spatial clustering with Euclidean distance

c

r

• Find the closest c for new URL u• Match if (s < r)• More than 98% new URLs

match with old clusters

• Cluster orphan URLs with diameter of dmax

• Replicate them with the average replicas/URL

us

26

Offline Incremental Clustering Results• Performance close to

optimal • With only 25-45%

replication cost

0

5

10

15

20

25

30

35

40

8/3 8/4 8/5 8/10 8/11 9/27 9/28 9/29 9/30 10/1

New traces

Av

era

ge

re

trie

va

l c

os

t Offline incrementalclustering, step 1

Offline incrementalclustering, step 2

Re-clustering, re-replication (optimal)

0

5

10

15

20

25

30

35

40

45

50

8/3 8/4 8/5 8/10 8/11 9/27 9/28 9/29 9/30 10/1

New date

Rep

licat

ion

co

st, n

orm

aliz

ed b

y th

at

of

the

op

tim

al c

ase

27

Conclusions• CDN operators: cooperative, clustering-based

replication– Cooperative: big savings on replica management and

update cost– Per URL replication outperforms per Website scheme by

60-70%– Clustering solves the scalability issues, and gives the

full spectrum of flexibility» Spatial clustering and popularity-based clustering

recommended

• To adapt to users’ access patterns: incremental clustering – Hyperlink-based online incremental clustering for

» High availability» Performance improvement

– Offline incremental clustering performs close to optimal

28

Performance of Cluster-based Replication

• Tested over various topologies and traces• Spatial clustering with Euclidean distance and

popularity-based clustering perform the best– Even small # of clusters (with only 1-2% of # of URLs) can

achieve close to per-URL performance

0

20

40

60

80

100

120

140


Ave

rag

e re

trie

val c

ost



Temporal clustering

Access frequency clustering

MSNBC, 8/2/1999, 5 replicas/URL NASA, 7/1/1995, 3 replicas/URL

0

10

20

30

40

50

60

70

80

90

100


Ave

rag

e re

trie

val c

ost



Temporal clustering

Access frequency clustering

Documents

1 Clustering Web Content for Efficient Replication Yan Chen, Lili Qiu*, Weiyu Chen, Luan Nguyen, Randy H. Katz EECS Department UC Berkeley *Microsoft Research

1 Clustering Web Content for Efficient Replication Yan Chen, Lili Qiu, Weiyu Chen, Luan Nguyen, Randy H. Katz EECS Department UC Berkeley Microsoft Research