View
230
Download
0
Tags:
Embed Size (px)
Citation preview
1
Clustering Web Content for Efficient Replication
Yan Chen, Lili Qiu*, Weiyu Chen, Luan Nguyen, Randy H. Katz
EECS DepartmentUC Berkeley
*Microsoft Research
2
Motivation• Amazing growth in WWW traffic
– Daily growth of roughly 7M Web pages– Annual growth of 200% predicted for next 4 years
• Content Distribution Network (CDN) commercialized to improve Web performance
– Un-cooperative pull-based replication
• Paradigm shift: cooperative push more cost-effective– Push replicas with greedy algorithms can achieve close to
optimal performance [JJKRS01, QPV01]– Improving availability during flash crowds and disasters
• Orthogonal issue: scalability– Per Website? Per URL? -> Clustering! – Clustering based on aggregated clients’ access patterns
• Adapt to users’ dynamic access patterns– Incremental clustering (online and offline)
3
Outlines
• Motivation• Architecture• Related Work • Problem Formulation• Simulation methodology• Granularity of replication• Dynamic clustering and replication• Conclusions
4
CDN name server
Client 1
Local DNS server
Local CDN server
1. G
ET r
equest
4. lo
cal C
DN
serv
er
IP
addre
ss
Web content server
Client 2
Local DNS server
Local CDN server
2. Request for hostname resolution
3. Reply: local CDN server IP
address
5.GET request
8. Response6.GET request if cache miss
ISP 2
ISP 1
Conventional CDN: Un-cooperative Pull
7. Response
Big waste of replication!
5
CDN name server
Client 1
Local DNS server
Local CDN server
1. G
ET r
equest
4. R
edir
ect
ed
serv
er
IP
addre
ss
Web content server
Client 2
Local DNS server
Local CDN server
2. Request for hostname resolution
3. Reply: nearby replica server or
Web server IP address
ISP 2
ISP 1
5.GET request
6. Response
6. Response
5.GET request if no replica yet
Cooperative Push-based CDN
0. P
ush
repl
icas
Significantly reduce # of replicas and consequently,the update cost (only 4% of un-coop pull)
6
Outlines
• Motivation• Architecture• Related Work • Problem Formulation• Simulation methodology• Granularity of replication• Dynamic clustering and replication• Conclusions
7
Related Work
• Many existing work model replica placement as NP-hard problem and propose greedy algorithms
• Ignore scalability problem
• Clustering of Web contents based on individuals’ access patterns for
– Pre-fetching, Web organization, etc.
• Little on the dynamics of replica placement / clustering
8
Problem Formulation
• Subject to the total replication cost (e.g., # of URL replicas)• Find a scalable, adaptive replication strategy to reduce avg access cost
9
Outlines
• Motivation• Architecture• Related Work • Problem Formulation• Simulation methodology• Granularity of replication• Dynamic clustering and replication• Conclusions
10
Simulation Methodology• Network Topology
– Pure-random, Waxman & transit-stub models from GT-ITM– A real AS-level topology from 7 widely-dispersed BGP peers
• Web Workload
Web Site
Period Duration # Requests avg –min-max
# Clients avg –min-max
# Client groups avg –min-max
MSNBC Aug-Oct/1999 10–11am 1.5M–642K–1.7M 129K–69K–150K 15.6K-10K-17K
NASA Jul-Aug/1995 All day 79K-61K-101K 5940-4781-7671 2378-1784-3011
– Aggregate MSNBC Web clients with BGP prefix» BGP tables from a BBNPlanet router» 10K groups left, chooses top 10% covering >70% of requests
– Aggregate NASA Web clients with domain names– Map the client groups onto the topology
• Performance Metric: average retrieval cost– Sum of edge costs from client to its closest replica
11
Outlines
• Motivation• Architecture• Related Work • Problem Formulation• Simulation methodology• Granularity of replication• Dynamic clustering and replication• Conclusions
12
Per Web site
1
2
3
4
1
2
3
4
Per URL
13
Where R: # of replicas/URL K: # of clusters M: # of URLs (M >> K)
C: # of clients S: # of CDN serversf: placement adaptation frequency
Replication Scheme States to Maintain Computation Cost
Per Website O (R) f × O(R × S × C)
Per Cluster O(R × K + M) f × O(K × R × (K + S × C))
Per URL O(R × M) f × O(M × R × (M + S × C))
• 60 – 70% average retrieval cost reduction for Per URL scheme
• Per URL is too expensive for management!
Replica Placement: Per Website vs. Per URL
14
Clustering Web Content
• General clustering framework– Define the correlation distance between URLs– Cluster diameter: the max distance b/t any two
members» Worst correlation in a cluster
– Generic clustering: minimize the max diameter of all clusters
• Correlation distance definition based on– Spatial locality– Temporal locality– Popularity– Semantics (e.g., directory)
15
Spatial Clustering
k
i
k
i ii
k
iii
BA
BABAdistcor
1 1
22
11),(_
• Correlation distance between two URLs defined as– Euclidean distance– Vector similarity
• URL spatial access vector
– Blue URL
1
2
3
4
0
2
0
1
16
Clustering Web Content (cont’d)
• Popularity-based clustering
– OR even simpler, sort them and put the first N/K elements into the first cluster, etc. - binary correlation
|)(_)(_|),(_ BfreqaccessAfreqaccessBAdistcor
)()(
),(_1),(_
BoccurrenceAoccurrence
BAoccurrencecoBAdistcor
• Temporal clustering– Divide traces into multiple individuals’ access sessions [ABQ01]
– In each session,
– Average over multiple sessions in one day
17
0
10
20
30
40
50
60
70
80
90
100
1 10 100 1000Number of clusters
Ave
rag
e re
trie
val c
ost
Spatial clustering: Euclidean distance
Spatial clustering: cosine similarity
Temporal clustering
Access frequency clustering0
10
20
30
40
50
1 10 100 1000Number of clusters
Co
mp
uta
tio
nal
co
st (
ho
urs
)
Performance of Cluster-based Replication
• Tested over various topologies and traces• Spatial clustering with Euclidean distance and
popularity-based clustering perform the best– Even small # of clusters (with only 1-2% of # of URLs) can
achieve close to per-URL performance, with much less overhead
MSNBC, 8/2/1999, 5 replicas/URL
18
Effects of the Non-Uniform Size of URLs
• Replication cost constraint : bytes• Similar trends exist
– Per URL based replication outperforms per Website dramatically – Spatial clustering with Euclidean distance and popularity-based
clustering are very cost-effective
1
2
3
4
19
Outlines
• Motivation• Architecture• Related Work • Problem Formulation• Simulation methodology• Granularity of replication• Dynamic clustering and replication
– Static clustering– Incremental clustering
• Conclusions
20
Static clustering and replication
• Two daily traces: old trace and new trace
• Static clustering performs poorly beyond a week– Average retrieval cost almost doubles
Methods Static 1 Static 2 Optimal
Traces used for clustering Old Old New
Traces used for replication
Old New New
Traces used for evaluation
New New New
0
10
20
30
40
50
60
8/3 8/4 8/5 8/10 8/11 9/27 9/28 9/29 9/30 10/1New traces
Av
era
ge
re
trie
va
l c
os
t Staticclustering 1
Staticclustering 2
Reclustering,re-replication(optimal)
21
Incremental Clustering• Generic framework
1. If new URL u match with existing clusters c, add u to c and replicate u to existing replicas of c
2. Else create new clusters and replicate them
• Online incremental clustering– Push before accessed -> high availability– Predict access patterns based on semantics– Simplify to popularity prediction – Groups of URLs with similar popularity? Use hyperlink
structures!» Groups of siblings» Groups of the same hyperlink depth: smallest # of links
from root
22
Online Popularity Prediction
• Experiments– Use WebReaper to crawl http://www.msnbc.com on 5/3/2002
with hyperlink depth 4, then group the URLs– Use corresponding access logs to analyze the correlation– Groups of siblings has the best correlation
• Measure the divergence of URL popularity within a group:
)_(
)_(_
frequencyaccessaverage
frequencyaccessdevstd
access freq span =
23
Online Incremental Clustering• Semantics-based incremental clustering
– Put new URL into existing clusters with largest # of siblings– When there is a tie, choose the cluster with more replicas
• Simulation on 5/3/2002 MSNBC– 8-10am trace: static popularity clustering + replication– At 10am: 16 new URLs emerged - online incremental
clustering + replication – Evaluation with 10-12am trace: 16 URLs has 33,262
requests
1 2 3 4 5 6
+ ?
2 3 5 61 4
1
4
2
3
5 6
24
Online Incremental Clustering & Replication
Results
0
50
100
150
200
250
300
350
400
450
500
No replicationof new URLs
Randomreplication of
new URLs
Onlineincremental
clustering andreplication
Staticclustering and
replication(oracle)
Ave
rag
e re
trie
val c
ost
25
Offline Incremental Clustering
• Assume access history as input• Study spatial clustering and popularity-based clustering• For instance, spatial clustering with Euclidean distance
c
r
• Find the closest c for new URL u• Match if (s < r)• More than 98% new URLs
match with old clusters
• Cluster orphan URLs with diameter of dmax
• Replicate them with the average replicas/URL
us
26
Offline Incremental Clustering Results• Performance close to
optimal • With only 25-45%
replication cost
0
5
10
15
20
25
30
35
40
8/3 8/4 8/5 8/10 8/11 9/27 9/28 9/29 9/30 10/1
New traces
Av
era
ge
re
trie
va
l c
os
t Offline incrementalclustering, step 1
Offline incrementalclustering, step 2
Re-clustering, re-replication (optimal)
0
5
10
15
20
25
30
35
40
45
50
8/3 8/4 8/5 8/10 8/11 9/27 9/28 9/29 9/30 10/1
New date
Rep
licat
ion
co
st, n
orm
aliz
ed b
y th
at
of
the
op
tim
al c
ase
27
Conclusions• CDN operators: cooperative, clustering-based
replication– Cooperative: big savings on replica management and
update cost– Per URL replication outperforms per Website scheme by
60-70%– Clustering solves the scalability issues, and gives the
full spectrum of flexibility» Spatial clustering and popularity-based clustering
recommended
• To adapt to users’ access patterns: incremental clustering – Hyperlink-based online incremental clustering for
» High availability» Performance improvement
– Offline incremental clustering performs close to optimal
28
Performance of Cluster-based Replication
• Tested over various topologies and traces• Spatial clustering with Euclidean distance and
popularity-based clustering perform the best– Even small # of clusters (with only 1-2% of # of URLs) can
achieve close to per-URL performance
0
20
40
60
80
100
120
140
1 10 100 1000Number of clusters
Ave
rag
e re
trie
val c
ost
Spatial clustering: Euclidean distance
Spatial clustering: cosine similarity
Temporal clustering
Access frequency clustering
MSNBC, 8/2/1999, 5 replicas/URL NASA, 7/1/1995, 3 replicas/URL
0
10
20
30
40
50
60
70
80
90
100
1 10 100 1000Number of clusters
Ave
rag
e re
trie
val c
ost
Spatial clustering: Euclidean distance
Spatial clustering: cosine similarity
Temporal clustering
Access frequency clustering