Upload
gwenllian-baker
View
31
Download
5
Tags:
Embed Size (px)
DESCRIPTION
An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services. Jingyu Zhou * § , Lingkun Chu*, Tao Yang* § * Ask Jeeves § University of California at Santa Barbara. Outline. Background & motivation Membership protocol design Implementation Evaluation - PowerPoint PPT Presentation
Citation preview
An Efficient Topology-Adaptive Membership Protocol for Large-
Scale Cluster-Based Services
Jingyu Zhou *§, Lingkun Chu*, Tao Yang*
§
* Ask Jeeves§University of California at Santa Barbara
Outline
Background & motivationMembership protocol designImplementationEvaluationRelated workConclusion
Background
Large-scale 24x7 Internet services Thousands of machines connected by many level-2
and level-3 switches (e.g. 10,000 at Ask Jeeves) Multi-tiered architecture with data partitioning and re
plication Some of machines are unavailable frequently due t
o failures, operational errors, and scheduled service update.
Network Topology in Service Clusters
Multiple hosting centers across Internet
In a hosting center Thousands of nodes Many level-2 and
level-3 switches Complex switch
topology
Internet
Data Center
California
Data Center
New York
Data CenterAsia
3DNS -WAN Load Balancer
Asian user
NY user
CA user
Level-2 Switch Level-2 Switch Level-2 Switch
Level-3 Switch
Level-2 Switch
Level-3 Switch
...
Motivation
Membership protocol Yellow page directory – discovery of services a
nd their attributes Server aliveness – quick fault detection
Challenges Efficiency Scalability Fast detection
Fast Failure Detection is crucial Online auction service even with
replication Failure of one replica 7s - 12s Service unavailable 10s - 13s
Auction Service
Replica1
Replica2
Replica3
Communication Cost for Fast Detection
Communication requirement Propagate to all nodes Fast detection needs
higher packet rate High bandwidth
Higher hardware cost More chances of
failures.
Design Requirements of Membership Protocol for Large-scale Clusters
Efficient: bandwidth, # of packetsTopology-adaptive: localize traffic within
switchesScalable: scale to tens of thousands of nodes
Fast failure detection and information propagation.
Approaches
Centralized Easy to implement Single point of failure, not scalable, extra delay
Distributed All-to-all broadcast [Shen’01]: doesn’t scale well Gossip [Renesse’98]: probabilistic guarantee Ring: slow to handle multi-failures
Don’t consider network topology
TAMP: Topology-Adaptive Membership Protocol
Topology-awareness Form a hierarchical tree according to network topology
Topology-adaptiveness Network changes: add/remove/move switches Service changes: add/remove/move nodes Exploit TTL field in IP packet
Hierarchical Tree Formation Algorithm
1. Form small multicast groups with low TTL values;
2. Each multicast group performs elections;
3. Group leaders form higher level groups with larger TTL values;
4. Stop when max. TTL value is reached; otherwise, goto Step 2.
An Example
3 Level-3 switches with 9 nodes
Group 2a239.255.0.22
TTL=3
Group 2b239.255.0.22
TTL=3
AB
C
A B CGroup 1a
239.255.0.21TTL=2
A B C
Group 1b239.255.0.21
TTL=2
Group 1c239.255.0.21
TTL=2
Group 0a239.255.0.20
TTL=1
A B C
Group 0b239.255.0.20
TTL=1
Group 0c239.255.0.20
TTL=1
B
Group 3a239.255.0.23
TTL=4
Node Joining Procedure
Purpose Find/elect a leader Exchange membership information
Process1. Join a channel and listen;2. If a leader exists, stop and bootstrap with the l
eader;3. Otherwise, elects a leader (bully algorithm);4. If is leader, increase channel ID & TTL, goto 1.
Properties of TAMP
Upward propagation guarantee A node is always aware of its leader Messages can always be propagated to nodes in
the higher levels Downward propagation guarantee
A node at level i must leaders of level i-1, i-2, …, 0 Messages can always be propagated to lower level
nodes Eventual convergence
View of every node converges
Update protocol when cluster structure changes
A B C
Level 0
B
D E F
Level 1
E
G H I
H
E
Level 2
A B CD E FG H I
A B D E FG H I
1
C
2
2
2
33
3
D E FA B CG H I
D E FA BG H I
G H IA B CD E F
G H IA BD E F
4 4
Heartbeat for failure detection
Leader receive an update - multicast up & down
Fault Tolerance Techniques
Leader failure: backup leader or electionNetwork partition failure
Timeout all nodes managed by a failed leader Hierarchical timeout: longer timeout for higher
levels
Packet loss Leaders exchanges deltas since last update Piggyback last three changes
Scalability Analysis
Protocols: all-to-all, gossip, and TAMP Basic performance factors
Failure detection time (Tfail_detect)
View convergence time (Tconverge) Communication cost in terms of bandwidth (B)
Scalability Analysis (Cont.)
Two metrics BDP = B * Tfail_detect , lower failure detection time with lo
w bandwidth is desired BCP = B * Tconverge , lower convergence time with low ba
ndwidth is desired
BDP BCP
All-to-all O(n2) O(n2)
Gossip O(n2logn) O(n2logn)
TAMP O(n) O(n)+O(B*logkn)
n: total # of nodesk: each group size, a constant
Implementation Inside Neptune middleware [Shen’01] –
programming and runtime support for building cluster-based Internet services
Can be easily coupled into others clustering frameworks
Hierarchical Membership Service
SHMLocal
Service Status Data
Structure
/proc File System
Annoucer
Multicast Channels
Receiver
StatusTracker
Informer
SHMExternal
ServiceCode
ClientCode
Contender
Evaluation: Objectives & Settings Metrics
Bandwidth failure detection time View convergence time
Hardware settings 100 dual PIII 1.4GHz nodes 2 switches connected by a Gigabit switch
Protocol related settings Frequency: 1 packet/s A node is deemed dead after 5 consecutive loss Gossip mistake probability 0.1% # of nodes: 20 – 100 in step of 20
Bandwidth Consumption
All-to-All & Gossip: quadratic increase TAMP: close to linear
Failure Detection Time
Gossip: log(N) increase All-to-All & TAMP: constant
View Convergence Time
Gossip: log(N) increase All-to-All & TAMP: constant
Related Work
Membership & failure detection [Chandra’96], [Fetzer’99], [Fetzer’01], [Neiger’96], and
[Stok’94] Gossip-style protocols
SCAMP, [Kempe’01], and [Renesse’98] High-availability system (e.g., HA-Linux, Linux
Heartbeat) Cluster-based network services
TACC, Porcupine, Neptune, Ninja Resource monitoring: Ganglia, NWS, MDS2
Contributions & Conclusions
TAMP is a highly efficient and scalable protocol for giant clusters
Exploiting TTL count in IP packet for topology-adaptive design.
Verified through property analysis and experimentation.
Deployed at Ask Jeeves clusters with thousands of machines.
Questions?