26
An Efficient Topology-Adaptive Membership Protocol for Large- Scale Cluster-Based Services Jingyu Zhou * § , Lingkun Chu*, Tao Yang* § * Ask Jeeves §University of California at Santa Barbara

An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services

Embed Size (px)

DESCRIPTION

An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services. Jingyu Zhou * § , Lingkun Chu*, Tao Yang* § * Ask Jeeves § University of California at Santa Barbara. Outline. Background & motivation Membership protocol design Implementation Evaluation - PowerPoint PPT Presentation

Citation preview

Page 1: An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services

An Efficient Topology-Adaptive Membership Protocol for Large-

Scale Cluster-Based Services

Jingyu Zhou *§, Lingkun Chu*, Tao Yang*

§

* Ask Jeeves§University of California at Santa Barbara

Page 2: An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services

Outline

Background & motivationMembership protocol designImplementationEvaluationRelated workConclusion

Page 3: An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services

Background

Large-scale 24x7 Internet services Thousands of machines connected by many level-2

and level-3 switches (e.g. 10,000 at Ask Jeeves) Multi-tiered architecture with data partitioning and re

plication Some of machines are unavailable frequently due t

o failures, operational errors, and scheduled service update.

Page 4: An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services

Network Topology in Service Clusters

Multiple hosting centers across Internet

In a hosting center Thousands of nodes Many level-2 and

level-3 switches Complex switch

topology

Internet

Data Center

California

Data Center

New York

Data CenterAsia

3DNS -WAN Load Balancer

Asian user

NY user

CA user

Level-2 Switch Level-2 Switch Level-2 Switch

Level-3 Switch

Level-2 Switch

Level-3 Switch

...

Page 5: An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services

Motivation

Membership protocol Yellow page directory – discovery of services a

nd their attributes Server aliveness – quick fault detection

Challenges Efficiency Scalability Fast detection

Page 6: An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services

Fast Failure Detection is crucial Online auction service even with

replication Failure of one replica 7s - 12s Service unavailable 10s - 13s

Auction Service

Replica1

Replica2

Replica3

Page 7: An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services

Communication Cost for Fast Detection

Communication requirement Propagate to all nodes Fast detection needs

higher packet rate High bandwidth

Higher hardware cost More chances of

failures.

Page 8: An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services

Design Requirements of Membership Protocol for Large-scale Clusters

Efficient: bandwidth, # of packetsTopology-adaptive: localize traffic within

switchesScalable: scale to tens of thousands of nodes

Fast failure detection and information propagation.

Page 9: An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services

Approaches

Centralized Easy to implement Single point of failure, not scalable, extra delay

Distributed All-to-all broadcast [Shen’01]: doesn’t scale well Gossip [Renesse’98]: probabilistic guarantee Ring: slow to handle multi-failures

Don’t consider network topology

Page 10: An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services

TAMP: Topology-Adaptive Membership Protocol

Topology-awareness Form a hierarchical tree according to network topology

Topology-adaptiveness Network changes: add/remove/move switches Service changes: add/remove/move nodes Exploit TTL field in IP packet

Page 11: An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services

Hierarchical Tree Formation Algorithm

1. Form small multicast groups with low TTL values;

2. Each multicast group performs elections;

3. Group leaders form higher level groups with larger TTL values;

4. Stop when max. TTL value is reached; otherwise, goto Step 2.

Page 12: An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services

An Example

3 Level-3 switches with 9 nodes

Group 2a239.255.0.22

TTL=3

Group 2b239.255.0.22

TTL=3

AB

C

A B CGroup 1a

239.255.0.21TTL=2

A B C

Group 1b239.255.0.21

TTL=2

Group 1c239.255.0.21

TTL=2

Group 0a239.255.0.20

TTL=1

A B C

Group 0b239.255.0.20

TTL=1

Group 0c239.255.0.20

TTL=1

B

Group 3a239.255.0.23

TTL=4

Page 13: An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services

Node Joining Procedure

Purpose Find/elect a leader Exchange membership information

Process1. Join a channel and listen;2. If a leader exists, stop and bootstrap with the l

eader;3. Otherwise, elects a leader (bully algorithm);4. If is leader, increase channel ID & TTL, goto 1.

Page 14: An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services

Properties of TAMP

Upward propagation guarantee A node is always aware of its leader Messages can always be propagated to nodes in

the higher levels Downward propagation guarantee

A node at level i must leaders of level i-1, i-2, …, 0 Messages can always be propagated to lower level

nodes Eventual convergence

View of every node converges

Page 15: An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services

Update protocol when cluster structure changes

A B C

Level 0

B

D E F

Level 1

E

G H I

H

E

Level 2

A B CD E FG H I

A B D E FG H I

1

C

2

2

2

33

3

D E FA B CG H I

D E FA BG H I

G H IA B CD E F

G H IA BD E F

4 4

Heartbeat for failure detection

Leader receive an update - multicast up & down

Page 16: An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services

Fault Tolerance Techniques

Leader failure: backup leader or electionNetwork partition failure

Timeout all nodes managed by a failed leader Hierarchical timeout: longer timeout for higher

levels

Packet loss Leaders exchanges deltas since last update Piggyback last three changes

Page 17: An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services

Scalability Analysis

Protocols: all-to-all, gossip, and TAMP Basic performance factors

Failure detection time (Tfail_detect)

View convergence time (Tconverge) Communication cost in terms of bandwidth (B)

Page 18: An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services

Scalability Analysis (Cont.)

Two metrics BDP = B * Tfail_detect , lower failure detection time with lo

w bandwidth is desired BCP = B * Tconverge , lower convergence time with low ba

ndwidth is desired

BDP BCP

All-to-all O(n2) O(n2)

Gossip O(n2logn) O(n2logn)

TAMP O(n) O(n)+O(B*logkn)

n: total # of nodesk: each group size, a constant

Page 19: An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services

Implementation Inside Neptune middleware [Shen’01] –

programming and runtime support for building cluster-based Internet services

Can be easily coupled into others clustering frameworks

Hierarchical Membership Service

SHMLocal

Service Status Data

Structure

/proc File System

Annoucer

Multicast Channels

Receiver

StatusTracker

Informer

SHMExternal

ServiceCode

ClientCode

Contender

Page 20: An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services

Evaluation: Objectives & Settings Metrics

Bandwidth failure detection time View convergence time

Hardware settings 100 dual PIII 1.4GHz nodes 2 switches connected by a Gigabit switch

Protocol related settings Frequency: 1 packet/s A node is deemed dead after 5 consecutive loss Gossip mistake probability 0.1% # of nodes: 20 – 100 in step of 20

Page 21: An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services

Bandwidth Consumption

All-to-All & Gossip: quadratic increase TAMP: close to linear

Page 22: An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services

Failure Detection Time

Gossip: log(N) increase All-to-All & TAMP: constant

Page 23: An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services

View Convergence Time

Gossip: log(N) increase All-to-All & TAMP: constant

Page 24: An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services

Related Work

Membership & failure detection [Chandra’96], [Fetzer’99], [Fetzer’01], [Neiger’96], and

[Stok’94] Gossip-style protocols

SCAMP, [Kempe’01], and [Renesse’98] High-availability system (e.g., HA-Linux, Linux

Heartbeat) Cluster-based network services

TACC, Porcupine, Neptune, Ninja Resource monitoring: Ganglia, NWS, MDS2

Page 25: An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services

Contributions & Conclusions

TAMP is a highly efficient and scalable protocol for giant clusters

Exploiting TTL count in IP packet for topology-adaptive design.

Verified through property analysis and experimentation.

Deployed at Ask Jeeves clusters with thousands of machines.

Page 26: An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services

Questions?