An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services

An Efficient Topology-Adaptive Membership Protocol for Large-

Scale Cluster-Based Services

Jingyu Zhou *§, Lingkun Chu*, Tao Yang*

§

* Ask Jeeves§University of California at Santa Barbara

Outline

Background & motivationMembership protocol designImplementationEvaluationRelated workConclusion

Background

Large-scale 24x7 Internet services Thousands of machines connected by many level-2

and level-3 switches (e.g. 10,000 at Ask Jeeves) Multi-tiered architecture with data partitioning and re

plication Some of machines are unavailable frequently due t

o failures, operational errors, and scheduled service update.

Network Topology in Service Clusters

Multiple hosting centers across Internet

In a hosting center Thousands of nodes Many level-2 and

level-3 switches Complex switch

topology

Internet

Data Center

California

Data Center

New York

Data CenterAsia

3DNS -WAN Load Balancer

Asian user

NY user

CA user

Level-2 Switch Level-2 Switch Level-2 Switch

Level-3 Switch

Level-2 Switch

Level-3 Switch

...

Motivation

Membership protocol Yellow page directory – discovery of services a

nd their attributes Server aliveness – quick fault detection

Challenges Efficiency Scalability Fast detection

Fast Failure Detection is crucial Online auction service even with

replication Failure of one replica 7s - 12s Service unavailable 10s - 13s

Auction Service

Replica1

Replica2

Replica3

Communication Cost for Fast Detection

Communication requirement Propagate to all nodes Fast detection needs

higher packet rate High bandwidth

Higher hardware cost More chances of

failures.

Design Requirements of Membership Protocol for Large-scale Clusters

Efficient: bandwidth, # of packetsTopology-adaptive: localize traffic within

switchesScalable: scale to tens of thousands of nodes

Fast failure detection and information propagation.

Approaches

Centralized Easy to implement Single point of failure, not scalable, extra delay

Distributed All-to-all broadcast [Shen’01]: doesn’t scale well Gossip [Renesse’98]: probabilistic guarantee Ring: slow to handle multi-failures

Don’t consider network topology

TAMP: Topology-Adaptive Membership Protocol

Topology-awareness Form a hierarchical tree according to network topology

Topology-adaptiveness Network changes: add/remove/move switches Service changes: add/remove/move nodes Exploit TTL field in IP packet

Hierarchical Tree Formation Algorithm

1. Form small multicast groups with low TTL values;

2. Each multicast group performs elections;

3. Group leaders form higher level groups with larger TTL values;

4. Stop when max. TTL value is reached; otherwise, goto Step 2.

An Example

3 Level-3 switches with 9 nodes

Group 2a239.255.0.22

TTL=3

Group 2b239.255.0.22

TTL=3

AB

C

A B CGroup 1a

239.255.0.21TTL=2

A B C

Group 1b239.255.0.21

TTL=2

Group 1c239.255.0.21

TTL=2

Group 0a239.255.0.20

TTL=1

A B C

Group 0b239.255.0.20

TTL=1

Group 0c239.255.0.20

TTL=1

B

Group 3a239.255.0.23

TTL=4

Node Joining Procedure

Purpose Find/elect a leader Exchange membership information

Process1. Join a channel and listen;2. If a leader exists, stop and bootstrap with the l

eader;3. Otherwise, elects a leader (bully algorithm);4. If is leader, increase channel ID & TTL, goto 1.

Properties of TAMP

Upward propagation guarantee A node is always aware of its leader Messages can always be propagated to nodes in

the higher levels Downward propagation guarantee

A node at level i must leaders of level i-1, i-2, …, 0 Messages can always be propagated to lower level

nodes Eventual convergence

View of every node converges

Update protocol when cluster structure changes

A B C

Level 0

B

D E F

Level 1

E

G H I

H

E

Level 2

A B CD E FG H I

A B D E FG H I

1

C

2

2

2

33

3

D E FA B CG H I

D E FA BG H I

G H IA B CD E F

G H IA BD E F

4 4

Heartbeat for failure detection

Leader receive an update - multicast up & down

Fault Tolerance Techniques

Leader failure: backup leader or electionNetwork partition failure

Timeout all nodes managed by a failed leader Hierarchical timeout: longer timeout for higher

levels

Packet loss Leaders exchanges deltas since last update Piggyback last three changes

Scalability Analysis

Protocols: all-to-all, gossip, and TAMP Basic performance factors

Failure detection time (Tfail_detect)

View convergence time (Tconverge) Communication cost in terms of bandwidth (B)

Scalability Analysis (Cont.)

Two metrics BDP = B * Tfail_detect , lower failure detection time with lo

w bandwidth is desired BCP = B * Tconverge , lower convergence time with low ba

ndwidth is desired

BDP BCP

All-to-all O(n2) O(n2)

Gossip O(n2logn) O(n2logn)

TAMP O(n) O(n)+O(B*logkn)

n: total # of nodesk: each group size, a constant

Implementation Inside Neptune middleware [Shen’01] –

programming and runtime support for building cluster-based Internet services

Can be easily coupled into others clustering frameworks

Hierarchical Membership Service

SHMLocal

Service Status Data

Structure

/proc File System

Annoucer

Multicast Channels

Receiver

StatusTracker

Informer

SHMExternal

ServiceCode

ClientCode

Contender

Evaluation: Objectives & Settings Metrics

Bandwidth failure detection time View convergence time

Hardware settings 100 dual PIII 1.4GHz nodes 2 switches connected by a Gigabit switch

Protocol related settings Frequency: 1 packet/s A node is deemed dead after 5 consecutive loss Gossip mistake probability 0.1% # of nodes: 20 – 100 in step of 20

Bandwidth Consumption

All-to-All & Gossip: quadratic increase TAMP: close to linear

Failure Detection Time

Gossip: log(N) increase All-to-All & TAMP: constant

View Convergence Time

Gossip: log(N) increase All-to-All & TAMP: constant

Related Work

Membership & failure detection [Chandra’96], [Fetzer’99], [Fetzer’01], [Neiger’96], and

[Stok’94] Gossip-style protocols

SCAMP, [Kempe’01], and [Renesse’98] High-availability system (e.g., HA-Linux, Linux

Heartbeat) Cluster-based network services

TACC, Porcupine, Neptune, Ninja Resource monitoring: Ganglia, NWS, MDS2

Contributions & Conclusions

TAMP is a highly efficient and scalable protocol for giant clusters

Exploiting TTL count in IP packet for topology-adaptive design.

Verified through property analysis and experimentation.

Deployed at Ask Jeeves clusters with thousands of machines.

Questions?

Documents

An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services