30
Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure Miguel Castro, Peter Druschel, Ann e-Marie Kermarrec, and Antony L. T. Rowstron IEEE Journal on Selected Areas in Communications, Oct, 2002

Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure

Embed Size (px)

DESCRIPTION

Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure. Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, and Antony L. T. Rowstron IEEE Journal on Selected Areas in Communications, Oct, 2002. Outline. Pastry A peer-to-peer location and routing substrate Scribe - PowerPoint PPT Presentation

Citation preview

Page 1: Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure

Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure

Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, and Antony L. T. Rowstron

IEEE Journal on Selected Areas in Communications, Oct, 2002

Page 2: Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure

Outline Pastry

A peer-to-peer location and routing substrate

Scribe Built on top of Pastry

Experimental evaluation Delay penalty Node stress (routing tables) Link stress (network bandwidth)

Page 3: Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure

Pastry (1/2)

Each Pastry node has a unique, 128-b nodeId. The set of existing nodeIds is uniformly di

stributed. This is achieved by basing the nodeId on

a secure hash of the node’s public key or IP address.

Page 4: Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure

Pastry (2/2) Each node contains

Routing tables (some of live nodes) Each entry maps a nodeId to the associated no

de’s IP address. IP addresses for the nodes in its “leaf set

”. Leaf set (total l nodes)

The set of nodes with l/2 numerically closest larger nodeId l/2 numerically closest smaller nodeId

Page 5: Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure

Routing Given a message and a key, Pastry reliably r

outes the message to the node with the nodeId that is numerically closest to the key among all live nodes.

In each routing step, the current node normally forwards the message to a node whose nodeId shares a longer prefix with the key.

The key can be different from the destination nodeId.

Page 6: Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure

Routing a messageFrom node 65a1fc with key d46a1c

Page 7: Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure

Locality properties Short routes property

Concern the total distance that messages travel along Pastry routes.

In each step, a message is routed to the nearest node with a longer prefix match.

Route convergence property Concern the distance traveled by two

messages sent to the same key before their routes converge.

A B

C

E

ConvergeD

Page 8: Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure

Node addition The new nodeId X can initialize its state by contactin

g a nearby node A. A will route a special message using X as the key. This message is routed to the existing node Z with n

odeId numerically closest to X. X then obtains

the leaf set the routing table

from Z. Z is the nearest node, so their leaf sets are almost the same. Their routing tables are very similar.

Page 9: Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure

Failure To handle node failures, neighboring nodes

in the nodeId space periodically exchange keep-alive messages.

If a node is unresponsive for a period T, it is presumed failed.

All members of the failed node’s leaf set are then notified and they update their leaf sets.

Routing table entries that refer to the failed nodes are repaired lazily.

Page 10: Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure

Scribe Scribe uses Pastry to manage

group creation, group joining and to build a per-group multicast tree.

Implementation CREATE JOIN MULTICAST LEAVE

Page 11: Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure

Multicast tree creation

1100

1111

1001

0100

0111

1100

CREATE

0111

JOIN

1001

forwarder

0100

JOIN

11011101

forwarder

1111

forwarder

b = 1 ( match 1 bit at a time)

Because b = 1, so both 1111 and 1101 can be a forwarder.

Page 12: Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure

Membership Rendezvous point

The root of the multicast tree. Can be changed.

Forwarder Scribe nodes that are part of a group’s

multicast tree. They may or may not be member of the

group. Each forwarder maintains a children table.

Page 13: Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure

Multicast message dissemination Multicast sources use Pastry to locate the

rendezvous point of a group. They route to the rendezvous point and ask it to

return its IP address. They cache the rendezvous point’s IP address and

use it in subsequent multicasts to the group. Multicast messages are disseminated from

the rendezvous point along the multicast tree.

Why? Each multicast source can also be viewed as the root.

If each multicast source transmit data by itself, the delay penalty in worst case can become twice.

Page 14: Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure

Reliability Each nonleaf node in the tree sends a heart

beat message to its children. A child suspects that its parent is faulty whe

n it fails to receive heartbeat messages. Upon detection of the failure of its parent, a

node calls Pastry to route a JOIN message to a new parent.

If the failed node is the root, a new root (the live node with the numerically closet nodeId to the groupId) will replace it.

Page 15: Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure

Experimental evaluation Compare with IP multicast

Delay penalty Node stress Link stress

Experimental setup A network topology with 5,050 routers Scribe run on 100,000 end nodes. 1,500 groups

Page 16: Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure

Delay penalty Scribe increases the delay to deliver

messages relative to IP multicast. RMD

The ratio between the maximum delay using Scribe and the maximum delay using IP multicast.

RAD The ratio between the average delay using

Scribe and the average delay using IP multicast.

Page 17: Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure

Delay penalty

Scribe / IP multicast

The number of groups with a RAD or RMD lower than or equal to the relative delay.

Page 18: Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure

Node stress (1/2)

Page 19: Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure

Node stress (2/2)

Each node averagely remembers few children.

Long tail

Page 20: Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure

Link stress

IP multicast

950

Scribe

4031

Page 21: Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure

Bottleneck remover (1/3) Reasons

Some node may have less computational power or bandwidth available than others.

The distribution of children table entries has a long tail.

Algorithm When a node is overloaded, it selects the

group that consumes the most resources. It chooses the child in this group that is

farthest away.

Page 22: Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure

Bottleneck remover (2/3) The parent drops the chosen child by

sending it a message containing the children table for the group.

When the child receives the message, It measures the delay between itself and

other nodes in the table. It computes the total delay between itself

and the parent via each node in the table.

It sends a join message to the node that provides the smallest combined delay.

Page 23: Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure

Bottleneck remover (3/3)

Page 24: Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure

Node stress

No long tail

Page 25: Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure

Scalability

Evaluating Scribe’s scalability with a large number of groups.

Experimental setup 50,000 Scribe nodes 30,000 groups with 11 members

Page 26: Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure

Node stress (1/2)

Collapse will be introduced later.

Page 27: Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure

Node stress (2/2)

Scribe is inappropriate to small groups!

Long tail

Page 28: Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure

Scribe collapse (1/2) If a multicast group has few

members, the group may require many other nodes to become forwarders. (The tree is inefficient.)

The new algorithm collapses long paths in the tree. Removing nodes that are not members

of a group and have only one entry on the group’s children table.

Page 29: Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure

Scribe collapse (2/2)

Page 30: Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure

Link stress

Naïve unicast

Scribe

IP multicast

Scribe collapse