23
Making a DSM Consistency Protocol Hierarchy-Aware: An Efficient Synchronization Scheme Gabriel Antoniu, Luc Bougé , Sébastien Lacour IRISA / INRIA & ENS Cachan, France Dagstuhl Seminar on Consistency Models October 21, 2003 Work originally presented at the DSM Workshop, CCGRID 2003, Tokyo

Making a DSM Consistency Protocol Hierarchy-Aware: An Efficient Synchronization Scheme

  • Upload
    zytka

  • View
    28

  • Download
    0

Embed Size (px)

DESCRIPTION

Making a DSM Consistency Protocol Hierarchy-Aware: An Efficient Synchronization Scheme. Gabriel Antoniu, Luc Bougé , Sébastien Lacour IRISA / INRIA & ENS Cachan, France Dagstuhl Seminar on Consistency Models October 21, 2003. Work originally presented at the DSM Workshop, CCGRID 2003, Tokyo. - PowerPoint PPT Presentation

Citation preview

Page 1: Making a DSM Consistency Protocol Hierarchy-Aware:  An Efficient Synchronization Scheme

Making a DSM Consistency Protocol Hierarchy-Aware: An Efficient Synchronization

Scheme

Gabriel Antoniu, Luc Bougé, Sébastien Lacour

IRISA / INRIA & ENS Cachan, France

Dagstuhl Seminar on Consistency ModelsOctober 21, 2003

Work originally presented at the DSM Workshop, CCGRID 2003, Tokyo

Page 2: Making a DSM Consistency Protocol Hierarchy-Aware:  An Efficient Synchronization Scheme

2

Distributed Shared Memory Distributed computing nodes Logically-shared virtual address space Multithreaded programming environment

Mem. Mem. Mem.

CPU CPU CPU

DSM

Network

Page 3: Making a DSM Consistency Protocol Hierarchy-Aware:  An Efficient Synchronization Scheme

3

Hierarchical Network Architectures

Latencies: FastEthernet: 50-100 µs / SCI/Myrinet: 5 µs

Flat architecture: Expensive network Technically difficult

Hierarchical architecture: Cheaper network Technically easier

Page 4: Making a DSM Consistency Protocol Hierarchy-Aware:  An Efficient Synchronization Scheme

4

DSM on Clusters of Clusters

Motivation: high-performance

computing / code coupling

Coupling with explicit data transfer: MPI, FTP, etc.

Key factor: Network latency

Solid mechanics

Thermodynamics

Optics

Dynamics

Satellitedesign

Multi-threaded code coupling application

Page 5: Making a DSM Consistency Protocol Hierarchy-Aware:  An Efficient Synchronization Scheme

5

Latency Heterogeneity and Memory Consistency Protocol

50-100 µs

5 µs

Cluster A Cluster BCommunications over high-latency

links?

Page 6: Making a DSM Consistency Protocol Hierarchy-Aware:  An Efficient Synchronization Scheme

6

Roadmap

Design, implement, evaluate a hierarchical memory consistency protocol

Same semantics as with a flat protocol Well-suited for clusters of clusters High-performance oriented Few related works:

Clusters of SMP nodes: Cashmere-2L (1997, Rochester, NY)

Clusters of clusters: Clustered Lazy Release Consistency (CLRC, 2000, LIP6, Paris) Cache data locally

Page 7: Making a DSM Consistency Protocol Hierarchy-Aware:  An Efficient Synchronization Scheme

7

DSM-PM2:the Implementation Platform

User-defined DSM protocols

Wide portability POSIX-like

multi-threading

DSM-PM2

Communications(Madeleine)

Threads(Marcel)

Consistency Protocols Library

Basic Building Blocks

PageManagement

CommunicationManagement

PM2

www.pm2.org

Page 8: Making a DSM Consistency Protocol Hierarchy-Aware:  An Efficient Synchronization Scheme

8

Starting Point: Home-Based Release Consistency

Each page is attached to a Home-Node Multiple writers, eager version Home-Node:

holds up-to-date version of the page it hosts gathers / applies page diffs

Home-node

Sending page modifications at lock

release time

Up-to-date

version of the

page

Page request

write

readdiff

Page 9: Making a DSM Consistency Protocol Hierarchy-Aware:  An Efficient Synchronization Scheme

9

Flat Protocol & Hierarchical Architectures:

Where Does Time Get Wasted?

1. At synchronization operations: lock acquisition and release

2. While waiting for message acknowledgements (consistency protocol)

While retrieving a page from a remote node (data locality CLRC, Paris)

Page 10: Making a DSM Consistency Protocol Hierarchy-Aware:  An Efficient Synchronization Scheme

10

1. Improving upon Synchronization Operations

Cluster A Cluster B

1 2

3 4

5

Executing incritical section

Page 11: Making a DSM Consistency Protocol Hierarchy-Aware:  An Efficient Synchronization Scheme

11

Hierarchy-UnawareLock Acquisitions

Cluster A Cluster B

1 2

3 4

5

4 high-latency communication

s!

Page 12: Making a DSM Consistency Protocol Hierarchy-Aware:  An Efficient Synchronization Scheme

12

Priority to Local Threads

Cluster A Cluster B

1 2

3 4

5

3 high-latency

+ 1 low-latency

communications!

Page 13: Making a DSM Consistency Protocol Hierarchy-Aware:  An Efficient Synchronization Scheme

13

Priority to Local Nodes

Cluster A Cluster B

1 2

3 4

5

1 high-latency

+ 3 low-latency

communications!

Page 14: Making a DSM Consistency Protocol Hierarchy-Aware:  An Efficient Synchronization Scheme

14

Diff Aggregation at Node Level

1 25

Flat protocol

Hierarchical protocol

diff

Home-Nodediff

1 25

Home-Node

aggregated diffs: 1 + 5

Page 15: Making a DSM Consistency Protocol Hierarchy-Aware:  An Efficient Synchronization Scheme

15

Avoiding Starvation

Control the lack of fairness Bounds to limit the number of

consecutive acquisitions of a lock: By the threads of a node By the nodes of a cluster Can be tuned at run-time

Page 16: Making a DSM Consistency Protocol Hierarchy-Aware:  An Efficient Synchronization Scheme

16

Flat Protocol & Hierarchical Architectures:

Where Does Time Get Wasted?

1. At synchronization operations: lock acquisition and release

2. While waiting for message acknowledgements (consistency protocol)

While retrieving a page from a remote node (data locality CLRC, Paris)

Page 17: Making a DSM Consistency Protocol Hierarchy-Aware:  An Efficient Synchronization Scheme

17

Lock Release in a Flat Protocol

Home

node

Cluster B Cluster C

diff

ack

invalidate

release

ackLock fully released

lock

Cluster A

Node 0

Node 1

Node 2

Page 18: Making a DSM Consistency Protocol Hierarchy-Aware:  An Efficient Synchronization Scheme

18

Lock Release in a Hierarchical Architecture

Home

node

Cluster B Cluster C

diff

ack

invalidate

release

ack

Full releaselock

Cluster A

Node 0

Node 1

Node 2

Time wasted waiting

Page 19: Making a DSM Consistency Protocol Hierarchy-Aware:  An Efficient Synchronization Scheme

19

2. Partial Lock Release in a Hierarchical Architecture

Home

node

Cluster B Cluster C

diff

ackinvalidat

e

release

ack

Full releaselock

Cluster A

Node 0

Node 1

Node 2

Time saved

lockPartial Release

Page 20: Making a DSM Consistency Protocol Hierarchy-Aware:  An Efficient Synchronization Scheme

20

Partial Lock Release in a Hierarchical Architecture

Cluster B Cluster C

ack

ack

Full release

notif

Node 0

Node 1

Node 2

Time save

d

lockPartial Release

Partially released locks can travel within a cluster

Fully released locks (acks received from all clusters) can travel to remote clusters

Page 21: Making a DSM Consistency Protocol Hierarchy-Aware:  An Efficient Synchronization Scheme

21

Performance Evaluation:Thread Level

SCI Cluster(10,000 C.S.per thread)

lock

max_tp 1 5 15 25 infinity

speed-up

1 3.4 5.8 6.9 ~60 (unfair)

Local Thread Priority

max_tp 1 5 15 25 infinity

speed-up

1 2.1

4.7

7.3

unfair

Modification Aggregation

Inter-node message Intra-node message

Less inter-node messages

Page 22: Making a DSM Consistency Protocol Hierarchy-Aware:  An Efficient Synchronization Scheme

22

Partial Lock Release (node level): Performance Gain

lock

0

20

40

60

80

100

120

140

0 1 2 3 4 5 6 7

Number of SCI clusters (2 nodes / cluster)

Tim

e p

er

C.S

. (u

s)

Hierarchy-Aware Hierarchy-UNaware

(10,000 C.S.per thread)

Inter-cluster message receipt overlapped with intra-cluster

computation

Page 23: Making a DSM Consistency Protocol Hierarchy-Aware:  An Efficient Synchronization Scheme

23

Conclusion

Hierarchy-aware approach to distributed synchronization

Complementary with local data caches (CLRC) New concept of "Partial Lock Release":

Applicable to other synchronization objects (semaphores, monitors), except for barriers

Applicable to other eager release consistency protocols More hierarchy levels, greater latency ratios:

PING paraplapla.irisa.fr (131.254.12.8) from 131.254.12.68 : 56(84) bytes of data.

64 bytes from paraplapla.irisa.fr (131.254.12.8): icmp_seq=9 ttl=255 time=385 usecPING ccgrid2003.apgrid.org (192.50.75.123) from 131.254.12.68 : 56(84) bytes of data.

64 bytes from sf280.hpcc.jp (192.50.75.123): icmp_seq=9 ttl=235 time=322.780 msec