41
Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Embed Size (px)

Citation preview

Page 1: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Distributed-Memory Multicomputers

Prof. Sivarama Dandamudi

School of Computer Science

Carleton University

Page 2: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 2

Roadmap Cray T3E

Architecture details on the video NCUBE

Communication primitives Binary collapsing

Job scheduling Space-sharing Time-sharing Hybrid

Hierarchical scheduling Performance

Page 3: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 3

Cray T3E

Distributed memory MIMD systemPredecessor model: T3D

Named after the interconnect used: a 3-D torusVideo gives details

T3E followed T3D Systems up to 126 processors: air cooledLarger systems (up to 2048 processors): liquid cooledUses DEC Alpha 21164A RISC processor

Page 4: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 4

Cray T3E (cont’d)

Each node consists of a processing element (PE) Processor and Memory Communication engine

Takes care of the communication between PEs

Memory 256 or 512 MB local memory (50 ns DRAM) per node

Total memory: 10GB to 1TB Cache coherent and physically distributed

Globally addressable SECDED data error protection Peak memory bandwidth: 1200 MB per PE All I/O channels are accessible and controllable from all PEs

Page 5: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 5

Cray T3E (cont’d)

I/O is done through GigaRing channels Each I/O channel

Uses dual-ring technique Two rings with data moving in opposite directions

Maximum bandwidth of 500 MB/sec

Processor DEC Alpha 21164A (EV5.6) 675 MHz

Superscalar RISC processor 2 floating-point operations/cycle

32- and 64-bit IEEE arithmetic 1350 MFLOPS per PE (peak) & 2700 MIPS per PE

Page 6: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 6

Cray T3E (cont’d)

Interconnection Uses 3-D torus interconnect (like the T3D) Peak bisection bandwidth:

42GB/sec (64 PEs) 166GB/sec (512 PEs)

Module 8 PEs per module One I/O interface per module

System size 40 to 2176 PEs per system

In increments of 8 PEs

Peak performance54 GFLOPS to 3 TFLOPS

Page 7: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 7

Cray T3E (cont’d)

Supports explicit as well as implicit parallelismExplicit methods

CF90 and C/C++ PVM MPI SHMEM

Implicit methodsHPFCray CRAFT work-sharing features

Page 8: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 8

NCUBE

Uses binary hypercube as the interconnect We look at NCUBE/ten

Uses 10-d hypercube1024 32-bit processors

Custom-made processors

128 KB memorySmall by current standards

Page 9: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 9

NCUBE (cont’d)

Each printed circuit board (16” X 22”) holds64 processorsMemoriesInterconnect

Total system is composed of16 processor boardsUp to 8 I/o boards

Entire system housed in a small air-cooled enclosure

Page 10: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 10

NCUBE (cont’d)

Inter-PCB communicationRequires 640 connections

Each node has 4 off-board bi-directional channels 64*4*2 = 512 wires

Each node has one I/O bi-directional channels 64 * 2 = 128

Total wires 512 + 128 = 640

Bit-serial links to conserve pins/connections

Page 11: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 11

NCUBE (cont’d)

CommunicationChannels operate at 10 MHz with parity check

Data transfer rate = 1 MB/s in each directionEach channel has two 32-bit write-only registers

One for message buffer addressOne for byte count

Indicates number of bytes left to send/receive

A ready flag and an interrupt enable flag for each channel

Page 12: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 12

NCUBE (cont’d)

Communication primitivesnwrite

To send a message

nwrite (message, length, dest, type,

status, error)status = indicates when the message has left the buffer

Buffer is reusable

error = error code

Page 13: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 13

NCUBE (cont’d)

nreadTo receive a message

blocking receive

nread (message, length, source, type,

status, error)source = 1 (wildcard)type = 1 (wildcard)Both can be 1 (wildcard)

Receives the next message

Page 14: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 14

NCUBE (cont’d)

NCUBE example Computes sum-of-squares

Sum (V[i]2) K elements

K = N 2M

Nodes in the cube = 2M

Each node receives N elements from host Final sum is returned to host

Uses binary collapsing Works on one dimension at a time

Page 15: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 15

NCUBE (cont’d)

NCUBE examplecall whoami(PN, PROC, HOST, M)

PN = Logical processor #PROC = process # in nodeHOST = host for cube communicationM = dimension of the allocated cube

SR = nread(V, N*4, HOST, TYPEH, FLAG1)Receive vector V of length N (N*4 bytes) from HOST

Page 16: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 16

NCUBE (cont’d)

S = 0

DO 1 I = 1, N

1 S = S + V(I)**2

Local computation is done by each nodeOnce done, we use binary collapsing to compute the final

sum

Local computation

Page 17: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 17

NCUBE (cont’d)

DO 2 I = M, 1, -1 IF(PN .LT. 2**I) THEN NPN = PN .NEQV. (2**(I-1)) IF(NPN .LT. PN) THEN SW = nwrite (S, 4, NPN, TYPEN, FLAG2) ELSE SR = nread (A, 4, NPN, TYPEN, FLAG3) S = S + A ENDIF ENDIF

2 CONTINUE

XOR operator

Page 18: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 18

NCUBE (cont’d)

Send final result back to hostIF (PN .EQ. 0) THEN

SW = nwrite (S, 4, HOST, TYPEH,

FLAG4)

ENDIF

This code is executed by node 0 only

Page 19: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 19

Scheduling in Multicomputers Principle (in the absence of priority)

Share processing power equally among the jobs Uniprocessors

Round-robin/processor sharing Multicomputers

Equal sharing can be done Spatially, or

Space-sharing policies Temporally

Time-sharing policies

Page 20: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 20

Space-Sharing Policies Space-sharing policies

System is divided into several partitions Each partition is assigned to a parallel job The assigned job keeps the partition until completion

Run-to-completion strategy

Three types of policies Fixed Static Dynamic

Page 21: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 21

Space-Sharing Policies (cont’d)

Fixed space-sharing Partitioning is a system configuration parameter

Long term Job characteristics can be used

Maximum job parallelism Average job parallelism

Partition is kept by the job until completion Advantage

Simple implementation Not the best way

Several problems

Page 22: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 22

Space-Sharing Policies (cont’d)

Problems with fixed space-sharing Difficult to partition the system

What is the best partition? Does not adapt to system load conditions and

resource requirements of jobs Internal fragmentation (this refers to leaving some

allocated processors idle) Example: Allocating 50 processors to a job that requires

only 40 processors Leads to under-utilization of resources

In the last example, 10 processors idle

Page 23: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 23

Space-Sharing Policies (cont’d)

Static space-sharing Partitions are allocated one job-by-job basis at the

schedule time No prepartitions as in fixed space-sharing Eliminates the mismatch between a job's partition size and

the allocated partitions size As in fixed policies, partition is kept until job completes

Advantages Internal fragmentation is avoided Better than fixed space-sharing

Page 24: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 24

Space-Sharing Policies (cont’d)

Problems with static space-sharing External fragmentation is possible

We can reduce this by using First-fit Best-fit A related problem: fairness

Another solution [Tucker and Gupta 1989]: Adjust software structure to fit the partition size Suitable for some applications Not suitable for applications that require partition size at

compile time in order to optimize the code

Page 25: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 25

Space-Sharing Policies (cont’d)

Fragmentation can also occur due to System imposed constraints

Example: In hypercube machines, contiguous set of nodes may not be available to form a sub-cube

All-or-nothing allocation Partial allocation may be acceptable to many applications

Central allocator may create performance problems

It can become a bottleneckFault-tolerance/reliability

Page 26: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 26

Space-Sharing Policies (cont’d)

Example Policy Original policy

Partition size = MAX (1, Total processors/(Q+1) )Q = job queue length

Problem: Does not take scheduled jobs into account Modified policy

Partition size = MAX (1, Total processors/(Q+f*S+1) )Q = job queue lengthS = Number of scheduled jobsf = Weight of scheduled jobs (between 0 and 1)

Page 27: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 27

Space-Sharing Policies (cont’d)

Dynamic space-sharingProcessors are not allocated on a lifetime basis

Processors are taken away from jobs if they cannot use them

Particularly useful for jobs that exhibit varying degree of parallelism

AdvantageEliminates some forms of external fragmentation

by not allocating partitions for the lifetime of jobs

Page 28: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 28

Space-Sharing Policies (cont’d)

Problems with dynamic space-sharing Difficult to implement on distributed-memory multicomputers

Expensive to take processors away in distributed-memory multicomputers

Processors may be taken only when the computation reaches a desired "yielding point"

Central allocator may become a bottleneck

Not used with multicomputer systems

Page 29: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 29

Time-Sharing Policies Space-sharing

Fixed policies: Long term commitment Static policies: commitments at job level Dynamic policies: commitments at task or sub-task

level Time-sharing

Changes focus from jobs to processors Time-sharinguses preemption to rotate

processors amongst a number of jobs Usually specified by multiprogramming level MPL

Page 30: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 30

Time-Sharing Policies (cont’d)

Two policies Task-based round-robin (RRTask)

Quantum size is fixed per task Violates our “equal allocation of processing power” principle

Larger jobs tend to dominate Job-based round-robin (RRJob)

Quantum size is fixed per job Equal allocation is possible

Preemption can be Coordinated (gang scheduling) Uncoordinated

Page 31: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 31

Time-Sharing Policies (cont’d)

Problems with time-sharing Requires a central coordinator

Coordinator can become a bottleneck for large systems Central task queue can create bottleneck problems

Could use Local RR-job Apply round-robin at the processors level Not as effective

Hybrid version is effective Combined space- and time-sharing Partition as in space-sharing, but time-share each

partition

Page 32: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 32

Hierarchical Scheduling Motivation

Should be self-scheduling to avoid bottlenecks Should not cause bottleneck problems

For the global task queue and coordinator Should minimize internal fragmentation

As in time sharing Should minimize external fragmentation

Implies partial allocation Handling system imposed constraints

Should be a hybrid policy space-sharing at low system loads/time-sharing at moderate

to high loads

Page 33: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 33

Hierarchical Scheduling (cont’d)

Page 34: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 34

Hierarchical Scheduling (cont’d)

Page 35: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 35

Hierarchical Scheduling (cont’d)

Performance Ideal workload

Example: Job service time = 16 minutes Divided into four tasks: 4, 4, 4, 4 minutes

50-50 workload 50% of evenly distributed task service time is distributed among 50% of the

tasks Example: 2, 2, 6, 6 minutes

50-25 workload Example: 1, 1, 7, 7 minutes

50-75 workload Example: 3, 3, 5, 5 minutes

Page 36: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 36

Hierarchical Scheduling (cont’d)

0

4

8

12

16

20

0 10 20 30 40 50 60 70 80 90 100

Utilization (%)

Mea

n re

spon

se ti

me

Space-sharing Hierarchical Time-sharing

Ideal workload

Page 37: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 37

Hierarchical Scheduling (cont’d)

50-50 (service CV=10)

0

4

8

12

16

20

0 10 20 30 40 50 60 70 80 90 100

Utilization (%)

Mea

n re

spon

se t

ime

Space-sharing Hierarchical Time-sharing

Page 38: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 38

Hierarchical Scheduling (cont’d)

50-50 (service CV=1)

0

4

8

12

16

20

0 10 20 30 40 50 60 70 80 90 100

Utilization (%)

Mea

n re

spon

se t

ime

Space-sharing Hierarchical Time-sharing

Page 39: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 39

Hierarchical Scheduling (cont’d)

50-50 (service CV=15)

0

10

20

30

40

50

60

0 10 20 30 40 50 60 70 80 90 100

Utilization (%)

Mea

n re

spon

se ti

me

Space-sharing Hierarchical Time-sharing

Page 40: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 40

Hierarchical Scheduling (cont’d)

50-25 (service CV=10)

0

10

20

30

40

50

60

70

0 20 40 60 80 100

Utilization (%)

Mea

n re

spon

se ti

me

Space-sharing Hierarchical Time-sharing

Page 41: Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 41

Hierarchical Scheduling (cont’d)

50-75 (service CV=10)

0

10

20

30

40

50

60

0 20 40 60 80 100

Utilization (%)

Mea

n re

spon

se t

ime

Space-sharing Hierarchical Time-sharing

Last slide