Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University

Distributed-Memory Multicomputers

Prof. Sivarama Dandamudi

School of Computer Science

Carleton University

Carleton University © S. Dandamudi 2

Roadmap Cray T3E

Architecture details on the video NCUBE

Communication primitives Binary collapsing

Job scheduling Space-sharing Time-sharing Hybrid

Hierarchical scheduling Performance


Cray T3E

Distributed memory MIMD systemPredecessor model: T3D

Named after the interconnect used: a 3-D torusVideo gives details

T3E followed T3D Systems up to 126 processors: air cooledLarger systems (up to 2048 processors): liquid cooledUses DEC Alpha 21164A RISC processor


Cray T3E (cont’d)

Each node consists of a processing element (PE) Processor and Memory Communication engine

Takes care of the communication between PEs

Memory 256 or 512 MB local memory (50 ns DRAM) per node

Total memory: 10GB to 1TB Cache coherent and physically distributed

Globally addressable SECDED data error protection Peak memory bandwidth: 1200 MB per PE All I/O channels are accessible and controllable from all PEs


Cray T3E (cont’d)

I/O is done through GigaRing channels Each I/O channel

Uses dual-ring technique Two rings with data moving in opposite directions

Maximum bandwidth of 500 MB/sec

Processor DEC Alpha 21164A (EV5.6) 675 MHz

Superscalar RISC processor 2 floating-point operations/cycle

32- and 64-bit IEEE arithmetic 1350 MFLOPS per PE (peak) & 2700 MIPS per PE


Cray T3E (cont’d)

Interconnection Uses 3-D torus interconnect (like the T3D) Peak bisection bandwidth:

42GB/sec (64 PEs) 166GB/sec (512 PEs)

Module 8 PEs per module One I/O interface per module

System size 40 to 2176 PEs per system

In increments of 8 PEs

Peak performance54 GFLOPS to 3 TFLOPS


Cray T3E (cont’d)

Supports explicit as well as implicit parallelismExplicit methods

CF90 and C/C++ PVM MPI SHMEM

Implicit methodsHPFCray CRAFT work-sharing features


NCUBE

Uses binary hypercube as the interconnect We look at NCUBE/ten

Uses 10-d hypercube1024 32-bit processors

Custom-made processors

128 KB memorySmall by current standards


NCUBE (cont’d)

Each printed circuit board (16” X 22”) holds64 processorsMemoriesInterconnect

Total system is composed of16 processor boardsUp to 8 I/o boards

Entire system housed in a small air-cooled enclosure


NCUBE (cont’d)

Inter-PCB communicationRequires 640 connections

Each node has 4 off-board bi-directional channels 64*4*2 = 512 wires

Each node has one I/O bi-directional channels 64 * 2 = 128

Total wires 512 + 128 = 640

Bit-serial links to conserve pins/connections


NCUBE (cont’d)

CommunicationChannels operate at 10 MHz with parity check

Data transfer rate = 1 MB/s in each directionEach channel has two 32-bit write-only registers

One for message buffer addressOne for byte count

Indicates number of bytes left to send/receive

A ready flag and an interrupt enable flag for each channel


NCUBE (cont’d)

Communication primitivesnwrite

To send a message

nwrite (message, length, dest, type,

status, error)status = indicates when the message has left the buffer

Buffer is reusable

error = error code


NCUBE (cont’d)

nreadTo receive a message

blocking receive

nread (message, length, source, type,

status, error)source = 1 (wildcard)type = 1 (wildcard)Both can be 1 (wildcard)

Receives the next message


NCUBE (cont’d)

NCUBE example Computes sum-of-squares

Sum (V[i]2) K elements

K = N 2M

Nodes in the cube = 2M

Each node receives N elements from host Final sum is returned to host

Uses binary collapsing Works on one dimension at a time


NCUBE (cont’d)

NCUBE examplecall whoami(PN, PROC, HOST, M)

PN = Logical processor #PROC = process # in nodeHOST = host for cube communicationM = dimension of the allocated cube

SR = nread(V, N*4, HOST, TYPEH, FLAG1)Receive vector V of length N (N*4 bytes) from HOST


NCUBE (cont’d)

S = 0

DO 1 I = 1, N

1 S = S + V(I)**2

Local computation is done by each nodeOnce done, we use binary collapsing to compute the final

sum

Local computation


NCUBE (cont’d)

DO 2 I = M, 1, -1 IF(PN .LT. 2**I) THEN NPN = PN .NEQV. (2**(I-1)) IF(NPN .LT. PN) THEN SW = nwrite (S, 4, NPN, TYPEN, FLAG2) ELSE SR = nread (A, 4, NPN, TYPEN, FLAG3) S = S + A ENDIF ENDIF

2 CONTINUE

XOR operator


NCUBE (cont’d)

Send final result back to hostIF (PN .EQ. 0) THEN

SW = nwrite (S, 4, HOST, TYPEH,

FLAG4)

ENDIF

This code is executed by node 0 only


Scheduling in Multicomputers Principle (in the absence of priority)

Share processing power equally among the jobs Uniprocessors

Round-robin/processor sharing Multicomputers

Equal sharing can be done Spatially, or

Space-sharing policies Temporally

Time-sharing policies


Space-Sharing Policies Space-sharing policies

System is divided into several partitions Each partition is assigned to a parallel job The assigned job keeps the partition until completion

Run-to-completion strategy

Three types of policies Fixed Static Dynamic


Space-Sharing Policies (cont’d)

Fixed space-sharing Partitioning is a system configuration parameter

Long term Job characteristics can be used

Maximum job parallelism Average job parallelism

Partition is kept by the job until completion Advantage

Simple implementation Not the best way

Several problems



Problems with fixed space-sharing Difficult to partition the system

What is the best partition? Does not adapt to system load conditions and

resource requirements of jobs Internal fragmentation (this refers to leaving some

allocated processors idle) Example: Allocating 50 processors to a job that requires

only 40 processors Leads to under-utilization of resources

In the last example, 10 processors idle



Static space-sharing Partitions are allocated one job-by-job basis at the

schedule time No prepartitions as in fixed space-sharing Eliminates the mismatch between a job's partition size and

the allocated partitions size As in fixed policies, partition is kept until job completes

Advantages Internal fragmentation is avoided Better than fixed space-sharing



Problems with static space-sharing External fragmentation is possible

We can reduce this by using First-fit Best-fit A related problem: fairness

Another solution [Tucker and Gupta 1989]: Adjust software structure to fit the partition size Suitable for some applications Not suitable for applications that require partition size at

compile time in order to optimize the code



Fragmentation can also occur due to System imposed constraints

Example: In hypercube machines, contiguous set of nodes may not be available to form a sub-cube

All-or-nothing allocation Partial allocation may be acceptable to many applications

Central allocator may create performance problems

It can become a bottleneckFault-tolerance/reliability



Example Policy Original policy

Partition size = MAX (1, Total processors/(Q+1) )Q = job queue length

Problem: Does not take scheduled jobs into account Modified policy

Partition size = MAX (1, Total processors/(Q+f*S+1) )Q = job queue lengthS = Number of scheduled jobsf = Weight of scheduled jobs (between 0 and 1)



Dynamic space-sharingProcessors are not allocated on a lifetime basis

Processors are taken away from jobs if they cannot use them

Particularly useful for jobs that exhibit varying degree of parallelism

AdvantageEliminates some forms of external fragmentation

by not allocating partitions for the lifetime of jobs



Problems with dynamic space-sharing Difficult to implement on distributed-memory multicomputers

Expensive to take processors away in distributed-memory multicomputers

Processors may be taken only when the computation reaches a desired "yielding point"

Central allocator may become a bottleneck

Not used with multicomputer systems


Time-Sharing Policies Space-sharing

Fixed policies: Long term commitment Static policies: commitments at job level Dynamic policies: commitments at task or sub-task

level Time-sharing

Changes focus from jobs to processors Time-sharinguses preemption to rotate

processors amongst a number of jobs Usually specified by multiprogramming level MPL


Time-Sharing Policies (cont’d)

Two policies Task-based round-robin (RRTask)

Quantum size is fixed per task Violates our “equal allocation of processing power” principle

Larger jobs tend to dominate Job-based round-robin (RRJob)

Quantum size is fixed per job Equal allocation is possible

Preemption can be Coordinated (gang scheduling) Uncoordinated


Time-Sharing Policies (cont’d)

Problems with time-sharing Requires a central coordinator

Coordinator can become a bottleneck for large systems Central task queue can create bottleneck problems

Could use Local RR-job Apply round-robin at the processors level Not as effective

Hybrid version is effective Combined space- and time-sharing Partition as in space-sharing, but time-share each

partition


Hierarchical Scheduling Motivation

Should be self-scheduling to avoid bottlenecks Should not cause bottleneck problems

For the global task queue and coordinator Should minimize internal fragmentation

As in time sharing Should minimize external fragmentation

Implies partial allocation Handling system imposed constraints

Should be a hybrid policy space-sharing at low system loads/time-sharing at moderate

to high loads


Hierarchical Scheduling (cont’d)





Performance Ideal workload

Example: Job service time = 16 minutes Divided into four tasks: 4, 4, 4, 4 minutes

50-50 workload 50% of evenly distributed task service time is distributed among 50% of the

tasks Example: 2, 2, 6, 6 minutes

50-25 workload Example: 1, 1, 7, 7 minutes

50-75 workload Example: 3, 3, 5, 5 minutes



0

4

8

12

16

20

0 10 20 30 40 50 60 70 80 90 100

Utilization (%)

Mea

n re

spon

se ti

me

Space-sharing Hierarchical Time-sharing

Ideal workload



50-50 (service CV=10)

0

4

8

12

16

20

0 10 20 30 40 50 60 70 80 90 100

Utilization (%)

Mea

n re

spon

se t

ime





0

4

8

12

16

20

0 10 20 30 40 50 60 70 80 90 100

Utilization (%)

Mea

n re

spon

se t

ime





0

10

20

30

40

50

60

0 10 20 30 40 50 60 70 80 90 100

Utilization (%)

Mea

n re

spon

se ti

me





0

10

20

30

40

50

60

70

0 20 40 60 80 100

Utilization (%)

Mea

n re

spon

se ti

me





0

10

20

30

40

50

60

0 20 40 60 80 100

Utilization (%)

Mea

n re

spon

se t

ime


Last slide

Documents

Distributed-Memory Multicomputers Prof. Sivarama Dandamudi School of Computer Science Carleton University