View
231
Download
0
Category
Preview:
Citation preview
Master Program (Laurea Magistrale) in Computer Science and Networking
High Performance Computing Systems and Enabling Platforms
Marco Vanneschi
Precourse in Computer Architecture Fundamentals
2.
Firmware Level
Firmware level
• At this level, the system is viewed as a collection of cooperating modules calledPROCESSING UNITS (simply: units).
• Each Processing Unit is
– autonomous
• has its own control, i.e. it has self-control capability, i.e. it is an active computational entity
– described by a sequential program, called microprogram.
• Cooperation is realized through COMMUNICATIONS
– Communication channels implemented by physical links and a firmware protocol.
• Parallelism among units.
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 2
U1
Uj
Un
U2
. . . . . .
. . .
Control Part j
Operating Part j
Examples
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 3
Cache Memory
.
.
.
Main Memory
MemoryManagement
Unit
ProcessorInterrupt Arbiter
I/O1 I/Ok. . .
. . .
CPU
A uniprocessor architecture
In turn, CPU is a collection of
communicating units (on the
same chip)
Cache Memory
In turn, the Cache Memory can be
implemented as a collection of
parallel units: instruction and data
cache, primary – secondary –
tertiary… cache
In turn, the Main Memory can be implemented as a
collection of parallel units: modular memory,
interleaved memory, buffer memory, or …
In turn, the Processor can be
implemented as a (large)
collection of parallel, pipelined
units
In turn, an advanced
I/O unit can be
a parallel architecture
(vector / SIMD
coprocessor / GPU /
…),
or a powerful
multiprocessor-based
network interface,
or ….
Firmware interpretation of assembly language
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 4
Hardware
Applications
Processes
Assembler
Firmware
Uniprocessor : Instruction Level Parallelism
Shared Memory Multiprocessor: SMP, NUMA,…
Distributed Memory : Cluster, MPP, …
• The true architectural level.• Microcoded interpretation of assembler
instructions.• This level is fundamental (it cannot be
eliminated).
• Physical resources: combinatorial and sequential circuit components, physical links
I
I
The firmware interpretation of any assembly
language instructions has a decentralized
organization, even for simple uniprocessor
systems.
The interpreter functionalities are partitioned
among the various processing units (belonging to
CPU, M, I/O subsystems), each unit having its
own Control Part and its own microprogram.
Cooperation of units through explicit message
passing .
Processing Unit model: Control + Operation
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 5
PC and PO are automata,implemented by synchronous sequential networks (sequential circuits)
Clock cycle: time interval t for executing a (any) microinstruction. Synchronous model of computation.
Externaloutputs
Externalinputs
Control Part
(PC)
Operating Part
(PO)
Condition variables
{x}
Control variables
g = {a} {b}
Clock signal
Example of a simple unit design
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 6
Specification IN: OUT = number_of_occurrences (IN, A)
Pseudo-code of interpreter:while (true) do
B = IN; C = 0;for (J = 0; J < N; J ++)
if ( A[ J ] == B) then C = C + 1;OUT = C
A[N]
IN (stream)
32 32
U
OUT (stream)
Integer array: implemented as a registerRAM/ROM
Approach to processing unit design
• PO structure is composed of
– Registers: to implement variables
– Independent registers,
– Register array: ROM/RAM memory
– Standard combinatorial circuits: ALUs (arithmetic-logic operations), multiplexers, and
few others
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 7
Synchronized registers
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 8
when clock do R_output = R_input
R_input
R_output = R_present_state
In PO, the clock signal is AND-ed with a control variable (b), in oder to enable register-writingexplicitly by PC.
Register_writing
RR_output = R_present_state
clock (pulse signal)
R_input
Register output and state remain STABLE duringa clock cycle
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 9
BbB
IN
CbC
KCaC
0 Alu1
JbJ
KJaJ
0 Alu2
OUTbOUT
C
x0 = J0
ALU1 : -, +1
aAlu2
KALU1aAlu1
outA C
Alu1
ALU2 : +1
Alu2
J
A [N]outA
Jm
ZbZ
x1 = Z
Jm
Example of PO structure
Arithmetic-logic
combinatorial
circuit
Register array
(RAM/ROM), N words,
32 bits/word
Register (32 bits)
Multiplexer
to PC
from PC
B
Write-enable signal, from PC
1 bit register
Approach to processing unit design
• PO structure is composed of
– Registers: to implement variables
– Independent registers,
– Register array: ROM/RAM memory
– Standard combinatorial circuits: ALUs (arithmetic-logic operations), multiplexers, and
few others
• The microprogram is expressed by means of register-transfer micro-operations.
– Example: A + B C
– i.e., the contents of registers A and B are added, and the result is written into register C.
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 10
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 11
BbB
IN
CbC
KCaC
0 Alu1
JbJ
KJaJ
0 Alu2
OUTbOUT
C
x0 = J0
ALU1 : -, +1
aAlu2
KALU1aAlu1
outA C
Alu1
ALU2 : +1
Alu2
J
A [N]outA
Jm
ZbZ
x1 = Z
Jm
Example of micro-operation
Data pathfor the execution of
zero (A[J] – B) Z
B
= 0
= 1
= 0
= 0 = 0 = 0
= 1
= - = -
Executed in oneclock cycle:all the interestedcombinatorialcircuits becomestable during the clock cycle.
Approach to processing unit design
• PO structure is composed of
– Registers: to implement variables
– Independent registers,
– Register array: ROM/RAM memory
– Standard combinatorial circuits: ALUs (arithmetic-logic operations), multiplexers, and
few others
• The microprogram is expressed by means of register-transfer micro-operations.
– Example: A + B C
– i.e., the contents of registers A and B are added, and the results is written into register C.
• Parallelism inside micro-operations. – Example: A + B C, D – E D
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 12
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 13
BbB
IN
CbC
KCaC
0 Alu1
JbJ
KJaJ
0 Alu2
OUTbOUT
C
x0 = J0
ALU1 : -, +1
aAlu2
KALU1aAlu1
outA C
Alu1
ALU2 : +1
Alu2
J
A [N]outA
Jm
ZbZ
x1 = Z
Jm
Data pathfor the execution of
J + 1 J, C + 1 C
B
= 1
= 0
= 1
= 1
= 1
= 1
= 0
= 0
= 0
Example of parallel micro-operation
These register-transferoperations are logicallyindependent, and theycan be actuallyexecuted in parallel ifproper PO resourcesare provided (e.g., twoALUs).Both operations are executed during the same clock cycle.
Approach to processing unit design
• PO structure is composed of
– Registers: to implement variables
– Independent registers,
– Register array: ROM/RAM memory
– Standard combinatorial circuits: ALUs (arithmetic-logic operations), multiplexers, and
few others
• The microprogram is expressed by means of register-transfer micro-operations. Example: A + B C
– i.e., the contents of registers A and B are added, and the results is written into register C.
• Parallelism inside micro-operations. Example: A + B C, D – E F
• Conditional microistructions: if … then … else, switch case …
• PC: at each clock cycle interact with PO in order to control the execution ofthe current microinstriction.
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 14
Example of microistruction
i. if (Asign = 0)
then A + B A, J + 1 J, goto j
else A – B B, goto k
j. …
…
k. …
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 15
Currentmicroistructionlabel
Logical conditionon one or more conditionvariables(Asign denotes the most significantbit of register A)
(Parallel) micro-operation
Label of nextmicroistruction
Control Part (PC) is anautomaton, whose state graphcorresponds to the structure ofthe microprogram:
i
j
k
Asign = 0: execute A + B A, J + 1 J in PO, and goto j
Asign = 1: execute A - B B in PO, and goto k
Approach to processing unit design
• PO structure is composed of– Registers: to implement variables
– Independent registers,
– Register array: ROM/RAM memory
– Standard combinatorial circuits: ALUs (arithmetic-logic operations), multiplexers, and few others
• The microprogram is expressed by means of register-transfer micro-operations. Example: A + B C– i.e., the contents of registers A and B are added, and the results is written into register C.
• Parallelism inside micro-operations. Example: A + B C, D – E F
• Conditional microistructions: if … then … else, switch case …
• PC: at each clock cycle interact with PO in order to control the execution ofthe current microinstriction.
• Each microinstruction is executed in a clock cycle.
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 16
Generation of propervalues ofcontrolvariables
Clock cycle: informal meaning
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 17
PC
PO
pulse width
(stabilization of register
contents)
{values of control variables
clock frequency f = 1/t e.g. t = 0,25 nsec f = 4 GHz
tExecution ofA + B C(A, B, C: registers ofPO)
Execution ofinput_C = A + B
Execution ofA + B C, D + 1 Din parallel in thesame clock cycle
Execution ofinput_D = D + 1
C = input_CD = input_D
Clock cycle, t = maximum stabilitation delay time for the execution of a (any) microinstruction, i.e. for the stabilization of the inputs of all the PC and PO registers
input_C C
Register C
Clock cycle: formally
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 18
TwPC
PC
PO
TwPOTsPO
TsPC
d pulse width
{x}
{g}
clock cycle t = TwPO + max (TwPC + TsPO, TsPC) + d
wA: output function ofautomata A
sA: state transitionfunction of automata A
TF: stabilization delay ofthe combinatorialnetwork implementingfunction F
Example of a simple unit design
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 19
Specification IN: OUT = number_of_occurrences (IN, A)
Pseudo-code of interpreter:while (true) do
B = IN; C = 0;for (J = 0; J < N; J ++)
if ( A[ J ] == B) then C = C + 1;OUT = C
Microprogram (= interpreter implementation)
0. IN B, 0 C, 0 J, goto 1
1. if (Jmost = 0) then zero (A[Jmodule] – B) Z, goto 2
else C OUT, goto 0
2. if (Z = 0) then J + 1 J, goto 1
else J + 1 J, C + 1 C, goto 1
// to be completed with synchronized communications //
A[N]
IN (stream)
32 32
U
Service time of U: Tcalc = k t = ( 2 + 2 N ) t 2N t
N times
N times
1 time
0 1 2
1 time
Control Part state graph (automaton):
Jmost = 0
Jmost = 1
OUT (stream)
Integer array: implemented as a registerRAM/ROM
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 20
BbB
IN
CbC
KCaC
0 Alu1
JbJ
KJaJ
0 Alu2
OUTbOUT
C
x0 = J0
ALU1 : -, +1
aAlu2
KALU1aAlu1
outA C
Alu1
ALU2 : +1
Alu2
J
A [N]outA
Jm
ZbZ
x1 = Z
Jm
PO structure
Arithmetic-logic
combinatorial
circuit
Register array
(RAM/ROM), N words,
32 bits/word
Register (32 bits)
Multiplexer
to PC
from PC
B
Write-enable signal, from PC
1 bit register
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 21
BbB
IN
CbC
KCaC
0 Alu1
JbJ
KJaJ
0 Alu2
OUTbOUT
C
x0 = J0
ALU1 : -, +1
aAlu2
KALU1aAlu1
outA C
Alu1
ALU2 : +1
Alu2
J
A [N]outA
Jm
ZbZ
x1 = Z
Jm
PO structure
Data pathfor the execution of
J + 1 J, C + 1 C
B
= 1
= 0
= 1
= 1
= 1
= 1
= 0
= 0
= 0
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 22
BbB
IN
CbC
KCaC
0 Alu1
JbJ
KJaJ
0 Alu2
OUTbOUT
C
x0 = J0
ALU1 : -, +1
aAlu2
KALU1aAlu1
outA C
Alu1
ALU2 : +1
Alu2
J
A [N]outA
Jm
ZbZ
x1 = Z
Jm
PO structure
Data pathfor the execution of
zero (A[J] – B) Z
B
= 0
= 1
= 0
= 0 = 0 = 0
= 1
= - = -
Service time, and other performance parameters
• Computations on streams
• Stream: (possibly infinite) sequence of values with the same type, also called tasks
• Service time: average time interval between the beginning of processing of twoconsecutive tasks, i.e. between two consecutive services
• Latency: average duration of a single task processing (from input acquisition to resultgeneration) : calculation time
– Service time is equal to latency for sequential computations only (a single module,
or unit)
• Processing Bandwidth = 1/service_time: average number of processed tasks per second (e.g. images per second, operations per second, instructions per second, bitstransmitted per second, …) – also called throughput for generic applications, or performance for CPUs (instructions or float operations per second)
• Completion time, for a finite lenght (m) stream: time interval needed to fully processall the tasks. For large m and parallel computations, often the following approximaterelation holds:
completion_time m service_time
equality relation is rigorous for sequential computationsMCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 23
Communications at the firmware level
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 24
Dedicated links (one-to-one, unidirectional)
Shared link: bus (many-to-many, bidirectional)It can support one-to-one, one-to-many, and many-to-one communications.
In general, for computer structures : parallel links(e.g. one or more wordsper link)
Valid for• uniprocessor computer, • multiprocessor with
shared memory, • some multicomputers
with distributed memory(MPP)
Not valid for• distributed architectures
with standard communication networks,
• some kind of simple I/O.
Basic firmware structure for dedicated links
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 25
Usource Udestination
Output
interfaceInput interface
LINK
Usource::
while (true) do
< produce message MSG >;
< send MSG to Udestination >
Udestination::
while (true) do
< receive MSG from Usource >;
< process MSG >
Ready – acknowledgement protocol
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 26
RDY line (1 bit)
ACK line (1 bit)
U1
O
U
TU2
I
NMSG (n bit)
< send MSG > ::
wait ACK;
write MSG into OUT, signal RDY,
…
< receive MSG >::
wait RDY;
utilize IN, signal ACK, …
Asynchronous communication : asynchrony degree = 1
Implementation
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 27
U1
OUT U2
I
N
ACK
(1 bit)
RDYOUT
ACKOUT
RDYIN
ACKIN
MSG (n bit)
bRDYOUT
bACKOUT
bRDYIN
bACKIN
RDY
(1 bit)
< send MSG > ::
wait until ACKOUT;
MSG OUT, set RDYOUT, reset ACKOUT
< receive MSG >::
wait until RDYIN;
f( IN, …) …… , set ACKIN, reset RDYIN
Level-transition interface : detects a leveltransition on input line, provided that it occurswhen S = Y
clock
RDY
line exor
RDYIN
S
Y
Module-2 counter
set X: complements X content (a level transition is produced onto output link)reset X: forces X output to zero (by forcing element Y to become equal to S)
Example of a simple unit design withcommunication
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 28
Specification IN: OUT = number_of_occurrences (IN, A)
Pseudo-code of interpreter:while (true) do
receive (IN, B); C = 0;for (J = 0; J < N; J ++)
if ( A[ J ] == B) then C = C + 1;send (OUT, C)
A[N]
IN
32 32
U
OUT
RDYIN RDYOUT
ACKIN ACKOUT
Microprogram (= interpreter implementation)
0. if (RDYIN = 0) then nop, goto 0
else reset RDYIN, set ACKIN, IN B, 0 C, 0 J, goto 1
1. case (Jmost ACKOUT = 0 - ) do zero (A[Jmodule] – B) Z, goto 2
(Jmost ACKOUT = 1 0 ) do nop, goto 1
(Jmost ACKOUT = 1 1 ) do reset ACKOUT, set RDYOUT, C OUT, goto 0
2. if (Z = 0) then J + 1 J, goto 1
else J + 1 J, C + 1 C, goto 1
Impact of communication protocol on performance parameters
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 29
1 time
N times
N times
1 time
0 1 2
Control Part state graph (automaton):
Jmost = 0
Jmost = 1 and ACKOUT = 1
Jmost = 1 and ACKOUT = 0
RDYIN = 0
RDYIN = 1
Synchronization implies waitconditions (wait RDYIN, waitACKOUT), corresponding to arcswith the same source and destination node (stable internalstates) in the state graph.
• In the evaluation of the internal calculation time of U, these wait conditions are not to be taken into account.• That is, we are interested in evaluating the processing capabilities of U
independently of the current situation of the partner units: as if the source and destination units were always ready to communicate.
• On the other hand, the communication protocol has “no cost”, i.e. no overhead.• Thus, the calculation time of U is again Tcalc = 2Nt.
• However, communication might have impact on the service time.
Communication: performance parameters
• Because of communication impact, we distinguish between:
– Ideal service time = internal calculation time
– Effective service time, or simply service time
• Overhead of RDY-ACK protocol at the firmware level, for dedicatedlinks: null for calculation time (ideal service time) !
– all the required operations for RDY-ACK protocol are executed in a single
clock cycle, in parallel to operations of the true computation.
• Communication latency: time needed to complete a communication, i.e. for the ACK reception.
• Service time: considering the effect of communication, it dependson the calculation time and on the communication latency:
– once executed a send operation, the sender can execute a new send operation
after the communication latency time interval (not before).
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 30
Communication latencyand impact on service time
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 31
ACKRDY RDY
Ttrt
Tcalc Tcom
Sender
Receiver
Clock cycle for calculation only
Clock cycle for calculation
and communication
Communication Latency
Lcom = 2 (Ttr + t)
Transmission Latency(Link only)
Communication time NOT overlapped to (i.e. not masked by) internal calculation
Calculation time
Service time T = Tcalc + Tcom
…
…
Tcalc ≥ Lcom Tcom = 0
Lcom
T
“Coarse grain” and “fine grain” computations
• Two possible situations: Tcalc Lcom or Tcalc < Lcom
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 32
Tcalc
Lcom
Tcalc Lcom
Tcom = 0Service time T = Tcalc
Tcalc
Lcom
Tcalc < Lcom
Tcom > 0Service time T = Lcom
(Coarse Grain)
(Fine Grain)
T = Tcalc + Tcom = max (Tcalc, Lcom)
In the example
Tcalc = 2Nt
Let: Ttr = 9 t
Then: Lcom = 2 (t + Ttr) = 20 t
For N 10: Tcalc Lcom
In this case:
Tcom = 0, service time T = Tcalc = 2Nt
i.e. “coarse grain” calculation (wrt communications)
However, for relatively low N, provided that N 10, it is a “fine grain”
computation (e.g. few tens clock cycles) able to fully mask communications.
For N < 10: Tcalc < Lcom
Tcom > 0, service time T = Lcom = 20t
i.e. “too fine” grain calculation.
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 33
Communication latency and impact on service time
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 34
Comm. Latency Lcom = 2 (Ttr + t)
Service time T = Tcalc + TcomTcalc ≥ Lcom Tcom = 0
efficiency e 1,
ideal processing bandwidth (throughput)
• Impact of Ttr on service time T is significant
• e.g. memory access time = response time of “external” memory to processor
requests (“external” = outside CPU-chip)
• Inside the same chip: Ttr 0,
• e.g. between processor – MMU – cache
• except for on-chip interconnection structures with “long wires” (e.g. buses,
rings, meshes, …)
• With Ttr 0, very “fine grain” computations are possible without communicationpenalty on service time: Lcom = 2t if Tcalc ≥ 2t then T = Tcalc
• Important application in Instruction Level Parallelism processors and in Interconnection Network switch-units.
Request-reply communication
Example: a simple cooperation between a processor Pand a memory unit M
(read or write request from P to M, reply from M to P)
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 35
t
tM
Ttr
P-M request-reply latency: ta = 2 (t + Ttr) + tM
P
M
Memory access time (as “seen” by the processor)
M
P
Integrated hardware technologies
• Integration on silicon semiconductor CHIPS
• Chip area vs clock cycle:– Each logical HW component takes up a given area on silicon (currently: order of fractions of
nanometers).
– Increasing the number of components per chip, i.e. increasing chip area, is positive providedthat components are intensively “packed”.
– Relatively long links on chip are the main source of inefficiency.
– Some macro-components are more suitable for intensive packing (notably: memories), othersare very critical (e.g. large combinatorial networks realized by AND-OR gates).
– Pin count problem: increasing the number of interfaces increases the chip perimeter linearly, thus the chip area increases quadratically. Many interfaces (e.g. over 102 -103 pins) are motivated only if the area increase is properly exploited by a quadratic increase in intensivelypacked HW compenents.
• Technology of CPUs vs technology of memories and links:– CPU clock cycle: t 100 nsec (frequency: few GHz)
– Static Main Memory clock cycle: tM 101 – 102 t
– Dynamic Main Memory clock cycle: tM 102 – 104 t
– Inter-chip link transmission latency: Ttr 100 – 102 t
– Thus, ta is (one or) more order of magnitude greater than the CPU clock cycle.
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 36
The processor-memory gap (Von Neumann bottleneck)
• For every (known) technology, the ratio ta/t is increasing at every technological improvement– the frequency increase of CPU clock is greater than the frequency increase of
memory clock and of links (inverse of unit latency of links).
• This is the central problem of computer architecture(yesterday, today, tomorrow)– SOLUTIONS ARE BASED ON MEMORY HIERACHIES AND ON PARALLELISM
• “Moore law” (the CPU frequency doubles every 1.5 years) hasbeen valid until 2004 (we’ll see why). Memory and linkstechnologies has been “stopped” too.– Multi-core evolution / revolution.
– The Moore law is now applied to the number of cores (CPUs) in multi-coreand many-core chips.
– What about programming ? One main motivation of SPA and SPM courses.
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 37
Interconnection structuresbased on dedicated links: first examples
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 38
E
E
E
U1
U2
U3
U4
C
C
C
input
stream
output
stream
Trees:Inteconnection of a single input stream and of a single output stream to a collection of nindependent units.
Communication Latency = O(lg n)
Rings /Tori: Communication Latency = O(n)
U1 U2 U3 U4
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 39
Interconnection structuresbased on dedicated links: first examples
Meshes:
U U U
U U U
U U U
Communication Latency = O(n1/2)
possibly toroidal
Interconnection structuresbased on dedicated links• In general, all the most interesting interconnection structures for
– shared memory multiprocessors
– distributed memory multicomputer, cluster, MPP
are “limited degree” networks– based on dedicated links
– with limited number of links per unit: CHIP COUNT problem
• Examples (Course Part 2):– n-dimension cubes
– n-dimensione butterflys
– n-dimension Trees / Fat Trees
– rings and multi-rings
• Currently the “best” commercial networks belong to these classes:– Infiniband
– Myrinet
– …
• Trend: on-chip optical interconnection networks
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 40
(Traditional, old-style) Bus
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 41
Sender behaviour Receiver behaviour
Shared, bidirectionallink.At most one messageat the time can betransmitted.
Arbitrationmechanisms must beprovided.
Serious problems ofbandwidth(contention) for bus-based parallelstructures.
Arbiter mechanisms (for bus writing)
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 42
Independent Request
Arbiters
• Daisy Chaining
• Polling (similar)
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 43
Decentralized arbiters
• Token ring
• Non-deterministic with collision detection
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 44
Firmware: summary
MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms 45
Hardware
Applications
Processes
Assembler
Firmware
Uniprocessor : Instruction Level Parallelism
Shared Memory Multiprocessor: SMP, NUMA,…
Distributed Memory : Cluster, MPP, …
• The true architectural level.• Microcoded interpretation of assembler
instructions.• This level is fundamental (it cannot be
eliminated).
• Physical resources: combinatorial and sequential circuit components, physical links
I
I
• Decentralized control of firmware interpreter
• Synchronous computation model
• Clock cycle
• Parallelism inside microinstructions
• Clear cost model for calculation and for
communication
• Efficient communications can be implemented
on dedicated links
• Communication-internal calculation
overlapping
Recommended