Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Multi-Processor Systems
Roberto Airoldi Department of Computer Systems Tampere University of Technology
TG 307
Lecture outline
Why Multi-processor systems are needed? Basics of MP design strategies - memory hierarchy - Homogeneous or Heterogeneous - Communication infrastructure - … Few words on the research work on our research group.
Multi-Processor Systems 2
Uniprocessor Performance (SPECint)
1
10
100
1000
10000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
Perfo
rman
ce (v
s. V
AX-1
1/78
0)
25%/year
52%/year
??%/year
Multi-Processor Systems 3
From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006
Other Factors ⇒ Multiprocessors
Growth in data-intensive applications – Data bases, file servers, …
Growing interest in servers, server performance Increasing desktop performance less important
– Outside of graphics Improved understanding in how to use multiprocessors effectively
– Especially server where significant natural TLP Advantage of leveraging design investment by replication
– Rather than unique design
Multi-Processor Systems 4
Flynn’s Taxonomy
Flynn classified by data and control streams in 1966
SIMD ⇒ Data Level Parallelism MIMD ⇒ Thread Level Parallelism MIMD popular because
– Flexible: N pgms and 1 multithreaded pgm – Cost-effective: same MPU in desktop & MIMD
Multi-Processor Systems 5
Single Instruction Single Data (SISD) (Uniprocessor)
Single Instruction Multiple Data SIMD (single PC: Vector, CM-2)
Multiple Instruction Single Data (MISD) (????)
Multiple Instruction Multiple Data MIMD (Clusters, SMP servers)
M.J. Flynn, "Very High-Speed Computers", Proc. of the IEEE, V 54, 1900-1909, Dec. 1966.
Back to Basics
“A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.” Parallel Architecture = Computer Architecture + Communication Architecture 2 classes of multiprocessors WRT memory:
– Centralized Memory Multiprocessor • < few dozen processor chips (and < 100 cores) in 2006 • Small enough to share single, centralized memory
– Physically Distributed-Memory multiprocessor • Larger number chips and cores • BW demands ⇒ Memory distributed among processors
Multi-Processor Systems 6
Centralized vs. Distributed Memory
Multi-Processor Systems 7
P 1
$
Inter connection network
$
P n
Mem Mem
P 1
$
Inter connection network
$
P n
Mem Mem
Centralized Memory Distributed Memory
Scale
Centralized Memory Multiprocessor Also called symmetric multiprocessors (SMPs) because single main memory has a symmetric relationship to all processors Large caches ⇒ single memory can satisfy memory demands of small number of processors Can scale to a few dozen processors by using a switch and by using many memory banks Although scaling beyond that is technically conceivable, it becomes less attractive as the number of processors sharing centralized memory increases
Multi-Processor Systems 8
Distributed Memory Multiprocessor Pro: Cost-effective way to scale memory bandwidth
– If most accesses are to local memory Pro: Reduces latency of local memory accesses Con: Communicating data between processors more complex Con: Must change software to take advantage of increased memory BW
Multi-Processor Systems 9
2 Models for Communication and Memory Architecture Communication occurs by explicitly passing messages among the processors:
– message-passing multiprocessors Communication occurs through a shared address space (via loads and stores): shared memory multiprocessors either
– UMA (Uniform Memory Access time) for shared address, centralized memory MP
– NUMA (Non Uniform Memory Access time multiprocessor) for shared address, distributed memory MP
In past, confusion whether “sharing” means sharing physical memory (Symmetric MP) or sharing address space
Multi-Processor Systems 10
Challenges of Parallel Processing
First challenge is % of program inherently sequential Suppose 80X speedup from 100 processors. What fraction of original program can be sequential?
Multi-Processor Systems 11
( )
( )
( )
%75.992.79/79Fraction
Fraction8.0Fraction8079
1)100
Fraction Fraction 1(80
100Fraction
Fraction 1
1 08
SpeedupFraction Fraction 1
1 Speedup
parallel
parallelparallel
parallelparallel
parallelparallel
parallel
enhancedenhanced
overall
==
×−×=
=+−×
+−=
+−=
Challenges of Parallel Processing
Second challenge is long latency to remote memory Suppose 32 CPU MP, 2GHz, 200 ns remote memory, all local accesses hit memory hierarchy and base CPI is 0.5. (Remote access = 200/0.5 = 400 clock cycles) What is performance impact if 0.2% instructions involve remote access? CPI = Base CPI + Remote request rate x Remote request cost CPI = 0.5 + 0.2% x 400 = 0.5 + 0.8 = 1.3 No communication is 1.3/0.5 = 2.6X faster than 0.2% instructions involve remote access
Multi-Processor Systems 12
Challenges of Parallel Processing
Application parallelism ⇒ primarily via new algorithms that have better parallel performance Long remote latency impact ⇒ both by architect and by the programmer For example, reduce frequency of remote accesses either by
– Caching shared data (HW) – Restructuring the data layout to make more accesses local (SW)
Multi-Processor Systems 13
Fundamental MP design decisions
We have already discussed: Shared memory versus Message passing ? Other decisions: Homogeneous versus Heterogeneous ? Bus versus Network ? Generic versus Application specific ? What types of parallelism to support ? Focus on Performance, Power or Cost ? Memory organization ?
Multi-Processor Systems
Homogeneous or Heterogeneous
Homogeneous: – replication effect – memory dominated any way – solve realization issues
once and for all – less flexible
Heterogeneous – better fit to application domain – smaller increments
Multi-Processor Systems
Hybrid approaches
Middle of the road approach Flexible tiles Fixed tile structure at top level
Multi-Processor Systems
Shared vs. Switched Media
Shared media: nodes share a single interconnection medium (e.g., Ethernet, Bus)
– Only one message sent at a time – Inexpensive: one medium used by all processors – Limited bandwidth: medium becomes the bottleneck – Needs arbitration
Switched media: allow direct communication between source and destination nodes (e.g., ATM)
– Multiple users at a time – More expensive: need to replicate medium – Higher total bandwidth – No arbitration – Added latency to go through the switch – Claimed to be more scalable
• but router overhead
Multi-Processor Systems
Which types off parallelism to support?
Kernel level ILP
DLP Module level
Program/ Thread level (TLP)
Application level
Multi-Processor Systems
Heterogeneity
Kernel level
Heterogeneous
Homogeneous Module level
Very heterogeneous Application
level
Multi-Processor Systems
Example MP supporting DLP and ILP: IMAGINE Stream Processor
Multi-Processor Systems
Focus
Low cost ? Low power ? High performance ? Depends on platform class
Multi-Processor Systems
Intrinsic computational efficiency of silicon
i386SX i486DX P5 68040
microsparc Super sparc
601 604
Ultra sparc
P6
604e 21164a
Turbosparc 604e 21364
7400
106
105
104
103
102
101
100 2 1 0.5 0.25 0.13 0.07
Computational efficiency
[MOPS/W]
Feature size [µm]
Intrinsic computational efficiency of silicon
Ref: Roza
C6411
Itanium2
XScale
C5510 Xtensa
Multi-Processor Systems
Computational efficiency of microprocessors
Problem Energy & Performance Memories
– Bandwidth • Slow (memory wall)
– Energy Expensive • RISC: 27% of power in Instruction Cache
– StrongARM SA110: A 160MHz 32b 0.5W CMOS ARM processor • VLIW: 30% of power in Instruction Cache
– C6x, 0.15um, 250MHz, Texas Instruments Inc. – Lx, HP & STMicroelectronics, L. Benini et.al (ISLPED)
Multi-Processor Systems
Wiring hierarchy
How far can signal reach in one local clock cycle (local frequency) ?
0.0
0.5
1.0
1.5
2.0
2.5
3.0
40 50 60 70 80 90 100
Technology node
Line
leng
th (m
m)
Local line length
Intermediate linelengthGlobal linelength
Multi-Processor Systems
INTERCONNECTION NETWORKS
Multi-Processor Systems
Network design parameters
Large network design space: – topology, degree – routing algorithm
• path, path control, collision resolution, network support, deadlock handling, livelock handling
– virtual layer support – flow control – buffering – QoS guarantees – error handling – etc, etc.
Multi-Processor Systems
Network-on-Chip
Some common characteristics of NoC – more than a single shared medium (like bus)
• a true network – provides a point-to-point connection between any two hosts
connected to the network • real point-to-point by a crossbar switch (or equivalent) • virtual point-to-point connection
– higher aggregate bandwidth due to parallelism • many concurrent packet flows or parallel circuit paths
– clearly separates communication from computation – layered approach is used for the network implementation – practical implication from the above points is that the network
contains storage elements for the data • implicit pipelining
Multi-Processor Systems
Additional reading: J. Nurmi, ”Network-on-Chip – A New Paradigm for System-on-Chip Design,”
in Proc. International Symposium on System-on-Chip, Nov. 2005, pp. 2-6.
Protocol layers
Multi-Processor Systems
Circuit-switching and packet-switching Circuit-switching
– A connection has to be established between the communicating parties before the transfer can start
– The line is reserved for the whole time independently of the amount of traffic (like the phone line is reserved whether you are silent or talk at your maximum rate)
– The connection will be explicitly closed when there is no more need for the communication
Packet-switching – The data is packed to carry also information about the destination
address (like a letter or packet in mail) – Separate packets or sub-packets (flits or phits) can be sent along
different routes (if the networks allows that) – Also other senders can inject packets to the network to travel along
(partially) the same route
Multi-Processor Systems
Nodes, switches, resources
A node or a switch – Routes the packet along the network – Often includes some buffering (FIFOs) – The name ’switch’ refers to a crossbar switch connecting any of the
inputs to any of the outputs – Different kinds of switch architectures can be used in NoC – Often connects also the RESOURCES to the network
A resource – processor – memory – fixed-function or reconfigurable block
Multi-Processor Systems
Node characteristics – examples
3 x 3 switch – This size of a crossbar is suitable for connecting a resource to a
bidirectional ring
5 x 5 switch – This is the typical size for Mesh networks, four nearest neighbors +
the local resource
16 x 16 switch – From this size upwards, the switches start to absorb the whole
network operation into a single centralized switch – The switch is the network!
Multi-Processor Systems
Network Topology
Switched media have a topology that indicate how nodes are connected Topology determines
– Degree: number of links from a node – Diameter: max number of links crossed between nodes – Average distance: number of links to random destination – Bisection: minimum number of links that separate the network into
two halves – Bisection bandwidth: link bandwidth x bisection
Multi-Processor Systems
Common Topologies
Multi-Processor Systems
Type Degree Diameter Ave Dist Bisection
1D mesh 2 N-1 N/3 1
2D mesh 4 2(N1/2 - 1) 2N1/2 / 3 N1/2
3D mesh 6 3(N1/3 - 1) 3N1/3 / 3 N2/3
nD mesh 2n n(N1/n - 1) nN1/n / 3 N(n-1) / n
Ring 2 N/2 N/4 2
2D torus 4 N1/2 N1/2 / 2 2N1/2
Hypercube Log2N n=Log2N n/2 N/2
2D Tree 3 2Log2N ~2Log2 N 1
Crossbar N-1 1 1 N2/2
N = number of nodes, n = dimension
Network topologies - Mesh
Two-dimensional mesh – very intuitive, indeed
Multi-Processor Systems
resources
nodes
links
Network topologies - Torus
Torus and folded torus extend from mesh – folded torus has 2x length of the links compared to mesh!
Multi-Processor Systems
Network topologies - Ring
Ring is the simplest of the networks – Unidirectional or bidirectional – Does not correspond to the physical placement of resources!
Multi-Processor Systems
Network topologies – Octacon and Spidercon Extend the ring by introducing shortcuts
– Octacon 4 diagonals – Spidercon N diagonals
Multi-Processor Systems
Routing schemes
Deterministic vs. adaptive – deterministic routing is fixed (by routing table or information in the
packet) – adaptive routing can adapt to e.g. network congestion
Packet handling property – store-and-forward
• the whole packet is always stored in a node – virtual cut-through
• the routing further may start if there is enough room in the next node
– wormhole routing • the flit will be routed further immediately
Multi-Processor Systems
Network performance
020406080
100120140160
10 30 50 70 90
Load %
Throughput
Latency
Throughput
– throughput/input – aggregate throughput – bisection throughput
Latency
– time from data injection to readout completion
Inter-related and function of network loading!
Multi-Processor Systems
Quality-of-Service (QoS)
Normally packet-switched networks provide best effort throughput to all traffic
– ”the mileage may vary...” For some services, guaranteed service is needed
– Predictable delays – Guaranteed throughput
This will be handled by
– circuit-switched channel – time-multiplexing (and giving a certain amount of time slots to the
priority service) – can be approximated by assigning packet priority (may lead to
starvation of other services)
Multi-Processor Systems
Fault and error tolerance
More permanent manufacturing defects in large complex circuits More soft errors due to lower error margins More dynamic transient faults due to crosstalk and other noise on-chip For error tolerance error detection and correction methods from telecommunications can be used to some extent
– must not cause too much overhead in area and power – resending or error correction codes???
Fault tolerance may cause too much overhead, either – network provides natural redundancy for fixing faults
Multi-Processor Systems
Syncronous, asynchronous, GALS Single-clock SoC not practical Single-clock NoC (communication part) may be... However, synchronization between clock domains needed anyway GALS (Globally Asynchronous Locally Synchronous) is a natural mode of operation for large NoCs with many clock domains connected
– synchronization carried out between the network and the synchronous resource
– a place for trade-offs: where to place the A/S border??? Asynchronous functional blocks rare, but the network can be fully asynchronous (with synchronizing wrappers)
Multi-Processor Systems
Our Research group
Prof. Jari Nurmi leads the group Research areas: Computer Architecture: Networks-on-Chip Multi-Processor architecture design Reconfigurable hardware Implementation f signal processing for Communications Navigation and Positioning: GNSS receivers Indoor positioning
Multi-Processor Systems 43