Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
High Performance Computing: Concepts, Methods, & MeansP ll l C t A hit tParallel Computer Architecture
Prof. Thomas SterlingDepartment of Computer ScienceLouisiana State UniversityJanuary 30, 2007
NEWS Alert!!
Intel and IBM announce developmentIntel and IBM announce development of hi-k metal gates material for dramatic reduction in leakage current
bli ti ti f M ’ lenabling continuation of Moore’s law down to at least 22 nm feature size; considered biggest breakthrough in 40 years in semiconductor industry.
22
Topics
• Introduction• Review Performance Factors (good & bad)• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance IssuesParallel Structures & Performance Issues• Performance Metrics• Course-grained MIMD Processing – MPPs • Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors
Summary Material for the Test3
• Summary – Material for the Test3
Topics
• IntroductionR i P f F t ( d & b d)• Review Performance Factors (good & bad)
• What is Computer Architecture• Parallel Structures & Performance Issues• Parallel Structures & Performance Issues• Performance Metrics• Course-grained MIMD Processing – MPPs g g• Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors
S M t i l f th T t4
• Summary – Material for the Test4
Opening Remarks
• This week is about supercomputer architecture– Today: major factors, classes, and system levely j y– Next time: modern microprocessor and multicore node
• Architecture exploits device technology to deliver its innate computation performance potentialp p p
• Between device technology and architecture is circuit design– Circuit design converts devices to logic gates and higher level
logical structures (e.g. multiplexers, adders)g ( g p , )– but this is outside the scope of this course.
• We will assume basic logic abstraction with characterizing properties:p p– Functional behavior (the logical operation it performs)– Switching speed– Propagation delay or latency
5
Propagation delay or latency– Size and power
5
Topics
• IntroductionReview Performance Factors (good & bad)• Review Performance Factors (good & bad)
• What is Computer Architecture• Parallel Structures & Performance Issues• Parallel Structures & Performance Issues• Performance Metrics• Course-grained MIMD Processing – MPPs g g• Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors
6
• Summary – Material for the Test6
Performance Factors: Technology
• Logic switching speed• Logic latency time• Logic latency time• On-chip data transfer rate• On-chip clock speed (clock cycle time)On chip clock speed (clock cycle time)• Logic density• Memory cycle time• Memory access time• Memory density• Processor to memory access latency• Network data rate
Network latency7
• Network latency7
Performance Factors: Parallelism
• Fully independent processing elements operating concurrently on separate tasksconcurrently on separate tasks
• Instruction level parallelism• Execution pipelinep p
– Overlapping sequential operations in execution pipeline
• SIMD operationsALU– ALU arrays
– Vector pipelines
• Overlapping of computation and communicationpp g p– Asynchronous– Prefetching
Multithreading8
• Multithreading8
Performance Degradation
• Overhead– Critical work for management of concurrent tasks andCritical work for management of concurrent tasks and
parallel resources not required for sequential execution
• Latency– Time for propagation of an action through a channel or
sequence of stages
• Contention– Idle cycles waiting for access to shared resources
• StarvationI ffi i t ll li– Insufficient parallelism
– Inadequate load balancing
99
Topics
• Introduction• Review Performance Factors (good & bad)• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance Issues• Parallel Structures & Performance Issues• Performance Metrics• Course-grained MIMD Processing – MPPs g g• Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors
10
• Summary – Material for the Test10
Computer Architecture
• Structure– Functional elements– Organization and balance– Interconnect and Data flow paths
• Semantics– Meaning of the logical constructs– Primitive data types– Manifest as Instruction Set Architecture abstract layer
M h i• Mechanisms– Primitive functions that are usually implemented in hardware
or sometimes firmwareDetermines preferred actions and sequences– Determines preferred actions and sequences
– Enables efficiency and scalability• Policy
Actions and decisions outside the ISA or name space
11
– Actions and decisions outside the ISA or name space
11
Structure
• Functional elements– The form of functional elements made up of more primitiveThe form of functional elements made up of more primitive
logical modules– e.g. vector arithmetic unit comprising a pipeline of simple
stagesstages
• Organization and balance– Number of major elements of different types– Hierarchy of collections of elements
• Data flow– Interconnection of functional state and communication– Interconnection of functional, state, and communication
elements– Control of dataflow paths determines actions of processor
and system
12
and system
12
Semantics
• Meaning of the logical constructs– Basic operations that can be performed on dataBasic operations that can be performed on data
• Primitive data types– Defines actions that can be performed on binary strings
• Instruction Set Architecture– Defined set of actions that can be performed and data object
on which they can be appliedon which they can be applied
1313
Mechanisms• Primitive functions that are usually implemented in
hardware or sometimes firmwareLower level than instruction set operations– Lower level than instruction set operations
– Multiple such mechanisms contribute to execution of operation
• Determines preferred actions and sequences• Determines preferred actions and sequences– Usually time effective primitives– Usually widely used by many instructions
E bl ffi i d l bilit• Enables efficiency and scalability– Establishes basic performance properties of machine
• Examplesp– Basic arithmetic and logic unit functions– Thread context switching– TLB (Translation Lookaside Buffer) address translation
14
TLB (Translation Lookaside Buffer) address translation– Cache line replacement
14
Policy
• Not all machine decisions are visible to the instructions of the systeminstructions of the system
• Not all machine choices are available to the name space of the operands
• Examples– Cache structure, size, and speed – Cache replacement policies– Cache replacement policies– Order of operation execution– Branch prediction– Allocation of shared resources– Network routers
1515
Topics
• Introduction• Review Performance Factors (good & bad)• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance Issues• Parallel Structures & Performance Issues• Performance Metrics• Course-grained MIMD Processing – MPPs g g• Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors
16
• Summary – Material for the Test16
Parallel Structures
• Pipelining– Vector processingVector processing– Execution pipeline
• Multiple Arithmetic Units– Instruction level parallelism– Systolic arrays
• Multiple processorsMultiple processors– MIMD: Separate control– SIMD: Single controller
1717
Performance Issues
• Pipelining increases throughput– More operations per unit timeMore operations per unit time
• Pipelining increases latency time– Operation on single operand pair can take longer than non-
pipelined functional unit
• Multiple ALUs – Increases peak performanceIncreases peak performance– Requires application instruction level parallelism– Average usually significantly lower than peak
M lti l i h d ti• Multiple processors require overhead operations– Synchronization– Communications
1818
Topics
• Introduction• Review Performance Factors (good & bad)• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance IssuesParallel Structures & Performance Issues• Performance Metrics• Course-grained MIMD Processing – MPPs g g• Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors
19
• Summary – Material for the Test19
Performance Metrics
• Peak floating point operations per second (flops)• Peak instructions per second (ips)• Sustained throughput
– flops, Mflops, Gflops, Tflops, Pflops– flops, Megaflops, Gigaflops, Teraflops, Petaflops– ips, Mips, …
• Cycles per instruction– cpi
Al i l i i l i– Alternatively: instructions per cycle, ipc• Memory access latency
– cycles per secondM b d idth• Memory access bandwidth– bytes per second– or Gigabytes per second, GBps, GB/s
• Bi section bandwidth
20
• Bi-section bandwidth– bytes per second
20
Topics
• Introduction• Review Performance Factors (good & bad)• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance IssuesParallel Structures & Performance Issues• Performance Metrics• Course-grained MIMD Processing – MPPsg g• Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors
21
• Summary – Material for the Test21
Multiprocessor• A general class of system• Integrates multiple processors in to an interconnected ensemble
MIMD: Multiple Instruction Stream Multiple Data Stream• MIMD: Multiple Instruction Stream Multiple Data Stream• Different memory models
– Distributed memory• Nodes support separate address spacesNodes support separate address spaces
– Shared memory• Symmetric multiprocessor• UMA – uniform memory access
C h h t• Cache coherent– Distributed shared memory
• NUMA – non uniform memory access• Cache coherent
– PGAS• Partitioned global address space• NUMA• Not cache coherence
22
Not cache coherence– Ensemble of distributed shared memory nodes
• Massively Parallel Processor, MPP 22
Massively Parallel Processor
• MPP• General class of large scale multiprocessor• General class of large scale multiprocessor• Represents largest systems
– IBM BG/L– Cray XT3
• Distinguished by memory strategyDi t ib t d– Distributed memory
– Distributed shared memory• Cache coherent• Partitioned global address space
• Potentially heterogeneous– May incorporate accelerator to boost peak performance
23
May incorporate accelerator to boost peak performance
23
IBM Blue Gene/L
2424
BG/L packaging hierarchy
2525
Topics
• Introduction• Review Performance Factors (good & bad)• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance IssuesParallel Structures & Performance Issues• Performance Metrics• Course-grained MIMD Processing – MPPs • Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors
26
• Summary – Material for the Test26
Pipeline Structures
• Partitioning of functional unit in to a sequence of stages– Execution time of each stage is < that of the original unit
T t l ti th h f t i ll > th t f th– Total time through sequence of stages is usually > that of the original unit
• Pipeline permits overlapping of multiple operations– At any one time: each stage is performing different operation– # of operations being performed in parallel = # stages
• Performance– Pipeline increments at clock rate of slowest pipeline stage
Response time for an operation is product of # stages and clock– Response time for an operation is product of # stages and clock cycle time
– Throughput = clock rate• i.e. one operation result per clock cycle of pipeline
Pi li t t l d i t f t• Pipeline structures employed in many parts of a computer architecture – to enable high throughput in the presence of high latency – enable faster clock rates
2727
Pipeline : Concepts
Tccp Tt <<
tptp pp tNT ×=
TpTppc TT <
cc TPerf 1=Where :
• Tc is the Logic Latency
p
c
tPerf 1=• Tp is the aggregated
pipeline latency• t is the latency for each
2828p
p tf• tp is the latency for each pipelined step
Vector Processors• Supports SIMD semantics
– Many instances of same operation performed concurrently under same control element
– Operates on vector data structures rather than single scalar values– Vector-scalar operations
• Scale a vector by a scalar factor (multiply each vector element by scalar)I t t ti– Inter-vector operations
• e.g., Pair wise multiplies– Intra-vector operation
Reduction operators• Reduction operators• e.g., sum all elements of a vector
• Exploits pipeline structureArithmetic units– Arithmetic units
– Vector registers– Overlap of memory banks access cycles
Overlap of communication with computation
29
– Overlap of communication with computation• Limited scaling – upper bound on number of pipeline stages
29
Vector Register (NR)
Vector ALU (NS Stages)
High speed memory busta : time for memory access
M1 M2 MN
N
asMV ttT += ∑1t
cerPerformanIdealVectoc
=
MVM T
NP =:
)(
21N
tNNNmancectorPerforAchievedVe
cRs
R
c
×+=
NNWhere :
• ta is the time for memory access• ts is the startup time
2
cSR
RR
SR
NtNN
NPerf
NN
×+=
=
1)(
3030
• TMV is the combined time for Memory Vector• PM is the memory performance• tc is the ALU clock time of each step
ccR
R
ttNN
×=
××=
21
)(2
Cray 1
tem
• First announced in 1975-6 • 80 MHz Clock rate
Cra
y 1
Syst• Theoretical peak performance (160
MIPS), average performance 136 megaflops, vector optimized peak performance 150 megaflops
The
Cperformance 150 megaflops• 1-million 64 bit words of high speed
memory• Manufactured by Cray Research Inc
boar
ds
• Manufactured by Cray Research Inc.• First Customer was National Center for
Atmospheric Research (NCAR) for 8.86 million dollars.
ray
1 lo
gic
b
3131
Cr
src : http://en.wikipedia.org/wiki/Cray-1
BA
RS
IDE
3232
Parallel-Vector-Processors: PVP
• Combines strengths of vector and MPP– Efficiency of vector processing
• Capability computing– Scalability of massively parallel processing
• Capacity and cooperative computing• Two levels of parallelism• Two levels of parallelism
– Ultra fine grain vector parallelism with vector pipelining– Medium to coarse-grain processor
• Memory model• Memory model– Alternative ways of organizing memory & address space– Distributed memory
• Shared memory within node of multiple vector processorsy• Fragmented or decoupled address space between nodes
– Partitioned global address space• Globally accessible address space• No cache coherence between nodes
33
• No cache coherence between nodes
33
Earth Simulator ( Images )
3434src : http://www.es.jamstec.go.jp/esc/eng/
EarthSimulator (Facts) • Located in Yokohoma, Japan• Size of the entire center about 4 tennis courts• Can execute 35.86 trillion (35,860,000,000,000) FLOPS,
or 35.86 TFLOPS.Consists of 640 nodes with each node consisting of 8• Consists of 640 nodes with each node consisting of 8 vector processors and 16 GB of memory
• Totaling 5120 processors and 10 Terabytes of memoryg p y y• Aggregated disk storage of 700 Terabytes and around
1.6 Petabytes of storage in tape drives • Costs about 350 million dollars• First on the Top500 list for 5 consecutive times.
Surpassed by IBM's BlueGene/L prototype on
35
Surpassed by IBM s BlueGene/L prototype on September 24, 2004
35
PVP Examples
• Early machines– CRI XMP, YMP, C-90, T-90– Cray 2– Fujitsu VP5000
• SX-8SX 8• Cray X1
Steve Scott
Cray Inc.
3636
Topics
• Introduction• Review Performance Factors (good & bad)• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance IssuesParallel Structures & Performance Issues• Performance Metrics• Course-grained MIMD Processing – MPPs • Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors
37
• Summary – Material for the Test37
SIMD Array• SIMD semantics
– Single Instruction stream Multiple Data stream– Data set partitioned in to blocks upon which– Data set partitioned in to blocks upon which
• One or two dimensions (vectors or matrices)– Each data block is processed separately– Each data block is controlled by same instruction sequenceEach data block is controlled by same instruction sequence– Data exchange cycle
• SIMD Parallel StructureNode Array of arithmetic units each coupled to local memory– Node Array of arithmetic units, each coupled to local memory
– Interconnect network for global data exchange– Single controller to issue instructions to array nodes
• Early systems broadcast one instruction at a time• Early systems broadcast one instruction at a time• Modern systems point to sequence of cached instructions
• SPMD– Single Program Multiple Data Stream
38
– Single Program Multiple Data Stream– Microprocessor based system where each node runs same program
38
Simplified SIMD Diagram
MDij
Processing Element
MI
Sequencer. . .
10 11 12 1n
00 01 02 0nInstruction BroadcastBus
MI
MD
10 11
20 22
12
21
1n
2n
. . . . .Switch
n0 nn
3939Data ProcessorsControl Processor
CM-2CM-2 General Specifications :• Processors 65,536 • Memory 512 MbytesMemory 512 Mbytes • Memory Bandwidth 300Gbits/Sec • I/O Channels 8 Capacity per Channel 40 Mbytes/Sec
Max. • Transfer Rate 320 Mbytes/Sec • Performance in excess of 2500 MIPS• Floating Point performance in excess of 2500 MFlops
DataVault Specifications :• Storage Capacity 5 or 10 Gbytes • I/O Interfaces 2 Transfer Rate, • Burst 40 Mbytes/Sec Max. • Aggregate Rate 320 Mbytes/Sec
O i i d MIT
40
• Originated at MIT, • Commercialized at Thinking Machines Corp.
40src : http://www.svisions.com/sv/cm-dv.html
ClearSpeed SIMD Accelerator
4141
Topics
• Introduction• Review Performance Factors (good & bad)• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance IssuesParallel Structures & Performance Issues• Performance Metrics• Course-grained MIMD Processing – MPPs • Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors
42
• Summary – Material for the Test42
Special Purpose Devices
• SPD• Optimized for a given algorithm or class
of problemsF ti l l t d d t fl th• Functional elements and dataflow path mirror the requirements of a specific algorithm
• Usually exploits fine grain parallelism for very high parallelism
• Best for arithmetic (or logic) intensive• Best for arithmetic (or logic) intensive applications with limited memory access requirements
• Best for strong temporal and spatial locality
• Systolic Arrays are one class of suchSystolic Arrays are one class of such machines widely used in digital signal processing
• Examples– MD-Grape first Petaflops machine, for N-
body problem– GPU Graphics Processing Unit, e.g.
NVIDIA– FPGA field programmable gate array
• Allows reconfiguration of logic array for
4343
Systolic ArraysA
∑=
=n
k
kjikij bac1
A
A
CC
BB
HostProcessing Element
Interface Unit
address
Cell 1
Cell 2
Cell 3
Cell n
XY
Warp Processor Array
XY
Example implementation:Warp architecture
Matrix multiplication on Systolic Array
4444
Matrix multiplication on Systolic ArrayReferences:M. Annaratone, E. Arnould, et al, “The Warp Computer: Architecture, Implementation, and Performance”Y. Yang, W. Zhao, and Y. Inoue, “High-Performance Systolic Arrays for Band Matrix Multiplication”
Topics
• Introduction• Review Performance Factors (good & bad)• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance IssuesParallel Structures & Performance Issues• Performance Metrics• Course-grained MIMD Processing – MPPs • Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors
45
• Summary – Material for the Test45
Introduction to SMP
• Symmetric Multiprocessor• Building block for large MPP• Building block for large MPP• Multiple processors
– 2 to 32 processorsp– Now multi core
• UMA shared memoryC h h t• Cache coherent
• Will discuss in significant detail next time
4646
Topics
• Introduction• Review Performance Factors (good & bad)• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance IssuesParallel Structures & Performance Issues• Performance Metrics• Course-grained MIMD Processing – MPPs • Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors
S M t i l f th T t47
• Summary – Material for the Test47
Summary – Material for the Test
• Need to know content on slides 11, 15, 22, 23, 33 U d t d h h f th t h l i li t d• Understand how each of the technologies listed on slide 7 affects performanceU d t d t lid 8 9• Understand concepts on slides 8,9
• Understand concepts on slides 17, 18, 20 U d t d i li i t d ti• Understand pipelining concepts and equations detailed in slides 27, 28U d t d t i t d• Understand vector processing concepts and equations detailed in slides 29, 30
4848
4949