High Performance Computing: Concepts, Methods, & Means P ll l …sidhanti/classes/csc7600/S2_L1... · 2007-01-30 · High Performance Computing: Concepts, Methods, & Means P ll l

High Performance Computing: Concepts, Methods, & MeansP ll l C t A hit tParallel Computer Architecture

Prof. Thomas SterlingDepartment of Computer ScienceLouisiana State UniversityJanuary 30, 2007

NEWS Alert!!

Intel and IBM announce developmentIntel and IBM announce development of hi-k metal gates material for dramatic reduction in leakage current

bli ti ti f M ’ lenabling continuation of Moore’s law down to at least 22 nm feature size; considered biggest breakthrough in 40 years in semiconductor industry.

22

Topics

• Introduction• Review Performance Factors (good & bad)• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance IssuesParallel Structures & Performance Issues• Performance Metrics• Course-grained MIMD Processing – MPPs • Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors

Summary Material for the Test3

• Summary – Material for the Test3

Topics

• IntroductionR i P f F t ( d & b d)• Review Performance Factors (good & bad)

• What is Computer Architecture• Parallel Structures & Performance Issues• Parallel Structures & Performance Issues• Performance Metrics• Course-grained MIMD Processing – MPPs g g• Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors

S M t i l f th T t4


Opening Remarks

• This week is about supercomputer architecture– Today: major factors, classes, and system levely j y– Next time: modern microprocessor and multicore node

• Architecture exploits device technology to deliver its innate computation performance potentialp p p

• Between device technology and architecture is circuit design– Circuit design converts devices to logic gates and higher level

logical structures (e.g. multiplexers, adders)g ( g p , )– but this is outside the scope of this course.

• We will assume basic logic abstraction with characterizing properties:p p– Functional behavior (the logical operation it performs)– Switching speed– Propagation delay or latency

5

Propagation delay or latency– Size and power

5

Topics

• IntroductionReview Performance Factors (good & bad)• Review Performance Factors (good & bad)

• What is Computer Architecture• Parallel Structures & Performance Issues• Parallel Structures & Performance Issues• Performance Metrics• Course-grained MIMD Processing – MPPs g g• Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors

6


Performance Factors: Technology

• Logic switching speed• Logic latency time• Logic latency time• On-chip data transfer rate• On-chip clock speed (clock cycle time)On chip clock speed (clock cycle time)• Logic density• Memory cycle time• Memory access time• Memory density• Processor to memory access latency• Network data rate

Network latency7

• Network latency7

Performance Factors: Parallelism

• Fully independent processing elements operating concurrently on separate tasksconcurrently on separate tasks

• Instruction level parallelism• Execution pipelinep p

– Overlapping sequential operations in execution pipeline

• SIMD operationsALU– ALU arrays

– Vector pipelines

• Overlapping of computation and communicationpp g p– Asynchronous– Prefetching

Multithreading8

• Multithreading8

Performance Degradation

• Overhead– Critical work for management of concurrent tasks andCritical work for management of concurrent tasks and

parallel resources not required for sequential execution

• Latency– Time for propagation of an action through a channel or

sequence of stages

• Contention– Idle cycles waiting for access to shared resources

• StarvationI ffi i t ll li– Insufficient parallelism

– Inadequate load balancing

99

Topics

• Introduction• Review Performance Factors (good & bad)• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance Issues• Parallel Structures & Performance Issues• Performance Metrics• Course-grained MIMD Processing – MPPs g g• Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors

10


Computer Architecture

• Structure– Functional elements– Organization and balance– Interconnect and Data flow paths

• Semantics– Meaning of the logical constructs– Primitive data types– Manifest as Instruction Set Architecture abstract layer

M h i• Mechanisms– Primitive functions that are usually implemented in hardware

or sometimes firmwareDetermines preferred actions and sequences– Determines preferred actions and sequences

– Enables efficiency and scalability• Policy

Actions and decisions outside the ISA or name space

11

– Actions and decisions outside the ISA or name space

11

Structure

• Functional elements– The form of functional elements made up of more primitiveThe form of functional elements made up of more primitive

logical modules– e.g. vector arithmetic unit comprising a pipeline of simple

stagesstages

• Organization and balance– Number of major elements of different types– Hierarchy of collections of elements

• Data flow– Interconnection of functional state and communication– Interconnection of functional, state, and communication

elements– Control of dataflow paths determines actions of processor

and system

12

and system

12

Semantics

• Meaning of the logical constructs– Basic operations that can be performed on dataBasic operations that can be performed on data

• Primitive data types– Defines actions that can be performed on binary strings

• Instruction Set Architecture– Defined set of actions that can be performed and data object

on which they can be appliedon which they can be applied

1313

Mechanisms• Primitive functions that are usually implemented in

hardware or sometimes firmwareLower level than instruction set operations– Lower level than instruction set operations

– Multiple such mechanisms contribute to execution of operation

• Determines preferred actions and sequences• Determines preferred actions and sequences– Usually time effective primitives– Usually widely used by many instructions

E bl ffi i d l bilit• Enables efficiency and scalability– Establishes basic performance properties of machine

• Examplesp– Basic arithmetic and logic unit functions– Thread context switching– TLB (Translation Lookaside Buffer) address translation

14

TLB (Translation Lookaside Buffer) address translation– Cache line replacement

14

Policy

• Not all machine decisions are visible to the instructions of the systeminstructions of the system

• Not all machine choices are available to the name space of the operands

• Examples– Cache structure, size, and speed – Cache replacement policies– Cache replacement policies– Order of operation execution– Branch prediction– Allocation of shared resources– Network routers

1515

Topics

• Introduction• Review Performance Factors (good & bad)• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance Issues• Parallel Structures & Performance Issues• Performance Metrics• Course-grained MIMD Processing – MPPs g g• Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors

16


Parallel Structures

• Pipelining– Vector processingVector processing– Execution pipeline

• Multiple Arithmetic Units– Instruction level parallelism– Systolic arrays

• Multiple processorsMultiple processors– MIMD: Separate control– SIMD: Single controller

1717

Performance Issues

• Pipelining increases throughput– More operations per unit timeMore operations per unit time

• Pipelining increases latency time– Operation on single operand pair can take longer than non-

pipelined functional unit

• Multiple ALUs – Increases peak performanceIncreases peak performance– Requires application instruction level parallelism– Average usually significantly lower than peak

M lti l i h d ti• Multiple processors require overhead operations– Synchronization– Communications

1818

Topics

• Introduction• Review Performance Factors (good & bad)• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance IssuesParallel Structures & Performance Issues• Performance Metrics• Course-grained MIMD Processing – MPPs g g• Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors

19


Performance Metrics

• Peak floating point operations per second (flops)• Peak instructions per second (ips)• Sustained throughput

– flops, Mflops, Gflops, Tflops, Pflops– flops, Megaflops, Gigaflops, Teraflops, Petaflops– ips, Mips, …

• Cycles per instruction– cpi

Al i l i i l i– Alternatively: instructions per cycle, ipc• Memory access latency

– cycles per secondM b d idth• Memory access bandwidth– bytes per second– or Gigabytes per second, GBps, GB/s

• Bi section bandwidth

20

• Bi-section bandwidth– bytes per second

20

Topics

• Introduction• Review Performance Factors (good & bad)• Review Performance Factors (good & bad)• What is Computer Architecture• Parallel Structures & Performance IssuesParallel Structures & Performance Issues• Performance Metrics• Course-grained MIMD Processing – MPPsg g• Very Fine-grained Vector Processing and PVPs• SIMD array and SPMD• Special Purpose Devices and Systolic Structures• An Introduction to Shared Memory Multiprocessors

21


Multiprocessor• A general class of system• Integrates multiple processors in to an interconnected ensemble

MIMD: Multiple Instruction Stream Multiple Data Stream• MIMD: Multiple Instruction Stream Multiple Data Stream• Different memory models

– Distributed memory• Nodes support separate address spacesNodes support separate address spaces

– Shared memory• Symmetric multiprocessor• UMA – uniform memory access

C h h t• Cache coherent– Distributed shared memory

• NUMA – non uniform memory access• Cache coherent

– PGAS• Partitioned global address space• NUMA• Not cache coherence

22

Not cache coherence– Ensemble of distributed shared memory nodes

• Massively Parallel Processor, MPP 22

Massively Parallel Processor

• MPP• General class of large scale multiprocessor• General class of large scale multiprocessor• Represents largest systems

– IBM BG/L– Cray XT3

• Distinguished by memory strategyDi t ib t d– Distributed memory

– Distributed shared memory• Cache coherent• Partitioned global address space

• Potentially heterogeneous– May incorporate accelerator to boost peak performance

23

May incorporate accelerator to boost peak performance

23

IBM Blue Gene/L

2424

BG/L packaging hierarchy

2525

Topics


26


Pipeline Structures

• Partitioning of functional unit in to a sequence of stages– Execution time of each stage is < that of the original unit

T t l ti th h f t i ll > th t f th– Total time through sequence of stages is usually > that of the original unit

• Pipeline permits overlapping of multiple operations– At any one time: each stage is performing different operation– # of operations being performed in parallel = # stages

• Performance– Pipeline increments at clock rate of slowest pipeline stage

Response time for an operation is product of # stages and clock– Response time for an operation is product of # stages and clock cycle time

– Throughput = clock rate• i.e. one operation result per clock cycle of pipeline

Pi li t t l d i t f t• Pipeline structures employed in many parts of a computer architecture – to enable high throughput in the presence of high latency – enable faster clock rates

2727

Pipeline : Concepts

Tccp Tt <<

tptp pp tNT ×=

TpTppc TT <

cc TPerf 1=Where :

• Tc is the Logic Latency

p

c

tPerf 1=• Tp is the aggregated

pipeline latency• t is the latency for each

2828p

p tf• tp is the latency for each pipelined step

Vector Processors• Supports SIMD semantics

– Many instances of same operation performed concurrently under same control element

– Operates on vector data structures rather than single scalar values– Vector-scalar operations

• Scale a vector by a scalar factor (multiply each vector element by scalar)I t t ti– Inter-vector operations

• e.g., Pair wise multiplies– Intra-vector operation

Reduction operators• Reduction operators• e.g., sum all elements of a vector

• Exploits pipeline structureArithmetic units– Arithmetic units

– Vector registers– Overlap of memory banks access cycles

Overlap of communication with computation

29

– Overlap of communication with computation• Limited scaling – upper bound on number of pipeline stages

29

Vector Register (NR)

Vector ALU (NS Stages)

High speed memory busta : time for memory access

M1 M2 MN

N

asMV ttT += ∑1t

cerPerformanIdealVectoc

=

MVM T

NP =:

)(

21N

tNNNmancectorPerforAchievedVe

cRs

R

c

×+=

NNWhere :

• ta is the time for memory access• ts is the startup time

2

cSR

RR

SR

NtNN

NPerf

NN

×+=

=

1)(

3030

• TMV is the combined time for Memory Vector• PM is the memory performance• tc is the ALU clock time of each step

ccR

R

ttNN

×=

××=

21

)(2

Cray 1

tem

• First announced in 1975-6 • 80 MHz Clock rate

Cra

y 1

Syst• Theoretical peak performance (160

MIPS), average performance 136 megaflops, vector optimized peak performance 150 megaflops

The

Cperformance 150 megaflops• 1-million 64 bit words of high speed

memory• Manufactured by Cray Research Inc

boar

ds

• Manufactured by Cray Research Inc.• First Customer was National Center for

Atmospheric Research (NCAR) for 8.86 million dollars.

ray

1 lo

gic

b

3131

Cr

src : http://en.wikipedia.org/wiki/Cray-1

BA

RS

IDE

3232

Parallel-Vector-Processors: PVP

• Combines strengths of vector and MPP– Efficiency of vector processing

• Capability computing– Scalability of massively parallel processing

• Capacity and cooperative computing• Two levels of parallelism• Two levels of parallelism

– Ultra fine grain vector parallelism with vector pipelining– Medium to coarse-grain processor

• Memory model• Memory model– Alternative ways of organizing memory & address space– Distributed memory

• Shared memory within node of multiple vector processorsy• Fragmented or decoupled address space between nodes

– Partitioned global address space• Globally accessible address space• No cache coherence between nodes

33

• No cache coherence between nodes

33

Earth Simulator ( Images )

3434src : http://www.es.jamstec.go.jp/esc/eng/

EarthSimulator (Facts) • Located in Yokohoma, Japan• Size of the entire center about 4 tennis courts• Can execute 35.86 trillion (35,860,000,000,000) FLOPS,

or 35.86 TFLOPS.Consists of 640 nodes with each node consisting of 8• Consists of 640 nodes with each node consisting of 8 vector processors and 16 GB of memory

• Totaling 5120 processors and 10 Terabytes of memoryg p y y• Aggregated disk storage of 700 Terabytes and around

1.6 Petabytes of storage in tape drives • Costs about 350 million dollars• First on the Top500 list for 5 consecutive times.

Surpassed by IBM's BlueGene/L prototype on

35

Surpassed by IBM s BlueGene/L prototype on September 24, 2004

35

PVP Examples

• Early machines– CRI XMP, YMP, C-90, T-90– Cray 2– Fujitsu VP5000

• SX-8SX 8• Cray X1

Steve Scott

Cray Inc.

3636

Topics


37


SIMD Array• SIMD semantics

– Single Instruction stream Multiple Data stream– Data set partitioned in to blocks upon which– Data set partitioned in to blocks upon which

• One or two dimensions (vectors or matrices)– Each data block is processed separately– Each data block is controlled by same instruction sequenceEach data block is controlled by same instruction sequence– Data exchange cycle

• SIMD Parallel StructureNode Array of arithmetic units each coupled to local memory– Node Array of arithmetic units, each coupled to local memory

– Interconnect network for global data exchange– Single controller to issue instructions to array nodes

• Early systems broadcast one instruction at a time• Early systems broadcast one instruction at a time• Modern systems point to sequence of cached instructions

• SPMD– Single Program Multiple Data Stream

38

– Single Program Multiple Data Stream– Microprocessor based system where each node runs same program

38

Simplified SIMD Diagram

MDij

Processing Element

MI

Sequencer. . .

10 11 12 1n

00 01 02 0nInstruction BroadcastBus

MI

MD

10 11

20 22

12

21

1n

2n

. . . . .Switch

n0 nn

3939Data ProcessorsControl Processor

CM-2CM-2 General Specifications :• Processors 65,536 • Memory 512 MbytesMemory 512 Mbytes • Memory Bandwidth 300Gbits/Sec • I/O Channels 8 Capacity per Channel 40 Mbytes/Sec

Max. • Transfer Rate 320 Mbytes/Sec • Performance in excess of 2500 MIPS• Floating Point performance in excess of 2500 MFlops

DataVault Specifications :• Storage Capacity 5 or 10 Gbytes • I/O Interfaces 2 Transfer Rate, • Burst 40 Mbytes/Sec Max. • Aggregate Rate 320 Mbytes/Sec

O i i d MIT

40

• Originated at MIT, • Commercialized at Thinking Machines Corp.

40src : http://www.svisions.com/sv/cm-dv.html

ClearSpeed SIMD Accelerator

4141

Topics


42


Special Purpose Devices

• SPD• Optimized for a given algorithm or class

of problemsF ti l l t d d t fl th• Functional elements and dataflow path mirror the requirements of a specific algorithm

• Usually exploits fine grain parallelism for very high parallelism

• Best for arithmetic (or logic) intensive• Best for arithmetic (or logic) intensive applications with limited memory access requirements

• Best for strong temporal and spatial locality

• Systolic Arrays are one class of suchSystolic Arrays are one class of such machines widely used in digital signal processing

• Examples– MD-Grape first Petaflops machine, for N-

body problem– GPU Graphics Processing Unit, e.g.

NVIDIA– FPGA field programmable gate array

• Allows reconfiguration of logic array for

4343

Systolic ArraysA

∑=

=n

k

kjikij bac1

A

A

CC

BB

HostProcessing Element

Interface Unit

address

Cell 1

Cell 2

Cell 3

Cell n

XY

Warp Processor Array

XY

Example implementation:Warp architecture

Matrix multiplication on Systolic Array

4444

Matrix multiplication on Systolic ArrayReferences:M. Annaratone, E. Arnould, et al, “The Warp Computer: Architecture, Implementation, and Performance”Y. Yang, W. Zhao, and Y. Inoue, “High-Performance Systolic Arrays for Band Matrix Multiplication”

Topics


45


Introduction to SMP

• Symmetric Multiprocessor• Building block for large MPP• Building block for large MPP• Multiple processors

– 2 to 32 processorsp– Now multi core

• UMA shared memoryC h h t• Cache coherent

• Will discuss in significant detail next time

4646

Topics


S M t i l f th T t47


Summary – Material for the Test

• Need to know content on slides 11, 15, 22, 23, 33 U d t d h h f th t h l i li t d• Understand how each of the technologies listed on slide 7 affects performanceU d t d t lid 8 9• Understand concepts on slides 8,9

• Understand concepts on slides 17, 18, 20 U d t d i li i t d ti• Understand pipelining concepts and equations detailed in slides 27, 28U d t d t i t d• Understand vector processing concepts and equations detailed in slides 29, 30

4848

4949

Documents

High Performance Computing: Concepts, Methods, & Means P ll l …sidhanti/classes/csc7600/S2_L1... · 2007-01-30 · High Performance Computing: Concepts, Methods, & Means P ll l