Parallel Computer Organization and Design EDA282 Slide 1

Parallel Computer Organization and DesignEDA282

Slide 1

Slide 2

Why Study Parallel Computers?

Almost ALL computers are now parallel

Understanding hardware is important for producing good software (converse also true!)

It’s fun!

Logistics EL43 1:15-3:00 T/Th (often1:15 F, too)

Expected participation¨ Attend lectures, participate in discussion¨ Complete labs (including a satisfactory writeup)

— dates/times TBD¨ Read papers¨ Complete quizzes¨ Write (short) survey article (in teams)¨ Finish (short) take-home exam

Canvas course-management system¨ https://canvas.instructure.com/courses/777378¨ Link: http://www.cse.chalmers.se/~mckee/eda282

Slide 3

https://canvas.instructure.com/courses/777378

http://www.cse.chalmers.se/~mckee/eda282

Personnel

Prof. Sally McKee¨ Office hours: arrange meetings via email¨ Available for discussions after class¨ [email protected]

Jacob Lidman¨ [email protected]

Slide 4

mailto:[email protected]

mailto:[email protected]

Course Materials

“Parallel Computer Organization and Design”Dubois, Annevaram, Stenström (at Cremona)

Research and survey papers (linked to web page)

Slide 5

Course Structure/Contents

Intro today

Programming models¨ Data parallelism¨ Shared address spaces¨ Message passing¨ Hybrid

Design principles/tradeoffs(this is the bulk of the material)¨ Small-scale systems¨ Scalable systems¨ Interconnects

Slide 6

For Each Big Topic, We’ll Discuss . . .

History¨ How concepts originated in old machines¨ How they show up in current machines

Basics required in any parallel machine¨ Memory coherence¨ Communication¨ Synchronization

How Did We Get Here?

Transistor count doubling every ~2 years

Transistor feature sizes shrinking

Costs changing

Clock speeds hitting limits

Parallelism per processor increasing

Looking at trends is important when designing new systems!

Slide 8

Costs of Parallel Machines

Things to keep in mind when designing a machine . . .

What does it cost to design the mechanism? What does it cost to verify? What does it cost to manufacture? What does it cost to test? What does it cost to program it? What does it cost to deploy (turn on)? What does it cost to keep it running? (power costs, maintenance) What does it cost to use it? What does it cost to dispose of it at the end of its lifetime?

(how long is a "lifetime"?)

Slide 9

Slide 10

Interesting Questions (i.e., course content) What do we mean by parallel?

¨ Task parallelism (SPMD, MPMD)¨ Data parallelism (SIMD)¨ Thread parallelism (Hyperthreading, SMT)

How do the processors coordinate their work?¨ Shared memory/message passing¨ Interconnection network (at least one!)¨ Synchronization primitives¨ Many combinations/variations

What’s the best way to put these pieces together?¨ What do you want to run?¨ How fast do you have to run it?¨ How much can you spend?¨ How much energy can you use?

Moore’s Law: Transistor Counts

Slide 11

Feature Sizes

Slide 12

Costs: Apple

Slide 13

Costs of Consumer Electronics Today

Slide 14

History Pascal adding machine, 1642

Leibniz adder/multiplier, ~1670

Babbage analytical engine, 1837 (punch cards, memory, printer!)

Hollerith punchcards, 1890 (used for US census data)

Aiken digital computer, 1940s (Harvard)

Von Neumann stored-program computer, 1945

Eckert/Mauchly ENIAC GP computer, 1946

Slide 15

Evolution of Electronic Computers Vacuum tubes replaced by transistors, late 1950s

¨ Smaller, faster, more versital logic elements¨ Lower power¨ Longer lifetime

Integrated Circuits, late 1960s¨ Many transistors fabricated on silicon substrate¨ Wires plated in place¨ Lower price¨ Smaller size¨ Lower failure rate

LSI/VLSI/microprocessors, 1970s¨ 1000s of interconnected transistors etched into silicon¨ Could check 8 switches at once → 8-bit “byte”

Slide 16

History of Supercomputers IBM 7030 Stretch, 1961

¨ 2K sq. ft.¨ Fastest computer in world at time¨ Slower than expected!¨ Cost initially $13M, dropped to $8.5M¨ Instruction pipelining, prefetching/decoding, memory

interleaving

CDC 6600, 1964¨ Size ~= 4 filing cabinets. ¨ Cost $8M ($60M today)¨ 40MHz, 3M FLOPS at peak¨ Freon cooled¨ CPU == 10 FUs, multiple PCBs¨ 60-bit words/regs

Slide 17

History of Supercomputers (2)

Cray 1, 1976¨ 64-bit words¨ 80 MHz¨ 136 MFLOPS!¨ Speed-critical parts

placed inside¨ 1662 PCBs w/ 144 Ics¨ 80 sold in 10 years¨ $5-8M ($25M now)

Slide 18

History of Supercomputers (3) Cray XMP, 1982

¨ Up to 4 CPUs in 1 chassis¨ Up to 16M 64-bit words (128 MB, all SRAM!)¨ Up to 32 1.2GB disks¨ 105 MHz¨ Up to 800 MFLOPS (200/CPU)¨ Double memory bandwidth wrt Cray 1

Cray 2, 1985¨ Again ICs packed on logic boards¨ Again, horseshoe shape¨ Boards packed tightly — submersed in Fluorinert to cool

(see http://archive.computerhistory.org/resources/text/Cray/Cray.Cray2.1985.102646185.pdf)

¨ Up to 8 CPUs, 1.9 GFLOPS¨ Mainstream software/Unix System V OS

Slide 19

History of Supercomputers (4)

Intel Paragon, 1989¨ I860-based¨ 32- or 64-bit¨ Up to 4K CPUs¨ 2D MIMD topology¨ Poor memory bandwidth utilization

ASCI Red, 1996¨ First to use off-the-shelf CPUs (Pentium Pros, Xeons)¨ 6K CPUs¨ Broke 1 TFLOP barrier¨ Cost $46M ($67M now)¨ Upgrade had 9298 Xeons for 3.1 TFlops¨ Over 1MW power!

Slide 20

History of Supercomputers (5) Hitachi SR2201, 1996

¨ H-shaped chassis¨ 2048 CPUs ¨ 600 GFLOPS peak

Other similar machines (many Japanese)¨ 100s of CPUs¨ 2D or 3D networks (e.g., Cray torus)¨ MIMD

Seymour Cray leaves Cray Research¨ Cray Computer Corp (CCC)

¨ Cray 3 first gallium arsenide chips¨ Cray 4 failed → bankruptcy

¨ SRC Computers (see http://www.srccomp.com/about/aboutus.asp)

Slide 21

Biggest Machine Today

SequoiaIBM BlueGene/Q machine at theU.S. Dept. of Energy Lawrence Livermore National Lab

Slide 22

Types of Parallelism Instruction-Level Parallelism (ILP)

¨ Superscalar issue¨ Out-of-order execution¨ Very Long Instruction Word (VLIW)

Thread-Level Parallelism (TLP)¨ Loop-level¨ Multithreading

¨ Explicit¨ Speculative¨ Simultaneous/Hyperthreading

Task-Level Parallelism Program-Level Parallelism Data-Level Parallelism

Slide 23

Parallelism in Sequential Programs

Programming model: C (sequential)

Architecture: superscalar¨ ILP¨ Communication through registers¨ Synchronization through pipeline interlocks

Slide 24

for i = 0 to N-1 a[(i+1) mod N] := b[i] + c[i];for i = 0 to N-1 d[i] := C*a[i];

Iteration: 0 1 2 … N-1Loop 1 a[1] a[2] … a[0]

Loop 2 a[0] a[1] … a[N-1]

data dependencies

Parallel Programming Models

Extend semantics to express¨ Units of parallelism

¨ Instructions¨ Threads¨ Programs

¨ Communication and coordination between units via¨ Registers¨ Memory¨ I/O

Slide 25

Model vs. Architecture

¨ Communication abstraction supports model¨ Communication architecture (ISA + comm/sync)

implements part of model¨ Hw/sw boundary defines which parts of comm arch

implemented in which

Slide 26

CAD

Multiprogramming Sharedaddress

Messagepassing

Dataparallel

Databases Scientific modeling Parallel applications

Programming models

Communication abstractionUser/system boundary

Compileror library

Operating system support

Communication harrdware

Physical communication medium

Hardware/software boundary

Shared Address Space Model

TLP

Communication/coordination among threads via shared global address space

Slide 27

for_all i = 0 to P-1 for j = i0[i] to in[i] a[(j+1) mod N] := b[j] + c[j];barrier;for_all i = 0 to P-1 for j = i0[i] to in[i] d[j] := C*a[j];

Communication abstractionsupported by HW/SW interface

P P P

Memory

Message Passing Model Process-level parallelism (separate addr spaces)

Communication/coordination via messages

Slide 28

for_all i = 0 to P-1 for j = i0[i] to in[i] index = (j+1) mod N; a[index] := b[j] + c[j]; if j = in[i] then send(a[index], (j+1) mod P, a[j]); end_forbarrier;for_all i = 0 to P-1 for j = i0[i] to in[i] if j = i0[i] then recv(tmp,(P+j-1) mod P, a[j]); d[j] := C * tmp;} end_for

Data Parallelism (SIMD)

Programming model¨ Operations done in parallel on multiple data elements¨ Single thread of control

Architectural model¨ Array of simple, cheap processors w/ little memory¨ Attached to control proc that issues instructions¨ Specialized + general comm, cheap sync

Slide 29

PE PE PE

PE PE PE

PE PE PE

Controlprocessor

parallel (i:0->N-1) a[(i+1) mod N] := b[i] + c[i];parallel (i:0->N-1) d[i] := C * a[i];

Coarser-Grain Data Parallelism

Single-Program Multiple-Data

More broadly applicable than SIMD

Slide 30

Creating a Parallel Program

ID work that can be done in parallel¨ Computation¨ Data access¨ I/O

Partition work/data among entities¨ Processes¨ Threads

Manage data access, comm, sync

Speedup(P) = Performance(P)/Performance(1)

= Time(1)/Time(P)

Slide 31

Steps

Decomposition

Assignment

Orchestration

Mapping

Can be done by ¨ Programmer¨ Compiler¨ Runtime¨ Hardware (speculatively)

Slide 32

Architecture independent

Parallelization

Architecture dependent

Slide 33

P0 P1

P2 P3

Sequential Compuitation Tasks Processes

Parallel Program Processors

Concepts

Task¨ Arbitrary piece of work from computation¨ Sequentually executed¨ Could be fine- or coarse-grained

Process (or thread)¨ What gets executed by a core¨ Abstract entity that performs tasks assigned to it¨ Processes comm & sync to perform tasks

Processor (core)¨ Physical engine on which processes run¨ Virtualized machine view for programmer

Slide 34

Decomposition

Purpose: Break up computation into tasks to be divided among processes¨ Tasks may become available dynamically¨ Number of available tasks may vary with time¨ i.e., identify concurrency and decide level at which to

exploit it

Goal: keep processes busy, but keep management reasonable¨ Number of tasks creates upper bound on speedup¨ Too many tasks requires too much coordination

Slide 35

Assignment

Specify mechanism to divide work among processes¨ Strive for balance¨ Reduce communication, management

Structured approach recommended¨ Inspect code¨ Apply well known heuristics

Programmer focuses on decomp/assign 1st ¨ Largely independent of architecture/programming model¨ Choice of primitives (cost/complexity) affects decisions

Architects assume program(mer) does decent job

Slide 36

Orchestration

Purpose¨ Name data, structure comm/sync¨ Organize data structures, schedule tasks (temporally)

Goals¨ Reduce costs of comm/sync from processor POV¨ Improve data locality¨ Reduce overhead of managing parallelism

Choices depend heavily on comm abstraction, efficiency of primitives Architects must provide appropriate, efficient

primitives

Slide 37

Mapping

Two aspects ¨ Which processes to run on same processor¨ Which process runs on which processor

One extreme-sharing¨ Partition machine s.t. only 1 app at a time in a subset¨ Pin processes to cores (or let OS balance workloads)

Another extreme¨ Control complete resource management in OS¨ Use performance techniques for dynamic balancing

Real world is between the two¨ User specifies desires in some aspects¨ System may ignore

Slide 38

High-Level Goals

High performance Low resource usage Low development effort Low power consumption Implications for algorithm designers and architects

¨ Algorithm designers: high-performance, low resource needs

¨ Architects: high-performance, low cost, reduced programming effort

Slide 39

Costs of Parallel Machines

Things to keep in mind when designing a machine . . .

What does it cost to design the mechanism? What does it cost to verify? What does it cost to manufacture? What does it cost to test? What does it cost to program it? What does it cost to deploy (turn on)? What does it cost to keep it running? (power costs, maintenance) What does it cost to use it? What does it cost to dispose of it at the end of its lifetime?

(how long is a "lifetime"?)

Slide 40

Documents

Parallel Computer Organization and Design EDA282 Slide 1