Upload
rhoda-thomas
View
221
Download
0
Tags:
Embed Size (px)
Citation preview
Parallel Computer Organization and DesignEDA282
Slide 1
Slide 2
Why Study Parallel Computers?
Almost ALL computers are now parallel
Understanding hardware is important for producing good software (converse also true!)
It’s fun!
Logistics EL43 1:15-3:00 T/Th (often1:15 F, too)
Expected participation¨ Attend lectures, participate in discussion¨ Complete labs (including a satisfactory writeup)
— dates/times TBD¨ Read papers¨ Complete quizzes¨ Write (short) survey article (in teams)¨ Finish (short) take-home exam
Canvas course-management system¨ https://canvas.instructure.com/courses/777378¨ Link: http://www.cse.chalmers.se/~mckee/eda282
Slide 3
Personnel
Prof. Sally McKee¨ Office hours: arrange meetings via email¨ Available for discussions after class¨ [email protected]
Jacob Lidman¨ [email protected]
Slide 4
Course Materials
“Parallel Computer Organization and Design”Dubois, Annevaram, Stenström (at Cremona)
Research and survey papers (linked to web page)
Slide 5
Course Structure/Contents
Intro today
Programming models¨ Data parallelism¨ Shared address spaces¨ Message passing¨ Hybrid
Design principles/tradeoffs(this is the bulk of the material)¨ Small-scale systems¨ Scalable systems¨ Interconnects
Slide 6
For Each Big Topic, We’ll Discuss . . .
History¨ How concepts originated in old machines¨ How they show up in current machines
Basics required in any parallel machine¨ Memory coherence¨ Communication¨ Synchronization
How Did We Get Here?
Transistor count doubling every ~2 years
Transistor feature sizes shrinking
Costs changing
Clock speeds hitting limits
Parallelism per processor increasing
Looking at trends is important when designing new systems!
Slide 8
Costs of Parallel Machines
Things to keep in mind when designing a machine . . .
What does it cost to design the mechanism? What does it cost to verify? What does it cost to manufacture? What does it cost to test? What does it cost to program it? What does it cost to deploy (turn on)? What does it cost to keep it running? (power costs, maintenance) What does it cost to use it? What does it cost to dispose of it at the end of its lifetime?
(how long is a "lifetime"?)
Slide 9
Slide 10
Interesting Questions (i.e., course content) What do we mean by parallel?
¨ Task parallelism (SPMD, MPMD)¨ Data parallelism (SIMD)¨ Thread parallelism (Hyperthreading, SMT)
How do the processors coordinate their work?¨ Shared memory/message passing¨ Interconnection network (at least one!)¨ Synchronization primitives¨ Many combinations/variations
What’s the best way to put these pieces together?¨ What do you want to run?¨ How fast do you have to run it?¨ How much can you spend?¨ How much energy can you use?
Moore’s Law: Transistor Counts
Slide 11
Feature Sizes
Slide 12
Costs: Apple
Slide 13
Costs of Consumer Electronics Today
Slide 14
History Pascal adding machine, 1642
Leibniz adder/multiplier, ~1670
Babbage analytical engine, 1837 (punch cards, memory, printer!)
Hollerith punchcards, 1890 (used for US census data)
Aiken digital computer, 1940s (Harvard)
Von Neumann stored-program computer, 1945
Eckert/Mauchly ENIAC GP computer, 1946
Slide 15
Evolution of Electronic Computers Vacuum tubes replaced by transistors, late 1950s
¨ Smaller, faster, more versital logic elements¨ Lower power¨ Longer lifetime
Integrated Circuits, late 1960s¨ Many transistors fabricated on silicon substrate¨ Wires plated in place¨ Lower price¨ Smaller size¨ Lower failure rate
LSI/VLSI/microprocessors, 1970s¨ 1000s of interconnected transistors etched into silicon¨ Could check 8 switches at once → 8-bit “byte”
Slide 16
History of Supercomputers IBM 7030 Stretch, 1961
¨ 2K sq. ft.¨ Fastest computer in world at time¨ Slower than expected!¨ Cost initially $13M, dropped to $8.5M¨ Instruction pipelining, prefetching/decoding, memory
interleaving
CDC 6600, 1964¨ Size ~= 4 filing cabinets. ¨ Cost $8M ($60M today)¨ 40MHz, 3M FLOPS at peak¨ Freon cooled¨ CPU == 10 FUs, multiple PCBs¨ 60-bit words/regs
Slide 17
History of Supercomputers (2)
Cray 1, 1976¨ 64-bit words¨ 80 MHz¨ 136 MFLOPS!¨ Speed-critical parts
placed inside¨ 1662 PCBs w/ 144 Ics¨ 80 sold in 10 years¨ $5-8M ($25M now)
Slide 18
History of Supercomputers (3) Cray XMP, 1982
¨ Up to 4 CPUs in 1 chassis¨ Up to 16M 64-bit words (128 MB, all SRAM!)¨ Up to 32 1.2GB disks¨ 105 MHz¨ Up to 800 MFLOPS (200/CPU)¨ Double memory bandwidth wrt Cray 1
Cray 2, 1985¨ Again ICs packed on logic boards¨ Again, horseshoe shape¨ Boards packed tightly — submersed in Fluorinert to cool
(see http://archive.computerhistory.org/resources/text/Cray/Cray.Cray2.1985.102646185.pdf)
¨ Up to 8 CPUs, 1.9 GFLOPS¨ Mainstream software/Unix System V OS
Slide 19
History of Supercomputers (4)
Intel Paragon, 1989¨ I860-based¨ 32- or 64-bit¨ Up to 4K CPUs¨ 2D MIMD topology¨ Poor memory bandwidth utilization
ASCI Red, 1996¨ First to use off-the-shelf CPUs (Pentium Pros, Xeons)¨ 6K CPUs¨ Broke 1 TFLOP barrier¨ Cost $46M ($67M now)¨ Upgrade had 9298 Xeons for 3.1 TFlops¨ Over 1MW power!
Slide 20
History of Supercomputers (5) Hitachi SR2201, 1996
¨ H-shaped chassis¨ 2048 CPUs ¨ 600 GFLOPS peak
Other similar machines (many Japanese)¨ 100s of CPUs¨ 2D or 3D networks (e.g., Cray torus)¨ MIMD
Seymour Cray leaves Cray Research¨ Cray Computer Corp (CCC)
¨ Cray 3 first gallium arsenide chips¨ Cray 4 failed → bankruptcy
¨ SRC Computers (see http://www.srccomp.com/about/aboutus.asp)
Slide 21
Biggest Machine Today
SequoiaIBM BlueGene/Q machine at theU.S. Dept. of Energy Lawrence Livermore National Lab
Slide 22
Types of Parallelism Instruction-Level Parallelism (ILP)
¨ Superscalar issue¨ Out-of-order execution¨ Very Long Instruction Word (VLIW)
Thread-Level Parallelism (TLP)¨ Loop-level¨ Multithreading
¨ Explicit¨ Speculative¨ Simultaneous/Hyperthreading
Task-Level Parallelism Program-Level Parallelism Data-Level Parallelism
Slide 23
Parallelism in Sequential Programs
Programming model: C (sequential)
Architecture: superscalar¨ ILP¨ Communication through registers¨ Synchronization through pipeline interlocks
Slide 24
for i = 0 to N-1 a[(i+1) mod N] := b[i] + c[i];for i = 0 to N-1 d[i] := C*a[i];
Iteration: 0 1 2 … N-1Loop 1 a[1] a[2] … a[0]
Loop 2 a[0] a[1] … a[N-1]
data dependencies
Parallel Programming Models
Extend semantics to express¨ Units of parallelism
¨ Instructions¨ Threads¨ Programs
¨ Communication and coordination between units via¨ Registers¨ Memory¨ I/O
Slide 25
Model vs. Architecture
¨ Communication abstraction supports model¨ Communication architecture (ISA + comm/sync)
implements part of model¨ Hw/sw boundary defines which parts of comm arch
implemented in which
Slide 26
CAD
Multiprogramming Sharedaddress
Messagepassing
Dataparallel
Databases Scientific modeling Parallel applications
Programming models
Communication abstractionUser/system boundary
Compileror library
Operating system support
Communication harrdware
Physical communication medium
Hardware/software boundary
Shared Address Space Model
TLP
Communication/coordination among threads via shared global address space
Slide 27
for_all i = 0 to P-1 for j = i0[i] to in[i] a[(j+1) mod N] := b[j] + c[j];barrier;for_all i = 0 to P-1 for j = i0[i] to in[i] d[j] := C*a[j];
Communication abstractionsupported by HW/SW interface
P P P
Memory
Message Passing Model Process-level parallelism (separate addr spaces)
Communication/coordination via messages
Slide 28
for_all i = 0 to P-1 for j = i0[i] to in[i] index = (j+1) mod N; a[index] := b[j] + c[j]; if j = in[i] then send(a[index], (j+1) mod P, a[j]); end_forbarrier;for_all i = 0 to P-1 for j = i0[i] to in[i] if j = i0[i] then recv(tmp,(P+j-1) mod P, a[j]); d[j] := C * tmp;} end_for
Data Parallelism (SIMD)
Programming model¨ Operations done in parallel on multiple data elements¨ Single thread of control
Architectural model¨ Array of simple, cheap processors w/ little memory¨ Attached to control proc that issues instructions¨ Specialized + general comm, cheap sync
Slide 29
PE PE PE
PE PE PE
PE PE PE
Controlprocessor
parallel (i:0->N-1) a[(i+1) mod N] := b[i] + c[i];parallel (i:0->N-1) d[i] := C * a[i];
Coarser-Grain Data Parallelism
Single-Program Multiple-Data
More broadly applicable than SIMD
Slide 30
Creating a Parallel Program
ID work that can be done in parallel¨ Computation¨ Data access¨ I/O
Partition work/data among entities¨ Processes¨ Threads
Manage data access, comm, sync
Speedup(P) = Performance(P)/Performance(1)
= Time(1)/Time(P)
Slide 31
Steps
Decomposition
Assignment
Orchestration
Mapping
Can be done by ¨ Programmer¨ Compiler¨ Runtime¨ Hardware (speculatively)
Slide 32
Architecture independent
Parallelization
Architecture dependent
Slide 33
P0 P1
P2 P3
Sequential Compuitation Tasks Processes
Parallel Program Processors
Concepts
Task¨ Arbitrary piece of work from computation¨ Sequentually executed¨ Could be fine- or coarse-grained
Process (or thread)¨ What gets executed by a core¨ Abstract entity that performs tasks assigned to it¨ Processes comm & sync to perform tasks
Processor (core)¨ Physical engine on which processes run¨ Virtualized machine view for programmer
Slide 34
Decomposition
Purpose: Break up computation into tasks to be divided among processes¨ Tasks may become available dynamically¨ Number of available tasks may vary with time¨ i.e., identify concurrency and decide level at which to
exploit it
Goal: keep processes busy, but keep management reasonable¨ Number of tasks creates upper bound on speedup¨ Too many tasks requires too much coordination
Slide 35
Assignment
Specify mechanism to divide work among processes¨ Strive for balance¨ Reduce communication, management
Structured approach recommended¨ Inspect code¨ Apply well known heuristics
Programmer focuses on decomp/assign 1st ¨ Largely independent of architecture/programming model¨ Choice of primitives (cost/complexity) affects decisions
Architects assume program(mer) does decent job
Slide 36
Orchestration
Purpose¨ Name data, structure comm/sync¨ Organize data structures, schedule tasks (temporally)
Goals¨ Reduce costs of comm/sync from processor POV¨ Improve data locality¨ Reduce overhead of managing parallelism
Choices depend heavily on comm abstraction, efficiency of primitives Architects must provide appropriate, efficient
primitives
Slide 37
Mapping
Two aspects ¨ Which processes to run on same processor¨ Which process runs on which processor
One extreme-sharing¨ Partition machine s.t. only 1 app at a time in a subset¨ Pin processes to cores (or let OS balance workloads)
Another extreme¨ Control complete resource management in OS¨ Use performance techniques for dynamic balancing
Real world is between the two¨ User specifies desires in some aspects¨ System may ignore
Slide 38
High-Level Goals
High performance Low resource usage Low development effort Low power consumption Implications for algorithm designers and architects
¨ Algorithm designers: high-performance, low resource needs
¨ Architects: high-performance, low cost, reduced programming effort
Slide 39
Costs of Parallel Machines
Things to keep in mind when designing a machine . . .
What does it cost to design the mechanism? What does it cost to verify? What does it cost to manufacture? What does it cost to test? What does it cost to program it? What does it cost to deploy (turn on)? What does it cost to keep it running? (power costs, maintenance) What does it cost to use it? What does it cost to dispose of it at the end of its lifetime?
(how long is a "lifetime"?)
Slide 40