131
CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences University of California, Berkeley

CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

  • View
    227

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

CSCE 930 Advanced Computer

Architecture

Introductions

Adopted fromProfessor David Patterson

&David Culler

Electrical Engineering and Computer SciencesUniversity of California, Berkeley

Page 2: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Outline

• Computer Science at a Crossroads: Parallelism– Architecture: multi-core and many-cores

– Program: multi-threading

• Parallel Architecture– What is Parallel Architecture?

– Why Parallel Architecture?

– Evolution and Convergence of Parallel Architectures

– Fundamental Design Issues

• Parallel Programs– Why bother with programs?

– Important for whom?

• Memory & Storage Subsystem Architectures

04/18/23 2CSCE930-Advanced Computer Architecture, Introduction

Page 3: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

• Old Conventional Wisdom: Power is free, Transistors expensive

• New Conventional Wisdom: “Power wall” Power expensive, Xtors free (Can put more on chip than can afford to turn on)

• Old CW: Sufficiently increasing Instruction Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, …)

• New CW: “ILP wall” law of diminishing returns on more HW for ILP

• Old CW: Multiplies are slow, Memory access is fast

• New CW: “Memory wall” Memory slow, multiplies fast (200 clock cycles to DRAM memory, 4 clocks for multiply)

• Old CW: Uniprocessor performance 2X / 1.5 yrs

• New CW: Power Wall + ILP Wall + Memory Wall = Brick Wall– Uniprocessor performance now 2X / 5(?) yrs

Sea change in chip design: multiple “cores” (2X processors per chip / ~ 2 years)

» More simpler processors are more power efficient

Crossroads: Conventional Wisdom in Comp. Arch

04/18/23 3CSCE930-Advanced Computer Architecture, Introduction

Page 4: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

1

10

100

1000

10000

1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006

Pe

rfo

rma

nce

(vs

. V

AX

-11

/78

0)

25%/year

52%/year

??%/year

Crossroads: Uniprocessor Performance

• VAX : 25%/year 1978 to 1986• RISC + x86: 52%/year 1986 to 2002• RISC + x86: ??%/year 2002 to present

From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, October, 2006

04/18/23 4CSCE930-Advanced Computer Architecture, Introduction

Page 5: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Sea Change in Chip Design• Intel 4004 (1971): 4-bit processor,

2312 transistors, 0.4 MHz, 10 micron PMOS, 11 mm2 chip

• Processor is the new transistor?

• RISC II (1983): 32-bit, 5 stage pipeline, 40,760 transistors, 3 MHz, 3 micron NMOS, 60 mm2 chip

• 125 mm2 chip, 0.065 micron CMOS = 2312 RISC II+FPU+Icache+Dcache

– RISC II shrinks to ~ 0.02 mm2 at 65 nm

– Caches via DRAM or 1 transistor SRAM (www.t-ram.com) ?– Proximity Communication via capacitive coupling at > 1 TB/s ?

(Ivan Sutherland @ Sun / Berkeley)

04/18/23 5CSCE930-Advanced Computer Architecture, Introduction

Page 6: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Déjà vu all over again?

• Multiprocessors imminent in 1970s, ‘80s, ‘90s, …• “… today’s processors … are nearing an impasse as

technologies approach the speed of light..”David Mitchell, The Transputer: The Time Is Now (1989)

• Transputer was premature Custom multiprocessors strove to lead uniprocessors Procrastination rewarded: 2X seq. perf. / 1.5 years

• “We are dedicating all of our future product development to multicore designs. … This is a sea change in computing”

Paul Otellini, President, Intel (2004) • Difference is all microprocessor companies switch to

multiprocessors (AMD, Intel, IBM, Sun; all new Apples 2 CPUs)

Procrastination penalized: 2X sequential perf. / 5 yrs Biggest programming challenge: 1 to 2 CPUs

04/18/23 6CSCE930-Advanced Computer Architecture, Introduction

Page 7: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Problems with Sea Change

• Algorithms, Programming Languages, Compilers, Operating Systems, Architectures, Libraries, … not ready to supply Thread Level Parallelism or Data Level Parallelism for 1000 CPUs / chip,

• Architectures not ready for 1000 CPUs / chip• Unlike Instruction Level Parallelism, cannot be solved by just by

computer architects and compiler writers alone, but also cannot be solved without participation of computer architects

• This course explores ISL (Instruction Level Parallelism) and its shift to Thread Level Parallelism / Data Level Parallelism

04/18/23 7CSCE930-Advanced Computer Architecture, Introduction

Page 8: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Outline

• Computer Science at a Crossroads: Parallelism– Architecture: multi-core and many-cores

– Program: multi-threading

• Parallel Architecture– What is Parallel Architecture?

– Why Parallel Architecture?

– Evolution and Convergence of Parallel Architectures

– Fundamental Design Issues

• Parallel Programs– Why bother with programs?

– Important for whom?

• Memory & Storage Subsystem Architectures

04/18/23 8CSCE930-Advanced Computer Architecture, Introduction

Page 9: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

What is Parallel Architecture?

• A parallel computer is a collection of processing elements that cooperate to solve large problems fast

• Some broad issues:– Resource Allocation:

» how large a collection?

» how powerful are the elements?

» how much memory?

– Data access, Communication and Synchronization

» how do the elements cooperate and communicate?

» how are data transmitted between processors?

» what are the abstractions and primitives for cooperation?

– Performance and Scalability

» how does it all translate into performance?

» how does it scale?

04/18/23 9CSCE930-Advanced Computer Architecture, Introduction

Page 10: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Why Study Parallel Architecture?

Role of a computer architect:

To design and engineer the various levels of a computer system to maximize performance and programmability within limits of technology and cost.

Parallelism:• Provides alternative to faster clock for performance

• Applies at all levels of system design

• Is a fascinating perspective from which to view architecture

• Is increasingly central in information processing

04/18/23 10CSCE930-Advanced Computer Architecture, Introduction

Page 11: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Why Study it Today?

• History: diverse and innovative organizational structures, often tied to novel programming models

• Rapidly maturing under strong technological constraints

– The “killer micro” is ubiquitous

– Laptops and supercomputers are fundamentally similar!

– Technological trends cause diverse approaches to converge

• Technological trends make parallel computing inevitable

– In the mainstream with the reality of multi-cores and many-cores

• Need to understand fundamental principles and design tradeoffs, not just taxonomies

– Naming, Ordering, Replication, Communication performance04/18/23 11CSCE930-Advanced Computer Architecture, Introduction

Page 12: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Inevitability of Parallel Computing

• Application demands: Our insatiable need for computing cycles

– Scientific computing: VR simulations in Biology, Chemistry, Physics, ...– General-purpose computing: Video, Graphics, CAD, Databases, AR, VI,

TP...

• Technology Trends– Number of cores on chip growing rapidly (New Moors Law)– Clock rates expected to go up only slowly (tech. wall)

• Architecture Trends– Instruction-level parallelism valuable but limited– Coarser-level parallelism, or thread-level parallelism, the most viable

approach

• Economics

• Current trends:– Today’s microprocessors are multiprocessors

04/18/23 12CSCE930-Advanced Computer Architecture, Introduction

Page 13: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Application Trends

• Demand for cycles fuels advances in hardware, and vice-versa

– Cycle drives exponential increase in microprocessor performance

– Drives parallel architecture harder: most demanding applications

• Range of performance demands– Need range of system performance with progressively increasing cost

– Platform pyramid

• Goal of applications in using parallel machines: Speedup

• Speedup (p processors) =

• For a fixed problem size (input data set), performance = 1/time

• Speedup fixed problem (p processors) =

Performance (p processors)

Performance (1 processor)

Time (1 processor)

Time (p processors)04/18/23 13CSCE930-Advanced Computer Architecture, Introduction

Page 14: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Scientific Computing Demand

04/18/23 14CSCE930-Advanced Computer Architecture, Introduction

Page 15: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Engineering Computing Demand

• Large parallel machines a mainstay in many industries– Petroleum (reservoir analysis)

– Automotive (crash simulation, drag analysis, combustion efficiency),

– Aeronautics (airflow analysis, engine efficiency, structural mechanics, electromagnetism),

– Computer-aided design

– Pharmaceuticals (molecular modeling)

– Visualization

» In all of the above

» Entertainment (3D films like Avatar & 3D games )

» Architecture (walk-throughs and rendering)

» Virtual Reality/Immersion (museums, teleporting, etc)

– Financial modeling (yield and derivative analysis)

– Etc.

04/18/23 15CSCE930-Advanced Computer Architecture, Introduction

Page 16: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Learning Curve for Parallel Applications

• AMBER molecular dynamics simulation program• Starting point was vector code for Cray-1• 145 MFLOP on Cray90, 406 for final version on 128-processor

Paragon, 891 on 128-processor Cray T3D04/18/23 16CSCE930-Advanced Computer Architecture, Introduction

Page 17: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Commercial Computing

• Also relies on parallelism for high end– Scale not so large, but use much more wide-spread

– Computational power determines scale of business that can be handled

• Databases, online-transaction processing, decision support, data mining, data warehousing ...

• TPC benchmarks (TPC-C order entry, TPC-D decision support)

– Explicit scaling criteria provided

– Size of enterprise scales with size of system

– Problem size no longer fixed as p increases, so throughput is used as a performance measure (transactions per minute or tpm)

04/18/23 17CSCE930-Advanced Computer Architecture, Introduction

Page 18: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Similar Story for Storage

• Divergence between memory capacity and speed more pronounced– Capacity increased by 1000x from 1980-95, speed only 2x

– Gigabit DRAM by c. 2000, but gap with processor speed much greater

• Larger memories are slower, while processors get faster– Need to transfer more data in parallel– Need deeper cache hierarchies– How to organize caches?

• Parallelism increases effective size of each level of hierarchy, without increasing access time

• Parallelism and locality within memory systems too– New designs fetch many bits within memory chip; follow with fast pipelined transfer

across narrower interface– Buffer caches most recently accessed data

• Disks too: Parallel disks plus caching

04/18/23 18CSCE930-Advanced Computer Architecture, Introduction

Page 19: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

>100TBMedicinal Image

Digital body

High performance Computing

>1PB

NASA

>100TB

1TB/body

H.B Glass

5GB/day

Real-world applications demand high-performing and reliable storage

04/18/23

Page 20: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

>1PBGIS

Oil prospecting>1PB

>1PBOcean resource data

1PB=1000TB=1015Bytes, It is equal to the capacity of 10,000 100GB disks.

Google, Yahoo, … >1PB/year

04/18/23

Page 21: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Technology Trends: Moore’s Law: 2X transistors / “year” 2X cores / “year”

• “Cramming More Components onto Integrated Circuits”– Gordon Moore, Electronics, 1965

• # on transistors / cost-effective integrated circuit double every N months (12 ≤ N ≤ 24)

04/18/23 21CSCE930-Advanced Computer Architecture, Introduction

Page 22: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Tracking Technology Performance Trends

• Drill down into 4 technologies:– Disks, – Memory, – Network, – Processors

• Compare ~1980 Archaic (Nostalgic) vs. ~2000 Modern (Newfangled)

– Performance Milestones in each technology

• Compare for Bandwidth vs. Latency improvements in performance over time

• Bandwidth: number of events per unit time– E.g., M bits / second over network, M bytes / second from disk

• Latency: elapsed time for a single event– E.g., one-way network delay in microseconds,

average disk access time in milliseconds

04/18/23 22CSCE930-Advanced Computer Architecture, Introduction

Page 23: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Disks: Archaic(Nostalgic) v. Modern(Newfangled)

• Seagate 373453, 2003

• 15000 RPM (4X)

• 73.4 GBytes (2500X)

• Tracks/Inch: 64000 (80X)

• Bits/Inch: 533,000 (60X)

• Four 2.5” platters (in 3.5” form factor)

• Bandwidth: 86 MBytes/sec (140X)

• Latency: 5.7 ms (8X)

• Cache: 8 MBytes

• CDC Wren I, 1983

• 3600 RPM

• 0.03 GBytes capacity

• Tracks/Inch: 800

• Bits/Inch: 9550

• Three 5.25” platters

• Bandwidth: 0.6 MBytes/sec

• Latency: 48.3 ms

• Cache: none

04/18/23 23CSCE930-Advanced Computer Architecture, Introduction

Page 24: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Latency Lags Bandwidth (for last ~20 years)

• Performance Milestones

• Disk: 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x)

(latency = simple operation w/o contentionBW = best-case)

1

10

100

1000

10000

1 10 100

Relative Latency Improvement

Relative BW

Improvement

Disk

(Latency improvement = Bandwidth improvement)

04/18/23 24CSCE930-Advanced Computer Architecture, Introduction

Page 25: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Memory: Archaic (Nostalgic) v. Modern (Newfangled)

• 1980 DRAM (asynchronous)

• 0.06 Mbits/chip

• 64,000 xtors, 35 mm2

• 16-bit data bus per module, 16 pins/chip

• 13 Mbytes/sec

• Latency: 225 ns

• (no block transfer)

• 2000 Double Data Rate Synchr. (clocked) DRAM

• 256.00 Mbits/chip (4000X)

• 256,000,000 xtors, 204 mm2

• 64-bit data bus per DIMM, 66 pins/chip (4X)

• 1600 Mbytes/sec (120X)

• Latency: 52 ns (4X)

• Block transfers (page mode)

04/18/23 25CSCE930-Advanced Computer Architecture, Introduction

Page 26: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Latency Lags Bandwidth (last ~20 years)

• Performance Milestones

• Memory Module: 16bit plain DRAM, Page Mode DRAM, 32b, 64b, SDRAM, DDR SDRAM (4x,120x)

• Disk: 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x)

(latency = simple operation w/o contentionBW = best-case)

1

10

100

1000

10000

1 10 100

Relative Latency Improvement

Relative BW

Improvement

MemoryDisk

(Latency improvement = Bandwidth improvement)

04/18/23 26CSCE930-Advanced Computer Architecture, Introduction

Page 27: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

LANs: Archaic (Nostalgic)v. Modern (Newfangled)

• Ethernet 802.3

• Year of Standard: 1978

• 10 Mbits/s link speed

• Latency: 3000 sec

• Shared media

• Coaxial cable

• Ethernet 802.3ae

• Year of Standard: 2003

• 10,000 Mbits/s

(1000X)link speed

• Latency: 190 sec

(15X)

• Switched media

• Category 5 copper wire

Coaxial Cable:

Copper coreInsulator

Braided outer conductorPlastic Covering

Copper, 1mm thick, twisted to avoid antenna effect

Twisted Pair:"Cat 5" is 4 twisted pairs in bundle

04/18/23 27CSCE930-Advanced Computer Architecture, Introduction

Page 28: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Latency Lags Bandwidth (last ~20 years)

• Performance Milestones

• Ethernet: 10Mb, 100Mb, 1000Mb, 10000 Mb/s (16x,1000x)

• Memory Module: 16bit plain DRAM, Page Mode DRAM, 32b, 64b, SDRAM, DDR SDRAM (4x,120x)

• Disk: 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x)

(latency = simple operation w/o contentionBW = best-case)

1

10

100

1000

10000

1 10 100

Relative Latency Improvement

Relative BW

Improvement

Memory

Network

Disk

(Latency improvement = Bandwidth improvement)

04/18/23 28CSCE930-Advanced Computer Architecture, Introduction

Page 29: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

CPUs: Archaic (Nostalgic) v. Modern (Newfangled)

• 1982 Intel 80286

• 12.5 MHz

• 2 MIPS (peak)

• Latency 320 ns

• 134,000 xtors, 47 mm2

• 16-bit data bus, 68 pins

• Microcode interpreter, separate FPU chip

• (no caches)

• 2001 Intel Pentium 4

• 1500 MHz (120X)

• 4500 MIPS (peak) (2250X)

• Latency 15 ns (20X)

• 42,000,000 xtors, 217 mm2

• 64-bit data bus, 423 pins

• 3-way superscalar,Dynamic translate to RISC, Superpipelined (22 stage),Out-of-Order execution

• On-chip 8KB Data caches, 96KB Instr. Trace cache, 256KB L2 cache

04/18/23 29CSCE930-Advanced Computer Architecture, Introduction

Page 30: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Latency Lags Bandwidth (last ~20 years)

• Performance Milestones• Processor: ‘286, ‘386, ‘486,

Pentium, Pentium Pro, Pentium 4 (21x,2250x)

• Ethernet: 10Mb, 100Mb, 1000Mb, 10000 Mb/s (16x,1000x)

• Memory Module: 16bit plain DRAM, Page Mode DRAM, 32b, 64b, SDRAM, DDR SDRAM (4x,120x)

• Disk : 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x)

1

10

100

1000

10000

1 10 100

Relative Latency Improvement

Relative BW

Improvement

Processor

Memory

Network

Disk

(Latency improvement = Bandwidth improvement)

CPU high, Memory low(“Memory Wall”)

04/18/23 30CSCE930-Advanced Computer Architecture, Introduction

Page 31: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Rule of Thumb for Latency Lagging BW

• In the time that bandwidth doubles, latency improves by no more than a factor of 1.2 to 1.4

(and capacity improves faster than bandwidth)

• Stated alternatively: Bandwidth improves by more than the square of the improvement in Latency

04/18/23 31CSCE930-Advanced Computer Architecture, Introduction

Page 32: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

6 Reasons Latency Lags Bandwidth

1. Moore’s Law helps BW more than latency • Faster transistors, more transistors,

more pins help Bandwidth

» MPU Transistors: 0.130 vs. 42 M xtors (300X)

» DRAM Transistors: 0.064 vs. 256 M xtors (4000X)

» MPU Pins: 68 vs. 423 pins (6X)

» DRAM Pins: 16 vs. 66 pins (4X)

• Smaller, faster transistors but communicate over (relatively) longer lines: limits latency

» Feature size: 1.5 to 3 vs. 0.18 micron(8X,17X)

» MPU Die Size: 35 vs. 204 mm2

(ratio sqrt 2X)

» DRAM Die Size: 47 vs. 217 mm2

(ratio sqrt 2X) 04/18/23 32CSCE930-Advanced Computer Architecture, Introduction

Page 33: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

6 Reasons Latency Lags Bandwidth (cont’d)

2. Distance limits latency • Size of DRAM block long bit and word lines

most of DRAM access time

• Speed of light and computers on network

• 1. & 2. explains linear latency vs. square BW?

3. Bandwidth easier to sell (“bigger=better”)• E.g., 10 Gbits/s Ethernet (“10 Gig”) vs.

10 sec latency Ethernet

• 4400 MB/s DIMM (“PC4400”) vs. 50 ns latency

• Even if just marketing, customers now trained

• Since bandwidth sells, more resources thrown at bandwidth, which further tips the balance

04/18/23 33CSCE930-Advanced Computer Architecture, Introduction

Page 34: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

4. Latency helps BW, but not vice versa • Spinning disk faster improves both bandwidth and

rotational latency

» 3600 RPM 15000 RPM = 4.2X

» Average rotational latency: 8.3 ms 2.0 ms

» Things being equal, also helps BW by 4.2X

• Lower DRAM latency More access/second (higher bandwidth)

• Higher linear density helps disk BW (and capacity), but not disk Latency

» 9,550 BPI 533,000 BPI 60X in BW

6 Reasons Latency Lags Bandwidth (cont’d)

04/18/23 34CSCE930-Advanced Computer Architecture, Introduction

Page 35: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

5. Bandwidth hurts latency• Queues help Bandwidth, hurt Latency (Queuing Theory)

• Adding chips to widen a memory module increases Bandwidth but higher fan-out on address lines may increase Latency

6. Operating System overhead hurts Latency more than Bandwidth

• Long messages amortize overhead; overhead bigger part of short messages

6 Reasons Latency Lags Bandwidth (cont’d)

04/18/23 35CSCE930-Advanced Computer Architecture, Introduction

Page 36: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Summary of Technology Trends

• For disk, LAN, memory, and microprocessor, bandwidth improves by square of latency improvement

– In the time that bandwidth doubles, latency improves by no more than 1.2X to 1.4X

• Lag probably even larger in real systems, as bandwidth gains multiplied by replicated components

– Multiple processors in a cluster or even in a chip

– Multiple disks in a disk array

– Multiple memory modules in a large memory

– Simultaneous communication in switched LAN

• HW and SW developers should innovate assuming Latency Lags Bandwidth

– If everything improves at the same rate, then nothing really changes

– When rates vary, require real innovation04/18/23 36CSCE930-Advanced Computer Architecture, Introduction

Page 37: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Architectural Trends

• Architecture translates technology’s gifts to performance and capability

• Resolves the tradeoff between parallelism and locality

– Current microprocessor: 1/4 compute, 1/2 cache, 1/4 off-chip connect

– Tradeoffs may change with scale and technology advances

• Four generations of architectural history: tube, transistor, IC, VLSI

– Here focus only on VLSI generation

• Greatest delineation in VLSI has been in type of parallelism exploited

04/18/23 37CSCE930-Advanced Computer Architecture, Introduction

Page 38: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Architectural Trends

• Greatest trend in VLSI generation is increase in parallelism– Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit

» slows after 32 bit

» adoption of 64-bit now under way, 128-bit far (not performance issue)

» great inflection point when 32-bit micro and cache fit on a chip

– Mid 80s to mid 90s: instruction level parallelism

» pipelining and simple instruction sets, + compiler advances (RISC)

» on-chip caches and functional units => superscalar execution

» greater sophistication: out of order execution, speculation, prediction• to deal with control transfer and latency problems

– Next step: thread level parallelism

04/18/23 38CSCE930-Advanced Computer Architecture, Introduction

Page 39: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Architectural Trends: ILP

• Reported speedups for superscalar processors

• Horst, Harris, and Jardine [1990] ...................... 1.37

• Wang and Wu [1988] .......................................... 1.70

• Smith, Johnson, and Horowitz [1989] .............. 2.30

• Murakami et al. [1989] ........................................ 2.55

• Chang et al. [1991] ............................................. 2.90

• Jouppi and Wall [1989] ...................................... 3.20

• Lee, Kwok, and Briggs [1991] ........................... 3.50

• Wall [1991] .......................................................... 5

• Melvin and Patt [1991] ....................................... 8

• Butler et al. [1991] ............................................. 17+

• Large variance due to difference in– application domain investigated (numerical versus non-

numerical)– capabilities of processor modeled

04/18/23 39CSCE930-Advanced Computer Architecture, Introduction

Page 40: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

ILP Ideal Potential

• Infinite resources and fetch bandwidth, perfect branch prediction and renaming

– real caches and non-zero miss latencies

0 1 2 3 4 5 6+0

5

10

15

20

25

30

0 5 10 150

0.5

1

1.5

2

2.5

3

Fra

ctio

n o

f to

tal c

ycle

s (%

)

Number of instructions issued

Sp

ee

du

p

Instructions issued per cycle

04/18/23 40CSCE930-Advanced Computer Architecture, Introduction

Page 41: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Results of ILP Studies

1x

2x

3x

4x

Jouppi_89 Smith_89 Murakami_89 Chang_91 Butler_91 Melvin_91

1 branch unit/real prediction

perfect branch prediction

• Concentrate on parallelism for 4-issue machines

• Realistic studies show only 2-fold speedup• Recent studies show that more ILP needs to look

across threads04/18/23 41CSCE930-Advanced Computer Architecture, Introduction

Page 42: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Architectural Trends: Bus-based MPs

•Micro on a chip makes it natural to connect many to shared memory

– dominates server and enterprise market, moving down to desktop

•Faster processors began to saturate bus, then bus technology advanced

– today, range of sizes for bus-based systems, desktop to large servers

0

10

20

30

40

CRAY CS6400

SGI Challenge

Sequent B2100

Sequent B8000

Symmetry81

Symmetry21

Power

SS690MP 140 SS690MP 120

AS8400

HP K400AS2100SS20

SE30

SS1000E

SS10

SE10

SS1000

P-ProSGI PowerSeries

SE60

SE70

Sun E6000

SC2000ESun SC2000SGI PowerChallenge/XL

SunE10000

50

60

70

1984 1986 1988 1990 1992 1994 1996 1998

Nu

mb

er

of

pro

cess

ors

04/18/23 42CSCE930-Advanced Computer Architecture, Introduction

Page 43: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Economics

• Commodity microprocessors not only fast but CHEAP• Development cost is tens of millions of dollars (5-100 typical)

• BUT, many more are sold compared to supercomputers

– Crucial to take advantage of the investment, and use the commodity building block

– Exotic parallel architectures no more than special-purpose

• Multiprocessors being pushed by software vendors (e.g. database) as well as hardware vendors

• Standardization by Intel makes small, bus-based SMPs commodity

• Desktop: few smaller processors versus one larger one?– Multiprocessor on a chip

04/18/23 43CSCE930-Advanced Computer Architecture, Introduction

Page 44: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Consider Scientific Supercomputing

• Proving ground and driver for innovative architecture and techniques

– Market smaller relative to commercial as MPs become mainstream

– Dominated by vector machines starting in 70s

– Microprocessors have made huge gains in floating-point performance

» high clock rates

» pipelined floating point units (e.g., multiply-add every cycle)

» instruction-level parallelism

» effective use of caches (e.g., automatic blocking)

– Plus economics

• Large-scale multiprocessors replace vector supercomputers

– Most top-performing machines on Top-500 list are multiprocessors

04/18/23 44CSCE930-Advanced Computer Architecture, Introduction

Page 45: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Summary: Why Parallel Architecture?

• Increasingly attractive– Economics, technology, architecture, application demand

• Increasingly central and mainstream

• Parallelism exploited at many levels– Instruction-level parallelism– Thread-level of parallelism (multi-cores and many-cores)– Application/Node-level of parallelism (“MPPs”, cluster, grids, clouds)

• Focus of this class: thread-level of parallelism

• Same story from memory/storage system perspective but with a “twist” (data-intensive computing, etc)

• Wide range of parallel architectures make sense– Different cost, performance and scalability

04/18/23 45CSCE930-Advanced Computer Architecture, Introduction

Page 46: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Convergence of Parallel Architectures

04/18/23 46CSCE930-Advanced Computer Architecture, Introduction

Page 47: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

History

Application Software

System SoftwareSIMD

Message Passing

Shared MemoryDataflow

Systolic

Arrays Architecture

• Uncertainty of direction paralyzed parallel software development!

Historically, parallel architectures tied to programming models

• Divergent architectures, with no predictable pattern of growth.

04/18/23 47CSCE930-Advanced Computer Architecture, Introduction

Page 48: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Today

• Extension of “computer architecture” to support communication and cooperation– OLD: Instruction Set Architecture

– NEW: Communication Architecture

• Defines – Critical abstractions, boundaries, and primitives (interfaces)

– Organizational structures that implement interfaces (hw or sw)

• Compilers, libraries and OS are important bridges today

04/18/23 48CSCE930-Advanced Computer Architecture, Introduction

Page 49: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Modern Layered Framework

CAD

Multiprogramming Sharedaddress

Messagepassing

Dataparallel

Database Scientific modeling Parallel applications

Programming models

Communication abstractionUser/system boundary

Compilationor library

Operating systems support

Communication hardware

Physical communication medium

Hardware/software boundary

04/18/23 49CSCE930-Advanced Computer Architecture, Introduction

Page 50: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Programming Model

• What programmer uses in coding applications

• Specifies communication and synchronization

• Examples:– Multiprogramming: no communication or synch. at program

level

– Shared address space: like bulletin board

– Message passing: like letters or phone calls, explicit point to point

– Data parallel: more regimented, global actions on data

» Implemented with shared address space or message passing

04/18/23 50CSCE930-Advanced Computer Architecture, Introduction

Page 51: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Communication Abstraction

• User level communication primitives provided– Realizes the programming model

– Mapping exists between language primitives of programming model and these primitives

• Supported directly by hw, or via OS, or via user sw

• Lot of debate about what to support in sw and gap between layers

• Today:– Hw/sw interface tends to be flat, i.e. complexity roughly uniform

– Compilers and software play important roles as bridges today

– Technology trends exert strong influence

• Result is convergence in organizational structure– Relatively simple, general purpose communication primitives

04/18/23 51CSCE930-Advanced Computer Architecture, Introduction

Page 52: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Communication Architecture

• = User/System Interface + Implementation

• User/System Interface:– Comm. primitives exposed to user-level by hw and system-level sw

• Implementation:– Organizational structures that implement the primitives: hw or OS– How optimized are they? How integrated into processing node?– Structure of network

• Goals:– Performance– Broad applicability– Programmability– Scalability– Low Cost

04/18/23 52CSCE930-Advanced Computer Architecture, Introduction

Page 53: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Evolution of Architectural Models

• Historically machines tailored to programming models– Prog. model, comm. abstraction, and machine organization lumped together as

the “architecture”

• Evolution helps understand convergence– Identify core concepts

• Shared Address Space

• Message Passing

• Data Parallel

• Others: – Dataflow

– Systolic Arrays

• Examine programming model, motivation, intended applications, and contributions to convergence

04/18/23 53CSCE930-Advanced Computer Architecture, Introduction

Page 54: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Shared Address Space Architectures

• Any processor can directly reference any memory location – Communication occurs implicitly as result of loads and stores

• Convenient: – Location transparency

– Similar programming model to time-sharing on uniprocessors

» Except processes run on different processors

» Good throughput on multiprogrammed workloads

• Naturally provided on wide range of platforms– History dates at least to precursors of mainframes in early 60s

– Wide range of scale: few to hundreds of processors

• Popularly known as shared memory machines or model– Ambiguous: memory may be physically distributed among processors

04/18/23 54CSCE930-Advanced Computer Architecture, Introduction

Page 55: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Shared Address Space Model

• Process: virtual address space plus one or more threads of control

• Portions of address spaces of processes are shared

•Writes to shared address visible to other threads (in other processes too)•Natural extension of uniprocessors model: conventional memory operations for comm.; special atomic operations for synchronization•OS uses shared memory to coordinate processes

St or e

P1

P2

Pn

P0

Load

P0 pr i vat e

P1 pr i vat e

P2 pr i vat e

Pn pr i vat e

Virtual address spaces for a

collection of processes communicating

via shared addresses

Machine physical address space

Shared portion

of address space

Private portion

of address space

Common physical

addresses

04/18/23 55CSCE930-Advanced Computer Architecture, Introduction

Page 56: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Communication Hardware

• Also natural extension of uniprocessor

• Already have processor, one or more memory modules and I/O controllers connected by hardware interconnect of some sort

Memory capacity increased by adding modules, I/O by controllers

•Add processors for processing!

•For higher-throughput multiprogramming, or parallel programs

I/O ctrlMem Mem Mem

Interconnect

Mem I/O ctrl

Processor Processor

Interconnect

I/Odevices

04/18/23 56CSCE930-Advanced Computer Architecture, Introduction

Page 57: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

History

P

P

C

C

I/O

I/O

M MM M

PP

C

I/O

M MC

I/O

$ $

• “Mainframe” approach– Motivated by multiprogramming– Extends crossbar used for mem bw and I/O– Originally processor cost limited to small

» later, cost of crossbar– Bandwidth scales with p– High incremental cost; use multistage

instead

• “Minicomputer” approach– Almost all microprocessor systems have bus– Motivated by multiprogramming, TP– Used heavily for parallel computing– Called symmetric multiprocessor (SMP)– Latency larger than for uniprocessor– Bus is bandwidth bottleneck

» caching is key: coherence problem– Low incremental cost

04/18/23 57CSCE930-Advanced Computer Architecture, Introduction

Page 58: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Example: Intel Pentium Pro Quad

– All coherence and multiprocessing glue in processor module

– Highly integrated, targeted at high volume

– Low latency and bandwidth

P-Pro bus (64-bit data, 36-bit addr ess, 66 MHz)

CPU

Bus interface

MIU

P-Promodule

P-Promodule

P-Promodule256-KB

L2 $Interruptcontroller

PCIbridge

PCIbridge

Memorycontroller

1-, 2-, or 4-wayinterleaved

DRAM

PC

I bus

PC

I busPCI

I/Ocards

04/18/23 58CSCE930-Advanced Computer Architecture, Introduction

Page 59: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Example: SUN Enterprise

– 16 cards of either type: processors + memory, or I/O– All memory accessed over bus, so symmetric– Higher bandwidth, higher latency bus

Gigaplane bus (256 data, 41 addr ess, 83 MHz)

SB

US

SB

US

SB

US

2 F

iber

Cha

nnel

100b

T, S

CS

I

Bus interface

CPU/memcardsP

$2

$

P

$2

$

Mem ctrl

Bus interface/switch

I/O cards

04/18/23 59CSCE930-Advanced Computer Architecture, Introduction

Page 60: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Scaling Up

– Problem is interconnect: cost (crossbar) or bandwidth (bus)

– Dance-hall: bandwidth still scalable, but lower cost than crossbar

» latencies to memory uniform, but uniformly large

– Distributed memory or non-uniform memory access (NUMA)

» Construct shared address space out of simple message transactions across a general-purpose network (e.g. read-request, read-response)

– Caching shared (particularly nonlocal) data?

M M M

M M M

NetworkNetwork

P

$

P

$

P

$

P

$

P

$

P

$

“Dance hall” Distributed memory

04/18/23 60CSCE930-Advanced Computer Architecture, Introduction

Page 61: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Example: Cray T3E

– Scale up to 1024 processors, 480MB/s links– Memory controller generates comm. request for nonlocal references– No hardware mechanism for coherence (SGI Origin etc. provide this)

Switch

P

$

XY

Z

External I/O

Memctrl

and NI

Mem

04/18/23 61CSCE930-Advanced Computer Architecture, Introduction

Page 62: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Message Passing Architectures

• Complete computer as building block, including I/O– Communication via explicit I/O operations

• Programming model: directly access only private address space (local memory), comm. via explicit messages (send/receive)

• High-level block diagram similar to distributed-memory SAS– But comm. integrated at IO level, needn’t be into memory system– Like networks of workstations (clusters), but tighter integration– Easier to build than scalable SAS

• Programming model more removed from basic hardware operations– Library or OS intervention

04/18/23 62CSCE930-Advanced Computer Architecture, Introduction

Page 63: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Message-Passing Abstraction

– Send specifies buffer to be transmitted and receiving process– Recv specifies sending process and application storage to receive into– Memory to memory copy, but need to name processes– Optional tag on send and matching rule on receive– User process names local data and entities in process/tag space too– In simplest form, the send/recv match achieves pairwise synch event

» Other variants too– Many overheads: copying, buffer management, protection

Process P Process Q

Address Y

Address X

Send X, Q, t

Receive Y, P, tMatch

Local processaddress space

Local processaddress space

04/18/23 63CSCE930-Advanced Computer Architecture, Introduction

Page 64: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Evolution of Message-Passing Machines

• Early machines: FIFO on each link– Hw close to prog. Model; synchronous ops– Replaced by DMA, enabling non-blocking ops

» Buffered by system at destination until recv

• Diminishing role of topology– Store&forward routing: topology important– Introduction of pipelined routing made it less so– Cost is in node-network interface– Simplifies programming

000001

010011

100

110

101

111

04/18/23 64CSCE930-Advanced Computer Architecture, Introduction

Page 65: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Example: IBM SP-2

– Made out of essentially complete RS6000 workstations

– Network interface integrated in I/O bus (bw limited by I/O bus)

Memory bus

MicroChannel bus

I/O

i860 NI

DMA

DR

AM

IBM SP-2 node

L2 $

Power 2CPU

Memorycontroller

4-wayinterleaved

DRAM

General interconnectionnetwork formed from8-port switches

NIC

04/18/23 65CSCE930-Advanced Computer Architecture, Introduction

Page 66: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Example Intel Paragon

Memory bus (64-bit, 50 MHz)

i860

L1 $

NI

DMA

i860

L1 $

Driver

Memctrl

4-wayinterleaved

DRAM

IntelParagonnode

8 bits,175 MHz,bidirectional2D grid network

with processing nodeattached to every switch

Sandia’ s Intel Paragon XP/S-based Super computer

04/18/23 66CSCE930-Advanced Computer Architecture, Introduction

Page 67: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Toward Architectural Convergence

• Evolution and role of software have blurred boundary– Send/recv supported on SAS machines via buffers

– Can construct global address space on MP using hashing

– Page-based (or finer-grained) shared virtual memory

• Hardware organization converging too– Tighter NI integration even for MP (low-latency, high-

bandwidth)

– At lower level, even hardware SAS passes hardware messages

• Even clusters of workstations/SMPs are parallel systems– Emergence of fast system area networks (SAN)

• Programming models distinct, but organizations converging

– Nodes connected by general network and communication assists

– Implementations also converging, at least in high-end machines

04/18/23 67CSCE930-Advanced Computer Architecture, Introduction

Page 68: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Data Parallel Systems

• Programming model – Operations performed in parallel on each element of data structure

– Logically single thread of control, performs sequential or parallel steps

– Conceptually, a processor associated with each data element

• Architectural model– Array of many simple, cheap processors with little memory each

» Processors don’t sequence through instructions

– Attached to a control processor that issues instructions

– Specialized and general communication, cheap global synchronization

PE PE PE

PE PE PE

PE PE PE

ControlprocessorOriginal motivations

•Matches simple differential equation solvers

•Centralize high cost of instruction fetch/sequencing

04/18/23 68CSCE930-Advanced Computer Architecture, Introduction

Page 69: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Application of Data Parallelism

– Each PE contains an employee record with his/her salary

If salary > 100K then

salary = salary *1.05

else

salary = salary *1.10

– Logically, the whole operation is a single step

– Some processors enabled for arithmetic operation, others disabled

• Other examples: – Finite differences, linear algebra, ... – Document searching, graphics, image processing, ...

• Some early machines: – Thinking Machines CM-1, CM-2 (and CM-5)– Maspar MP-1 and MP-2,

• Today’s on-chip examples: – IBM’s Cell procesor,

– GPU, GPGPU

04/18/23 69CSCE930-Advanced Computer Architecture, Introduction

Page 70: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Dataflow Architectures

• Represent computation as a graph of essential dependences

– Logical processor at each node, activated by availability of operands– Message (tokens) carrying tag of next instruction sent to next processor– Tag compared with others in matching store; match fires execution

1 b

a

+

c e

d

f

Dataflow graph

f = a d

Network

Tokenstore

WaitingMatching

Instructionfetch

Execute

Token queue

Formtoken

Network

Network

Programstore

a = (b +1) (b c)d = c e

04/18/23 70CSCE930-Advanced Computer Architecture, Introduction

Page 71: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Evolution and Convergence• Key characteristics

– Ability to name operations, synchronization, dynamic scheduling

• Problems– Operations have locality across them, useful to group together– Handling complex data structures like arrays– Complexity of matching store and memory units– Expose too much parallelism (?)

• Converged to use conventional processors and memory– Support for large, dynamic set of threads to map to processors– Typically shared address space as well– But separation of progr. model from hardware (like data-parallel)

• Lasting contributions:– Integration of communication with thread (handler) generation– Tightly integrated communication and fine-grained synchronization– Remained useful concept for software (compilers etc.)

04/18/23 71CSCE930-Advanced Computer Architecture, Introduction

Page 72: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Systolic Architectures– Replace single processor with array of regular processing elements

– Orchestrate data flow for high throughput with less memory access

Different from pipelining• Nonlinear array structure, multidirection data flow, each PE may have

(small) local instruction and data memory

Different from SIMD: each PE may do something different

Initial motivation: VLSI enables inexpensive special-purpose chips

Represent algorithms directly by chips connected in regular pattern

M

PE

M

PE PE PE

04/18/23 72CSCE930-Advanced Computer Architecture, Introduction

Page 73: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Systolic Arrays (contd.)

– Practical realizations (e.g. iWARP) use quite general processors

» Enable variety of algorithms on same hardware– But dedicated interconnect channels

» Data transfer directly from register to register across channel– Specialized, and same problems as SIMD

» General purpose systems work well for same algorithms (locality etc.)

y(i) = w1 x(i) + w2 x(i + 1) + w3 x(i + 2) + w4 x(i + 3)

x8

y3 y2 y1

x7x6

x5x4

x3

w4

x2

x

w

x1

w3 w2 w1

xin

yin

xout

yout

xout = x

yout = yin + w xinx = x in

Example: Systolic array for 1-D convolution

04/18/23 73CSCE930-Advanced Computer Architecture, Introduction

Page 74: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Convergence: Generic Parallel Architecture

• A generic modern multiprocessor

Node: processor(s), memory system, plus communication assist• Network interface and communication controller

• Scalable network

• Convergence allows lots of innovation, now within framework• Integration of assist with node, what operations, how efficiently...

Mem

Network

P

$

Communicationassist (CA)

04/18/23 74CSCE930-Advanced Computer Architecture, Introduction

Page 75: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Fundamental Design Issues

04/18/23 75CSCE930-Advanced Computer Architecture, Introduction

Page 76: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Understanding Parallel Architecture

• Traditional taxonomies not very useful

• Programming models not enough, nor hardware structures

– Same one can be supported by radically different architectures

• Architectural distinctions that affect software– Compilers, libraries, programs

• Design of user/system and hardware/software interface– Constrained from above by progr. models and below by technology

• Guiding principles provided by layers– What primitives are provided at communication abstraction

– How programming models map to these

– How they are mapped to hardware

04/18/23 76CSCE930-Advanced Computer Architecture, Introduction

Page 77: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Fundamental Design Issues

• At any layer, interface (contract) aspect and performance aspects

– Naming: How are logically shared data and/or processes referenced?

– Operations: What operations are provided on these data

– Ordering: How are accesses to data ordered and coordinated?

– Replication: How are data replicated to reduce communication?

– Communication Cost: Latency, bandwidth, overhead, occupancy

• Understand at programming model first, since that sets requirements

• Other issues

• • Node Granularity: How to split between processors and memory?

• • ...04/18/23 77CSCE930-Advanced Computer Architecture, Introduction

Page 78: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Sequential Programming Model

• Contract– Naming: Can name any variable in virtual address space

» Hardware (and perhaps compilers) does translation to physical addresses

– Operations: Loads and Stores

– Ordering: Sequential program order

• Performance– Rely on dependences on single location (mostly): dependence order

– Compilers and hardware violate other orders without getting caught

– Compiler: reordering and register allocation

– Hardware: out of order, pipeline bypassing, write buffers

– Transparent replication in caches

04/18/23 78CSCE930-Advanced Computer Architecture, Introduction

Page 79: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Shared Address Space (SAS) Programming Model

• Naming: Any process can name any variable in shared space

• Operations: loads and stores, plus those needed for ordering

• Simplest Ordering Model: – Within a process/thread: sequential program order

– Across threads: some interleaving (as in time-sharing)

– Additional orders through synchronization

– Again, compilers/hardware can violate orders without getting caught» Different, more subtle ordering models also possible (discussed later)

04/18/23 79CSCE930-Advanced Computer Architecture, Introduction

Page 80: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Synchronization

• Mutual exclusion (locks)– Ensure certain operations on certain data can be performed by

only one process at a time

– Room that only one person can enter at a time

– No ordering guarantees

• Event synchronization – Ordering of events to preserve dependences

» e.g. producer —> consumer of data

– 3 main types:

» point-to-point

» global

» group

04/18/23 80CSCE930-Advanced Computer Architecture, Introduction

Page 81: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Message Passing Programming Model

• Naming: Processes can name private data directly. – No shared address space

• Operations: Explicit communication through send and receive

– Send transfers data from private address space to another process

– Receive copies data from process to private address space

– Must be able to name processes

• Ordering: – Program order within a process

– Send and receive can provide pt to pt synch between processes

– Mutual exclusion inherent

• Can construct global address space:– Process number + address within process address space

– But no direct operations on these names04/18/23 81CSCE930-Advanced Computer Architecture, Introduction

Page 82: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Design Issues Apply at All Layers

• Prog. model’s position provides constraints/goals for system

• In fact, each interface between layers supports or takes a position on:

– Naming model

– Set of operations on names

– Ordering model

– Replication

– Communication performance

• Any set of positions can be mapped to any other by software

• Let’s see issues across layers– How lower layers can support contracts of programming models

– Performance issues04/18/23 82CSCE930-Advanced Computer Architecture, Introduction

Page 83: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Naming and Operations• Naming and operations in programming model can be directly

supported by lower levels, or translated by compiler, libraries or OS

• Example: Shared virtual address space in programming model

• Hardware interface supports shared physical address space – Direct support by hardware through v-to-p mappings, no software layers

• Hardware supports independent physical address spaces– Can provide SAS through OS, so in system/user interface

» v-to-p mappings only for data that are local

» remote data accesses incur page faults; brought in via page fault handlers

» same programming model, different hardware requirements and cost model

– Or through compilers or runtime, so above sys/user interface

» shared objects, instrumentation of shared accesses, compiler support

04/18/23 83CSCE930-Advanced Computer Architecture, Introduction

Page 84: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Naming and Operations (contd)

• Example: Implementing Message Passing

• Direct support at hardware interface– But match and buffering benefit from more flexibility

• Support at sys/user interface or above in software (almost always)

– Hardware interface provides basic data transport (well suited)

– Send/receive built in sw for flexibility (protection, buffering)

– Choices at user/system interface:

» OS each time: expensive

» OS sets up once/infrequently, then little sw involvement each time

– Or lower interfaces provide SAS, and send/receive built on top with buffers and loads/stores

• Need to examine the issues and tradeoffs at every layer– Frequencies and types of operations, costs

04/18/23 84CSCE930-Advanced Computer Architecture, Introduction

Page 85: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Ordering

• Message passing: no assumptions on orders across processes except those imposed by send/receive pairs

• SAS: How processes see the order of other processes’ references defines semantics of SAS

– Ordering very important and subtle

– Uniprocessors play tricks with orders to gain parallelism or locality

– These are more important in multiprocessors

– Need to understand which old tricks are valid, and learn new ones

– How programs behave, what they rely on, and hardware implications

04/18/23 85CSCE930-Advanced Computer Architecture, Introduction

Page 86: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Replication

• Very important for reducing data transfer/communication

• Again, depends on naming model

• Uniprocessor: caches do it automatically– Reduce communication with memory

• Message Passing naming model at an interface– A receive replicates, giving a new name; subsequently use new name

– Replication is explicit in software above that interface

• SAS naming model at an interface– A load brings in data transparently, so can replicate transparently

– Hardware caches do this, e.g. in shared physical address space

– OS can do it at page level in shared virtual address space, or objects

– No explicit renaming, many copies for same name: coherence problem

» in uniprocessors, “coherence” of copies is natural in memory hierarchy

04/18/23 86CSCE930-Advanced Computer Architecture, Introduction

Page 87: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Communication Performance

• Performance characteristics determine usage of operations at a layer

– Programmer, compilers etc make choices based on this

• Fundamentally, three characteristics:– Latency: time taken for an operation

– Bandwidth: rate of performing operations

– Cost: impact on execution time of program

• If processor does one thing at a time: bandwidth 1/latency– But actually more complex in modern systems

• Characteristics apply to overall operations, as well as individual components of a system, however small

• We’ll focus on communication or data transfer across nodes

04/18/23 87CSCE930-Advanced Computer Architecture, Introduction

Page 88: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Outline

• Computer Science at a Crossroads: Parallelism– Architecture: multi-core and many-cores

– Program: multi-threading

• Parallel Architecture– What is Parallel Architecture?

– Why Parallel Architecture?

– Evolution and Convergence of Parallel Architectures

– Fundamental Design Issues

• Parallel Programs– Why bother with programs?

– Important for whom?

• Memory & Storage Subsystem Architectures

04/18/23 88CSCE930-Advanced Computer Architecture, Introduction

Page 89: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Why Bother with Programs?

• They’re what runs on the machines we design– Helps make design decisions

– Helps evaluate systems tradeoffs

• Led to the key advances in uniprocessor architecture

– Caches and instruction set design

• More important in multiprocessors– New degrees of freedom

– Greater penalties for mismatch between program and architecture

04/18/23 89CSCE930-Advanced Computer Architecture, Introduction

Page 90: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Important for Whom?

• Algorithm designers– Designing algorithms that will run well on real systems

• Programmers– Understanding key issues and obtaining best performance

• Architects– Understand workloads, interactions, important degrees of

freedom

– Valuable for design and for evaluation

04/18/23 90CSCE930-Advanced Computer Architecture, Introduction

Page 91: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Software

• 1. Parallel programs– Process of parallelization– What parallel programs look like in major programming models

• 2. Programming for performance– Key performance issues and architectural interactions

• 3. Workload-driven architectural evaluation– Beneficial for architects and for users in procuring machines

• Unlike on sequential systems, can’t take workload for granted

– Software base not mature; evolves with architectures for performance

– So need to open the box

• Let’s begin with parallel programs ...

04/18/23 91CSCE930-Advanced Computer Architecture, Introduction

Page 92: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Outline

• Motivating Problems (application case studies)

• Steps in creating a parallel program

• What a simple parallel program looks like– In the three major programming models

– Ehat primitives must a system support?

• Later: Performance issues and architectural interactions

04/18/23 92CSCE930-Advanced Computer Architecture, Introduction

Page 93: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Motivating Problems

• Simulating Ocean Currents– Regular structure, scientific computing

• Simulating the Evolution of Galaxies– Irregular structure, scientific computing

• Rendering Scenes by Ray Tracing– Irregular structure, computer graphics

• Data Mining– Irregular structure, information processing

– Not discussed here (read in book)

04/18/23 93CSCE930-Advanced Computer Architecture, Introduction

Page 94: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Simulating Ocean Currents

– Model as two-dimensional grids– Discretize in space and time

» finer spatial and temporal resolution => greater accuracy

– Many different computations per time step» set up and solve equations

– Concurrency across and within grid computations

(a) Cross sections (b) Spatial discretization of a cross section

04/18/23 94CSCE930-Advanced Computer Architecture, Introduction

Page 95: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Simulating Galaxy Evolution

– Simulate the interactions of many stars evolving over time

– Computing forces is expensive

– O(n2) brute force approach

– Hierarchical Methods take advantage of force law: G m1m2

r2

•Many time-steps, plenty of concurrency across stars within one

Star on which forcesare being computed

Star too close toapproximate

Small group far enough away toapproximate to center of mass

Large group farenough away toapproximate

04/18/23 95CSCE930-Advanced Computer Architecture, Introduction

Page 96: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Rendering Scenes by Ray Tracing

– Shoot rays into scene through pixels in image plane

– Follow their paths

» they bounce around as they strike objects

» they generate new rays: ray tree per input ray

– Result is color and opacity for that pixel

– Parallelism across rays

• All case studies have abundant concurrency

04/18/23 96CSCE930-Advanced Computer Architecture, Introduction

Page 97: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Creating a Parallel Program

• Assumption: Sequential algorithm is given– Sometimes need very different algorithm, but beyond scope

• Pieces of the job:– Identify work that can be done in parallel

– Partition work and perhaps data among processes

– Manage data access, communication and synchronization

– Note: work includes computation, data access and I/O

• Main goal: Speedup (plus low prog. effort and resource needs)

• Speedup (p) =

• For a fixed problem:

• Speedup (p) =

Performance(p)

Performance(1)Time(1)

Time(p)04/18/23 97CSCE930-Advanced Computer Architecture, Introduction

Page 98: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Steps in Creating a Parallel Program

• 4 steps: Decomposition, Assignment, Orchestration, Mapping

– Done by programmer or system software (compiler, runtime, ...)– Issues are the same, so assume programmer does it all explicitly

P0

Tasks Processes Processors

P1

P2 P3

p0 p1

p2 p3

p0 p1

p2 p3

Partitioning

Sequentialcomputation

Parallelprogram

Assignment

Decomposition

Mapping

Orchestration

04/18/23 98CSCE930-Advanced Computer Architecture, Introduction

Page 99: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Some Important Concepts• Task:

– Arbitrary piece of undecomposed work in parallel computation

– Executed sequentially; concurrency is only across tasks

– E.g. a particle/cell in Barnes-Hut, a ray or ray group in Raytrace

– Fine-grained versus coarse-grained tasks

• Process (thread): – Abstract entity that performs the tasks assigned to processes

– Processes communicate and synchronize to perform their tasks

• Processor: – Physical engine on which process executes

– Processes virtualize machine to programmer

» first write program in terms of processes, then map to processors

04/18/23 99CSCE930-Advanced Computer Architecture, Introduction

Page 100: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Decomposition

• Break up computation into tasks to be divided among processes

– Tasks may become available dynamically

– No. of available tasks may vary with time

• i.e. identify concurrency and decide level at which to exploit it

• Goal: Enough tasks to keep processes busy, but not too many

– No. of tasks available at a time is upper bound on achievable speedup

04/18/23 100CSCE930-Advanced Computer Architecture, Introduction

Page 101: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Limited Concurrency: Amdahl’s Law

– Most fundamental limitation on parallel speedup

– If fraction s of seq execution is inherently serial, speedup <= 1/s

– Example: 2-phase calculation

» sweep over n-by-n grid and do some independent computation

» sweep again and add each value to global sum

– Time for first phase = n2/p

– Second phase serialized at global variable, so time = n2

– Speedup <= or at most 2

– Trick: divide second phase into two» accumulate into private sum during sweep» add per-process private sum into global sum

– Parallel time is n2/p + n2/p + p, and speedup at best

2n2

n2

p+ n2

2n2

2n2 + p2

04/18/23 101CSCE930-Advanced Computer Architecture, Introduction

Page 102: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Pictorial Depiction

1

p

1

p

1

n2/p

n2

p

wor

k do

ne c

oncu

rren

tly

n2

n2

Timen2/p n2/p

(c)

(b)

(a)

04/18/23 102CSCE930-Advanced Computer Architecture, Introduction

Page 103: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Concurrency Profiles

– Area under curve is total work done, or time with 1 processor– Horizontal extent is lower bound on time (infinite processors)

– Speedup is the ratio: , base case:

– Amdahl’s law applies to any overhead, not just limited concurrency

•Cannot usually divide into serial and parallel part

Co

ncu

rre

ncy

15

0

21

9

24

7

28

6

31

3

34

3

38

0

41

5

44

4

48

3

50

4

52

6

56

4

58

9

63

3

66

2

70

2

73

3

0

200

400

600

800

1,000

1,200

1,400

Clock cycle number

fk k

fkkp

k=1

k=1

1

s + 1-sp

04/18/23 103CSCE930-Advanced Computer Architecture, Introduction

Page 104: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Assignment• Specifying mechanism to divide work up among

processes– E.g. which process computes forces on which stars, or which

rays– Together with decomposition, also called partitioning– Balance workload, reduce communication and management

cost• Structured approaches usually work well

– Code inspection (parallel loops) or understanding of application

– Well-known heuristics– Static versus dynamic assignment

• As programmers, we worry about partitioning first

– Usually independent of architecture or prog model– But cost and complexity of using primitives may affect

decisions• As architects, we assume program does

reasonable job of it04/18/23 104CSCE930-Advanced Computer Architecture, Introduction

Page 105: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Orchestration– Naming data

– Structuring communication

– Synchronization

– Organizing data structures and scheduling tasks temporally

• Goals– Reduce cost of communication and synch. as seen by

processors– Reserve locality of data reference (incl. data structure

organization)– Schedule tasks to satisfy dependences early– Reduce overhead of parallelism management

• Closest to architecture (and programming model & language)

– Choices depend a lot on comm. abstraction, efficiency of primitives

– Architects should provide appropriate primitives efficiently04/18/23 105CSCE930-Advanced Computer Architecture, Introduction

Page 106: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Mapping• After orchestration, already have parallel program

• Two aspects of mapping:– Which processes will run on same processor, if necessary– Which process runs on which particular processor

» mapping to a network topology

• One extreme: space-sharing– Machine divided into subsets, only one app at a time in a subset

– Processes can be pinned to processors, or left to OS

• Another extreme: complete resource management control to OS

– OS uses the performance techniques we will discuss later

• Real world is between the two– User specifies desires in some aspects, system may ignore

• Usually adopt the view: process <-> processor

04/18/23 106CSCE930-Advanced Computer Architecture, Introduction

Page 107: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Parallelizing Computation vs. Data

• Above view is centered around computation– Computation is decomposed and assigned (partitioned)

• Partitioning Data is often a natural view too– Computation follows data: owner computes

– Grid example; data mining; High Performance Fortran (HPF)

• But not general enough– Distinction between comp. and data stronger in many

applications» Barnes-Hut, Raytrace (later)

– Retain computation-centric view

– Data access and communication is part of orchestration

04/18/23 107CSCE930-Advanced Computer Architecture, Introduction

Page 108: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

High-level Goals

• But low resource usage and development effort

• Implications for algorithm designers and architects– Algorithm designers: high-perf., low resource needs– Architects: high-perf., low cost, reduced programming effort

» e.g. gradually improving perf. with programming effort may be preferable to sudden threshold after large programming effort

High performance (speedup over sequential program)

Table 2.1 Steps in the Parallelization Process and Their Goals

StepArchitecture-Dependent? Major Performance Goals

Decomposition Mostly no Expose enough concurrency but not too much

Assignment Mostly no Balance workloadReduce communication volume

Orchestration Yes Reduce noninherent communication via data locality

Reduce communication and synchronization cost as seen by the processor

Reduce serialization at shared resourcesSchedule tasks to satisfy dependences early

Mapping Yes Put related processes on the same processor if necessary

Exploit locality in network topology

04/18/23 108CSCE930-Advanced Computer Architecture, Introduction

Page 109: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

What Parallel Programs Look Like

Page 110: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Parallelization of An Example Program• Motivating problems all lead to large, complex

programs

• Examine a simplified version of a piece of Ocean simulation

– Iterative equation solver

• Illustrate parallel program in low-level parallel language

– C-like pseudocode with simple extensions for parallelism

– Expose basic comm. and synch. primitives that must be supported

– State of most real parallel programming today

04/18/23 110CSCE930-Advanced Computer Architecture, Introduction

Page 111: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Grid Solver Example

– Simplified version of solver in Ocean simulation

– Gauss-Seidel (near-neighbor) sweeps to convergence» interior n-by-n points of (n+2)-by-(n+2) updated in each sweep» updates done in-place in grid, and diff. from prev. value computed» accumulate partial diffs into global diff at end of every sweep» check if error has converged (to within a tolerance parameter)» if so, exit solver; if not, do another sweep

A[i,j ] = 0.2 (A[i,j ] + A[i,j – 1] + A[i – 1, j] +

A[i,j + 1] + A[i + 1, j ])

Expression for updating each interior point:

04/18/23 111CSCE930-Advanced Computer Architecture, Introduction

Page 112: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

1. int n; /*size of matrix: (n + 2-by-n + 2) elements*/2. float **A, diff = 0;

3. main()4. begin5. read(n) ; /*read input parameter: matrix size*/

6. A malloc (a 2-d array of size n + 2 by n + 2 doubles);7. initialize(A); /*initialize the matrix A somehow*/8. Solve (A); /*call the routine to solve equation*/9. end main

10. procedure Solve (A) /*solve the equation system*/11. float **A; /*A is an (n + 2)-by-(n + 2) array*/12. begin13. int i, j, done = 0;14. float diff = 0, temp;15. while (!done) do /*outermost loop over sweeps*/16. diff = 0; /*initialize maximum difference to 0*/

17. for i 1 to n do /*sweep over nonborder points of grid*/

18. for j 1 to n do19. temp = A[i,j]; /*save old value of element*/

20. A[i,j] 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +21. A[i,j+1] + A[i+1,j]); /*compute average*/22. diff += abs(A[i,j] - temp);23. end for24. end for25. if (diff/(n*n) < TOL) then done = 1;26. end while27. end procedure

04/18/23 112CSCE930-Advanced Computer Architecture, Introduction

Page 113: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Decomposition

– Concurrency O(n) along anti-diagonals, serialization O(n) along diag.– Retain loop structure, use pt-to-pt synch; Problem: too many synch ops.– Restructure loops, use global synch; imbalance and too much synch

•Simple way to identify concurrency is to look at loop iterations–dependence analysis; if not enough concurrency, then look further

•Not much concurrency here at this level (all loops sequential)•Examine fundamental dependences, ignoring loop structure

04/18/23 113CSCE930-Advanced Computer Architecture, Introduction

Page 114: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Exploit Application Knowledge

– Different ordering of updates: may converge quicker or slower – Red sweep and black sweep are each fully parallel: – Global synch between them (conservative but convenient)– Ocean uses red-black; we use simpler, asynchronous one to illustrate

» no red-black, simply ignore dependences within sweep» sequential order same as original, parallel program

nondeterministic

Red point

Black point

•Reorder grid traversal: red-black ordering

04/18/23 114CSCE930-Advanced Computer Architecture, Introduction

Page 115: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Decomposition Only

– Decomposition into elements: degree of concurrency n2

– To decompose into rows, make line 18 loop sequential; degree n– for_all leaves assignment left to system

» but implicit global synch. at end of for_all loop

15. while (!done) do /*a sequential loop*/16. diff = 0; 17. for_all i 1 to n do /*a parallel loop nest*/18. for_all j 1 to n do19. temp = A[i,j];20. A[i,j] 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +21. A[i,j+1] + A[i+1,j]);22. diff += abs(A[i,j] - temp);23. end for_all24. end for_all25. if (diff/(n*n) < TOL) then done = 1;26. end while

04/18/23 115CSCE930-Advanced Computer Architecture, Introduction

Page 116: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Assignment

– Dynamic assignment

» get a row index, work on the row, get a new row, and so on

– Static assignment into rows reduces concurrency (from n to p)» block assign. reduces communication by keeping adjacent rows

together

– Let’s dig into orchestration under three programming models

ip

P0

P1

P2

P4

•Static assignments (given decomposition into rows)

–block assignment of rows: Row i is assigned to process

–cyclic assignment of rows: process i is assigned rows i, i+p, and so on

04/18/23 116CSCE930-Advanced Computer Architecture, Introduction

Page 117: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Data Parallel Solver1. int n, nprocs; /*grid size (n + 2-by-n + 2) and number of processes*/2. float **A, diff = 0;

3. main()4. begin5. read(n); read(nprocs); ; /*read input grid size and number of processes*/

6. A G_MALLOC (a 2-d array of size n+2 by n+2 doubles);7. initialize(A); /*initialize the matrix A somehow*/8. Solve (A); /*call the routine to solve equation*/9. end main

10. procedure Solve(A) /*solve the equation system*/11. float **A; /*A is an (n + 2-by-n + 2) array*/12. begin13. int i, j, done = 0;14. float mydiff = 0, temp;14a. DECOMP A[BLOCK,*, nprocs];15. while (!done) do /*outermost loop over sweeps*/16. mydiff = 0; /*initialize maximum difference to 0*/17. for_all i 1 to n do /*sweep over non-border points of grid*/

18. for_all j 1 to n do19. temp = A[i,j]; /*save old value of element*/

20. A[i,j] 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +21. A[i,j+1] + A[i+1,j]); /*compute average*/22. mydiff += abs(A[i,j] - temp);23. end for_all24. end for_all24a. REDUCE (mydiff, diff, ADD);25. if (diff/(n*n) < TOL) then done = 1;26. end while27. end procedure

04/18/23 117CSCE930-Advanced Computer Architecture, Introduction

Page 118: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Shared Address Space Solver

– Assignment controlled by values of variables used as loop bounds

Single Program Multiple Data (SPMD)

Sweep

Test Convergence

Processes

Solve Solve Solve Solve

04/18/23 118CSCE930-Advanced Computer Architecture, Introduction

Page 119: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

1. int n, nprocs; /*matrix dimension and number of processors to be used*/2a. float **A, diff; /*A is global (shared) array representing the grid*/

/*diff is global (shared) maximum difference in currentsweep*/

2b. LOCKDEC(diff_lock); /*declaration of lock to enforce mutual exclusion*/2c. BARDEC (bar1); /*barrier declaration for global synchronization between

sweeps*/

3. main()4. begin5. read(n); read(nprocs); /*read input matrix size and number of processes*/6. A G_MALLOC (a two-dimensional array of size n+2 by n+2 doubles);7. initialize(A); /*initialize A in an unspecified way*/8a. CREATE (nprocs–1, Solve, A);8. Solve(A); /*main process becomes a worker too*/8b. WAIT_FOR_END (nprocs–1); /*wait for all child processes created to terminate*/9. end main

10. procedure Solve(A)11. float **A; /*A is entire n+2-by-n+2 shared array,

as in the sequential program*/12. begin13. int i,j, pid, done = 0;14. float temp, mydiff = 0; /*private variables*/14a. int mymin = 1 + (pid * n/nprocs); /*assume that n is exactly divisible by*/14b. int mymax = mymin + n/nprocs - 1 /*nprocs for simplicity here*/

15. while (!done) do /*outer loop over all diagonal elements*/16. mydiff = diff = 0; /*set global diff to 0 (okay for all to do it)*/16a. BARRIER(bar1, nprocs); /*ensure all reach here before anyone modifies diff*/17. for i mymin to mymax do /*for each of my rows*/18. for j 1 to n do /*for all nonborder elements in that row*/19. temp = A[i,j];20. A[i,j] = 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +21. A[i,j+1] + A[i+1,j]);22. mydiff += abs(A[i,j] - temp);23. endfor24. endfor25a. LOCK(diff_lock); /*update global diff if necessary*/25b. diff += mydiff;25c. UNLOCK(diff_lock);25d. BARRIER(bar1, nprocs); /*ensure all reach here before checking if done*/25e. if (diff/(n*n) < TOL) then done = 1; /*check convergence; all get

same answer*/25f. BARRIER(bar1, nprocs);26. endwhile27. end procedure04/18/23 119CSCE930-Advanced Computer Architecture, Introduction

Page 120: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Notes on SAS Program

– SPMD: not lockstep or even necessarily same instructions

– Assignment controlled by values of variables used as loop bounds

» unique pid per process, used to control assignment

– Done condition evaluated redundantly by all

– Code that does the update identical to sequential program

» each process has private mydiff variable

– Most interesting special operations are for synchronization

» accumulations into shared diff have to be mutually exclusive

» why the need for all the barriers?

04/18/23 120CSCE930-Advanced Computer Architecture, Introduction

Page 121: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Need for Mutual Exclusion– Code each process executes:

• load the value of diff into register r1

• add the register r2 to register r1

• store the value of register r1 into diff

– A possible interleaving:

• P1 P2

• r1 diff {P1 gets 0 in its r1}

• r1 diff {P2 also gets 0}

• r1 r1+r2 {P1 sets

its r1 to 1}

• r1 r1+r2 {P2 sets

its r1 to 1}

• diff r1 {P1 sets cell_cost to 1}

• diff r1 {P2 also sets cell_cost

to 1}

– Need the sets of operations to be atomic (mutually exclusive)

04/18/23 121CSCE930-Advanced Computer Architecture, Introduction

Page 122: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Mutual Exclusion

• Provided by LOCK-UNLOCK around critical section

– Set of operations we want to execute atomically

– Implementation of LOCK/UNLOCK must guarantee mutual excl.

• Can lead to significant serialization if contended– Especially since expect non-local accesses in critical section

– Another reason to use private mydiff for partial accumulation

04/18/23 122CSCE930-Advanced Computer Architecture, Introduction

Page 123: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Global Event Synchronization

• BARRIER(nprocs): wait here till nprocs processes get here

– Built using lower level primitives– Global sum example: wait for all to accumulate before using sum– Often used to separate phases of computation

• Process P_1 Process P_2 Process P_nprocs

• set up eqn system set up eqn system set up eqn system

• Barrier (name, nprocs) Barrier (name, nprocs) Barrier (name, nprocs)

• solve eqn system solve eqn system solve eqn system

• Barrier (name, nprocs) Barrier (name, nprocs) Barrier (name, nprocs)

• apply results apply results apply results

• Barrier (name, nprocs) Barrier (name, nprocs) Barrier (name, nprocs)

– Conservative form of preserving dependences, but easy to use

• WAIT_FOR_END (nprocs-1)

04/18/23 123CSCE930-Advanced Computer Architecture, Introduction

Page 124: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Pt-to-pt Event Synch (Not Used Here)

• One process notifies another of an event so it can proceed– Common example: producer-consumer (bounded buffer)

– Concurrent programming on uniprocessor: semaphores

– Shared address space parallel programs: semaphores, or use ordinary variables as flags

•Busy-waiting or spinning

P1 P2

A = 1;

a: while (flag is 0) do nothing;b: flag = 1;

print A;

04/18/23 124CSCE930-Advanced Computer Architecture, Introduction

Page 125: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Group Event Synchronization

• Subset of processes involved– Can use flags or barriers (involving only the subset)

– Concept of producers and consumers

• Major types:– Single-producer, multiple-consumer

– Multiple-producer, single-consumer

– Multiple-producer, single-consumer

04/18/23 125CSCE930-Advanced Computer Architecture, Introduction

Page 126: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Message Passing Grid Solver

– Cannot declare A to be shared array any more

– Need to compose it logically from per-process private arrays

» usually allocated in accordance with the assignment of work

» process assigned a set of rows allocates them locally

– Transfers of entire rows between traversals

– Structurally similar to SAS (e.g. SPMD), but orchestration different

» data structures and data access/naming

» communication

» synchronization

04/18/23 126CSCE930-Advanced Computer Architecture, Introduction

Page 127: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

1. int pid, n, b; /*process id, matrix dimension and number ofprocessors to be used*/

2. float **myA;3. main()4. begin5. read(n); read(nprocs); /*read input matrix size and number of processes*/8a. CREATE (nprocs-1, Solve);8b. Solve(); /*main process becomes a worker too*/8c. WAIT_FOR_END (nprocs–1); /*wait for all child processes created to terminate*/9. end main

10. procedure Solve()11. begin13. int i,j, pid, n’ = n/nprocs, done = 0;14. float temp, tempdiff, mydiff = 0; /*private variables*/

6. myA malloc(a 2-d array of size [n/nprocs + 2] by n+2);/*my assigned rows of A*/

7. initialize(myA); /*initialize my rows of A, in an unspecified way*/

15. while (!done) do16. mydiff = 0; /*set local diff to 0*/16a. if (pid != 0) then SEND(&myA[1,0],n*sizeof(float),pid-1,ROW);16b. if (pid = nprocs-1) then

SEND(&myA[n’,0],n*sizeof(float),pid+1,ROW);16c. if (pid != 0) then RECEIVE(&myA[0,0],n*sizeof(float),pid-1,ROW);16d. if (pid != nprocs-1) then

RECEIVE(&myA[n’+1,0],n*sizeof(float), pid+1,ROW);/*border rows of neighbors have now been copiedinto myA[0,*] and myA[n’+1,*]*/

17. for i 1 to n’ do /*for each of my (nonghost) rows*/18. for j 1 to n do /*for all nonborder elements in that row*/19. temp = myA[i,j];20. myA[i,j] = 0.2 * (myA[i,j] + myA[i,j-1] + myA[i-1,j] +21. myA[i,j+1] + myA[i+1,j]);22. mydiff += abs(myA[i,j] - temp);23. endfor24. endfor

/*communicate local diff values and determine ifdone; can be replaced by reduction and broadcast*/

25a. if (pid != 0) then /*process 0 holds global total diff*/25b. SEND(mydiff,sizeof(float),0,DIFF);25c. RECEIVE(done,sizeof(int),0,DONE);25d. else /*pid 0 does this*/25e. for i 1 to nprocs-1 do /*for each other process*/25f. RECEIVE(tempdiff,sizeof(float),*,DIFF);25g. mydiff += tempdiff; /*accumulate into total*/25h. endfor25i if (mydiff/(n*n) < TOL) then done = 1;25j. for i 1 to nprocs-1 do /*for each other process*/25k. SEND(done,sizeof(int),i,DONE);25l. endfor25m. endif26. endwhile27. end procedure

04/18/23 127CSCE930-Advanced Computer Architecture, Introduction

Page 128: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Notes on Message Passing Program– Use of ghost rows

– Receive does not transfer data, send does

» unlike SAS which is usually receiver-initiated (load fetches data)

– Communication done at beginning of iteration, so no asynchrony

– Communication in whole rows, not element at a time

– Core similar, but indices/bounds in local rather than global space

– Synchronization through sends and receives

» Update of global diff and event synch for done condition

» Could implement locks and barriers with messages

– Can use REDUCE and BROADCAST library calls to simplify code

/*communicate local diff values and determine if done, using reduction and broadcast*/25b. REDUCE(0,mydiff,sizeof(float),ADD);25c. if (pid == 0) then25i. if (mydiff/(n*n) < TOL) then done = 1;25k. endif25m. BROADCAST(0,done,sizeof(int),DONE);

04/18/23 128CSCE930-Advanced Computer Architecture, Introduction

Page 129: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Send and Receive Alternatives

– Affect event synch (mutual excl. by fiat: only one process touches data)– Affect ease of programming and performance

• Synchronous messages provide built-in synch. through match– Separate event synchronization needed with asynch. messages

• With synch. messages, our code is deadlocked. Fix?

Can extend functionality: stride, scatter-gather, groups

Semantic flavors: based on when control is returnedAffect when data structures or buffers can be reused at either end

Send/Receive

Synchronous Asynchronous

Blocking asynch. Nonblocking asynch.

04/18/23 129CSCE930-Advanced Computer Architecture, Introduction

Page 130: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Orchestration: Summary

• Shared address space– Shared and private data explicitly separate– Communication implicit in access patterns– No correctness need for data distribution– Synchronization via atomic operations on shared data– Synchronization explicit and distinct from data

communication

• Message passing– Data distribution among local address spaces needed– No explicit shared structures (implicit in comm. patterns)– Communication is explicit– Synchronization implicit in communication (at least in synch.

case)

» mutual exclusion by fiat

04/18/23 130CSCE930-Advanced Computer Architecture, Introduction

Page 131: CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

Correctness in Grid Solver Program

• Decomposition and Assignment similar in SAS and message-passing

• Orchestration is different– Data structures, data access/naming, communication, synchronization

SAS Msg-Passing

Explicit global data structure? Yes No

Assignment indept of data layout? Yes No

Communication Implicit Explicit

Synchronization Explicit Implicit

Explicit replication of border rows? No Yes

Requirements for performance are another story ...

04/18/23 131CSCE930-Advanced Computer Architecture, Introduction