CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences

CSCE 930 Advanced Computer

Architecture

Introductions

Adopted fromProfessor David Patterson

&David Culler

Electrical Engineering and Computer SciencesUniversity of California, Berkeley

Outline

• Computer Science at a Crossroads: Parallelism– Architecture: multi-core and many-cores

– Program: multi-threading

• Parallel Architecture– What is Parallel Architecture?

– Why Parallel Architecture?

– Evolution and Convergence of Parallel Architectures

– Fundamental Design Issues

• Parallel Programs– Why bother with programs?

– Important for whom?

• Memory & Storage Subsystem Architectures

04/18/23 2CSCE930-Advanced Computer Architecture, Introduction

• Old Conventional Wisdom: Power is free, Transistors expensive

• New Conventional Wisdom: “Power wall” Power expensive, Xtors free (Can put more on chip than can afford to turn on)

• Old CW: Sufficiently increasing Instruction Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, …)

• New CW: “ILP wall” law of diminishing returns on more HW for ILP

• Old CW: Multiplies are slow, Memory access is fast

• New CW: “Memory wall” Memory slow, multiplies fast (200 clock cycles to DRAM memory, 4 clocks for multiply)

• Old CW: Uniprocessor performance 2X / 1.5 yrs

• New CW: Power Wall + ILP Wall + Memory Wall = Brick Wall– Uniprocessor performance now 2X / 5(?) yrs

Sea change in chip design: multiple “cores” (2X processors per chip / ~ 2 years)

» More simpler processors are more power efficient

Crossroads: Conventional Wisdom in Comp. Arch


1

10

100

1000

10000

1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006

Pe

rfo

rma

nce

(vs

. V

AX

-11

/78

0)

25%/year

52%/year

??%/year

Crossroads: Uniprocessor Performance

• VAX : 25%/year 1978 to 1986• RISC + x86: 52%/year 1986 to 2002• RISC + x86: ??%/year 2002 to present

From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, October, 2006


Sea Change in Chip Design• Intel 4004 (1971): 4-bit processor,

2312 transistors, 0.4 MHz, 10 micron PMOS, 11 mm2 chip

• Processor is the new transistor?

• RISC II (1983): 32-bit, 5 stage pipeline, 40,760 transistors, 3 MHz, 3 micron NMOS, 60 mm2 chip

• 125 mm2 chip, 0.065 micron CMOS = 2312 RISC II+FPU+Icache+Dcache

– RISC II shrinks to ~ 0.02 mm2 at 65 nm

– Caches via DRAM or 1 transistor SRAM (www.t-ram.com) ?– Proximity Communication via capacitive coupling at > 1 TB/s ?

(Ivan Sutherland @ Sun / Berkeley)


http://www.t-ram.com/

Déjà vu all over again?

• Multiprocessors imminent in 1970s, ‘80s, ‘90s, …• “… today’s processors … are nearing an impasse as

technologies approach the speed of light..”David Mitchell, The Transputer: The Time Is Now (1989)

• Transputer was premature Custom multiprocessors strove to lead uniprocessors Procrastination rewarded: 2X seq. perf. / 1.5 years

• “We are dedicating all of our future product development to multicore designs. … This is a sea change in computing”

Paul Otellini, President, Intel (2004) • Difference is all microprocessor companies switch to

multiprocessors (AMD, Intel, IBM, Sun; all new Apples 2 CPUs)

Procrastination penalized: 2X sequential perf. / 5 yrs Biggest programming challenge: 1 to 2 CPUs


Problems with Sea Change

• Algorithms, Programming Languages, Compilers, Operating Systems, Architectures, Libraries, … not ready to supply Thread Level Parallelism or Data Level Parallelism for 1000 CPUs / chip,

• Architectures not ready for 1000 CPUs / chip• Unlike Instruction Level Parallelism, cannot be solved by just by

computer architects and compiler writers alone, but also cannot be solved without participation of computer architects

• This course explores ISL (Instruction Level Parallelism) and its shift to Thread Level Parallelism / Data Level Parallelism


Outline











What is Parallel Architecture?

• A parallel computer is a collection of processing elements that cooperate to solve large problems fast

• Some broad issues:– Resource Allocation:

» how large a collection?

» how powerful are the elements?

» how much memory?

– Data access, Communication and Synchronization

» how do the elements cooperate and communicate?

» how are data transmitted between processors?

» what are the abstractions and primitives for cooperation?

– Performance and Scalability

» how does it all translate into performance?

» how does it scale?


Why Study Parallel Architecture?

Role of a computer architect:

To design and engineer the various levels of a computer system to maximize performance and programmability within limits of technology and cost.

Parallelism:• Provides alternative to faster clock for performance

• Applies at all levels of system design

• Is a fascinating perspective from which to view architecture

• Is increasingly central in information processing


Why Study it Today?

• History: diverse and innovative organizational structures, often tied to novel programming models

• Rapidly maturing under strong technological constraints

– The “killer micro” is ubiquitous

– Laptops and supercomputers are fundamentally similar!

– Technological trends cause diverse approaches to converge

• Technological trends make parallel computing inevitable

– In the mainstream with the reality of multi-cores and many-cores

• Need to understand fundamental principles and design tradeoffs, not just taxonomies

– Naming, Ordering, Replication, Communication performance04/18/23 11CSCE930-Advanced Computer Architecture, Introduction

Inevitability of Parallel Computing

• Application demands: Our insatiable need for computing cycles

– Scientific computing: VR simulations in Biology, Chemistry, Physics, ...– General-purpose computing: Video, Graphics, CAD, Databases, AR, VI,

TP...

• Technology Trends– Number of cores on chip growing rapidly (New Moors Law)– Clock rates expected to go up only slowly (tech. wall)

• Architecture Trends– Instruction-level parallelism valuable but limited– Coarser-level parallelism, or thread-level parallelism, the most viable

approach

• Economics

• Current trends:– Today’s microprocessors are multiprocessors


Application Trends

• Demand for cycles fuels advances in hardware, and vice-versa

– Cycle drives exponential increase in microprocessor performance

– Drives parallel architecture harder: most demanding applications

• Range of performance demands– Need range of system performance with progressively increasing cost

– Platform pyramid

• Goal of applications in using parallel machines: Speedup

• Speedup (p processors) =

• For a fixed problem size (input data set), performance = 1/time

• Speedup fixed problem (p processors) =

Performance (p processors)

Performance (1 processor)

Time (1 processor)

Time (p processors)04/18/23 13CSCE930-Advanced Computer Architecture, Introduction

Scientific Computing Demand


Engineering Computing Demand

• Large parallel machines a mainstay in many industries– Petroleum (reservoir analysis)

– Automotive (crash simulation, drag analysis, combustion efficiency),

– Aeronautics (airflow analysis, engine efficiency, structural mechanics, electromagnetism),

– Computer-aided design

– Pharmaceuticals (molecular modeling)

– Visualization

» In all of the above

» Entertainment (3D films like Avatar & 3D games )

» Architecture (walk-throughs and rendering)

» Virtual Reality/Immersion (museums, teleporting, etc)

– Financial modeling (yield and derivative analysis)

– Etc.


Learning Curve for Parallel Applications

• AMBER molecular dynamics simulation program• Starting point was vector code for Cray-1• 145 MFLOP on Cray90, 406 for final version on 128-processor

Paragon, 891 on 128-processor Cray T3D04/18/23 16CSCE930-Advanced Computer Architecture, Introduction

Commercial Computing

• Also relies on parallelism for high end– Scale not so large, but use much more wide-spread

– Computational power determines scale of business that can be handled

• Databases, online-transaction processing, decision support, data mining, data warehousing ...

• TPC benchmarks (TPC-C order entry, TPC-D decision support)

– Explicit scaling criteria provided

– Size of enterprise scales with size of system

– Problem size no longer fixed as p increases, so throughput is used as a performance measure (transactions per minute or tpm)


Similar Story for Storage

• Divergence between memory capacity and speed more pronounced– Capacity increased by 1000x from 1980-95, speed only 2x

– Gigabit DRAM by c. 2000, but gap with processor speed much greater

• Larger memories are slower, while processors get faster– Need to transfer more data in parallel– Need deeper cache hierarchies– How to organize caches?

• Parallelism increases effective size of each level of hierarchy, without increasing access time

• Parallelism and locality within memory systems too– New designs fetch many bits within memory chip; follow with fast pipelined transfer

across narrower interface– Buffer caches most recently accessed data

• Disks too: Parallel disks plus caching


>100TBMedicinal Image

Digital body

High performance Computing

>1PB

NASA

>100TB

1TB/body

H.B Glass

5GB/day

Real-world applications demand high-performing and reliable storage

04/18/23

>1PBGIS

Oil prospecting>1PB

>1PBOcean resource data

1PB=1000TB=1015Bytes， It is equal to the capacity of 10,000 100GB disks.

Google, Yahoo, … >1PB/year

04/18/23

Technology Trends: Moore’s Law: 2X transistors / “year” 2X cores / “year”

• “Cramming More Components onto Integrated Circuits”– Gordon Moore, Electronics, 1965

• # on transistors / cost-effective integrated circuit double every N months (12 ≤ N ≤ 24)


Tracking Technology Performance Trends

• Drill down into 4 technologies:– Disks, – Memory, – Network, – Processors

• Compare ~1980 Archaic (Nostalgic) vs. ~2000 Modern (Newfangled)

– Performance Milestones in each technology

• Compare for Bandwidth vs. Latency improvements in performance over time

• Bandwidth: number of events per unit time– E.g., M bits / second over network, M bytes / second from disk

• Latency: elapsed time for a single event– E.g., one-way network delay in microseconds,

average disk access time in milliseconds


Disks: Archaic(Nostalgic) v. Modern(Newfangled)

• Seagate 373453, 2003

• 15000 RPM (4X)

• 73.4 GBytes (2500X)

• Tracks/Inch: 64000 (80X)

• Bits/Inch: 533,000 (60X)

• Four 2.5” platters (in 3.5” form factor)

• Bandwidth: 86 MBytes/sec (140X)

• Latency: 5.7 ms (8X)

• Cache: 8 MBytes

• CDC Wren I, 1983

• 3600 RPM

• 0.03 GBytes capacity

• Tracks/Inch: 800

• Bits/Inch: 9550

• Three 5.25” platters

• Bandwidth: 0.6 MBytes/sec

• Latency: 48.3 ms

• Cache: none


Latency Lags Bandwidth (for last ~20 years)

• Performance Milestones

• Disk: 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x)

(latency = simple operation w/o contentionBW = best-case)

1

10

100

1000

10000

1 10 100

Relative Latency Improvement

Relative BW

Improvement

Disk

(Latency improvement = Bandwidth improvement)


Memory: Archaic (Nostalgic) v. Modern (Newfangled)

• 1980 DRAM (asynchronous)

• 0.06 Mbits/chip

• 64,000 xtors, 35 mm2

• 16-bit data bus per module, 16 pins/chip

• 13 Mbytes/sec

• Latency: 225 ns

• (no block transfer)

• 2000 Double Data Rate Synchr. (clocked) DRAM

• 256.00 Mbits/chip (4000X)

• 256,000,000 xtors, 204 mm2

• 64-bit data bus per DIMM, 66 pins/chip (4X)

• 1600 Mbytes/sec (120X)

• Latency: 52 ns (4X)

• Block transfers (page mode)


Latency Lags Bandwidth (last ~20 years)


• Memory Module: 16bit plain DRAM, Page Mode DRAM, 32b, 64b, SDRAM, DDR SDRAM (4x,120x)

• Disk: 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x)


1

10

100

1000

10000

1 10 100


Relative BW

Improvement

MemoryDisk



LANs: Archaic (Nostalgic)v. Modern (Newfangled)

• Ethernet 802.3

• Year of Standard: 1978

• 10 Mbits/s link speed

• Latency: 3000 sec

• Shared media

• Coaxial cable

• Ethernet 802.3ae

• Year of Standard: 2003

• 10,000 Mbits/s

(1000X)link speed

• Latency: 190 sec

(15X)

• Switched media

• Category 5 copper wire

Coaxial Cable:

Copper coreInsulator

Braided outer conductorPlastic Covering

Copper, 1mm thick, twisted to avoid antenna effect

Twisted Pair:"Cat 5" is 4 twisted pairs in bundle




• Ethernet: 10Mb, 100Mb, 1000Mb, 10000 Mb/s (16x,1000x)


• Disk: 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x)


1

10

100

1000

10000

1 10 100


Relative BW

Improvement

Memory

Network

Disk



CPUs: Archaic (Nostalgic) v. Modern (Newfangled)

• 1982 Intel 80286

• 12.5 MHz

• 2 MIPS (peak)

• Latency 320 ns

• 134,000 xtors, 47 mm2

• 16-bit data bus, 68 pins

• Microcode interpreter, separate FPU chip

• (no caches)

• 2001 Intel Pentium 4

• 1500 MHz (120X)

• 4500 MIPS (peak) (2250X)

• Latency 15 ns (20X)

• 42,000,000 xtors, 217 mm2

• 64-bit data bus, 423 pins

• 3-way superscalar,Dynamic translate to RISC, Superpipelined (22 stage),Out-of-Order execution

• On-chip 8KB Data caches, 96KB Instr. Trace cache, 256KB L2 cache



• Performance Milestones• Processor: ‘286, ‘386, ‘486,

Pentium, Pentium Pro, Pentium 4 (21x,2250x)

• Ethernet: 10Mb, 100Mb, 1000Mb, 10000 Mb/s (16x,1000x)


• Disk : 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x)

1

10

100

1000

10000

1 10 100


Relative BW

Improvement

Processor

Memory

Network

Disk


CPU high, Memory low(“Memory Wall”)


Rule of Thumb for Latency Lagging BW

• In the time that bandwidth doubles, latency improves by no more than a factor of 1.2 to 1.4

(and capacity improves faster than bandwidth)

• Stated alternatively: Bandwidth improves by more than the square of the improvement in Latency


6 Reasons Latency Lags Bandwidth

1. Moore’s Law helps BW more than latency • Faster transistors, more transistors,

more pins help Bandwidth

» MPU Transistors: 0.130 vs. 42 M xtors (300X)

» DRAM Transistors: 0.064 vs. 256 M xtors (4000X)

» MPU Pins: 68 vs. 423 pins (6X)

» DRAM Pins: 16 vs. 66 pins (4X)

• Smaller, faster transistors but communicate over (relatively) longer lines: limits latency

» Feature size: 1.5 to 3 vs. 0.18 micron(8X,17X)

» MPU Die Size: 35 vs. 204 mm2

(ratio sqrt 2X)

» DRAM Die Size: 47 vs. 217 mm2

(ratio sqrt 2X) 04/18/23 32CSCE930-Advanced Computer Architecture, Introduction

6 Reasons Latency Lags Bandwidth (cont’d)

2. Distance limits latency • Size of DRAM block long bit and word lines

most of DRAM access time

• Speed of light and computers on network

• 1. & 2. explains linear latency vs. square BW?

3. Bandwidth easier to sell (“bigger=better”)• E.g., 10 Gbits/s Ethernet (“10 Gig”) vs.

10 sec latency Ethernet

• 4400 MB/s DIMM (“PC4400”) vs. 50 ns latency

• Even if just marketing, customers now trained

• Since bandwidth sells, more resources thrown at bandwidth, which further tips the balance


4. Latency helps BW, but not vice versa • Spinning disk faster improves both bandwidth and

rotational latency

» 3600 RPM 15000 RPM = 4.2X

» Average rotational latency: 8.3 ms 2.0 ms

» Things being equal, also helps BW by 4.2X

• Lower DRAM latency More access/second (higher bandwidth)

• Higher linear density helps disk BW (and capacity), but not disk Latency

» 9,550 BPI 533,000 BPI 60X in BW



5. Bandwidth hurts latency• Queues help Bandwidth, hurt Latency (Queuing Theory)

• Adding chips to widen a memory module increases Bandwidth but higher fan-out on address lines may increase Latency

6. Operating System overhead hurts Latency more than Bandwidth

• Long messages amortize overhead; overhead bigger part of short messages



Summary of Technology Trends

• For disk, LAN, memory, and microprocessor, bandwidth improves by square of latency improvement

– In the time that bandwidth doubles, latency improves by no more than 1.2X to 1.4X

• Lag probably even larger in real systems, as bandwidth gains multiplied by replicated components

– Multiple processors in a cluster or even in a chip

– Multiple disks in a disk array

– Multiple memory modules in a large memory

– Simultaneous communication in switched LAN

• HW and SW developers should innovate assuming Latency Lags Bandwidth

– If everything improves at the same rate, then nothing really changes

– When rates vary, require real innovation04/18/23 36CSCE930-Advanced Computer Architecture, Introduction

Architectural Trends

• Architecture translates technology’s gifts to performance and capability

• Resolves the tradeoff between parallelism and locality

– Current microprocessor: 1/4 compute, 1/2 cache, 1/4 off-chip connect

– Tradeoffs may change with scale and technology advances

• Four generations of architectural history: tube, transistor, IC, VLSI

– Here focus only on VLSI generation

• Greatest delineation in VLSI has been in type of parallelism exploited


Architectural Trends

• Greatest trend in VLSI generation is increase in parallelism– Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit

» slows after 32 bit

» adoption of 64-bit now under way, 128-bit far (not performance issue)

» great inflection point when 32-bit micro and cache fit on a chip

– Mid 80s to mid 90s: instruction level parallelism

» pipelining and simple instruction sets, + compiler advances (RISC)

» on-chip caches and functional units => superscalar execution

» greater sophistication: out of order execution, speculation, prediction• to deal with control transfer and latency problems

– Next step: thread level parallelism


Architectural Trends: ILP

• Reported speedups for superscalar processors

• Horst, Harris, and Jardine [1990] ...................... 1.37

• Wang and Wu [1988] .......................................... 1.70

• Smith, Johnson, and Horowitz [1989] .............. 2.30

• Murakami et al. [1989] ........................................ 2.55

• Chang et al. [1991] ............................................. 2.90

• Jouppi and Wall [1989] ...................................... 3.20

• Lee, Kwok, and Briggs [1991] ........................... 3.50

• Wall [1991] .......................................................... 5

• Melvin and Patt [1991] ....................................... 8

• Butler et al. [1991] ............................................. 17+

• Large variance due to difference in– application domain investigated (numerical versus non-

numerical)– capabilities of processor modeled


ILP Ideal Potential

• Infinite resources and fetch bandwidth, perfect branch prediction and renaming

– real caches and non-zero miss latencies

0 1 2 3 4 5 6+0

5

10

15

20

25

30

0 5 10 150

0.5

1

1.5

2

2.5

3

Fra

ctio

n o

f to

tal c

ycle

s (%

)

Number of instructions issued

Sp

ee

du

p

Instructions issued per cycle


Results of ILP Studies

1x

2x

3x

4x

Jouppi_89 Smith_89 Murakami_89 Chang_91 Butler_91 Melvin_91

1 branch unit/real prediction

perfect branch prediction

• Concentrate on parallelism for 4-issue machines

• Realistic studies show only 2-fold speedup• Recent studies show that more ILP needs to look

across threads04/18/23 41CSCE930-Advanced Computer Architecture, Introduction

Architectural Trends: Bus-based MPs

•Micro on a chip makes it natural to connect many to shared memory

– dominates server and enterprise market, moving down to desktop

•Faster processors began to saturate bus, then bus technology advanced

– today, range of sizes for bus-based systems, desktop to large servers

0

10

20

30

40

CRAY CS6400

SGI Challenge

Sequent B2100

Sequent B8000

Symmetry81

Symmetry21

Power

SS690MP 140 SS690MP 120

AS8400

HP K400AS2100SS20

SE30

SS1000E

SS10

SE10

SS1000

P-ProSGI PowerSeries

SE60

SE70

Sun E6000

SC2000ESun SC2000SGI PowerChallenge/XL

SunE10000

50

60

70

1984 1986 1988 1990 1992 1994 1996 1998

Nu

mb

er

of

pro

cess

ors


Economics

• Commodity microprocessors not only fast but CHEAP• Development cost is tens of millions of dollars (5-100 typical)

• BUT, many more are sold compared to supercomputers

– Crucial to take advantage of the investment, and use the commodity building block

– Exotic parallel architectures no more than special-purpose

• Multiprocessors being pushed by software vendors (e.g. database) as well as hardware vendors

• Standardization by Intel makes small, bus-based SMPs commodity

• Desktop: few smaller processors versus one larger one?– Multiprocessor on a chip


Consider Scientific Supercomputing

• Proving ground and driver for innovative architecture and techniques

– Market smaller relative to commercial as MPs become mainstream

– Dominated by vector machines starting in 70s

– Microprocessors have made huge gains in floating-point performance

» high clock rates

» pipelined floating point units (e.g., multiply-add every cycle)

» instruction-level parallelism

» effective use of caches (e.g., automatic blocking)

– Plus economics

• Large-scale multiprocessors replace vector supercomputers

– Most top-performing machines on Top-500 list are multiprocessors


Summary: Why Parallel Architecture?

• Increasingly attractive– Economics, technology, architecture, application demand

• Increasingly central and mainstream

• Parallelism exploited at many levels– Instruction-level parallelism– Thread-level of parallelism (multi-cores and many-cores)– Application/Node-level of parallelism (“MPPs”, cluster, grids, clouds)

• Focus of this class: thread-level of parallelism

• Same story from memory/storage system perspective but with a “twist” (data-intensive computing, etc)

• Wide range of parallel architectures make sense– Different cost, performance and scalability


Convergence of Parallel Architectures


History

Application Software

System SoftwareSIMD

Message Passing

Shared MemoryDataflow

Systolic

Arrays Architecture

• Uncertainty of direction paralyzed parallel software development!

Historically, parallel architectures tied to programming models

• Divergent architectures, with no predictable pattern of growth.


Today

• Extension of “computer architecture” to support communication and cooperation– OLD: Instruction Set Architecture

– NEW: Communication Architecture

• Defines – Critical abstractions, boundaries, and primitives (interfaces)

– Organizational structures that implement interfaces (hw or sw)

• Compilers, libraries and OS are important bridges today


Modern Layered Framework

CAD

Multiprogramming Sharedaddress

Messagepassing

Dataparallel

Database Scientific modeling Parallel applications

Programming models

Communication abstractionUser/system boundary

Compilationor library

Operating systems support

Communication hardware

Physical communication medium

Hardware/software boundary


Programming Model

• What programmer uses in coding applications

• Specifies communication and synchronization

• Examples:– Multiprogramming: no communication or synch. at program

level

– Shared address space: like bulletin board

– Message passing: like letters or phone calls, explicit point to point

– Data parallel: more regimented, global actions on data

» Implemented with shared address space or message passing


Communication Abstraction

• User level communication primitives provided– Realizes the programming model

– Mapping exists between language primitives of programming model and these primitives

• Supported directly by hw, or via OS, or via user sw

• Lot of debate about what to support in sw and gap between layers

• Today:– Hw/sw interface tends to be flat, i.e. complexity roughly uniform

– Compilers and software play important roles as bridges today

– Technology trends exert strong influence

• Result is convergence in organizational structure– Relatively simple, general purpose communication primitives


Communication Architecture

• = User/System Interface + Implementation

• User/System Interface:– Comm. primitives exposed to user-level by hw and system-level sw

• Implementation:– Organizational structures that implement the primitives: hw or OS– How optimized are they? How integrated into processing node?– Structure of network

• Goals:– Performance– Broad applicability– Programmability– Scalability– Low Cost


Evolution of Architectural Models

• Historically machines tailored to programming models– Prog. model, comm. abstraction, and machine organization lumped together as

the “architecture”

• Evolution helps understand convergence– Identify core concepts

• Shared Address Space

• Message Passing

• Data Parallel

• Others: – Dataflow

– Systolic Arrays

• Examine programming model, motivation, intended applications, and contributions to convergence


Shared Address Space Architectures

• Any processor can directly reference any memory location – Communication occurs implicitly as result of loads and stores

• Convenient: – Location transparency

– Similar programming model to time-sharing on uniprocessors

» Except processes run on different processors

» Good throughput on multiprogrammed workloads

• Naturally provided on wide range of platforms– History dates at least to precursors of mainframes in early 60s

– Wide range of scale: few to hundreds of processors

• Popularly known as shared memory machines or model– Ambiguous: memory may be physically distributed among processors


Shared Address Space Model

• Process: virtual address space plus one or more threads of control

• Portions of address spaces of processes are shared

•Writes to shared address visible to other threads (in other processes too)•Natural extension of uniprocessors model: conventional memory operations for comm.; special atomic operations for synchronization•OS uses shared memory to coordinate processes

St or e

P1

P2

Pn

P0

Load

P0 pr i vat e

P1 pr i vat e

P2 pr i vat e

Pn pr i vat e

Virtual address spaces for a

collection of processes communicating

via shared addresses

Machine physical address space

Shared portion

of address space

Private portion

of address space

Common physical

addresses


Communication Hardware

• Also natural extension of uniprocessor

• Already have processor, one or more memory modules and I/O controllers connected by hardware interconnect of some sort

Memory capacity increased by adding modules, I/O by controllers

•Add processors for processing!

•For higher-throughput multiprogramming, or parallel programs

I/O ctrlMem Mem Mem

Interconnect

Mem I/O ctrl

Processor Processor

Interconnect

I/Odevices


History

P

P

C

C

I/O

I/O

M MM M

PP

C

I/O

M MC

I/O

$ $

• “Mainframe” approach– Motivated by multiprogramming– Extends crossbar used for mem bw and I/O– Originally processor cost limited to small

» later, cost of crossbar– Bandwidth scales with p– High incremental cost; use multistage

instead

• “Minicomputer” approach– Almost all microprocessor systems have bus– Motivated by multiprogramming, TP– Used heavily for parallel computing– Called symmetric multiprocessor (SMP)– Latency larger than for uniprocessor– Bus is bandwidth bottleneck

» caching is key: coherence problem– Low incremental cost


Example: Intel Pentium Pro Quad

– All coherence and multiprocessing glue in processor module

– Highly integrated, targeted at high volume

– Low latency and bandwidth

P-Pro bus (64-bit data, 36-bit addr ess, 66 MHz)

CPU

Bus interface

MIU

P-Promodule

P-Promodule

P-Promodule256-KB

L2 $Interruptcontroller

PCIbridge

PCIbridge

Memorycontroller

1-, 2-, or 4-wayinterleaved

DRAM

PC

I bus

PC

I busPCI

I/Ocards


Example: SUN Enterprise

– 16 cards of either type: processors + memory, or I/O– All memory accessed over bus, so symmetric– Higher bandwidth, higher latency bus

Gigaplane bus (256 data, 41 addr ess, 83 MHz)

SB

US

SB

US

SB

US

2 F

iber

Cha

nnel

100b

T, S

CS

I

Bus interface

CPU/memcardsP

$2

$

P

$2

$

Mem ctrl

Bus interface/switch

I/O cards


Scaling Up

– Problem is interconnect: cost (crossbar) or bandwidth (bus)

– Dance-hall: bandwidth still scalable, but lower cost than crossbar

» latencies to memory uniform, but uniformly large

– Distributed memory or non-uniform memory access (NUMA)

» Construct shared address space out of simple message transactions across a general-purpose network (e.g. read-request, read-response)

– Caching shared (particularly nonlocal) data?

M M M

M M M

NetworkNetwork

P

$

P

$

P

$

P

$

P

$

P

$

“Dance hall” Distributed memory


Example: Cray T3E

– Scale up to 1024 processors, 480MB/s links– Memory controller generates comm. request for nonlocal references– No hardware mechanism for coherence (SGI Origin etc. provide this)

Switch

P

$

XY

Z

External I/O

Memctrl

and NI

Mem


Message Passing Architectures

• Complete computer as building block, including I/O– Communication via explicit I/O operations

• Programming model: directly access only private address space (local memory), comm. via explicit messages (send/receive)

• High-level block diagram similar to distributed-memory SAS– But comm. integrated at IO level, needn’t be into memory system– Like networks of workstations (clusters), but tighter integration– Easier to build than scalable SAS

• Programming model more removed from basic hardware operations– Library or OS intervention


Message-Passing Abstraction

– Send specifies buffer to be transmitted and receiving process– Recv specifies sending process and application storage to receive into– Memory to memory copy, but need to name processes– Optional tag on send and matching rule on receive– User process names local data and entities in process/tag space too– In simplest form, the send/recv match achieves pairwise synch event

» Other variants too– Many overheads: copying, buffer management, protection

Process P Process Q

Address Y

Address X

Send X, Q, t

Receive Y, P, tMatch

Local processaddress space

Local processaddress space


Evolution of Message-Passing Machines

• Early machines: FIFO on each link– Hw close to prog. Model; synchronous ops– Replaced by DMA, enabling non-blocking ops

» Buffered by system at destination until recv

• Diminishing role of topology– Store&forward routing: topology important– Introduction of pipelined routing made it less so– Cost is in node-network interface– Simplifies programming

000001

010011

100

110

101

111


Example: IBM SP-2

– Made out of essentially complete RS6000 workstations

– Network interface integrated in I/O bus (bw limited by I/O bus)

Memory bus

MicroChannel bus

I/O

i860 NI

DMA

DR

AM

IBM SP-2 node

L2 $

Power 2CPU

Memorycontroller

4-wayinterleaved

DRAM

General interconnectionnetwork formed from8-port switches

NIC


Example Intel Paragon

Memory bus (64-bit, 50 MHz)

i860

L1 $

NI

DMA

i860

L1 $

Driver

Memctrl

4-wayinterleaved

DRAM

IntelParagonnode

8 bits,175 MHz,bidirectional2D grid network

with processing nodeattached to every switch

Sandia’ s Intel Paragon XP/S-based Super computer


Toward Architectural Convergence

• Evolution and role of software have blurred boundary– Send/recv supported on SAS machines via buffers

– Can construct global address space on MP using hashing

– Page-based (or finer-grained) shared virtual memory

• Hardware organization converging too– Tighter NI integration even for MP (low-latency, high-

bandwidth)

– At lower level, even hardware SAS passes hardware messages

• Even clusters of workstations/SMPs are parallel systems– Emergence of fast system area networks (SAN)

• Programming models distinct, but organizations converging

– Nodes connected by general network and communication assists

– Implementations also converging, at least in high-end machines


Data Parallel Systems

• Programming model – Operations performed in parallel on each element of data structure

– Logically single thread of control, performs sequential or parallel steps

– Conceptually, a processor associated with each data element

• Architectural model– Array of many simple, cheap processors with little memory each

» Processors don’t sequence through instructions

– Attached to a control processor that issues instructions

– Specialized and general communication, cheap global synchronization

PE PE PE

PE PE PE

PE PE PE

ControlprocessorOriginal motivations

•Matches simple differential equation solvers

•Centralize high cost of instruction fetch/sequencing


Application of Data Parallelism

– Each PE contains an employee record with his/her salary

If salary > 100K then

salary = salary *1.05

else

salary = salary *1.10

– Logically, the whole operation is a single step

– Some processors enabled for arithmetic operation, others disabled

• Other examples: – Finite differences, linear algebra, ... – Document searching, graphics, image processing, ...

• Some early machines: – Thinking Machines CM-1, CM-2 (and CM-5)– Maspar MP-1 and MP-2,

• Today’s on-chip examples: – IBM’s Cell procesor,

– GPU, GPGPU


Dataflow Architectures

• Represent computation as a graph of essential dependences

– Logical processor at each node, activated by availability of operands– Message (tokens) carrying tag of next instruction sent to next processor– Tag compared with others in matching store; match fires execution

1 b

a

+

c e

d

f

Dataflow graph

f = a d

Network

Tokenstore

WaitingMatching

Instructionfetch

Execute

Token queue

Formtoken

Network

Network

Programstore

a = (b +1) (b c)d = c e


Evolution and Convergence• Key characteristics

– Ability to name operations, synchronization, dynamic scheduling

• Problems– Operations have locality across them, useful to group together– Handling complex data structures like arrays– Complexity of matching store and memory units– Expose too much parallelism (?)

• Converged to use conventional processors and memory– Support for large, dynamic set of threads to map to processors– Typically shared address space as well– But separation of progr. model from hardware (like data-parallel)

• Lasting contributions:– Integration of communication with thread (handler) generation– Tightly integrated communication and fine-grained synchronization– Remained useful concept for software (compilers etc.)


Systolic Architectures– Replace single processor with array of regular processing elements

– Orchestrate data flow for high throughput with less memory access

Different from pipelining• Nonlinear array structure, multidirection data flow, each PE may have

(small) local instruction and data memory

Different from SIMD: each PE may do something different

Initial motivation: VLSI enables inexpensive special-purpose chips

Represent algorithms directly by chips connected in regular pattern

M

PE

M

PE PE PE


Systolic Arrays (contd.)

– Practical realizations (e.g. iWARP) use quite general processors

» Enable variety of algorithms on same hardware– But dedicated interconnect channels

» Data transfer directly from register to register across channel– Specialized, and same problems as SIMD

» General purpose systems work well for same algorithms (locality etc.)

y(i) = w1 x(i) + w2 x(i + 1) + w3 x(i + 2) + w4 x(i + 3)

x8

y3 y2 y1

x7x6

x5x4

x3

w4

x2

x

w

x1

w3 w2 w1

xin

yin

xout

yout

xout = x

yout = yin + w xinx = x in

Example: Systolic array for 1-D convolution


Convergence: Generic Parallel Architecture

• A generic modern multiprocessor

Node: processor(s), memory system, plus communication assist• Network interface and communication controller

• Scalable network

• Convergence allows lots of innovation, now within framework• Integration of assist with node, what operations, how efficiently...

Mem

Network

P

$

Communicationassist (CA)


Fundamental Design Issues


Understanding Parallel Architecture

• Traditional taxonomies not very useful

• Programming models not enough, nor hardware structures

– Same one can be supported by radically different architectures

• Architectural distinctions that affect software– Compilers, libraries, programs

• Design of user/system and hardware/software interface– Constrained from above by progr. models and below by technology

• Guiding principles provided by layers– What primitives are provided at communication abstraction

– How programming models map to these

– How they are mapped to hardware


Fundamental Design Issues

• At any layer, interface (contract) aspect and performance aspects

– Naming: How are logically shared data and/or processes referenced?

– Operations: What operations are provided on these data

– Ordering: How are accesses to data ordered and coordinated?

– Replication: How are data replicated to reduce communication?

– Communication Cost: Latency, bandwidth, overhead, occupancy

• Understand at programming model first, since that sets requirements

• Other issues

• • Node Granularity: How to split between processors and memory?

• • ...04/18/23 77CSCE930-Advanced Computer Architecture, Introduction

Sequential Programming Model

• Contract– Naming: Can name any variable in virtual address space

» Hardware (and perhaps compilers) does translation to physical addresses

– Operations: Loads and Stores

– Ordering: Sequential program order

• Performance– Rely on dependences on single location (mostly): dependence order

– Compilers and hardware violate other orders without getting caught

– Compiler: reordering and register allocation

– Hardware: out of order, pipeline bypassing, write buffers

– Transparent replication in caches


Shared Address Space (SAS) Programming Model

• Naming: Any process can name any variable in shared space

• Operations: loads and stores, plus those needed for ordering

• Simplest Ordering Model: – Within a process/thread: sequential program order

– Across threads: some interleaving (as in time-sharing)

– Additional orders through synchronization

– Again, compilers/hardware can violate orders without getting caught» Different, more subtle ordering models also possible (discussed later)


Synchronization

• Mutual exclusion (locks)– Ensure certain operations on certain data can be performed by

only one process at a time

– Room that only one person can enter at a time

– No ordering guarantees

• Event synchronization – Ordering of events to preserve dependences

» e.g. producer —> consumer of data

– 3 main types:

» point-to-point

» global

» group


Message Passing Programming Model

• Naming: Processes can name private data directly. – No shared address space

• Operations: Explicit communication through send and receive

– Send transfers data from private address space to another process

– Receive copies data from process to private address space

– Must be able to name processes

• Ordering: – Program order within a process

– Send and receive can provide pt to pt synch between processes

– Mutual exclusion inherent

• Can construct global address space:– Process number + address within process address space

– But no direct operations on these names04/18/23 81CSCE930-Advanced Computer Architecture, Introduction

Design Issues Apply at All Layers

• Prog. model’s position provides constraints/goals for system

• In fact, each interface between layers supports or takes a position on:

– Naming model

– Set of operations on names

– Ordering model

– Replication

– Communication performance

• Any set of positions can be mapped to any other by software

• Let’s see issues across layers– How lower layers can support contracts of programming models

– Performance issues04/18/23 82CSCE930-Advanced Computer Architecture, Introduction

Naming and Operations• Naming and operations in programming model can be directly

supported by lower levels, or translated by compiler, libraries or OS

• Example: Shared virtual address space in programming model

• Hardware interface supports shared physical address space – Direct support by hardware through v-to-p mappings, no software layers

• Hardware supports independent physical address spaces– Can provide SAS through OS, so in system/user interface

» v-to-p mappings only for data that are local

» remote data accesses incur page faults; brought in via page fault handlers

» same programming model, different hardware requirements and cost model

– Or through compilers or runtime, so above sys/user interface

» shared objects, instrumentation of shared accesses, compiler support


Naming and Operations (contd)

• Example: Implementing Message Passing

• Direct support at hardware interface– But match and buffering benefit from more flexibility

• Support at sys/user interface or above in software (almost always)

– Hardware interface provides basic data transport (well suited)

– Send/receive built in sw for flexibility (protection, buffering)

– Choices at user/system interface:

» OS each time: expensive

» OS sets up once/infrequently, then little sw involvement each time

– Or lower interfaces provide SAS, and send/receive built on top with buffers and loads/stores

• Need to examine the issues and tradeoffs at every layer– Frequencies and types of operations, costs


Ordering

• Message passing: no assumptions on orders across processes except those imposed by send/receive pairs

• SAS: How processes see the order of other processes’ references defines semantics of SAS

– Ordering very important and subtle

– Uniprocessors play tricks with orders to gain parallelism or locality

– These are more important in multiprocessors

– Need to understand which old tricks are valid, and learn new ones

– How programs behave, what they rely on, and hardware implications


Replication

• Very important for reducing data transfer/communication

• Again, depends on naming model

• Uniprocessor: caches do it automatically– Reduce communication with memory

• Message Passing naming model at an interface– A receive replicates, giving a new name; subsequently use new name

– Replication is explicit in software above that interface

• SAS naming model at an interface– A load brings in data transparently, so can replicate transparently

– Hardware caches do this, e.g. in shared physical address space

– OS can do it at page level in shared virtual address space, or objects

– No explicit renaming, many copies for same name: coherence problem

» in uniprocessors, “coherence” of copies is natural in memory hierarchy


Communication Performance

• Performance characteristics determine usage of operations at a layer

– Programmer, compilers etc make choices based on this

• Fundamentally, three characteristics:– Latency: time taken for an operation

– Bandwidth: rate of performing operations

– Cost: impact on execution time of program

• If processor does one thing at a time: bandwidth 1/latency– But actually more complex in modern systems

• Characteristics apply to overall operations, as well as individual components of a system, however small

• We’ll focus on communication or data transfer across nodes


Outline











Why Bother with Programs?

• They’re what runs on the machines we design– Helps make design decisions

– Helps evaluate systems tradeoffs

• Led to the key advances in uniprocessor architecture

– Caches and instruction set design

• More important in multiprocessors– New degrees of freedom

– Greater penalties for mismatch between program and architecture


Important for Whom?

• Algorithm designers– Designing algorithms that will run well on real systems

• Programmers– Understanding key issues and obtaining best performance

• Architects– Understand workloads, interactions, important degrees of

freedom

– Valuable for design and for evaluation


Software

• 1. Parallel programs– Process of parallelization– What parallel programs look like in major programming models

• 2. Programming for performance– Key performance issues and architectural interactions

• 3. Workload-driven architectural evaluation– Beneficial for architects and for users in procuring machines

• Unlike on sequential systems, can’t take workload for granted

– Software base not mature; evolves with architectures for performance

– So need to open the box

• Let’s begin with parallel programs ...


Outline

• Motivating Problems (application case studies)

• Steps in creating a parallel program

• What a simple parallel program looks like– In the three major programming models

– Ehat primitives must a system support?

• Later: Performance issues and architectural interactions


Motivating Problems

• Simulating Ocean Currents– Regular structure, scientific computing

• Simulating the Evolution of Galaxies– Irregular structure, scientific computing

• Rendering Scenes by Ray Tracing– Irregular structure, computer graphics

• Data Mining– Irregular structure, information processing

– Not discussed here (read in book)


Simulating Ocean Currents

– Model as two-dimensional grids– Discretize in space and time

» finer spatial and temporal resolution => greater accuracy

– Many different computations per time step» set up and solve equations

– Concurrency across and within grid computations

(a) Cross sections (b) Spatial discretization of a cross section


Simulating Galaxy Evolution

– Simulate the interactions of many stars evolving over time

– Computing forces is expensive

– O(n2) brute force approach

– Hierarchical Methods take advantage of force law: G m1m2

r2

•Many time-steps, plenty of concurrency across stars within one

Star on which forcesare being computed

Star too close toapproximate

Small group far enough away toapproximate to center of mass

Large group farenough away toapproximate


Rendering Scenes by Ray Tracing

– Shoot rays into scene through pixels in image plane

– Follow their paths

» they bounce around as they strike objects

» they generate new rays: ray tree per input ray

– Result is color and opacity for that pixel

– Parallelism across rays

• All case studies have abundant concurrency


Creating a Parallel Program

• Assumption: Sequential algorithm is given– Sometimes need very different algorithm, but beyond scope

• Pieces of the job:– Identify work that can be done in parallel

– Partition work and perhaps data among processes

– Manage data access, communication and synchronization

– Note: work includes computation, data access and I/O

• Main goal: Speedup (plus low prog. effort and resource needs)

• Speedup (p) =

• For a fixed problem:

• Speedup (p) =

Performance(p)

Performance(1)Time(1)

Time(p)04/18/23 97CSCE930-Advanced Computer Architecture, Introduction

Steps in Creating a Parallel Program

• 4 steps: Decomposition, Assignment, Orchestration, Mapping

– Done by programmer or system software (compiler, runtime, ...)– Issues are the same, so assume programmer does it all explicitly

P0

Tasks Processes Processors

P1

P2 P3

p0 p1

p2 p3

p0 p1

p2 p3

Partitioning

Sequentialcomputation

Parallelprogram

Assignment

Decomposition

Mapping

Orchestration


Some Important Concepts• Task:

– Arbitrary piece of undecomposed work in parallel computation

– Executed sequentially; concurrency is only across tasks

– E.g. a particle/cell in Barnes-Hut, a ray or ray group in Raytrace

– Fine-grained versus coarse-grained tasks

• Process (thread): – Abstract entity that performs the tasks assigned to processes

– Processes communicate and synchronize to perform their tasks

• Processor: – Physical engine on which process executes

– Processes virtualize machine to programmer

» first write program in terms of processes, then map to processors


Decomposition

• Break up computation into tasks to be divided among processes

– Tasks may become available dynamically

– No. of available tasks may vary with time

• i.e. identify concurrency and decide level at which to exploit it

• Goal: Enough tasks to keep processes busy, but not too many

– No. of tasks available at a time is upper bound on achievable speedup


Limited Concurrency: Amdahl’s Law

– Most fundamental limitation on parallel speedup

– If fraction s of seq execution is inherently serial, speedup <= 1/s

– Example: 2-phase calculation

» sweep over n-by-n grid and do some independent computation

» sweep again and add each value to global sum

– Time for first phase = n2/p

– Second phase serialized at global variable, so time = n2

– Speedup <= or at most 2

– Trick: divide second phase into two» accumulate into private sum during sweep» add per-process private sum into global sum

– Parallel time is n2/p + n2/p + p, and speedup at best

2n2

n2

p+ n2

2n2

2n2 + p2


Pictorial Depiction

1

p

1

p

1

n2/p

n2

p

wor

k do

ne c

oncu

rren

tly

n2

n2

Timen2/p n2/p

(c)

(b)

(a)


Concurrency Profiles

– Area under curve is total work done, or time with 1 processor– Horizontal extent is lower bound on time (infinite processors)

– Speedup is the ratio: , base case:

– Amdahl’s law applies to any overhead, not just limited concurrency

•Cannot usually divide into serial and parallel part

Co

ncu

rre

ncy

15

0

21

9

24

7

28

6

31

3

34

3

38

0

41

5

44

4

48

3

50

4

52

6

56

4

58

9

63

3

66

2

70

2

73

3

0

200

400

600

800

1,000

1,200

1,400

Clock cycle number

fk k

fkkp

k=1

k=1

1

s + 1-sp


Assignment• Specifying mechanism to divide work up among

processes– E.g. which process computes forces on which stars, or which

rays– Together with decomposition, also called partitioning– Balance workload, reduce communication and management

cost• Structured approaches usually work well

– Code inspection (parallel loops) or understanding of application

– Well-known heuristics– Static versus dynamic assignment

• As programmers, we worry about partitioning first

– Usually independent of architecture or prog model– But cost and complexity of using primitives may affect

decisions• As architects, we assume program does

reasonable job of it04/18/23 104CSCE930-Advanced Computer Architecture, Introduction

Orchestration– Naming data

– Structuring communication

– Synchronization

– Organizing data structures and scheduling tasks temporally

• Goals– Reduce cost of communication and synch. as seen by

processors– Reserve locality of data reference (incl. data structure

organization)– Schedule tasks to satisfy dependences early– Reduce overhead of parallelism management

• Closest to architecture (and programming model & language)

– Choices depend a lot on comm. abstraction, efficiency of primitives

– Architects should provide appropriate primitives efficiently04/18/23 105CSCE930-Advanced Computer Architecture, Introduction

Mapping• After orchestration, already have parallel program

• Two aspects of mapping:– Which processes will run on same processor, if necessary– Which process runs on which particular processor

» mapping to a network topology

• One extreme: space-sharing– Machine divided into subsets, only one app at a time in a subset

– Processes can be pinned to processors, or left to OS

• Another extreme: complete resource management control to OS

– OS uses the performance techniques we will discuss later

• Real world is between the two– User specifies desires in some aspects, system may ignore

• Usually adopt the view: process <-> processor


Parallelizing Computation vs. Data

• Above view is centered around computation– Computation is decomposed and assigned (partitioned)

• Partitioning Data is often a natural view too– Computation follows data: owner computes

– Grid example; data mining; High Performance Fortran (HPF)

• But not general enough– Distinction between comp. and data stronger in many

applications» Barnes-Hut, Raytrace (later)

– Retain computation-centric view

– Data access and communication is part of orchestration


High-level Goals

• But low resource usage and development effort

• Implications for algorithm designers and architects– Algorithm designers: high-perf., low resource needs– Architects: high-perf., low cost, reduced programming effort

» e.g. gradually improving perf. with programming effort may be preferable to sudden threshold after large programming effort

High performance (speedup over sequential program)

Table 2.1 Steps in the Parallelization Process and Their Goals

StepArchitecture-Dependent? Major Performance Goals

Decomposition Mostly no Expose enough concurrency but not too much

Assignment Mostly no Balance workloadReduce communication volume

Orchestration Yes Reduce noninherent communication via data locality

Reduce communication and synchronization cost as seen by the processor

Reduce serialization at shared resourcesSchedule tasks to satisfy dependences early

Mapping Yes Put related processes on the same processor if necessary

Exploit locality in network topology


What Parallel Programs Look Like

Parallelization of An Example Program• Motivating problems all lead to large, complex

programs

• Examine a simplified version of a piece of Ocean simulation

– Iterative equation solver

• Illustrate parallel program in low-level parallel language

– C-like pseudocode with simple extensions for parallelism

– Expose basic comm. and synch. primitives that must be supported

– State of most real parallel programming today


Grid Solver Example

– Simplified version of solver in Ocean simulation

– Gauss-Seidel (near-neighbor) sweeps to convergence» interior n-by-n points of (n+2)-by-(n+2) updated in each sweep» updates done in-place in grid, and diff. from prev. value computed» accumulate partial diffs into global diff at end of every sweep» check if error has converged (to within a tolerance parameter)» if so, exit solver; if not, do another sweep

A[i,j ] = 0.2 (A[i,j ] + A[i,j – 1] + A[i – 1, j] +

A[i,j + 1] + A[i + 1, j ])

Expression for updating each interior point:


1. int n; /*size of matrix: (n + 2-by-n + 2) elements*/2. float **A, diff = 0;

3. main()4. begin5. read(n) ; /*read input parameter: matrix size*/

6. A malloc (a 2-d array of size n + 2 by n + 2 doubles);7. initialize(A); /*initialize the matrix A somehow*/8. Solve (A); /*call the routine to solve equation*/9. end main

10. procedure Solve (A) /*solve the equation system*/11. float **A; /*A is an (n + 2)-by-(n + 2) array*/12. begin13. int i, j, done = 0;14. float diff = 0, temp;15. while (!done) do /*outermost loop over sweeps*/16. diff = 0; /*initialize maximum difference to 0*/

17. for i 1 to n do /*sweep over nonborder points of grid*/

18. for j 1 to n do19. temp = A[i,j]; /*save old value of element*/

20. A[i,j] 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +21. A[i,j+1] + A[i+1,j]); /*compute average*/22. diff += abs(A[i,j] - temp);23. end for24. end for25. if (diff/(n*n) < TOL) then done = 1;26. end while27. end procedure


Decomposition

– Concurrency O(n) along anti-diagonals, serialization O(n) along diag.– Retain loop structure, use pt-to-pt synch; Problem: too many synch ops.– Restructure loops, use global synch; imbalance and too much synch

•Simple way to identify concurrency is to look at loop iterations–dependence analysis; if not enough concurrency, then look further

•Not much concurrency here at this level (all loops sequential)•Examine fundamental dependences, ignoring loop structure


Exploit Application Knowledge

– Different ordering of updates: may converge quicker or slower – Red sweep and black sweep are each fully parallel: – Global synch between them (conservative but convenient)– Ocean uses red-black; we use simpler, asynchronous one to illustrate

» no red-black, simply ignore dependences within sweep» sequential order same as original, parallel program

nondeterministic

Red point

Black point

•Reorder grid traversal: red-black ordering


Decomposition Only

– Decomposition into elements: degree of concurrency n2

– To decompose into rows, make line 18 loop sequential; degree n– for_all leaves assignment left to system

» but implicit global synch. at end of for_all loop

15. while (!done) do /*a sequential loop*/16. diff = 0; 17. for_all i 1 to n do /*a parallel loop nest*/18. for_all j 1 to n do19. temp = A[i,j];20. A[i,j] 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +21. A[i,j+1] + A[i+1,j]);22. diff += abs(A[i,j] - temp);23. end for_all24. end for_all25. if (diff/(n*n) < TOL) then done = 1;26. end while


Assignment

– Dynamic assignment

» get a row index, work on the row, get a new row, and so on

– Static assignment into rows reduces concurrency (from n to p)» block assign. reduces communication by keeping adjacent rows

together

– Let’s dig into orchestration under three programming models

ip

P0

P1

P2

P4

•Static assignments (given decomposition into rows)

–block assignment of rows: Row i is assigned to process

–cyclic assignment of rows: process i is assigned rows i, i+p, and so on


Data Parallel Solver1. int n, nprocs; /*grid size (n + 2-by-n + 2) and number of processes*/2. float **A, diff = 0;

3. main()4. begin5. read(n); read(nprocs); ; /*read input grid size and number of processes*/

6. A G_MALLOC (a 2-d array of size n+2 by n+2 doubles);7. initialize(A); /*initialize the matrix A somehow*/8. Solve (A); /*call the routine to solve equation*/9. end main

10. procedure Solve(A) /*solve the equation system*/11. float **A; /*A is an (n + 2-by-n + 2) array*/12. begin13. int i, j, done = 0;14. float mydiff = 0, temp;14a. DECOMP A[BLOCK,*, nprocs];15. while (!done) do /*outermost loop over sweeps*/16. mydiff = 0; /*initialize maximum difference to 0*/17. for_all i 1 to n do /*sweep over non-border points of grid*/

18. for_all j 1 to n do19. temp = A[i,j]; /*save old value of element*/

20. A[i,j] 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +21. A[i,j+1] + A[i+1,j]); /*compute average*/22. mydiff += abs(A[i,j] - temp);23. end for_all24. end for_all24a. REDUCE (mydiff, diff, ADD);25. if (diff/(n*n) < TOL) then done = 1;26. end while27. end procedure


Shared Address Space Solver

– Assignment controlled by values of variables used as loop bounds

Single Program Multiple Data (SPMD)

Sweep

Test Convergence

Processes

Solve Solve Solve Solve


1. int n, nprocs; /*matrix dimension and number of processors to be used*/2a. float **A, diff; /*A is global (shared) array representing the grid*/

/*diff is global (shared) maximum difference in currentsweep*/

2b. LOCKDEC(diff_lock); /*declaration of lock to enforce mutual exclusion*/2c. BARDEC (bar1); /*barrier declaration for global synchronization between

sweeps*/

3. main()4. begin5. read(n); read(nprocs); /*read input matrix size and number of processes*/6. A G_MALLOC (a two-dimensional array of size n+2 by n+2 doubles);7. initialize(A); /*initialize A in an unspecified way*/8a. CREATE (nprocs–1, Solve, A);8. Solve(A); /*main process becomes a worker too*/8b. WAIT_FOR_END (nprocs–1); /*wait for all child processes created to terminate*/9. end main

10. procedure Solve(A)11. float **A; /*A is entire n+2-by-n+2 shared array,

as in the sequential program*/12. begin13. int i,j, pid, done = 0;14. float temp, mydiff = 0; /*private variables*/14a. int mymin = 1 + (pid * n/nprocs); /*assume that n is exactly divisible by*/14b. int mymax = mymin + n/nprocs - 1 /*nprocs for simplicity here*/

15. while (!done) do /*outer loop over all diagonal elements*/16. mydiff = diff = 0; /*set global diff to 0 (okay for all to do it)*/16a. BARRIER(bar1, nprocs); /*ensure all reach here before anyone modifies diff*/17. for i mymin to mymax do /*for each of my rows*/18. for j 1 to n do /*for all nonborder elements in that row*/19. temp = A[i,j];20. A[i,j] = 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +21. A[i,j+1] + A[i+1,j]);22. mydiff += abs(A[i,j] - temp);23. endfor24. endfor25a. LOCK(diff_lock); /*update global diff if necessary*/25b. diff += mydiff;25c. UNLOCK(diff_lock);25d. BARRIER(bar1, nprocs); /*ensure all reach here before checking if done*/25e. if (diff/(n*n) < TOL) then done = 1; /*check convergence; all get

same answer*/25f. BARRIER(bar1, nprocs);26. endwhile27. end procedure04/18/23 119CSCE930-Advanced Computer Architecture, Introduction

Notes on SAS Program

– SPMD: not lockstep or even necessarily same instructions

– Assignment controlled by values of variables used as loop bounds

» unique pid per process, used to control assignment

– Done condition evaluated redundantly by all

– Code that does the update identical to sequential program

» each process has private mydiff variable

– Most interesting special operations are for synchronization

» accumulations into shared diff have to be mutually exclusive

» why the need for all the barriers?


Need for Mutual Exclusion– Code each process executes:

• load the value of diff into register r1

• add the register r2 to register r1

• store the value of register r1 into diff

– A possible interleaving:

• P1 P2

• r1 diff {P1 gets 0 in its r1}

• r1 diff {P2 also gets 0}

• r1 r1+r2 {P1 sets

its r1 to 1}

• r1 r1+r2 {P2 sets

its r1 to 1}

• diff r1 {P1 sets cell_cost to 1}

• diff r1 {P2 also sets cell_cost

to 1}

– Need the sets of operations to be atomic (mutually exclusive)


Mutual Exclusion

• Provided by LOCK-UNLOCK around critical section

– Set of operations we want to execute atomically

– Implementation of LOCK/UNLOCK must guarantee mutual excl.

• Can lead to significant serialization if contended– Especially since expect non-local accesses in critical section

– Another reason to use private mydiff for partial accumulation


Global Event Synchronization

• BARRIER(nprocs): wait here till nprocs processes get here

– Built using lower level primitives– Global sum example: wait for all to accumulate before using sum– Often used to separate phases of computation

• Process P_1 Process P_2 Process P_nprocs

• set up eqn system set up eqn system set up eqn system

• Barrier (name, nprocs) Barrier (name, nprocs) Barrier (name, nprocs)

• solve eqn system solve eqn system solve eqn system


• apply results apply results apply results


– Conservative form of preserving dependences, but easy to use

• WAIT_FOR_END (nprocs-1)


Pt-to-pt Event Synch (Not Used Here)

• One process notifies another of an event so it can proceed– Common example: producer-consumer (bounded buffer)

– Concurrent programming on uniprocessor: semaphores

– Shared address space parallel programs: semaphores, or use ordinary variables as flags

•Busy-waiting or spinning

P1 P2

A = 1;

a: while (flag is 0) do nothing;b: flag = 1;

print A;


Group Event Synchronization

• Subset of processes involved– Can use flags or barriers (involving only the subset)

– Concept of producers and consumers

• Major types:– Single-producer, multiple-consumer

– Multiple-producer, single-consumer

– Multiple-producer, single-consumer


Message Passing Grid Solver

– Cannot declare A to be shared array any more

– Need to compose it logically from per-process private arrays

» usually allocated in accordance with the assignment of work

» process assigned a set of rows allocates them locally

– Transfers of entire rows between traversals

– Structurally similar to SAS (e.g. SPMD), but orchestration different

» data structures and data access/naming

» communication

» synchronization


1. int pid, n, b; /*process id, matrix dimension and number ofprocessors to be used*/

2. float **myA;3. main()4. begin5. read(n); read(nprocs); /*read input matrix size and number of processes*/8a. CREATE (nprocs-1, Solve);8b. Solve(); /*main process becomes a worker too*/8c. WAIT_FOR_END (nprocs–1); /*wait for all child processes created to terminate*/9. end main

10. procedure Solve()11. begin13. int i,j, pid, n’ = n/nprocs, done = 0;14. float temp, tempdiff, mydiff = 0; /*private variables*/

6. myA malloc(a 2-d array of size [n/nprocs + 2] by n+2);/*my assigned rows of A*/

7. initialize(myA); /*initialize my rows of A, in an unspecified way*/

15. while (!done) do16. mydiff = 0; /*set local diff to 0*/16a. if (pid != 0) then SEND(&myA[1,0],n*sizeof(float),pid-1,ROW);16b. if (pid = nprocs-1) then

SEND(&myA[n’,0],n*sizeof(float),pid+1,ROW);16c. if (pid != 0) then RECEIVE(&myA[0,0],n*sizeof(float),pid-1,ROW);16d. if (pid != nprocs-1) then

RECEIVE(&myA[n’+1,0],n*sizeof(float), pid+1,ROW);/*border rows of neighbors have now been copiedinto myA[0,*] and myA[n’+1,*]*/

17. for i 1 to n’ do /*for each of my (nonghost) rows*/18. for j 1 to n do /*for all nonborder elements in that row*/19. temp = myA[i,j];20. myA[i,j] = 0.2 * (myA[i,j] + myA[i,j-1] + myA[i-1,j] +21. myA[i,j+1] + myA[i+1,j]);22. mydiff += abs(myA[i,j] - temp);23. endfor24. endfor

/*communicate local diff values and determine ifdone; can be replaced by reduction and broadcast*/

25a. if (pid != 0) then /*process 0 holds global total diff*/25b. SEND(mydiff,sizeof(float),0,DIFF);25c. RECEIVE(done,sizeof(int),0,DONE);25d. else /*pid 0 does this*/25e. for i 1 to nprocs-1 do /*for each other process*/25f. RECEIVE(tempdiff,sizeof(float),*,DIFF);25g. mydiff += tempdiff; /*accumulate into total*/25h. endfor25i if (mydiff/(n*n) < TOL) then done = 1;25j. for i 1 to nprocs-1 do /*for each other process*/25k. SEND(done,sizeof(int),i,DONE);25l. endfor25m. endif26. endwhile27. end procedure


Notes on Message Passing Program– Use of ghost rows

– Receive does not transfer data, send does

» unlike SAS which is usually receiver-initiated (load fetches data)

– Communication done at beginning of iteration, so no asynchrony

– Communication in whole rows, not element at a time

– Core similar, but indices/bounds in local rather than global space

– Synchronization through sends and receives

» Update of global diff and event synch for done condition

» Could implement locks and barriers with messages

– Can use REDUCE and BROADCAST library calls to simplify code

/*communicate local diff values and determine if done, using reduction and broadcast*/25b. REDUCE(0,mydiff,sizeof(float),ADD);25c. if (pid == 0) then25i. if (mydiff/(n*n) < TOL) then done = 1;25k. endif25m. BROADCAST(0,done,sizeof(int),DONE);


Send and Receive Alternatives

– Affect event synch (mutual excl. by fiat: only one process touches data)– Affect ease of programming and performance

• Synchronous messages provide built-in synch. through match– Separate event synchronization needed with asynch. messages

• With synch. messages, our code is deadlocked. Fix?

Can extend functionality: stride, scatter-gather, groups

Semantic flavors: based on when control is returnedAffect when data structures or buffers can be reused at either end

Send/Receive

Synchronous Asynchronous

Blocking asynch. Nonblocking asynch.


Orchestration: Summary

• Shared address space– Shared and private data explicitly separate– Communication implicit in access patterns– No correctness need for data distribution– Synchronization via atomic operations on shared data– Synchronization explicit and distinct from data

communication

• Message passing– Data distribution among local address spaces needed– No explicit shared structures (implicit in comm. patterns)– Communication is explicit– Synchronization implicit in communication (at least in synch.

case)

» mutual exclusion by fiat


Correctness in Grid Solver Program

• Decomposition and Assignment similar in SAS and message-passing

• Orchestration is different– Data structures, data access/naming, communication, synchronization

SAS Msg-Passing

Explicit global data structure? Yes No

Assignment indept of data layout? Yes No

Communication Implicit Explicit

Synchronization Explicit Implicit

Explicit replication of border rows? No Yes

Requirements for performance are another story ...


Documents

CSCE 930 Advanced Computer Architecture Introductions Adopted from Professor David Patterson & David Culler Electrical Engineering and Computer Sciences