CSCE 930 Advanced Computer
Architecture
Introductions
Adopted fromProfessor David Patterson
&David Culler
Electrical Engineering and Computer SciencesUniversity of California, Berkeley
Outline
• Computer Science at a Crossroads: Parallelism– Architecture: multi-core and many-cores
– Program: multi-threading
• Parallel Architecture– What is Parallel Architecture?
– Why Parallel Architecture?
– Evolution and Convergence of Parallel Architectures
– Fundamental Design Issues
• Parallel Programs– Why bother with programs?
– Important for whom?
• Memory & Storage Subsystem Architectures
04/18/23 2CSCE930-Advanced Computer Architecture, Introduction
• Old Conventional Wisdom: Power is free, Transistors expensive
• New Conventional Wisdom: “Power wall” Power expensive, Xtors free (Can put more on chip than can afford to turn on)
• Old CW: Sufficiently increasing Instruction Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, …)
• New CW: “ILP wall” law of diminishing returns on more HW for ILP
• Old CW: Multiplies are slow, Memory access is fast
• New CW: “Memory wall” Memory slow, multiplies fast (200 clock cycles to DRAM memory, 4 clocks for multiply)
• Old CW: Uniprocessor performance 2X / 1.5 yrs
• New CW: Power Wall + ILP Wall + Memory Wall = Brick Wall– Uniprocessor performance now 2X / 5(?) yrs
Sea change in chip design: multiple “cores” (2X processors per chip / ~ 2 years)
» More simpler processors are more power efficient
Crossroads: Conventional Wisdom in Comp. Arch
04/18/23 3CSCE930-Advanced Computer Architecture, Introduction
1
10
100
1000
10000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
Pe
rfo
rma
nce
(vs
. V
AX
-11
/78
0)
25%/year
52%/year
??%/year
Crossroads: Uniprocessor Performance
• VAX : 25%/year 1978 to 1986• RISC + x86: 52%/year 1986 to 2002• RISC + x86: ??%/year 2002 to present
From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, October, 2006
04/18/23 4CSCE930-Advanced Computer Architecture, Introduction
Sea Change in Chip Design• Intel 4004 (1971): 4-bit processor,
2312 transistors, 0.4 MHz, 10 micron PMOS, 11 mm2 chip
• Processor is the new transistor?
• RISC II (1983): 32-bit, 5 stage pipeline, 40,760 transistors, 3 MHz, 3 micron NMOS, 60 mm2 chip
• 125 mm2 chip, 0.065 micron CMOS = 2312 RISC II+FPU+Icache+Dcache
– RISC II shrinks to ~ 0.02 mm2 at 65 nm
– Caches via DRAM or 1 transistor SRAM (www.t-ram.com) ?– Proximity Communication via capacitive coupling at > 1 TB/s ?
(Ivan Sutherland @ Sun / Berkeley)
04/18/23 5CSCE930-Advanced Computer Architecture, Introduction
Déjà vu all over again?
• Multiprocessors imminent in 1970s, ‘80s, ‘90s, …• “… today’s processors … are nearing an impasse as
technologies approach the speed of light..”David Mitchell, The Transputer: The Time Is Now (1989)
• Transputer was premature Custom multiprocessors strove to lead uniprocessors Procrastination rewarded: 2X seq. perf. / 1.5 years
• “We are dedicating all of our future product development to multicore designs. … This is a sea change in computing”
Paul Otellini, President, Intel (2004) • Difference is all microprocessor companies switch to
multiprocessors (AMD, Intel, IBM, Sun; all new Apples 2 CPUs)
Procrastination penalized: 2X sequential perf. / 5 yrs Biggest programming challenge: 1 to 2 CPUs
04/18/23 6CSCE930-Advanced Computer Architecture, Introduction
Problems with Sea Change
• Algorithms, Programming Languages, Compilers, Operating Systems, Architectures, Libraries, … not ready to supply Thread Level Parallelism or Data Level Parallelism for 1000 CPUs / chip,
• Architectures not ready for 1000 CPUs / chip• Unlike Instruction Level Parallelism, cannot be solved by just by
computer architects and compiler writers alone, but also cannot be solved without participation of computer architects
• This course explores ISL (Instruction Level Parallelism) and its shift to Thread Level Parallelism / Data Level Parallelism
04/18/23 7CSCE930-Advanced Computer Architecture, Introduction
Outline
• Computer Science at a Crossroads: Parallelism– Architecture: multi-core and many-cores
– Program: multi-threading
• Parallel Architecture– What is Parallel Architecture?
– Why Parallel Architecture?
– Evolution and Convergence of Parallel Architectures
– Fundamental Design Issues
• Parallel Programs– Why bother with programs?
– Important for whom?
• Memory & Storage Subsystem Architectures
04/18/23 8CSCE930-Advanced Computer Architecture, Introduction
What is Parallel Architecture?
• A parallel computer is a collection of processing elements that cooperate to solve large problems fast
• Some broad issues:– Resource Allocation:
» how large a collection?
» how powerful are the elements?
» how much memory?
– Data access, Communication and Synchronization
» how do the elements cooperate and communicate?
» how are data transmitted between processors?
» what are the abstractions and primitives for cooperation?
– Performance and Scalability
» how does it all translate into performance?
» how does it scale?
04/18/23 9CSCE930-Advanced Computer Architecture, Introduction
Why Study Parallel Architecture?
Role of a computer architect:
To design and engineer the various levels of a computer system to maximize performance and programmability within limits of technology and cost.
Parallelism:• Provides alternative to faster clock for performance
• Applies at all levels of system design
• Is a fascinating perspective from which to view architecture
• Is increasingly central in information processing
04/18/23 10CSCE930-Advanced Computer Architecture, Introduction
Why Study it Today?
• History: diverse and innovative organizational structures, often tied to novel programming models
• Rapidly maturing under strong technological constraints
– The “killer micro” is ubiquitous
– Laptops and supercomputers are fundamentally similar!
– Technological trends cause diverse approaches to converge
• Technological trends make parallel computing inevitable
– In the mainstream with the reality of multi-cores and many-cores
• Need to understand fundamental principles and design tradeoffs, not just taxonomies
– Naming, Ordering, Replication, Communication performance04/18/23 11CSCE930-Advanced Computer Architecture, Introduction
Inevitability of Parallel Computing
• Application demands: Our insatiable need for computing cycles
– Scientific computing: VR simulations in Biology, Chemistry, Physics, ...– General-purpose computing: Video, Graphics, CAD, Databases, AR, VI,
TP...
• Technology Trends– Number of cores on chip growing rapidly (New Moors Law)– Clock rates expected to go up only slowly (tech. wall)
• Architecture Trends– Instruction-level parallelism valuable but limited– Coarser-level parallelism, or thread-level parallelism, the most viable
approach
• Economics
• Current trends:– Today’s microprocessors are multiprocessors
04/18/23 12CSCE930-Advanced Computer Architecture, Introduction
Application Trends
• Demand for cycles fuels advances in hardware, and vice-versa
– Cycle drives exponential increase in microprocessor performance
– Drives parallel architecture harder: most demanding applications
• Range of performance demands– Need range of system performance with progressively increasing cost
– Platform pyramid
• Goal of applications in using parallel machines: Speedup
• Speedup (p processors) =
• For a fixed problem size (input data set), performance = 1/time
• Speedup fixed problem (p processors) =
Performance (p processors)
Performance (1 processor)
Time (1 processor)
Time (p processors)04/18/23 13CSCE930-Advanced Computer Architecture, Introduction
Scientific Computing Demand
04/18/23 14CSCE930-Advanced Computer Architecture, Introduction
Engineering Computing Demand
• Large parallel machines a mainstay in many industries– Petroleum (reservoir analysis)
– Automotive (crash simulation, drag analysis, combustion efficiency),
– Aeronautics (airflow analysis, engine efficiency, structural mechanics, electromagnetism),
– Computer-aided design
– Pharmaceuticals (molecular modeling)
– Visualization
» In all of the above
» Entertainment (3D films like Avatar & 3D games )
» Architecture (walk-throughs and rendering)
» Virtual Reality/Immersion (museums, teleporting, etc)
– Financial modeling (yield and derivative analysis)
– Etc.
04/18/23 15CSCE930-Advanced Computer Architecture, Introduction
Learning Curve for Parallel Applications
• AMBER molecular dynamics simulation program• Starting point was vector code for Cray-1• 145 MFLOP on Cray90, 406 for final version on 128-processor
Paragon, 891 on 128-processor Cray T3D04/18/23 16CSCE930-Advanced Computer Architecture, Introduction
Commercial Computing
• Also relies on parallelism for high end– Scale not so large, but use much more wide-spread
– Computational power determines scale of business that can be handled
• Databases, online-transaction processing, decision support, data mining, data warehousing ...
• TPC benchmarks (TPC-C order entry, TPC-D decision support)
– Explicit scaling criteria provided
– Size of enterprise scales with size of system
– Problem size no longer fixed as p increases, so throughput is used as a performance measure (transactions per minute or tpm)
04/18/23 17CSCE930-Advanced Computer Architecture, Introduction
Similar Story for Storage
• Divergence between memory capacity and speed more pronounced– Capacity increased by 1000x from 1980-95, speed only 2x
– Gigabit DRAM by c. 2000, but gap with processor speed much greater
• Larger memories are slower, while processors get faster– Need to transfer more data in parallel– Need deeper cache hierarchies– How to organize caches?
• Parallelism increases effective size of each level of hierarchy, without increasing access time
• Parallelism and locality within memory systems too– New designs fetch many bits within memory chip; follow with fast pipelined transfer
across narrower interface– Buffer caches most recently accessed data
• Disks too: Parallel disks plus caching
04/18/23 18CSCE930-Advanced Computer Architecture, Introduction
>100TBMedicinal Image
Digital body
High performance Computing
>1PB
NASA
>100TB
1TB/body
H.B Glass
5GB/day
Real-world applications demand high-performing and reliable storage
04/18/23
>1PBGIS
Oil prospecting>1PB
>1PBOcean resource data
1PB=1000TB=1015Bytes, It is equal to the capacity of 10,000 100GB disks.
Google, Yahoo, … >1PB/year
04/18/23
Technology Trends: Moore’s Law: 2X transistors / “year” 2X cores / “year”
• “Cramming More Components onto Integrated Circuits”– Gordon Moore, Electronics, 1965
• # on transistors / cost-effective integrated circuit double every N months (12 ≤ N ≤ 24)
04/18/23 21CSCE930-Advanced Computer Architecture, Introduction
Tracking Technology Performance Trends
• Drill down into 4 technologies:– Disks, – Memory, – Network, – Processors
• Compare ~1980 Archaic (Nostalgic) vs. ~2000 Modern (Newfangled)
– Performance Milestones in each technology
• Compare for Bandwidth vs. Latency improvements in performance over time
• Bandwidth: number of events per unit time– E.g., M bits / second over network, M bytes / second from disk
• Latency: elapsed time for a single event– E.g., one-way network delay in microseconds,
average disk access time in milliseconds
04/18/23 22CSCE930-Advanced Computer Architecture, Introduction
Disks: Archaic(Nostalgic) v. Modern(Newfangled)
• Seagate 373453, 2003
• 15000 RPM (4X)
• 73.4 GBytes (2500X)
• Tracks/Inch: 64000 (80X)
• Bits/Inch: 533,000 (60X)
• Four 2.5” platters (in 3.5” form factor)
• Bandwidth: 86 MBytes/sec (140X)
• Latency: 5.7 ms (8X)
• Cache: 8 MBytes
• CDC Wren I, 1983
• 3600 RPM
• 0.03 GBytes capacity
• Tracks/Inch: 800
• Bits/Inch: 9550
• Three 5.25” platters
• Bandwidth: 0.6 MBytes/sec
• Latency: 48.3 ms
• Cache: none
04/18/23 23CSCE930-Advanced Computer Architecture, Introduction
Latency Lags Bandwidth (for last ~20 years)
• Performance Milestones
• Disk: 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x)
(latency = simple operation w/o contentionBW = best-case)
1
10
100
1000
10000
1 10 100
Relative Latency Improvement
Relative BW
Improvement
Disk
(Latency improvement = Bandwidth improvement)
04/18/23 24CSCE930-Advanced Computer Architecture, Introduction
Memory: Archaic (Nostalgic) v. Modern (Newfangled)
• 1980 DRAM (asynchronous)
• 0.06 Mbits/chip
• 64,000 xtors, 35 mm2
• 16-bit data bus per module, 16 pins/chip
• 13 Mbytes/sec
• Latency: 225 ns
• (no block transfer)
• 2000 Double Data Rate Synchr. (clocked) DRAM
• 256.00 Mbits/chip (4000X)
• 256,000,000 xtors, 204 mm2
• 64-bit data bus per DIMM, 66 pins/chip (4X)
• 1600 Mbytes/sec (120X)
• Latency: 52 ns (4X)
• Block transfers (page mode)
04/18/23 25CSCE930-Advanced Computer Architecture, Introduction
Latency Lags Bandwidth (last ~20 years)
• Performance Milestones
• Memory Module: 16bit plain DRAM, Page Mode DRAM, 32b, 64b, SDRAM, DDR SDRAM (4x,120x)
• Disk: 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x)
(latency = simple operation w/o contentionBW = best-case)
1
10
100
1000
10000
1 10 100
Relative Latency Improvement
Relative BW
Improvement
MemoryDisk
(Latency improvement = Bandwidth improvement)
04/18/23 26CSCE930-Advanced Computer Architecture, Introduction
LANs: Archaic (Nostalgic)v. Modern (Newfangled)
• Ethernet 802.3
• Year of Standard: 1978
• 10 Mbits/s link speed
• Latency: 3000 sec
• Shared media
• Coaxial cable
• Ethernet 802.3ae
• Year of Standard: 2003
• 10,000 Mbits/s
(1000X)link speed
• Latency: 190 sec
(15X)
• Switched media
• Category 5 copper wire
Coaxial Cable:
Copper coreInsulator
Braided outer conductorPlastic Covering
Copper, 1mm thick, twisted to avoid antenna effect
Twisted Pair:"Cat 5" is 4 twisted pairs in bundle
04/18/23 27CSCE930-Advanced Computer Architecture, Introduction
Latency Lags Bandwidth (last ~20 years)
• Performance Milestones
• Ethernet: 10Mb, 100Mb, 1000Mb, 10000 Mb/s (16x,1000x)
• Memory Module: 16bit plain DRAM, Page Mode DRAM, 32b, 64b, SDRAM, DDR SDRAM (4x,120x)
• Disk: 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x)
(latency = simple operation w/o contentionBW = best-case)
1
10
100
1000
10000
1 10 100
Relative Latency Improvement
Relative BW
Improvement
Memory
Network
Disk
(Latency improvement = Bandwidth improvement)
04/18/23 28CSCE930-Advanced Computer Architecture, Introduction
CPUs: Archaic (Nostalgic) v. Modern (Newfangled)
• 1982 Intel 80286
• 12.5 MHz
• 2 MIPS (peak)
• Latency 320 ns
• 134,000 xtors, 47 mm2
• 16-bit data bus, 68 pins
• Microcode interpreter, separate FPU chip
• (no caches)
• 2001 Intel Pentium 4
• 1500 MHz (120X)
• 4500 MIPS (peak) (2250X)
• Latency 15 ns (20X)
• 42,000,000 xtors, 217 mm2
• 64-bit data bus, 423 pins
• 3-way superscalar,Dynamic translate to RISC, Superpipelined (22 stage),Out-of-Order execution
• On-chip 8KB Data caches, 96KB Instr. Trace cache, 256KB L2 cache
04/18/23 29CSCE930-Advanced Computer Architecture, Introduction
Latency Lags Bandwidth (last ~20 years)
• Performance Milestones• Processor: ‘286, ‘386, ‘486,
Pentium, Pentium Pro, Pentium 4 (21x,2250x)
• Ethernet: 10Mb, 100Mb, 1000Mb, 10000 Mb/s (16x,1000x)
• Memory Module: 16bit plain DRAM, Page Mode DRAM, 32b, 64b, SDRAM, DDR SDRAM (4x,120x)
• Disk : 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x)
1
10
100
1000
10000
1 10 100
Relative Latency Improvement
Relative BW
Improvement
Processor
Memory
Network
Disk
(Latency improvement = Bandwidth improvement)
CPU high, Memory low(“Memory Wall”)
04/18/23 30CSCE930-Advanced Computer Architecture, Introduction
Rule of Thumb for Latency Lagging BW
• In the time that bandwidth doubles, latency improves by no more than a factor of 1.2 to 1.4
(and capacity improves faster than bandwidth)
• Stated alternatively: Bandwidth improves by more than the square of the improvement in Latency
04/18/23 31CSCE930-Advanced Computer Architecture, Introduction
6 Reasons Latency Lags Bandwidth
1. Moore’s Law helps BW more than latency • Faster transistors, more transistors,
more pins help Bandwidth
» MPU Transistors: 0.130 vs. 42 M xtors (300X)
» DRAM Transistors: 0.064 vs. 256 M xtors (4000X)
» MPU Pins: 68 vs. 423 pins (6X)
» DRAM Pins: 16 vs. 66 pins (4X)
• Smaller, faster transistors but communicate over (relatively) longer lines: limits latency
» Feature size: 1.5 to 3 vs. 0.18 micron(8X,17X)
» MPU Die Size: 35 vs. 204 mm2
(ratio sqrt 2X)
» DRAM Die Size: 47 vs. 217 mm2
(ratio sqrt 2X) 04/18/23 32CSCE930-Advanced Computer Architecture, Introduction
6 Reasons Latency Lags Bandwidth (cont’d)
2. Distance limits latency • Size of DRAM block long bit and word lines
most of DRAM access time
• Speed of light and computers on network
• 1. & 2. explains linear latency vs. square BW?
3. Bandwidth easier to sell (“bigger=better”)• E.g., 10 Gbits/s Ethernet (“10 Gig”) vs.
10 sec latency Ethernet
• 4400 MB/s DIMM (“PC4400”) vs. 50 ns latency
• Even if just marketing, customers now trained
• Since bandwidth sells, more resources thrown at bandwidth, which further tips the balance
04/18/23 33CSCE930-Advanced Computer Architecture, Introduction
4. Latency helps BW, but not vice versa • Spinning disk faster improves both bandwidth and
rotational latency
» 3600 RPM 15000 RPM = 4.2X
» Average rotational latency: 8.3 ms 2.0 ms
» Things being equal, also helps BW by 4.2X
• Lower DRAM latency More access/second (higher bandwidth)
• Higher linear density helps disk BW (and capacity), but not disk Latency
» 9,550 BPI 533,000 BPI 60X in BW
6 Reasons Latency Lags Bandwidth (cont’d)
04/18/23 34CSCE930-Advanced Computer Architecture, Introduction
5. Bandwidth hurts latency• Queues help Bandwidth, hurt Latency (Queuing Theory)
• Adding chips to widen a memory module increases Bandwidth but higher fan-out on address lines may increase Latency
6. Operating System overhead hurts Latency more than Bandwidth
• Long messages amortize overhead; overhead bigger part of short messages
6 Reasons Latency Lags Bandwidth (cont’d)
04/18/23 35CSCE930-Advanced Computer Architecture, Introduction
Summary of Technology Trends
• For disk, LAN, memory, and microprocessor, bandwidth improves by square of latency improvement
– In the time that bandwidth doubles, latency improves by no more than 1.2X to 1.4X
• Lag probably even larger in real systems, as bandwidth gains multiplied by replicated components
– Multiple processors in a cluster or even in a chip
– Multiple disks in a disk array
– Multiple memory modules in a large memory
– Simultaneous communication in switched LAN
• HW and SW developers should innovate assuming Latency Lags Bandwidth
– If everything improves at the same rate, then nothing really changes
– When rates vary, require real innovation04/18/23 36CSCE930-Advanced Computer Architecture, Introduction
Architectural Trends
• Architecture translates technology’s gifts to performance and capability
• Resolves the tradeoff between parallelism and locality
– Current microprocessor: 1/4 compute, 1/2 cache, 1/4 off-chip connect
– Tradeoffs may change with scale and technology advances
• Four generations of architectural history: tube, transistor, IC, VLSI
– Here focus only on VLSI generation
• Greatest delineation in VLSI has been in type of parallelism exploited
04/18/23 37CSCE930-Advanced Computer Architecture, Introduction
Architectural Trends
• Greatest trend in VLSI generation is increase in parallelism– Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit
» slows after 32 bit
» adoption of 64-bit now under way, 128-bit far (not performance issue)
» great inflection point when 32-bit micro and cache fit on a chip
– Mid 80s to mid 90s: instruction level parallelism
» pipelining and simple instruction sets, + compiler advances (RISC)
» on-chip caches and functional units => superscalar execution
» greater sophistication: out of order execution, speculation, prediction• to deal with control transfer and latency problems
– Next step: thread level parallelism
04/18/23 38CSCE930-Advanced Computer Architecture, Introduction
Architectural Trends: ILP
• Reported speedups for superscalar processors
• Horst, Harris, and Jardine [1990] ...................... 1.37
• Wang and Wu [1988] .......................................... 1.70
• Smith, Johnson, and Horowitz [1989] .............. 2.30
• Murakami et al. [1989] ........................................ 2.55
• Chang et al. [1991] ............................................. 2.90
• Jouppi and Wall [1989] ...................................... 3.20
• Lee, Kwok, and Briggs [1991] ........................... 3.50
• Wall [1991] .......................................................... 5
• Melvin and Patt [1991] ....................................... 8
• Butler et al. [1991] ............................................. 17+
• Large variance due to difference in– application domain investigated (numerical versus non-
numerical)– capabilities of processor modeled
04/18/23 39CSCE930-Advanced Computer Architecture, Introduction
ILP Ideal Potential
• Infinite resources and fetch bandwidth, perfect branch prediction and renaming
– real caches and non-zero miss latencies
0 1 2 3 4 5 6+0
5
10
15
20
25
30
0 5 10 150
0.5
1
1.5
2
2.5
3
Fra
ctio
n o
f to
tal c
ycle
s (%
)
Number of instructions issued
Sp
ee
du
p
Instructions issued per cycle
04/18/23 40CSCE930-Advanced Computer Architecture, Introduction
Results of ILP Studies
1x
2x
3x
4x
Jouppi_89 Smith_89 Murakami_89 Chang_91 Butler_91 Melvin_91
1 branch unit/real prediction
perfect branch prediction
• Concentrate on parallelism for 4-issue machines
• Realistic studies show only 2-fold speedup• Recent studies show that more ILP needs to look
across threads04/18/23 41CSCE930-Advanced Computer Architecture, Introduction
Architectural Trends: Bus-based MPs
•Micro on a chip makes it natural to connect many to shared memory
– dominates server and enterprise market, moving down to desktop
•Faster processors began to saturate bus, then bus technology advanced
– today, range of sizes for bus-based systems, desktop to large servers
0
10
20
30
40
CRAY CS6400
SGI Challenge
Sequent B2100
Sequent B8000
Symmetry81
Symmetry21
Power
SS690MP 140 SS690MP 120
AS8400
HP K400AS2100SS20
SE30
SS1000E
SS10
SE10
SS1000
P-ProSGI PowerSeries
SE60
SE70
Sun E6000
SC2000ESun SC2000SGI PowerChallenge/XL
SunE10000
50
60
70
1984 1986 1988 1990 1992 1994 1996 1998
Nu
mb
er
of
pro
cess
ors
04/18/23 42CSCE930-Advanced Computer Architecture, Introduction
Economics
• Commodity microprocessors not only fast but CHEAP• Development cost is tens of millions of dollars (5-100 typical)
• BUT, many more are sold compared to supercomputers
– Crucial to take advantage of the investment, and use the commodity building block
– Exotic parallel architectures no more than special-purpose
• Multiprocessors being pushed by software vendors (e.g. database) as well as hardware vendors
• Standardization by Intel makes small, bus-based SMPs commodity
• Desktop: few smaller processors versus one larger one?– Multiprocessor on a chip
04/18/23 43CSCE930-Advanced Computer Architecture, Introduction
Consider Scientific Supercomputing
• Proving ground and driver for innovative architecture and techniques
– Market smaller relative to commercial as MPs become mainstream
– Dominated by vector machines starting in 70s
– Microprocessors have made huge gains in floating-point performance
» high clock rates
» pipelined floating point units (e.g., multiply-add every cycle)
» instruction-level parallelism
» effective use of caches (e.g., automatic blocking)
– Plus economics
• Large-scale multiprocessors replace vector supercomputers
– Most top-performing machines on Top-500 list are multiprocessors
04/18/23 44CSCE930-Advanced Computer Architecture, Introduction
Summary: Why Parallel Architecture?
• Increasingly attractive– Economics, technology, architecture, application demand
• Increasingly central and mainstream
• Parallelism exploited at many levels– Instruction-level parallelism– Thread-level of parallelism (multi-cores and many-cores)– Application/Node-level of parallelism (“MPPs”, cluster, grids, clouds)
• Focus of this class: thread-level of parallelism
• Same story from memory/storage system perspective but with a “twist” (data-intensive computing, etc)
• Wide range of parallel architectures make sense– Different cost, performance and scalability
04/18/23 45CSCE930-Advanced Computer Architecture, Introduction
Convergence of Parallel Architectures
04/18/23 46CSCE930-Advanced Computer Architecture, Introduction
History
Application Software
System SoftwareSIMD
Message Passing
Shared MemoryDataflow
Systolic
Arrays Architecture
• Uncertainty of direction paralyzed parallel software development!
Historically, parallel architectures tied to programming models
• Divergent architectures, with no predictable pattern of growth.
04/18/23 47CSCE930-Advanced Computer Architecture, Introduction
Today
• Extension of “computer architecture” to support communication and cooperation– OLD: Instruction Set Architecture
– NEW: Communication Architecture
• Defines – Critical abstractions, boundaries, and primitives (interfaces)
– Organizational structures that implement interfaces (hw or sw)
• Compilers, libraries and OS are important bridges today
04/18/23 48CSCE930-Advanced Computer Architecture, Introduction
Modern Layered Framework
CAD
Multiprogramming Sharedaddress
Messagepassing
Dataparallel
Database Scientific modeling Parallel applications
Programming models
Communication abstractionUser/system boundary
Compilationor library
Operating systems support
Communication hardware
Physical communication medium
Hardware/software boundary
04/18/23 49CSCE930-Advanced Computer Architecture, Introduction
Programming Model
• What programmer uses in coding applications
• Specifies communication and synchronization
• Examples:– Multiprogramming: no communication or synch. at program
level
– Shared address space: like bulletin board
– Message passing: like letters or phone calls, explicit point to point
– Data parallel: more regimented, global actions on data
» Implemented with shared address space or message passing
04/18/23 50CSCE930-Advanced Computer Architecture, Introduction
Communication Abstraction
• User level communication primitives provided– Realizes the programming model
– Mapping exists between language primitives of programming model and these primitives
• Supported directly by hw, or via OS, or via user sw
• Lot of debate about what to support in sw and gap between layers
• Today:– Hw/sw interface tends to be flat, i.e. complexity roughly uniform
– Compilers and software play important roles as bridges today
– Technology trends exert strong influence
• Result is convergence in organizational structure– Relatively simple, general purpose communication primitives
04/18/23 51CSCE930-Advanced Computer Architecture, Introduction
Communication Architecture
• = User/System Interface + Implementation
• User/System Interface:– Comm. primitives exposed to user-level by hw and system-level sw
• Implementation:– Organizational structures that implement the primitives: hw or OS– How optimized are they? How integrated into processing node?– Structure of network
• Goals:– Performance– Broad applicability– Programmability– Scalability– Low Cost
04/18/23 52CSCE930-Advanced Computer Architecture, Introduction
Evolution of Architectural Models
• Historically machines tailored to programming models– Prog. model, comm. abstraction, and machine organization lumped together as
the “architecture”
• Evolution helps understand convergence– Identify core concepts
• Shared Address Space
• Message Passing
• Data Parallel
• Others: – Dataflow
– Systolic Arrays
• Examine programming model, motivation, intended applications, and contributions to convergence
04/18/23 53CSCE930-Advanced Computer Architecture, Introduction
Shared Address Space Architectures
• Any processor can directly reference any memory location – Communication occurs implicitly as result of loads and stores
• Convenient: – Location transparency
– Similar programming model to time-sharing on uniprocessors
» Except processes run on different processors
» Good throughput on multiprogrammed workloads
• Naturally provided on wide range of platforms– History dates at least to precursors of mainframes in early 60s
– Wide range of scale: few to hundreds of processors
• Popularly known as shared memory machines or model– Ambiguous: memory may be physically distributed among processors
04/18/23 54CSCE930-Advanced Computer Architecture, Introduction
Shared Address Space Model
• Process: virtual address space plus one or more threads of control
• Portions of address spaces of processes are shared
•Writes to shared address visible to other threads (in other processes too)•Natural extension of uniprocessors model: conventional memory operations for comm.; special atomic operations for synchronization•OS uses shared memory to coordinate processes
St or e
P1
P2
Pn
P0
Load
P0 pr i vat e
P1 pr i vat e
P2 pr i vat e
Pn pr i vat e
Virtual address spaces for a
collection of processes communicating
via shared addresses
Machine physical address space
Shared portion
of address space
Private portion
of address space
Common physical
addresses
04/18/23 55CSCE930-Advanced Computer Architecture, Introduction
Communication Hardware
• Also natural extension of uniprocessor
• Already have processor, one or more memory modules and I/O controllers connected by hardware interconnect of some sort
Memory capacity increased by adding modules, I/O by controllers
•Add processors for processing!
•For higher-throughput multiprogramming, or parallel programs
I/O ctrlMem Mem Mem
Interconnect
Mem I/O ctrl
Processor Processor
Interconnect
I/Odevices
04/18/23 56CSCE930-Advanced Computer Architecture, Introduction
History
P
P
C
C
I/O
I/O
M MM M
PP
C
I/O
M MC
I/O
$ $
• “Mainframe” approach– Motivated by multiprogramming– Extends crossbar used for mem bw and I/O– Originally processor cost limited to small
» later, cost of crossbar– Bandwidth scales with p– High incremental cost; use multistage
instead
• “Minicomputer” approach– Almost all microprocessor systems have bus– Motivated by multiprogramming, TP– Used heavily for parallel computing– Called symmetric multiprocessor (SMP)– Latency larger than for uniprocessor– Bus is bandwidth bottleneck
» caching is key: coherence problem– Low incremental cost
04/18/23 57CSCE930-Advanced Computer Architecture, Introduction
Example: Intel Pentium Pro Quad
– All coherence and multiprocessing glue in processor module
– Highly integrated, targeted at high volume
– Low latency and bandwidth
P-Pro bus (64-bit data, 36-bit addr ess, 66 MHz)
CPU
Bus interface
MIU
P-Promodule
P-Promodule
P-Promodule256-KB
L2 $Interruptcontroller
PCIbridge
PCIbridge
Memorycontroller
1-, 2-, or 4-wayinterleaved
DRAM
PC
I bus
PC
I busPCI
I/Ocards
04/18/23 58CSCE930-Advanced Computer Architecture, Introduction
Example: SUN Enterprise
– 16 cards of either type: processors + memory, or I/O– All memory accessed over bus, so symmetric– Higher bandwidth, higher latency bus
Gigaplane bus (256 data, 41 addr ess, 83 MHz)
SB
US
SB
US
SB
US
2 F
iber
Cha
nnel
100b
T, S
CS
I
Bus interface
CPU/memcardsP
$2
$
P
$2
$
Mem ctrl
Bus interface/switch
I/O cards
04/18/23 59CSCE930-Advanced Computer Architecture, Introduction
Scaling Up
– Problem is interconnect: cost (crossbar) or bandwidth (bus)
– Dance-hall: bandwidth still scalable, but lower cost than crossbar
» latencies to memory uniform, but uniformly large
– Distributed memory or non-uniform memory access (NUMA)
» Construct shared address space out of simple message transactions across a general-purpose network (e.g. read-request, read-response)
– Caching shared (particularly nonlocal) data?
M M M
M M M
NetworkNetwork
P
$
P
$
P
$
P
$
P
$
P
$
“Dance hall” Distributed memory
04/18/23 60CSCE930-Advanced Computer Architecture, Introduction
Example: Cray T3E
– Scale up to 1024 processors, 480MB/s links– Memory controller generates comm. request for nonlocal references– No hardware mechanism for coherence (SGI Origin etc. provide this)
Switch
P
$
XY
Z
External I/O
Memctrl
and NI
Mem
04/18/23 61CSCE930-Advanced Computer Architecture, Introduction
Message Passing Architectures
• Complete computer as building block, including I/O– Communication via explicit I/O operations
• Programming model: directly access only private address space (local memory), comm. via explicit messages (send/receive)
• High-level block diagram similar to distributed-memory SAS– But comm. integrated at IO level, needn’t be into memory system– Like networks of workstations (clusters), but tighter integration– Easier to build than scalable SAS
• Programming model more removed from basic hardware operations– Library or OS intervention
04/18/23 62CSCE930-Advanced Computer Architecture, Introduction
Message-Passing Abstraction
– Send specifies buffer to be transmitted and receiving process– Recv specifies sending process and application storage to receive into– Memory to memory copy, but need to name processes– Optional tag on send and matching rule on receive– User process names local data and entities in process/tag space too– In simplest form, the send/recv match achieves pairwise synch event
» Other variants too– Many overheads: copying, buffer management, protection
Process P Process Q
Address Y
Address X
Send X, Q, t
Receive Y, P, tMatch
Local processaddress space
Local processaddress space
04/18/23 63CSCE930-Advanced Computer Architecture, Introduction
Evolution of Message-Passing Machines
• Early machines: FIFO on each link– Hw close to prog. Model; synchronous ops– Replaced by DMA, enabling non-blocking ops
» Buffered by system at destination until recv
• Diminishing role of topology– Store&forward routing: topology important– Introduction of pipelined routing made it less so– Cost is in node-network interface– Simplifies programming
000001
010011
100
110
101
111
04/18/23 64CSCE930-Advanced Computer Architecture, Introduction
Example: IBM SP-2
– Made out of essentially complete RS6000 workstations
– Network interface integrated in I/O bus (bw limited by I/O bus)
Memory bus
MicroChannel bus
I/O
i860 NI
DMA
DR
AM
IBM SP-2 node
L2 $
Power 2CPU
Memorycontroller
4-wayinterleaved
DRAM
General interconnectionnetwork formed from8-port switches
NIC
04/18/23 65CSCE930-Advanced Computer Architecture, Introduction
Example Intel Paragon
Memory bus (64-bit, 50 MHz)
i860
L1 $
NI
DMA
i860
L1 $
Driver
Memctrl
4-wayinterleaved
DRAM
IntelParagonnode
8 bits,175 MHz,bidirectional2D grid network
with processing nodeattached to every switch
Sandia’ s Intel Paragon XP/S-based Super computer
04/18/23 66CSCE930-Advanced Computer Architecture, Introduction
Toward Architectural Convergence
• Evolution and role of software have blurred boundary– Send/recv supported on SAS machines via buffers
– Can construct global address space on MP using hashing
– Page-based (or finer-grained) shared virtual memory
• Hardware organization converging too– Tighter NI integration even for MP (low-latency, high-
bandwidth)
– At lower level, even hardware SAS passes hardware messages
• Even clusters of workstations/SMPs are parallel systems– Emergence of fast system area networks (SAN)
• Programming models distinct, but organizations converging
– Nodes connected by general network and communication assists
– Implementations also converging, at least in high-end machines
04/18/23 67CSCE930-Advanced Computer Architecture, Introduction
Data Parallel Systems
• Programming model – Operations performed in parallel on each element of data structure
– Logically single thread of control, performs sequential or parallel steps
– Conceptually, a processor associated with each data element
• Architectural model– Array of many simple, cheap processors with little memory each
» Processors don’t sequence through instructions
– Attached to a control processor that issues instructions
– Specialized and general communication, cheap global synchronization
PE PE PE
PE PE PE
PE PE PE
ControlprocessorOriginal motivations
•Matches simple differential equation solvers
•Centralize high cost of instruction fetch/sequencing
04/18/23 68CSCE930-Advanced Computer Architecture, Introduction
Application of Data Parallelism
– Each PE contains an employee record with his/her salary
If salary > 100K then
salary = salary *1.05
else
salary = salary *1.10
– Logically, the whole operation is a single step
– Some processors enabled for arithmetic operation, others disabled
• Other examples: – Finite differences, linear algebra, ... – Document searching, graphics, image processing, ...
• Some early machines: – Thinking Machines CM-1, CM-2 (and CM-5)– Maspar MP-1 and MP-2,
• Today’s on-chip examples: – IBM’s Cell procesor,
– GPU, GPGPU
04/18/23 69CSCE930-Advanced Computer Architecture, Introduction
Dataflow Architectures
• Represent computation as a graph of essential dependences
– Logical processor at each node, activated by availability of operands– Message (tokens) carrying tag of next instruction sent to next processor– Tag compared with others in matching store; match fires execution
1 b
a
+
c e
d
f
Dataflow graph
f = a d
Network
Tokenstore
WaitingMatching
Instructionfetch
Execute
Token queue
Formtoken
Network
Network
Programstore
a = (b +1) (b c)d = c e
04/18/23 70CSCE930-Advanced Computer Architecture, Introduction
Evolution and Convergence• Key characteristics
– Ability to name operations, synchronization, dynamic scheduling
• Problems– Operations have locality across them, useful to group together– Handling complex data structures like arrays– Complexity of matching store and memory units– Expose too much parallelism (?)
• Converged to use conventional processors and memory– Support for large, dynamic set of threads to map to processors– Typically shared address space as well– But separation of progr. model from hardware (like data-parallel)
• Lasting contributions:– Integration of communication with thread (handler) generation– Tightly integrated communication and fine-grained synchronization– Remained useful concept for software (compilers etc.)
04/18/23 71CSCE930-Advanced Computer Architecture, Introduction
Systolic Architectures– Replace single processor with array of regular processing elements
– Orchestrate data flow for high throughput with less memory access
Different from pipelining• Nonlinear array structure, multidirection data flow, each PE may have
(small) local instruction and data memory
Different from SIMD: each PE may do something different
Initial motivation: VLSI enables inexpensive special-purpose chips
Represent algorithms directly by chips connected in regular pattern
M
PE
M
PE PE PE
04/18/23 72CSCE930-Advanced Computer Architecture, Introduction
Systolic Arrays (contd.)
– Practical realizations (e.g. iWARP) use quite general processors
» Enable variety of algorithms on same hardware– But dedicated interconnect channels
» Data transfer directly from register to register across channel– Specialized, and same problems as SIMD
» General purpose systems work well for same algorithms (locality etc.)
y(i) = w1 x(i) + w2 x(i + 1) + w3 x(i + 2) + w4 x(i + 3)
x8
y3 y2 y1
x7x6
x5x4
x3
w4
x2
x
w
x1
w3 w2 w1
xin
yin
xout
yout
xout = x
yout = yin + w xinx = x in
Example: Systolic array for 1-D convolution
04/18/23 73CSCE930-Advanced Computer Architecture, Introduction
Convergence: Generic Parallel Architecture
• A generic modern multiprocessor
Node: processor(s), memory system, plus communication assist• Network interface and communication controller
• Scalable network
• Convergence allows lots of innovation, now within framework• Integration of assist with node, what operations, how efficiently...
Mem
Network
P
$
Communicationassist (CA)
04/18/23 74CSCE930-Advanced Computer Architecture, Introduction
Fundamental Design Issues
04/18/23 75CSCE930-Advanced Computer Architecture, Introduction
Understanding Parallel Architecture
• Traditional taxonomies not very useful
• Programming models not enough, nor hardware structures
– Same one can be supported by radically different architectures
• Architectural distinctions that affect software– Compilers, libraries, programs
• Design of user/system and hardware/software interface– Constrained from above by progr. models and below by technology
• Guiding principles provided by layers– What primitives are provided at communication abstraction
– How programming models map to these
– How they are mapped to hardware
04/18/23 76CSCE930-Advanced Computer Architecture, Introduction
Fundamental Design Issues
• At any layer, interface (contract) aspect and performance aspects
– Naming: How are logically shared data and/or processes referenced?
– Operations: What operations are provided on these data
– Ordering: How are accesses to data ordered and coordinated?
– Replication: How are data replicated to reduce communication?
– Communication Cost: Latency, bandwidth, overhead, occupancy
• Understand at programming model first, since that sets requirements
• Other issues
• • Node Granularity: How to split between processors and memory?
• • ...04/18/23 77CSCE930-Advanced Computer Architecture, Introduction
Sequential Programming Model
• Contract– Naming: Can name any variable in virtual address space
» Hardware (and perhaps compilers) does translation to physical addresses
– Operations: Loads and Stores
– Ordering: Sequential program order
• Performance– Rely on dependences on single location (mostly): dependence order
– Compilers and hardware violate other orders without getting caught
– Compiler: reordering and register allocation
– Hardware: out of order, pipeline bypassing, write buffers
– Transparent replication in caches
04/18/23 78CSCE930-Advanced Computer Architecture, Introduction
Shared Address Space (SAS) Programming Model
• Naming: Any process can name any variable in shared space
• Operations: loads and stores, plus those needed for ordering
• Simplest Ordering Model: – Within a process/thread: sequential program order
– Across threads: some interleaving (as in time-sharing)
– Additional orders through synchronization
– Again, compilers/hardware can violate orders without getting caught» Different, more subtle ordering models also possible (discussed later)
04/18/23 79CSCE930-Advanced Computer Architecture, Introduction
Synchronization
• Mutual exclusion (locks)– Ensure certain operations on certain data can be performed by
only one process at a time
– Room that only one person can enter at a time
– No ordering guarantees
• Event synchronization – Ordering of events to preserve dependences
» e.g. producer —> consumer of data
– 3 main types:
» point-to-point
» global
» group
04/18/23 80CSCE930-Advanced Computer Architecture, Introduction
Message Passing Programming Model
• Naming: Processes can name private data directly. – No shared address space
• Operations: Explicit communication through send and receive
– Send transfers data from private address space to another process
– Receive copies data from process to private address space
– Must be able to name processes
• Ordering: – Program order within a process
– Send and receive can provide pt to pt synch between processes
– Mutual exclusion inherent
• Can construct global address space:– Process number + address within process address space
– But no direct operations on these names04/18/23 81CSCE930-Advanced Computer Architecture, Introduction
Design Issues Apply at All Layers
• Prog. model’s position provides constraints/goals for system
• In fact, each interface between layers supports or takes a position on:
– Naming model
– Set of operations on names
– Ordering model
– Replication
– Communication performance
• Any set of positions can be mapped to any other by software
• Let’s see issues across layers– How lower layers can support contracts of programming models
– Performance issues04/18/23 82CSCE930-Advanced Computer Architecture, Introduction
Naming and Operations• Naming and operations in programming model can be directly
supported by lower levels, or translated by compiler, libraries or OS
• Example: Shared virtual address space in programming model
• Hardware interface supports shared physical address space – Direct support by hardware through v-to-p mappings, no software layers
• Hardware supports independent physical address spaces– Can provide SAS through OS, so in system/user interface
» v-to-p mappings only for data that are local
» remote data accesses incur page faults; brought in via page fault handlers
» same programming model, different hardware requirements and cost model
– Or through compilers or runtime, so above sys/user interface
» shared objects, instrumentation of shared accesses, compiler support
04/18/23 83CSCE930-Advanced Computer Architecture, Introduction
Naming and Operations (contd)
• Example: Implementing Message Passing
• Direct support at hardware interface– But match and buffering benefit from more flexibility
• Support at sys/user interface or above in software (almost always)
– Hardware interface provides basic data transport (well suited)
– Send/receive built in sw for flexibility (protection, buffering)
– Choices at user/system interface:
» OS each time: expensive
» OS sets up once/infrequently, then little sw involvement each time
– Or lower interfaces provide SAS, and send/receive built on top with buffers and loads/stores
• Need to examine the issues and tradeoffs at every layer– Frequencies and types of operations, costs
04/18/23 84CSCE930-Advanced Computer Architecture, Introduction
Ordering
• Message passing: no assumptions on orders across processes except those imposed by send/receive pairs
• SAS: How processes see the order of other processes’ references defines semantics of SAS
– Ordering very important and subtle
– Uniprocessors play tricks with orders to gain parallelism or locality
– These are more important in multiprocessors
– Need to understand which old tricks are valid, and learn new ones
– How programs behave, what they rely on, and hardware implications
04/18/23 85CSCE930-Advanced Computer Architecture, Introduction
Replication
• Very important for reducing data transfer/communication
• Again, depends on naming model
• Uniprocessor: caches do it automatically– Reduce communication with memory
• Message Passing naming model at an interface– A receive replicates, giving a new name; subsequently use new name
– Replication is explicit in software above that interface
• SAS naming model at an interface– A load brings in data transparently, so can replicate transparently
– Hardware caches do this, e.g. in shared physical address space
– OS can do it at page level in shared virtual address space, or objects
– No explicit renaming, many copies for same name: coherence problem
» in uniprocessors, “coherence” of copies is natural in memory hierarchy
04/18/23 86CSCE930-Advanced Computer Architecture, Introduction
Communication Performance
• Performance characteristics determine usage of operations at a layer
– Programmer, compilers etc make choices based on this
• Fundamentally, three characteristics:– Latency: time taken for an operation
– Bandwidth: rate of performing operations
– Cost: impact on execution time of program
• If processor does one thing at a time: bandwidth 1/latency– But actually more complex in modern systems
• Characteristics apply to overall operations, as well as individual components of a system, however small
• We’ll focus on communication or data transfer across nodes
04/18/23 87CSCE930-Advanced Computer Architecture, Introduction
Outline
• Computer Science at a Crossroads: Parallelism– Architecture: multi-core and many-cores
– Program: multi-threading
• Parallel Architecture– What is Parallel Architecture?
– Why Parallel Architecture?
– Evolution and Convergence of Parallel Architectures
– Fundamental Design Issues
• Parallel Programs– Why bother with programs?
– Important for whom?
• Memory & Storage Subsystem Architectures
04/18/23 88CSCE930-Advanced Computer Architecture, Introduction
Why Bother with Programs?
• They’re what runs on the machines we design– Helps make design decisions
– Helps evaluate systems tradeoffs
• Led to the key advances in uniprocessor architecture
– Caches and instruction set design
• More important in multiprocessors– New degrees of freedom
– Greater penalties for mismatch between program and architecture
04/18/23 89CSCE930-Advanced Computer Architecture, Introduction
Important for Whom?
• Algorithm designers– Designing algorithms that will run well on real systems
• Programmers– Understanding key issues and obtaining best performance
• Architects– Understand workloads, interactions, important degrees of
freedom
– Valuable for design and for evaluation
04/18/23 90CSCE930-Advanced Computer Architecture, Introduction
Software
• 1. Parallel programs– Process of parallelization– What parallel programs look like in major programming models
• 2. Programming for performance– Key performance issues and architectural interactions
• 3. Workload-driven architectural evaluation– Beneficial for architects and for users in procuring machines
• Unlike on sequential systems, can’t take workload for granted
– Software base not mature; evolves with architectures for performance
– So need to open the box
• Let’s begin with parallel programs ...
04/18/23 91CSCE930-Advanced Computer Architecture, Introduction
Outline
• Motivating Problems (application case studies)
• Steps in creating a parallel program
• What a simple parallel program looks like– In the three major programming models
– Ehat primitives must a system support?
• Later: Performance issues and architectural interactions
04/18/23 92CSCE930-Advanced Computer Architecture, Introduction
Motivating Problems
• Simulating Ocean Currents– Regular structure, scientific computing
• Simulating the Evolution of Galaxies– Irregular structure, scientific computing
• Rendering Scenes by Ray Tracing– Irregular structure, computer graphics
• Data Mining– Irregular structure, information processing
– Not discussed here (read in book)
04/18/23 93CSCE930-Advanced Computer Architecture, Introduction
Simulating Ocean Currents
– Model as two-dimensional grids– Discretize in space and time
» finer spatial and temporal resolution => greater accuracy
– Many different computations per time step» set up and solve equations
– Concurrency across and within grid computations
(a) Cross sections (b) Spatial discretization of a cross section
04/18/23 94CSCE930-Advanced Computer Architecture, Introduction
Simulating Galaxy Evolution
– Simulate the interactions of many stars evolving over time
– Computing forces is expensive
– O(n2) brute force approach
– Hierarchical Methods take advantage of force law: G m1m2
r2
•Many time-steps, plenty of concurrency across stars within one
Star on which forcesare being computed
Star too close toapproximate
Small group far enough away toapproximate to center of mass
Large group farenough away toapproximate
04/18/23 95CSCE930-Advanced Computer Architecture, Introduction
Rendering Scenes by Ray Tracing
– Shoot rays into scene through pixels in image plane
– Follow their paths
» they bounce around as they strike objects
» they generate new rays: ray tree per input ray
– Result is color and opacity for that pixel
– Parallelism across rays
• All case studies have abundant concurrency
04/18/23 96CSCE930-Advanced Computer Architecture, Introduction
Creating a Parallel Program
• Assumption: Sequential algorithm is given– Sometimes need very different algorithm, but beyond scope
• Pieces of the job:– Identify work that can be done in parallel
– Partition work and perhaps data among processes
– Manage data access, communication and synchronization
– Note: work includes computation, data access and I/O
• Main goal: Speedup (plus low prog. effort and resource needs)
• Speedup (p) =
• For a fixed problem:
• Speedup (p) =
Performance(p)
Performance(1)Time(1)
Time(p)04/18/23 97CSCE930-Advanced Computer Architecture, Introduction
Steps in Creating a Parallel Program
• 4 steps: Decomposition, Assignment, Orchestration, Mapping
– Done by programmer or system software (compiler, runtime, ...)– Issues are the same, so assume programmer does it all explicitly
P0
Tasks Processes Processors
P1
P2 P3
p0 p1
p2 p3
p0 p1
p2 p3
Partitioning
Sequentialcomputation
Parallelprogram
Assignment
Decomposition
Mapping
Orchestration
04/18/23 98CSCE930-Advanced Computer Architecture, Introduction
Some Important Concepts• Task:
– Arbitrary piece of undecomposed work in parallel computation
– Executed sequentially; concurrency is only across tasks
– E.g. a particle/cell in Barnes-Hut, a ray or ray group in Raytrace
– Fine-grained versus coarse-grained tasks
• Process (thread): – Abstract entity that performs the tasks assigned to processes
– Processes communicate and synchronize to perform their tasks
• Processor: – Physical engine on which process executes
– Processes virtualize machine to programmer
» first write program in terms of processes, then map to processors
04/18/23 99CSCE930-Advanced Computer Architecture, Introduction
Decomposition
• Break up computation into tasks to be divided among processes
– Tasks may become available dynamically
– No. of available tasks may vary with time
• i.e. identify concurrency and decide level at which to exploit it
• Goal: Enough tasks to keep processes busy, but not too many
– No. of tasks available at a time is upper bound on achievable speedup
04/18/23 100CSCE930-Advanced Computer Architecture, Introduction
Limited Concurrency: Amdahl’s Law
– Most fundamental limitation on parallel speedup
– If fraction s of seq execution is inherently serial, speedup <= 1/s
– Example: 2-phase calculation
» sweep over n-by-n grid and do some independent computation
» sweep again and add each value to global sum
– Time for first phase = n2/p
– Second phase serialized at global variable, so time = n2
– Speedup <= or at most 2
– Trick: divide second phase into two» accumulate into private sum during sweep» add per-process private sum into global sum
– Parallel time is n2/p + n2/p + p, and speedup at best
2n2
n2
p+ n2
2n2
2n2 + p2
04/18/23 101CSCE930-Advanced Computer Architecture, Introduction
Pictorial Depiction
1
p
1
p
1
n2/p
n2
p
wor
k do
ne c
oncu
rren
tly
n2
n2
Timen2/p n2/p
(c)
(b)
(a)
04/18/23 102CSCE930-Advanced Computer Architecture, Introduction
Concurrency Profiles
– Area under curve is total work done, or time with 1 processor– Horizontal extent is lower bound on time (infinite processors)
– Speedup is the ratio: , base case:
– Amdahl’s law applies to any overhead, not just limited concurrency
•Cannot usually divide into serial and parallel part
Co
ncu
rre
ncy
15
0
21
9
24
7
28
6
31
3
34
3
38
0
41
5
44
4
48
3
50
4
52
6
56
4
58
9
63
3
66
2
70
2
73
3
0
200
400
600
800
1,000
1,200
1,400
Clock cycle number
fk k
fkkp
k=1
k=1
1
s + 1-sp
04/18/23 103CSCE930-Advanced Computer Architecture, Introduction
Assignment• Specifying mechanism to divide work up among
processes– E.g. which process computes forces on which stars, or which
rays– Together with decomposition, also called partitioning– Balance workload, reduce communication and management
cost• Structured approaches usually work well
– Code inspection (parallel loops) or understanding of application
– Well-known heuristics– Static versus dynamic assignment
• As programmers, we worry about partitioning first
– Usually independent of architecture or prog model– But cost and complexity of using primitives may affect
decisions• As architects, we assume program does
reasonable job of it04/18/23 104CSCE930-Advanced Computer Architecture, Introduction
Orchestration– Naming data
– Structuring communication
– Synchronization
– Organizing data structures and scheduling tasks temporally
• Goals– Reduce cost of communication and synch. as seen by
processors– Reserve locality of data reference (incl. data structure
organization)– Schedule tasks to satisfy dependences early– Reduce overhead of parallelism management
• Closest to architecture (and programming model & language)
– Choices depend a lot on comm. abstraction, efficiency of primitives
– Architects should provide appropriate primitives efficiently04/18/23 105CSCE930-Advanced Computer Architecture, Introduction
Mapping• After orchestration, already have parallel program
• Two aspects of mapping:– Which processes will run on same processor, if necessary– Which process runs on which particular processor
» mapping to a network topology
• One extreme: space-sharing– Machine divided into subsets, only one app at a time in a subset
– Processes can be pinned to processors, or left to OS
• Another extreme: complete resource management control to OS
– OS uses the performance techniques we will discuss later
• Real world is between the two– User specifies desires in some aspects, system may ignore
• Usually adopt the view: process <-> processor
04/18/23 106CSCE930-Advanced Computer Architecture, Introduction
Parallelizing Computation vs. Data
• Above view is centered around computation– Computation is decomposed and assigned (partitioned)
• Partitioning Data is often a natural view too– Computation follows data: owner computes
– Grid example; data mining; High Performance Fortran (HPF)
• But not general enough– Distinction between comp. and data stronger in many
applications» Barnes-Hut, Raytrace (later)
– Retain computation-centric view
– Data access and communication is part of orchestration
04/18/23 107CSCE930-Advanced Computer Architecture, Introduction
High-level Goals
• But low resource usage and development effort
• Implications for algorithm designers and architects– Algorithm designers: high-perf., low resource needs– Architects: high-perf., low cost, reduced programming effort
» e.g. gradually improving perf. with programming effort may be preferable to sudden threshold after large programming effort
High performance (speedup over sequential program)
Table 2.1 Steps in the Parallelization Process and Their Goals
StepArchitecture-Dependent? Major Performance Goals
Decomposition Mostly no Expose enough concurrency but not too much
Assignment Mostly no Balance workloadReduce communication volume
Orchestration Yes Reduce noninherent communication via data locality
Reduce communication and synchronization cost as seen by the processor
Reduce serialization at shared resourcesSchedule tasks to satisfy dependences early
Mapping Yes Put related processes on the same processor if necessary
Exploit locality in network topology
04/18/23 108CSCE930-Advanced Computer Architecture, Introduction
What Parallel Programs Look Like
Parallelization of An Example Program• Motivating problems all lead to large, complex
programs
• Examine a simplified version of a piece of Ocean simulation
– Iterative equation solver
• Illustrate parallel program in low-level parallel language
– C-like pseudocode with simple extensions for parallelism
– Expose basic comm. and synch. primitives that must be supported
– State of most real parallel programming today
04/18/23 110CSCE930-Advanced Computer Architecture, Introduction
Grid Solver Example
– Simplified version of solver in Ocean simulation
– Gauss-Seidel (near-neighbor) sweeps to convergence» interior n-by-n points of (n+2)-by-(n+2) updated in each sweep» updates done in-place in grid, and diff. from prev. value computed» accumulate partial diffs into global diff at end of every sweep» check if error has converged (to within a tolerance parameter)» if so, exit solver; if not, do another sweep
A[i,j ] = 0.2 (A[i,j ] + A[i,j – 1] + A[i – 1, j] +
A[i,j + 1] + A[i + 1, j ])
Expression for updating each interior point:
04/18/23 111CSCE930-Advanced Computer Architecture, Introduction
1. int n; /*size of matrix: (n + 2-by-n + 2) elements*/2. float **A, diff = 0;
3. main()4. begin5. read(n) ; /*read input parameter: matrix size*/
6. A malloc (a 2-d array of size n + 2 by n + 2 doubles);7. initialize(A); /*initialize the matrix A somehow*/8. Solve (A); /*call the routine to solve equation*/9. end main
10. procedure Solve (A) /*solve the equation system*/11. float **A; /*A is an (n + 2)-by-(n + 2) array*/12. begin13. int i, j, done = 0;14. float diff = 0, temp;15. while (!done) do /*outermost loop over sweeps*/16. diff = 0; /*initialize maximum difference to 0*/
17. for i 1 to n do /*sweep over nonborder points of grid*/
18. for j 1 to n do19. temp = A[i,j]; /*save old value of element*/
20. A[i,j] 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +21. A[i,j+1] + A[i+1,j]); /*compute average*/22. diff += abs(A[i,j] - temp);23. end for24. end for25. if (diff/(n*n) < TOL) then done = 1;26. end while27. end procedure
04/18/23 112CSCE930-Advanced Computer Architecture, Introduction
Decomposition
– Concurrency O(n) along anti-diagonals, serialization O(n) along diag.– Retain loop structure, use pt-to-pt synch; Problem: too many synch ops.– Restructure loops, use global synch; imbalance and too much synch
•Simple way to identify concurrency is to look at loop iterations–dependence analysis; if not enough concurrency, then look further
•Not much concurrency here at this level (all loops sequential)•Examine fundamental dependences, ignoring loop structure
04/18/23 113CSCE930-Advanced Computer Architecture, Introduction
Exploit Application Knowledge
– Different ordering of updates: may converge quicker or slower – Red sweep and black sweep are each fully parallel: – Global synch between them (conservative but convenient)– Ocean uses red-black; we use simpler, asynchronous one to illustrate
» no red-black, simply ignore dependences within sweep» sequential order same as original, parallel program
nondeterministic
Red point
Black point
•Reorder grid traversal: red-black ordering
04/18/23 114CSCE930-Advanced Computer Architecture, Introduction
Decomposition Only
– Decomposition into elements: degree of concurrency n2
– To decompose into rows, make line 18 loop sequential; degree n– for_all leaves assignment left to system
» but implicit global synch. at end of for_all loop
15. while (!done) do /*a sequential loop*/16. diff = 0; 17. for_all i 1 to n do /*a parallel loop nest*/18. for_all j 1 to n do19. temp = A[i,j];20. A[i,j] 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +21. A[i,j+1] + A[i+1,j]);22. diff += abs(A[i,j] - temp);23. end for_all24. end for_all25. if (diff/(n*n) < TOL) then done = 1;26. end while
04/18/23 115CSCE930-Advanced Computer Architecture, Introduction
Assignment
– Dynamic assignment
» get a row index, work on the row, get a new row, and so on
– Static assignment into rows reduces concurrency (from n to p)» block assign. reduces communication by keeping adjacent rows
together
– Let’s dig into orchestration under three programming models
ip
P0
P1
P2
P4
•Static assignments (given decomposition into rows)
–block assignment of rows: Row i is assigned to process
–cyclic assignment of rows: process i is assigned rows i, i+p, and so on
04/18/23 116CSCE930-Advanced Computer Architecture, Introduction
Data Parallel Solver1. int n, nprocs; /*grid size (n + 2-by-n + 2) and number of processes*/2. float **A, diff = 0;
3. main()4. begin5. read(n); read(nprocs); ; /*read input grid size and number of processes*/
6. A G_MALLOC (a 2-d array of size n+2 by n+2 doubles);7. initialize(A); /*initialize the matrix A somehow*/8. Solve (A); /*call the routine to solve equation*/9. end main
10. procedure Solve(A) /*solve the equation system*/11. float **A; /*A is an (n + 2-by-n + 2) array*/12. begin13. int i, j, done = 0;14. float mydiff = 0, temp;14a. DECOMP A[BLOCK,*, nprocs];15. while (!done) do /*outermost loop over sweeps*/16. mydiff = 0; /*initialize maximum difference to 0*/17. for_all i 1 to n do /*sweep over non-border points of grid*/
18. for_all j 1 to n do19. temp = A[i,j]; /*save old value of element*/
20. A[i,j] 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +21. A[i,j+1] + A[i+1,j]); /*compute average*/22. mydiff += abs(A[i,j] - temp);23. end for_all24. end for_all24a. REDUCE (mydiff, diff, ADD);25. if (diff/(n*n) < TOL) then done = 1;26. end while27. end procedure
04/18/23 117CSCE930-Advanced Computer Architecture, Introduction
Shared Address Space Solver
– Assignment controlled by values of variables used as loop bounds
Single Program Multiple Data (SPMD)
Sweep
Test Convergence
Processes
Solve Solve Solve Solve
04/18/23 118CSCE930-Advanced Computer Architecture, Introduction
1. int n, nprocs; /*matrix dimension and number of processors to be used*/2a. float **A, diff; /*A is global (shared) array representing the grid*/
/*diff is global (shared) maximum difference in currentsweep*/
2b. LOCKDEC(diff_lock); /*declaration of lock to enforce mutual exclusion*/2c. BARDEC (bar1); /*barrier declaration for global synchronization between
sweeps*/
3. main()4. begin5. read(n); read(nprocs); /*read input matrix size and number of processes*/6. A G_MALLOC (a two-dimensional array of size n+2 by n+2 doubles);7. initialize(A); /*initialize A in an unspecified way*/8a. CREATE (nprocs–1, Solve, A);8. Solve(A); /*main process becomes a worker too*/8b. WAIT_FOR_END (nprocs–1); /*wait for all child processes created to terminate*/9. end main
10. procedure Solve(A)11. float **A; /*A is entire n+2-by-n+2 shared array,
as in the sequential program*/12. begin13. int i,j, pid, done = 0;14. float temp, mydiff = 0; /*private variables*/14a. int mymin = 1 + (pid * n/nprocs); /*assume that n is exactly divisible by*/14b. int mymax = mymin + n/nprocs - 1 /*nprocs for simplicity here*/
15. while (!done) do /*outer loop over all diagonal elements*/16. mydiff = diff = 0; /*set global diff to 0 (okay for all to do it)*/16a. BARRIER(bar1, nprocs); /*ensure all reach here before anyone modifies diff*/17. for i mymin to mymax do /*for each of my rows*/18. for j 1 to n do /*for all nonborder elements in that row*/19. temp = A[i,j];20. A[i,j] = 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +21. A[i,j+1] + A[i+1,j]);22. mydiff += abs(A[i,j] - temp);23. endfor24. endfor25a. LOCK(diff_lock); /*update global diff if necessary*/25b. diff += mydiff;25c. UNLOCK(diff_lock);25d. BARRIER(bar1, nprocs); /*ensure all reach here before checking if done*/25e. if (diff/(n*n) < TOL) then done = 1; /*check convergence; all get
same answer*/25f. BARRIER(bar1, nprocs);26. endwhile27. end procedure04/18/23 119CSCE930-Advanced Computer Architecture, Introduction
Notes on SAS Program
– SPMD: not lockstep or even necessarily same instructions
– Assignment controlled by values of variables used as loop bounds
» unique pid per process, used to control assignment
– Done condition evaluated redundantly by all
– Code that does the update identical to sequential program
» each process has private mydiff variable
– Most interesting special operations are for synchronization
» accumulations into shared diff have to be mutually exclusive
» why the need for all the barriers?
04/18/23 120CSCE930-Advanced Computer Architecture, Introduction
Need for Mutual Exclusion– Code each process executes:
• load the value of diff into register r1
• add the register r2 to register r1
• store the value of register r1 into diff
– A possible interleaving:
• P1 P2
• r1 diff {P1 gets 0 in its r1}
• r1 diff {P2 also gets 0}
• r1 r1+r2 {P1 sets
its r1 to 1}
• r1 r1+r2 {P2 sets
its r1 to 1}
• diff r1 {P1 sets cell_cost to 1}
• diff r1 {P2 also sets cell_cost
to 1}
– Need the sets of operations to be atomic (mutually exclusive)
04/18/23 121CSCE930-Advanced Computer Architecture, Introduction
Mutual Exclusion
• Provided by LOCK-UNLOCK around critical section
– Set of operations we want to execute atomically
– Implementation of LOCK/UNLOCK must guarantee mutual excl.
• Can lead to significant serialization if contended– Especially since expect non-local accesses in critical section
– Another reason to use private mydiff for partial accumulation
04/18/23 122CSCE930-Advanced Computer Architecture, Introduction
Global Event Synchronization
• BARRIER(nprocs): wait here till nprocs processes get here
– Built using lower level primitives– Global sum example: wait for all to accumulate before using sum– Often used to separate phases of computation
• Process P_1 Process P_2 Process P_nprocs
• set up eqn system set up eqn system set up eqn system
• Barrier (name, nprocs) Barrier (name, nprocs) Barrier (name, nprocs)
• solve eqn system solve eqn system solve eqn system
• Barrier (name, nprocs) Barrier (name, nprocs) Barrier (name, nprocs)
• apply results apply results apply results
• Barrier (name, nprocs) Barrier (name, nprocs) Barrier (name, nprocs)
– Conservative form of preserving dependences, but easy to use
• WAIT_FOR_END (nprocs-1)
04/18/23 123CSCE930-Advanced Computer Architecture, Introduction
Pt-to-pt Event Synch (Not Used Here)
• One process notifies another of an event so it can proceed– Common example: producer-consumer (bounded buffer)
– Concurrent programming on uniprocessor: semaphores
– Shared address space parallel programs: semaphores, or use ordinary variables as flags
•Busy-waiting or spinning
P1 P2
A = 1;
a: while (flag is 0) do nothing;b: flag = 1;
print A;
04/18/23 124CSCE930-Advanced Computer Architecture, Introduction
Group Event Synchronization
• Subset of processes involved– Can use flags or barriers (involving only the subset)
– Concept of producers and consumers
• Major types:– Single-producer, multiple-consumer
– Multiple-producer, single-consumer
– Multiple-producer, single-consumer
04/18/23 125CSCE930-Advanced Computer Architecture, Introduction
Message Passing Grid Solver
– Cannot declare A to be shared array any more
– Need to compose it logically from per-process private arrays
» usually allocated in accordance with the assignment of work
» process assigned a set of rows allocates them locally
– Transfers of entire rows between traversals
– Structurally similar to SAS (e.g. SPMD), but orchestration different
» data structures and data access/naming
» communication
» synchronization
04/18/23 126CSCE930-Advanced Computer Architecture, Introduction
1. int pid, n, b; /*process id, matrix dimension and number ofprocessors to be used*/
2. float **myA;3. main()4. begin5. read(n); read(nprocs); /*read input matrix size and number of processes*/8a. CREATE (nprocs-1, Solve);8b. Solve(); /*main process becomes a worker too*/8c. WAIT_FOR_END (nprocs–1); /*wait for all child processes created to terminate*/9. end main
10. procedure Solve()11. begin13. int i,j, pid, n’ = n/nprocs, done = 0;14. float temp, tempdiff, mydiff = 0; /*private variables*/
6. myA malloc(a 2-d array of size [n/nprocs + 2] by n+2);/*my assigned rows of A*/
7. initialize(myA); /*initialize my rows of A, in an unspecified way*/
15. while (!done) do16. mydiff = 0; /*set local diff to 0*/16a. if (pid != 0) then SEND(&myA[1,0],n*sizeof(float),pid-1,ROW);16b. if (pid = nprocs-1) then
SEND(&myA[n’,0],n*sizeof(float),pid+1,ROW);16c. if (pid != 0) then RECEIVE(&myA[0,0],n*sizeof(float),pid-1,ROW);16d. if (pid != nprocs-1) then
RECEIVE(&myA[n’+1,0],n*sizeof(float), pid+1,ROW);/*border rows of neighbors have now been copiedinto myA[0,*] and myA[n’+1,*]*/
17. for i 1 to n’ do /*for each of my (nonghost) rows*/18. for j 1 to n do /*for all nonborder elements in that row*/19. temp = myA[i,j];20. myA[i,j] = 0.2 * (myA[i,j] + myA[i,j-1] + myA[i-1,j] +21. myA[i,j+1] + myA[i+1,j]);22. mydiff += abs(myA[i,j] - temp);23. endfor24. endfor
/*communicate local diff values and determine ifdone; can be replaced by reduction and broadcast*/
25a. if (pid != 0) then /*process 0 holds global total diff*/25b. SEND(mydiff,sizeof(float),0,DIFF);25c. RECEIVE(done,sizeof(int),0,DONE);25d. else /*pid 0 does this*/25e. for i 1 to nprocs-1 do /*for each other process*/25f. RECEIVE(tempdiff,sizeof(float),*,DIFF);25g. mydiff += tempdiff; /*accumulate into total*/25h. endfor25i if (mydiff/(n*n) < TOL) then done = 1;25j. for i 1 to nprocs-1 do /*for each other process*/25k. SEND(done,sizeof(int),i,DONE);25l. endfor25m. endif26. endwhile27. end procedure
04/18/23 127CSCE930-Advanced Computer Architecture, Introduction
Notes on Message Passing Program– Use of ghost rows
– Receive does not transfer data, send does
» unlike SAS which is usually receiver-initiated (load fetches data)
– Communication done at beginning of iteration, so no asynchrony
– Communication in whole rows, not element at a time
– Core similar, but indices/bounds in local rather than global space
– Synchronization through sends and receives
» Update of global diff and event synch for done condition
» Could implement locks and barriers with messages
– Can use REDUCE and BROADCAST library calls to simplify code
/*communicate local diff values and determine if done, using reduction and broadcast*/25b. REDUCE(0,mydiff,sizeof(float),ADD);25c. if (pid == 0) then25i. if (mydiff/(n*n) < TOL) then done = 1;25k. endif25m. BROADCAST(0,done,sizeof(int),DONE);
04/18/23 128CSCE930-Advanced Computer Architecture, Introduction
Send and Receive Alternatives
– Affect event synch (mutual excl. by fiat: only one process touches data)– Affect ease of programming and performance
• Synchronous messages provide built-in synch. through match– Separate event synchronization needed with asynch. messages
• With synch. messages, our code is deadlocked. Fix?
Can extend functionality: stride, scatter-gather, groups
Semantic flavors: based on when control is returnedAffect when data structures or buffers can be reused at either end
Send/Receive
Synchronous Asynchronous
Blocking asynch. Nonblocking asynch.
04/18/23 129CSCE930-Advanced Computer Architecture, Introduction
Orchestration: Summary
• Shared address space– Shared and private data explicitly separate– Communication implicit in access patterns– No correctness need for data distribution– Synchronization via atomic operations on shared data– Synchronization explicit and distinct from data
communication
• Message passing– Data distribution among local address spaces needed– No explicit shared structures (implicit in comm. patterns)– Communication is explicit– Synchronization implicit in communication (at least in synch.
case)
» mutual exclusion by fiat
04/18/23 130CSCE930-Advanced Computer Architecture, Introduction
Correctness in Grid Solver Program
• Decomposition and Assignment similar in SAS and message-passing
• Orchestration is different– Data structures, data access/naming, communication, synchronization
SAS Msg-Passing
Explicit global data structure? Yes No
Assignment indept of data layout? Yes No
Communication Implicit Explicit
Synchronization Explicit Implicit
Explicit replication of border rows? No Yes
Requirements for performance are another story ...
04/18/23 131CSCE930-Advanced Computer Architecture, Introduction