Transcript
Page 1: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 1

Part VIIAdvanced Architectures

Page 2: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 2

About This PresentationThis presentation is intended to support the use of the textbookComputer Architecture: From Microprocessors to Supercomputers, Oxford University Press, 2005, ISBN 0-19-515455-X. It is updated regularly by the author as part of his teaching of the upper-division course ECE 154, Introduction to Computer Architecture, at the University of California, Santa Barbara. Instructors can use these slides freely in classroom teaching and for other educational purposes. Any other use is strictly prohibited. © Behrooz Parhami

Edition Released Revised Revised Revised RevisedFirst July 2003 July 2004 July 2005 Mar. 2007 Feb. 2011*

* Minimal update, due to this part not being used for lectures in ECE 154 at UCSB

Page 3: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 3

VII Advanced Architectures

Topics in This PartChapter 25 Road to Higher PerformanceChapter 26 Vector and Array ProcessingChapter 27 Shared-Memory MultiprocessingChapter 28 Distributed Multicomputing

Performance enhancement beyond what we have seen:• What else can we do at the instruction execution level?• Data parallelism: vector and array processing• Control parallelism: parallel and distributed processing

Page 4: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 4

25 Road to Higher PerformanceReview past, current, and future architectural trends:

• General-purpose and special-purpose acceleration• Introduction to data and control parallelism

Topics in This Chapter25.1 Past and Current Performance Trends

25.2 Performance-Driven ISA Extensions

25.3 Instruction-Level Parallelism

25.4 Speculation and Value Prediction

25.5 Special-Purpose Hardware Accelerators

25.6 Vector, Array, and Parallel Processing

Page 5: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 5

25.1 Past and Current Performance Trends

0.06 MIPS (4-bit processor)

Intel 4004: The first μp (1971) Intel Pentium 4, circa 2005

10,000 MIPS (32-bit processor)

8008

8080

80848-bit

8086

80186

8028616-bit

8088

80188

80386

Pentium, MMX

Pentium Pro, II32-bit

80486

Pentium III, M

Celeron

Page 6: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 6

Architectural Innovations for Improved Performance

Architectural method Improvement factor

1. Pipelining (and superpipelining) 3-8 √2. Cache memory, 2-3 levels 2-5 √3. RISC and related ideas 2-3 √4. Multiple instruction issue (superscalar) 2-3 √5. ISA extensions (e.g., for multimedia) 1-3 √6. Multithreading (super-, hyper-) 2-5 ?7. Speculation and value prediction 2-3 ?8. Hardware acceleration 2-10 ?9. Vector and array processing 2-10 ?

10. Parallel/distributed computing 2-1000s ?

Est

ablis

hed

met

hods

New

erm

etho

ds

Pre

viou

sly

disc

usse

dC

over

ed in

Par

t VII

Available computing power ca. 2000:GFLOPS on desktop TFLOPS in supercomputer centerPFLOPS on drawing board

Computer performance grew by a factorof about 10000 between 1980 and 2000

100 due to faster technology 100 due to better architecture

Page 7: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 7

Peak Performance of SupercomputersPFLOPS

TFLOPS

GFLOPS1980 20001990 2010

Earth Simulator

ASCI White Pacific

ASCI Red

Cray T3DTMC CM-5

TMC CM-2Cray X-MP

Cray 2

× 10 / 5 years

Dongarra, J., “Trends in High Performance Computing,”Computer J., Vol. 47, No. 4, pp. 399-403, 2004. [Dong04]

Page 8: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 8

Energy Consumption is Getting out of Hand

Figure 25.1 Trend in energy consumption for each MIPS of computational power in general-purpose processors and DSPs.

1990 1980 2000 2010kIPS

MIPS

GIPS

TIPS

Per

form

ance

Calendar year

Absolute processor

performance

GP processor performance

per watt

DSP performance per watt

Page 9: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 9

25.2 Performance-Driven ISA Extensions

Adding instructions that do more work per cycleShift-add: replace two instructions with one (e.g., multiply by 5)Multiply-add: replace two instructions with one (x := c + a × b)Multiply-accumulate: reduce round-off error (s := s + a × b)Conditional copy: to avoid some branches (e.g., in if-then-else)

Subword parallelism (for multimedia applications)Intel MMX: multimedia extension

64-bit registers can hold multiple integer operands

Intel SSE: Streaming SIMD extension128-bit registers can hold several floating-point operands

Page 10: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 10

Intel MMXISA

Exten-sion

Table25.1

Class Instruction Vector Op type Function or resultsRegister copy 32 bits Integer register ↔ MMX registerParallel pack 4, 2 Saturate Convert to narrower elementsParallel unpack low 8, 4, 2 Merge lower halves of 2 vectorsParallel unpack high 8, 4, 2 Merge upper halves of 2 vectorsParallel add 8, 4, 2 Wrap/Saturate# Add; inhibit carry at boundariesParallel subtract 8, 4, 2 Wrap/Saturate# Subtract with carry inhibitionParallel multiply low 4 Multiply, keep the 4 low halvesParallel multiply high 4 Multiply, keep the 4 high halvesParallel multiply-add 4 Multiply, add adjacent products*Parallel compare equal 8, 4, 2 All 1s where equal, else all 0sParallel compare greater 8, 4, 2 All 1s where greater, else all 0sParallel left shift logical 4, 2, 1 Shift left, respect boundariesParallel right shift logical 4, 2, 1 Shift right, respect boundariesParallel right shift arith 4, 2 Arith shift within each (half)wordParallel AND 1 Bitwise dest ← (src1) ∧ (src2)Parallel ANDNOT 1 Bitwise dest ← (src1) ∧ (src2)′Parallel OR 1 Bitwise dest ← (src1) ∨ (src2)Parallel XOR 1 Bitwise dest ← (src1) ⊕ (src2)Parallel load MMX reg 32 or 64 bits Address given in integer registerParallel store MMX reg 32 or 64 bit Address given in integer register

Control Empty FP tag bits Required for compatibility$

Memoryaccess

Logic

Shift

Arith-metic

Copy

Page 11: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 11

MMX Multiplication and Multiply-Add

Figure 25.2 Parallel multiplication and multiply-add in MMX.

a

(a) Parallel multiply low (b) Parallel multiply-add

b d e

e f g h

s t u v

e × h d × g

b × f a × e

z v

y u

x t

w s

a b d e

e f g h

s + t u + v

e × h d × g

b × f a × e

v

u

t

s

add add

Page 12: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 12

MMX Parallel Comparisons

Figure 25.3 Parallel comparisons in MMX.

14

(a) Parallel compare equal (b) Parallel compare greater

3 58 66

79 1 58 65

0 0 0

5 12 3 32

12 3 22

5 12 6 9

12 5 90 17 8 65 535 (all 1s)

0 0 0 0 0

255 (all 1s)

Page 13: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 13

25.3 Instruction-Level Parallelism

Figure 25.4 Available instruction-level parallelism and the speedup due to multiple instruction issue in superscalar processors [John91].

1

Frac

tion

of c

ycle

s

Issuable instructions per cycle

20%

30%

10%

0% 2 3 4 5 6 7 8 0

Spee

dup

atta

ined

Instruction issue width

3

2

1 2 4 6 8 0

(a) (b)

Page 14: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 14

Instruction-Level Parallelism

Figure 25.5 A computation with inherent instruction-level parallelism.

Page 15: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 15

VLIW and EPIC Architectures

Figure 25.6 Hardware organization for IA-64. General and floating-point registers are 64-bit wide. Predicates are single-bit registers.

VLIW Very long instruction word architectureEPIC Explicitly parallel instruction computing

Memory

General registers (128)

Floating-point registers (128)

Predi- cates (64)

Execution unit

Execution unit

Execution unit

Execution unit

Execution unit

Execution unit . . .

. . .

Page 16: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 16

25.4 Speculation and Value Prediction

Figure 25.7 Examples of software speculation in IA-64.

---- ---- ---- ---- load ---- ----

spec load ---- ---- ---- ---- check load ---- ----

(a) Control speculation

---- ---- store ---- load ---- ----

spec load ---- ---- store ---- check load ---- ----

(b) Data speculation

Page 17: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 17

Value Prediction

Figure 25.8 Value prediction for multiplication or division via a memo table.

Mult/ Div

Memo table

Control

Mux

Inputs

Inputs ready

Output

Output ready

0

1

Miss

Done

Page 18: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 18

25.5 Special-Purpose Hardware Accelerators

Figure 25.9 General structure of a processor with configurable hardware accelerators.

CPU Configuration memory

Accel. 1

Accel. 2

Accel. 3

Data and program memory

FPGA-like unit on which accelerators can be formed via loading of configuration registers

Unused resources

Page 19: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 19

Graphic Processors, Network Processors, etc.

Figure 25.10 Simplified block diagram of Toaster2, Cisco Systems’ network processor.

Input buffer

PE 0

PE 1

PE 2

PE 3

PE 4

PE 5 PE

6 PE 7

PE 8

PE 9

PE 10

PE 11

PE 12

PE 13

PE 14

PE 15

Output buffer

Column memory Column

memory Column memory Column

memory

Feedback path

PE5

Page 20: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 20

25.6 Vector, Array, and Parallel Processing

Figure 25.11 The Flynn-Johnson classification of computer systems.

SISD

SIMD

MISD

MIMD

GMSV

GMMP

DMSV

DMMP

Single data stream

Multiple data streams

Sing

le in

str

stre

am

Mul

tiple

inst

r st

ream

s

Flynn’s categories

John

son’

s ex

pans

ion

Shared variables

Message passing

Glo

bal

mem

ory

Dist

ribut

ed

mem

ory

Uniprocessors

Rarely used

Array or vector processors

Mult iproc’s or mult icomputers

Shared-memory mult iprocessors

Rarely used

Distributed shared memory

Distrib-memory mult icomputers

Page 21: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 21

SIMD Architectures

Data parallelism: executing one operation on multiple data streams

Concurrency in time – vector processingConcurrency in space – array processing

Example to provide context

Multiplying a coefficient vector by a data vector (e.g., in filtering)y[i] := c[i] × x[i], 0 ≤ i < n

Sources of performance improvement in vector processing (details in the first half of Chapter 26)

One instruction is fetched and decoded for the entire operationThe multiplications are known to be independent (no checking)Pipelining/concurrency in memory access as well as in arithmetic

Array processing is similar (details in the second half of Chapter 26)

Page 22: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 22

MISD Architecture Example

Figure 25.12 Multiple instruction streams operating on a single data stream (MISD).

I n s t r u c t i o n s t r e a m s 1-5

Data in

Data out

Page 23: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 23

MIMD ArchitecturesControl parallelism: executing several instruction streams in parallel

GMSV: Shared global memory – symmetric multiprocessorsDMSV: Shared distributed memory – asymmetric multiprocessorsDMMP: Message passing – multicomputers

Figure 27.1 Centralized shared memory. Figure 28.1 Distributed memory.

0 0

1 1

m−1

Processor-to-

memory network

Processor-to-

processor network

Processors Memory modules

Parallel I/O

. . .

.

.

.

.

.

.

p−1

0

1

Inter- connection

network

Memories and processors

Par

alle

l inp

ut/o

utpu

t p−1

. . .

Routers

A computing node

. . .

Page 24: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 24

Amdahl’s Law Revisited

0

10

20

40

50

0 10 20 30 40 50Enhancement factor (p )

Spe

edup

(s)

f = 0

f = 0.1

f = 0.05

f = 0.0230

f = 0.01

Figure 4.4 Amdahl’s law: speedup achieved if a fraction f of a task is unaffected and the remaining 1 – f part runs p times as fast.

s =

≤ min(p, 1/f)

1f+ (1 – f)/p

f = sequential fraction

with pprocessors

p = speedup of the rest

Page 25: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 25

26 Vector and Array ProcessingSingle instruction stream operating on multiple data streams

• Data parallelism in time = vector processing• Data parallelism in space = array processing

Topics in This Chapter26.1 Operations on Vectors

26.2 Vector Processor Implementation

26.3 Vector Processor Performance

26.4 Shared-Control Systems

26.5 Array Processor Implementation

26.6 Array Processor Performance

Page 26: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 26

26.1 Operations on Vectors

Sequential processor:

for i = 0 to 63 do P[i] := W[i] × D[i]

endfor

Vector processor:

load Wload DP := W × Dstore P

for i = 0 to 63 do X[i+1] := X[i] + Z[i] Y[i+1] := X[i+1]+Y[i]

endfor

Unparallelizable

Page 27: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 27

26.2 Vector Processor Implementation

Figure 26.1 Simplified generic structure of a vector processor.

Function unit 1 pipeline

To a

nd fr

om m

emor

y un

it

From scalar registers

Vector register

file

Function unit 2 pipeline

Function unit 3 pipeline

Forwarding muxes

Load unit A

Load unit B

Store unit

Page 28: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 28

Conflict-Free Memory Access

Figure 26.2 Skewed storage of the elements of a 64 × 64 matrix for conflict-free memory access in a 64-way interleaved memory. Elements of column 0 are highlighted in both diagrams .

0,0

2,0 . . .

62,0

63,0

0,1

2,1 . . .

62,1

63,1

0,2

2,2 . . .

62,2

63,2

0,62

2,62 . . .

62,62

63,62

0,63

2,63 . . .

62,63

63,63

...

... . . . ...

...

0,0

2,62 . . .

62,2

63,1

0,1

2,63 . . .

62,3

63,2

0,2

2,0 . . .

62,4

63,3

0,62

2,60 . . .

62,0

63,63

0,63

2,61 . . .

62,1

63,0

...

... . . . ... ...

(a) Conventional row-major order (b) Skewed row-major order

Bank number 0 1 62 63 2 . . . 0 1 62 63 2 . . .

1,0 1,1 1,2 1,62 1,63 ... 1,63 1,0 1,0 1,61 1,62 ...

Page 29: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 29

Overlapped Memory Access and Computation

Figure 26.3 Vector processing via segmented load/store of vectors in registers in a double-buffering scheme. Solid (dashed) lines show data flow in the current (next) segment.

Vector reg 0

Vector reg 1

Vector reg 5

Vector reg 2

Vector reg 3

Vector reg 4

Load X

Load Y

Store Z

To a

nd fr

om m

emor

y un

it

Pipelined adder

Page 30: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 30

26.3 Vector Processor Performance

Figure 26.4 Total latency of the vector computation S := X × Y + Z, without and with pipeline chaining.

Multiplication start-up

Addition start-up

+ ×

+

×

Without chaining

With pipeline chaining

Time

Page 31: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 31

Performance as a Function of Vector Length

Figure 26.5 The per-element execution time in a vector processor as a function of the vector length.

Vector length 100 200 300 400 0

Clo

ck c

ycle

s pe

r ve

ctor

ele

men

t

5

4

3

2

1

0

Page 32: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 32

26.4 Shared-Control Systems

Figure 26.6 From completely shared control to totally separate controls.

(a) Shared-control array processor, SIMD

(b) Multiple shared controls, MSIMD

(c) Separate controls, MIMD

Processing Control

. . .

Processing Control

. . .

Processing Control

. . .

. . .

Page 33: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 33

Example Array Processor

Figure 26.7 Array processor with 2D torus interprocessor communication network.

Control broadcast Parallel

I/O

Processor array Control

Switches

Page 34: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 34

26.5 Array Processor Implementation

Figure 26.8 Handling of interprocessor communication via a mechanism similar to data forwarding.

ALU Reg file

CommunDir CommunEn

PE state FF

Data memory

To array state reg

To reg f ile and data memory

Commun buffer

N E

W S To NEWS

neighbors

0

1

Page 35: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 35

Configuration Switches

Figure 26.9 I/O switch states in the array processor of Figure 26.7.

Control broadcast Parallel

I/O

Processor array Control

Switches

Figure 26.7

(a) Torus operation

In

(b) Clockwise I/O (c) Counterclockwise I/O

Out

In

Out

Page 36: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 36

26.6 Array Processor Performance

Array processors perform well for the same class of problems thatare suitable for vector processors

For embarrassingly (pleasantly) parallel problems, array processors

A criticism of array processing:For conditional computations, a significant part of the array remainsidle while the “then” part is performed; subsequently, idle and busyprocessors reverse roles during the “else” part

However:Considering array processors inefficient due to idle processorsis like criticizing mass transportation because many seats are unoccupied most of the time

It’s the total cost of computation that counts, not hardware utilization!

can be faster and more energy-efficient than vector processors

Page 37: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 37

27 Shared-Memory MultiprocessingMultiple processors sharing a memory unit seems naïve

• Didn’t we conclude that memory is the bottleneck?• How then does it make sense to share the memory?

Topics in This Chapter27.1 Centralized Shared Memory

27.2 Multiple Caches and Cache Coherence

27.3 Implementing Symmetric Multiprocessors

27.4 Distributed Shared Memory

27.5 Directories to Guide Data Access

27.6 Implementing Asymmetric Multiprocessors

Page 38: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 38

Parallel Processing as a Topic of Study

Graduate course ECE 254B:Adv. Computer Architecture –Parallel Processing

An important area of studythat allows us to overcomefundamental speed limits

Our treatment of the topic isquite brief (Chapters 26-27)

Page 39: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 39

27.1 Centralized Shared Memory

Figure 27.1 Structure of a multiprocessor with centralized shared-memory.

0 0

1 1

m−1

Processor-to-

memory network

Processor-to-

processor network

Processors Memory modules

Parallel I/O

. . .

.

.

.

.

.

.

p−1

Page 40: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 40

Processor-to-Memory Interconnection Network

Figure 27.2 Butterfly and the related Beneš network as examples of processor-to-memory interconnection network in a multiprocessor.

(a) Butterfly network (b) Beneš network

0

2

4

6

8

10

12

14

Processors Memories

P r o c e s s o r s

M e m o r i e s

1

3

5

7

9

11

13

15

0

2

4

6

8

10

12

14

1

3

5

7

9

11

13

15

0

2

4

6

0

2

4

6

1

3

5

7

1

3

5

7

Row 0

Row 1

Row 2

Row 3

Row 4

Row 5

Row 6

Row 7

Page 41: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 41

Processor-to-Memory Interconnection Network

Figure 27.3 Interconnection of eight processors to 256 memory banks in Cray Y-MP, a supercomputer with multiple vector processors.

0

1

2

3

4

5

6

7

8 × 8

8 × 8

8 × 8

8 × 8

4 × 4

4 × 4

4 × 4

4 × 4

4 × 4

4 × 4

4 × 4

4 × 4

Sections Subsections Memory banks

0, 4, 8, 12, 16, 20, 24, 28 32, 36, 40, 44, 48, 52, 56, 60

1, 5, 9, 13, 17, 21, 25, 29

2, 6, 10, 14, 18, 22, 26, 30

3, 7, 11, 15, 19, 23, 27, 31

Processors

1 × 8 switches 224, 228, 232, 236, . . . , 252

225, 229, 233, 237, . . . , 253

226, 230, 234, 238, . . . , 254

227, 231, 235, 239, . . . , 255

8 /

8 /

8 /

8 /

Page 42: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 42

Shared-Memory Programming: Broadcasting

Copy B[0] into all B[i] so that multiple processorscan read its value without memory access conflicts

for k = 0 to ⎡log2 p⎤ – 1 processor j, 0 ≤ j < p, doB[j + 2k] := B[j]

endfor

0 1 2 3 4 5 6 7 8 9 10 11

B

Recursivedoubling

Page 43: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 43

Shared-Memory Programming: SummationSum reduction of vector X

processor j, 0 ≤ j < p, do Z[j] := X[j]s := 1while s < p processor j, 0 ≤ j < p – s, do

Z[j + s] := X[j] + X[j + s]s := 2 × s

endfor

0 1 2 3 4 5 6 7 8 9

S 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7 8:8 9:9

0:0 0:1 1:2 2:3 3:4 4:5 5:6 6:7 7:8 8:9

0:0 0:1 0:2 0:3 1:4 2:5 3:6 4:7 5:8 6:9

0:0 0:1 0:2 0:3 0:4 0:5 0:6 0:7 1:8 2:9

0:0 0:1 0:2 0:3 0:4 0:5 0:6 0:7 0:8 0:9

Recursivedoubling

Page 44: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 44

27.2 Multiple Caches and Cache Coherence

Private processor caches reduce memory access traffic through the interconnection network but lead to challenging consistency problems.

0 0

1 1

m−1

Processor-to-

memory network

p−1

Processor-to-

processor network

Processors Caches Memory modules

Parallel I/O

. . .

.

.

.

.

.

.

Page 45: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 45

Status of Data Copies

Figure 27.4 Various types of cached data blocks in a parallel processor with centralized main memory and private processor caches.

0

1

Processor-to-

memory network

p–1

Processor-to-

processor network

Processors Caches Memory modules

Parallel I/O

. . .

.

.

.

.

.

.

w x

y

z ′

w z ′

w y ′

x z

Multiple consistent

Single consistent

Single inconsistent

Invalid

m–1

0

1

Page 46: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 46

A Snoopy Cache Coherence

Protocol

Figure 27.5 Finite-state control mechanism for a bus-based snoopy cache coherence protocol with write-back caches.

CPU read or write hit

Invalid

Shared (read-only)

Exclusive (writable)

CPU read hit

CPU read miss: signal read miss

on bus

CPU w rite miss: signal write miss

on bus

CPU w rite hit: signal write miss on bus

Bus write miss: write back cache line

Bus write miss

Bus read miss: write back cache line

PC

PC

PC

PC

BusMemory

Page 47: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 47

27.3 Implementing Symmetric Multiprocessors

Figure 27.6 Structure of a generic bus-based symmetric multiprocessor.

Computing nodes (typically, 1-4 CPUs

and caches per node)

Interleaved memory

Bus adapter

I/O modules

Standard interfaces

Bus adapter

Very wide, high-bandwidth bus

Page 48: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 48

Bus Bandwidth Limits PerformanceExample 27.1

Consider a shared-memory multiprocessor built around a single bus with a data bandwidth of x GB/s. Instructions and data words are 4 B wide, each instruction requires access to an average of 1.4 memory words (including the instruction itself). The combined hit rate for caches is 98%. Compute an upper bound on the multiprocessor performance in GIPS. Address lines are separate and do not affect the bus data bandwidth.

Solution

Executing an instruction implies a bus transfer of 1.4 × 0.02 × 4 = 0.112B. Thus, an absolute upper bound on performance is x/0.112 = 8.93x GIPS. Assuming a bus width of 32 B, no bus cycle or data going to waste, and a bus clock rate of y GHz, the performance bound becomes 286y GIPS. This bound is highly optimistic. Buses operate in the range 0.1 to 1 GHz. Thus, a performance level approaching 1 TIPS (perhaps even ¼ TIPS) is beyond reach with this type of architecture.

Page 49: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 49

Implementing Snoopy Caches

Figure 27.7 Main structure for a snoop-based cache coherence algorithm.

Tags

Cache data array

Duplicate tags and state store for snoop side

CPU

Main tags and state store for processor side

=?

=?

Processor side cache control

Snoop side cache control

Addr Addr Cmd Cmd Buffer Buffer Snoop state

System bus

Tag

Addr Cmd

State

Page 50: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 50

27.4 Distributed Shared Memory

Figure 27.8 Structure of a distributed shared-memory multiprocessor.

0

1 z : 0

x : 0 y : 1

Inter- connection

network

Processors with memory

Par

alle

l inp

ut/o

utpu

t

. . .

p−1

y := -1 z := 1

while z=0 do x := x + y endwhile

Routers

Page 51: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 51

27.5 Directories to Guide Data Access

Figure 27.9 Distributed shared-memory multiprocessor with a cache, directory, and memory module associated with each processor.

0

1

Inter- connection

network

Processors & caches

Par

alle

l inp

ut/o

utpu

t

. . .

p−1

Memories

Directories Communication & memory interfaces

Page 52: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 52

Directory-Based Cache Coherence

Figure 27.10 States and transitions for a directory entry in a directory-based cache coherence protocol (c is the requesting cache).

Write miss: return value, set sharing set to {c}

Uncached

Shared (read-only)

Exclusive (writable)

Read miss: return value, include c in sharing set

Read miss: return value, set sharing set to {c}

Write miss: invalidate all cached copies, set sharing set to {c}, return value

Data w rite-back: set sharing set to { }

Read miss: fetch data from owner, return value, include c in sharing set

Write miss: fetch data from owner, request invalidation,

return value, set sharing set to {c}

Page 53: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 53

27.6 Implementing Asymmetric Multiprocessors

Figure 27.11 Structure of a ring-based distributed-memory multiprocessor.

Computing nodes (typically, 1-4 CPUs and associated memory)

Link

To I/O controllers

Memory

Ring network

Link Link Link

Node 0 Node 1 Node 2 Node 3

Page 54: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 54

Scalable Coherent Interface

(SCI)

Figure 27.11 Structure of a ring-based distributed-memory multiprocessor.

0

1

Processors and caches

To in

terc

onne

ctio

n ne

twor

k

3

Memories

2

Page 55: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 55

28 Distributed MulticomputingComputer architects’ dream: connect computers like toy blocks

• Building multicomputers from loosely connected nodes• Internode communication is done via message passing

Topics in This Chapter28.1 Communication by Message Passing

28.2 Interconnection Networks

28.3 Message Composition and Routing

28.4 Building and Using Multicomputers

28.5 Network-Based Distributed Computing

28.6 Grid Computing and Beyond

Page 56: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 56

28.1 Communication by Message Passing

Figure 28.1 Structure of a distributed multicomputer.

0

1

Inter- connection

network

Memories and processors

Par

alle

l inp

ut/o

utpu

t

p−1

. . .

Routers

A computing node

Page 57: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 57

Router Design

Figure 28.2 The structure of a generic router.

Switch

Inpu

t cha

nnel

s

Routing and arbitration

Input queues

Q

Q

Q

Q

Q

Q

Q

Q

LC

LC

LC

LC

LC

LC

LC

LC Out

put c

hann

els

Output queues

Q Q

LC LC Link controller

Message queue

Injection channel Ejection channel

Page 58: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 58

Building Networks from Switches

Straight through Crossed connection Lower broadcast Upper broadcast

Figure 28.3 Example 2 × 2 switch with point-to-point and broadcast connection capabilities.

(a) Butterfly network (b) Beneš network

0

2

4

6

8

10

12

14

Processors Memories

P r o c e s s o r s

M e m o r i e s

1

3

5

7

9

11

13

15

0

2

4

6

8

10

12

14

1

3

5

7

9

11

13

15

0

2

4

6

0

2

4

6

1

3

5

7

1

3

5

7

Row 0

Row 1

Row 2

Row 3

Row 4

Row 5

Row 6

Row 7

Figure 27.2Butterfly and Beneš networks

Page 59: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 59

Interprocess Communication via Messages

Figure 28.4 Use of send and receive message-passing primitives to synchronize two processes.

Process A Process B

...

...

...

...

...

... send x ... ... ... ... ... ... ...

...

... receive x ... ... ... Time

Communication latency

Process B is suspended

Process B is awakened

Page 60: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 60

28.2 Interconnection Networks

Figure 28.5 Examples of direct and indirect interconnection networks.

(a) Direct network (b) Indirect network

Routers Nodes Nodes

Page 61: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 61

Direct Interconnection

Networks

Figure 28.6 A sampling of common direct interconnection networks. Only routers are shown; a computing node is implicit for each router.

(a) 2D torus (b) 4D hypercube

(c) Chordal ring (d) Ring of rings

Page 62: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 62

Indirect Interconnection Networks

Figure 28.7 Two commonly used indirect interconnection networks.

(a) Hierarchical buses (b) Omega network

Level-1 bus

Level-2 bus

Level-3 bus

Page 63: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 63

28.3 Message Composition and Routing

Figure 28.8 Messages and their parts for message passing.

Message Padding

Packet data

Last packet Header Trailer

A transmitted packet

Flow control digits (flits)

Data or payload First packet

Page 64: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 64

Wormhole Switching

Figure 28.9 Concepts of wormhole switching.

Worm 1: moving

(a) Two worms en route to their respective destinations

Source 2

Source 1

Destination 1

Destination 2

Worm 2: blocked

(b) Deadlock due to circular waiting of four blocked worms

Each worm is blocked at the point of attempted right turn

Page 65: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 65

28.4 Building and Using Multicomputers

Figure 28.10 A task system and schedules on 1, 2, and 3 computers.

(a) Static task graph (b) Schedules on 1-3 computers

Inputs

Outputs

t = 1

t = 1

t = 2

t = 2 t = 2

t = 3

B

A C

D

E

F

G

H

t = 1

t = 2

B A C D E F G H

B A C

D E

H F G

B A C

D

E F G H

0 5 10 15

Time

Page 66: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 66

Building Multicomputers from Commodity Nodes

Figure 28.11 Growing clusters using modular nodes.

(a) Current racks of modules (b) Futuristic toy-block construction

Expansion slots

One module: CPU,

memory, disks

One module:CPU(s), memory,

disks

Wireless connection surfaces

Page 67: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 67

28.5 Network-Based Distributed Computing

Figure 28.12 Network of workstations.

System or I/O bus PC

Fast network interface with large memory

NIC

Network built of high-speed

wormhole switches

Page 68: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 68

28.6 Grid Computing and Beyond

Computational grid is analogous to the power grid

Decouples the “production” and “consumption” of computational power

Homes don’t have an electricity generator; why should they have a computer?

Advantages of computational grid:

Near continuous availability of computational and related resourcesResource requirements based on sum of averages, rather than sum of peaksPaying for services based on actual usage rather than peak demandDistributed data storage for higher reliability, availability, and securityUniversal access to specialized and one-of-a-kind computing resources

Still to be worked out as of late 2000s: How to charge for compute usage

Page 69: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 69

Computing in the Cloud

Image from Wikipedia

Computational resources,both hardware and software,are provided by, and managed within, the cloud

Users pay a fee for access

Managing / upgrading is much more efficient in large, centralized facilities (warehouse-sized data centers or server farms)

This is a natural continuation of the outsourcing trend for special services, so that companies can focus their energies on their main business

Page 70: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 70

The Shrinking Supercomputer

Page 71: Computer Architecture, Part 7 - Electrical and Computer

Feb. 2011 Computer Architecture, Advanced Architectures Slide 71

Warehouse-Sized Data Centers

Image from IEEE Spectrum, June 2009


Recommended