61
LECTURE 1 : PIPELINING JAN LEMEIRE SHEN & LIPASTI CHAPTERS 2 & 4 Advanced Computer Architecture 10/12/2012 1 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

L E C T U R E 1 : P I P E L I N I N G

J A N L E M E I R E

S H E N & L I P A S T I C H A P T E R S 2 & 4

Advanced Computer Architecture

10/12/2012

1

Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

Page 2: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Historic Perspective

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

2

First employed in early 1960s

During the 1980s, it was the cornerstone of RISC approach

Intel i486 was first pipelined CISC processor (1989)

Today, almost all processors are pipelined

Page 3: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Throughput vs Latency

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

3

Latency: time between start and finish of a job

Throughput: number of jobs per second/hour/day/...

Example: sending letters via airmail

More letters on a plane: more throughput, but same latency

More planes with same amount of letters: latency decreases and throughput increases

Less planes, with more letters: same throughput, higher latency

Page 4: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Pipelining Principle

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

4

Goal: increase throughput without much additional HW and without additional latency

Page 5: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Pipelining Principle

Long operations

Combination of short operations

Pipelining

1 2 3 4

1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4

GPU Programming

time

IF

ID

OF

EX

1

1

1

1 2

2

2

2

3

3

3

3

4

4

4

4

Page 6: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Pipelining Overhead

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

6

Pipelining Idealisms Uniform subcomputations

Can pipeline into stages with equal delay

Balance pipeline stages, otherwise have internal fragmentation

Also: no additional delay by the interstage buffers/clocking requirements

Identical computations

Can fill pipeline with identical work and no unused pipeline stages

Unify instruction types (example later) to avoid external fragmentation

Independent computations

No relationships between work units

Minimize pipeline stalls (dynamic external fragmentation)

Are these practical? No, but can get close enough to get significant speedup

Deviations determine performance loss (inefficiencies)

Page 7: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Pipeline Design Tasks

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

7

Perform stage quantization

to create uniform, balanced pipeline stages

Unify different resource requirements for different instruction types

to minimize the underuse of resources

to enable execution of all instruction types

Deal with dependent operations

Since not all instructions will be independent

Must not punish execution of independent instructions

Page 8: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Impact on ISA

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

8

Uniform subcomputations

longest latency undividable instruction must be found

is usually the memory access

this should not be slowed down by complex addressing modes

caches are required because of high latency to main memory

Unifying the resource requirements

is easier for simple, less diverse RISC instructions

Deal with dependencies

is very hard in HW with complex addressing modes

can be done in HW (RISC) or in SW (VLIW)

Page 9: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Pipeline Design: Balanced Stages (1)

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

9

Typically five tasks in instruction execution

IF: instruction fetch

ID: instruction decode

OF: operand fetch

EX: instruction execution

OS: operand store, often called write-back WB

Page 10: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Pipeline Design: Balanced Stages (2)

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

10

Two techniques

1) merge stages 2) subdivide stages

Page 11: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Pipeline Design: Balanced Stages (3)

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

11

More stages is more complex

more concurrent register file accesses

more concurrent memory accesses

pipelined memory accesses are complex

Page 12: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Pipeline Design: Unified Instruction Types (1)

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

12

Different types of instructions

ALU instructions

Memory accesses

Branch Instructions

Coalescing of requirements

Analyze subcomputation sequences and resource requirements

Find commonalities and merge them

In case of flexibility, shift or reorder subcomputations to ease merging

Page 13: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Pipeline Design: Unified Instruction Types (2)

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

13

Page 14: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Pipeline Design: Unified Instruction Types (3)

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

14

Objectives

minimize the total number of resources

maximize utilization, minimize idling stages

limit instruction latency

put idle cycles at the end (to minimize dependencies)

Page 15: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Pipeline Design: Unified Instruction Types (4)

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

15

source registers

operation to be performed

destination register

Page 16: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Pipeline Design: Unified Instruction Types (5)

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

16

Page 17: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Pipeline Design: Minimize Pipeline Stalls (1)

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

17

Multiple instructions in pipeline, in different stages

Be carefull with data dependencies

Suppose I4 consumes value produced by I2.

Need to avoid that I4 reads operands before I2 writes them.

Solution: detect dependence and delay instruction I4.

Page 18: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

18

Pipeline Design: Minimize Pipeline Stalls (2)

Data dependencies

True dependence = read-after-write (RAW)

Anti dependence = write-after-read (WAR)

Output dependence = write-after-write (WAW)

A hazard: messing up the program by not respecting a dependency

V3 ← V1 op V2

V4 ← V3 op V5

V3 ← V1 op V2

V1 ← V4 op V5

V3 ← V1 op V2

V3 ← V4 op V5

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

Page 19: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Pipeline Design: Minimize Pipeline Stalls (3)

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

19

Dependencies Through Memory?

Not possible since (1) only one stage accesses memory, and (2) all instructions pass through mem stage in program order.

Page 20: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Pipeline Design: Minimize Pipeline Stalls (4)

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

20

What about WAW hazards through registers?

Not possible since (1) all writes to register happen in single WB stage, and (2) all instructions pass through WB stage in program order.

Page 21: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Pipeline Design: Minimize Pipeline Stalls (5)

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

21

What about WAR hazards through registers?

Not possible since (1) writes occur in WB stage after reads in RD stage, and (2) all instructions pass through these stages in program order

Page 22: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Pipeline Design: Minimize Pipeline Stalls (6)

10/12/2012

22

RAW hazards

Suppose I4 consumes value produced by I2.

Need to avoid that I4 reads operands before I2 writes them.

Solution: detect dependence and delay instruction I4.

Result: pipeline bubble or stall, performance drop

Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

Page 23: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Pipeline Design: Minimize Pipeline Stalls (7) 23

RAW hazards: forwarding

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

Page 24: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Pipeline Design: Minimize Pipeline Stalls (8) 24

Forwarding implementation

Delay lines

Comparators

Multiplexors

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

Page 25: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Pipeline Design: Minimize Pipeline Stalls (9)

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

25

What about control dependencies and branches

EVERY 5th-6th instruction is a

branch!

Page 26: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Pipeline Design: Minimize Pipeline Stalls (A)

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

26

Similar solution: branch forwarding to save a cycle

Page 27: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Pipeline Design: Minimize Pipeline Stalls (B)

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

27

Additional solution: branch not taken prediction

keep fetching & decoding & ...

If branch is not taken, oke, keep executing

If branch is taken: flush all the instructions past the branch from the pipeline, and start fetching again

Another solution: delay slot

keep fetching & decoding & executing delay slot instructions

compiler places only instructions in the delay slot that are executed on taken and non-taken path

instructions in delay slot need not be flushed!

Page 28: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Pipeline Design: Minimize Pipeline Stalls (C) 28

IF

ID

OF

EX

MEM

WB

br t

br

br

x x+1 x+2 x+3 x+4

br

br

x+5

inst

correct not-taken prediction: penalty = 0 cycli

inst inst

inst

inst

inst

inst

inst

inst

inst

inst

inst

inst

inst

inst

br

inst inst

not- taken taken

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

Page 29: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Pipeline Design: Minimize Pipeline Stalls (D) 29

IF

ID

OF

EX

MEM

WB

br t

br

br

x x+1 x+2 x+3 x+4

br

br

x+5

inst

incorrect not-taken prediction: penalty = 4 cycli

inst inst

inst

inst

inst

inst

inst

inst

inst

inst

br

inst inst

not- taken taken

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

Page 30: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Pipeline Design: Minimize Pipeline Stalls (E) 30

IF

ID

OF

EX

MEM

WB

br t

br

br

x x+1 x+2 x+3 x+4

br

br

x+5

inst

incorrect not-taken prediction: penalty with delay slot = 3 cycli

inst inst

inst

inst

inst

inst

inst

inst

inst

inst

br

inst inst

not- taken taken inst

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

Page 31: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

31

Limitations to Scalar Pipelines

Instruction type unification is a problem

e.g.: floating-point addition vs. integer addition

Fundamental limit: IPC ≤ 1

In-order execution: IPC < 1

stalls caused by dependencies

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

Page 32: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

32

Problem 1: Unification

IF

ID

OF

EX

MEM

WB

EX has to execute integer addition as well as floating-point addition in one cycle ... Solution: diversification

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

Page 33: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

33

Solution 1: Diversified Pipeline

IF

ID

OF

FP (1)

FP (2)

FP (3)

FP (4)

FP (5)

FP (6)

WB

INT MEM (1)

MEM (2)

MEM (3)

BR

higher clock frequency than unified pipeline

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

Page 34: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

34

Problems of Diversification

Out-of-order completion

Writing back the results can happen out-of-order (= WAW hazard)

Potentially more write operations to register file in WB stage per klok cycle (= structural hazard)

Exceptions

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

Page 35: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

35

Out-of-order completion

R4←ld[MEM]

R3←R1+R2

IF ID OF MEM1 MEM2 MEM3 WB

IF ID OF EX WB

t

... is not really a problem (except for exceptions—see later)...

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

Page 36: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

36

but ...

R4←ld[MEM]

R4←R1+R2

IF ID OF MEM1 MEM2 MEM3 WB

IF ID OF EX WB

... this is a WAW hazard, the solution is ...

t

R4←ld[MEM]

R4←R1+R2

IF ID OF MEM1 MEM2 MEM3 WB

IF ID OF xx xx EX WB

... blocking the pipeline (a stall)

t

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

Page 37: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

37

Multiple WBs per cycle

Is impossible because of only 1 write port to register file

Solution: Add a write port, or

Treat the write port as structural hazard (i.e. an additional stall)

R4←ld[MEM]

R3←R1+R2

IF ID OF MEM1 MEM2 MEM3 WB

IF ID OF EX WB

t

R5←R3+R2 IF ID OF EX WB

R5←R3+R2 IF ID OF xx EX WB

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

Page 38: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

38

Interrupts vs. exceptions (1)

Interrupts

Typically because of external factors

Asynchronuous with respect to program executing

Interrupt handling:

Stop fetching new instructions

Finish executing instructions in the pipeline

Save architectural state

Handle the interrupt

Restore state and continue executing the program

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

Page 39: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

39

Interrupts vs. exceptions (2)

Exceptions/faults

Caused by something in the execution of the program

division by zero, page fault, overflow, etc.

Precise exception

Store state from just before instruction

Handle the exception

Continue execution from instruction that caused the exception

What happens with out-of-order completion?

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

Page 40: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

40

Exceptions en OoO completion

Because of OoO completion, precise exceptions cannot be guaranteed

Imprecise exceptions

Hard to support in modern processors, complicates design

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

Page 41: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

41

Limitations to Scalar Pipelines

Instruction type unification is a problem

e.g.: floating-point addition vs. integer addition

solved with pipeline diversification

Fundamental limit: IPC ≤ 1

In-order execution: IPC < 1

stalls caused by dependencies

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

Page 42: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

42

Solution 2: Superscalar pipeline (1)

FP (1)

FP (2)

FP (3)

FP (4)

FP (5)

FP (6)

INT MEM (1)

MEM (2)

MEM (3)

BR

IF

ID

OF

WB

crossbar

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

Page 43: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Solution 2: Superscalar pipeline (2)

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

43

Temporal Parallelism

Pipeline

Relatively Cheap

Spatial Parallelism

Superscalar

Relatively expensive (more hardware)

Superscalar pipeline

Both temporal and spatial parallelism

Potential speedup of the pipeline: depth * width

Page 44: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Solution 2: Superscalar pipeline (3)

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

44

Additional HW cost: More register ports

2 x w read ports

w write ports

can do with less, at the expense of structural hazards

More bandwidth to I$ and D$ (caches)

Interconnections

To distribute instructions over pipelines

Complexity w2

Hazard detection is more complex

Page 45: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Solution 2: Superscalar pipeline (4)

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

45

Dealing with hazards

Typically before instruction is executed

i.e. at operand fetch time

Once in the function unit (execution stages of pipeline), the instruction is no longer blocked

Page 46: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Solution 2: Superscalar pipeline (5)

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

46

R4←ld[MEM]

R3←R1+R2

IF ID OF MEM1 MEM2 MEM3 WB

IF ID OF EX WB

t

R5←R3+R2 IF ID OF xx EX WB

First and second instruction executed together as they are independent. Third instruction is blocked because of RAW hazard with second instruction. Blocking happens in OF stage.

R6←R1+R2

R3←R1+R6

IF ID OF xx EX WB

IF ID OF xx xx EX WB

R5←R2+R6 IF ID OF xx xx EX WB

Page 47: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

47

Limitations to Scalar Pipelines

Instruction type unification is a problem

e.g.: floating-point addition vs. integer addition

solved with pipeline diversification

Fundamental limit: IPC ≤ 1

solved with superscalar pipeline

In-order execution: IPC < 1

stalls caused by dependencies

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

Page 48: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Problem with in-order execution (1)

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

48

R4←ld[MEM]

R3←R1+R2

IF ID OF MEM1 MEM2 MEM3 WB

IF ID OF EX WB

t

R5←R3+R2 IF ID OF xx EX WB

Fourth to sixth instruction are blocked when second instruction is blocked, even though they were not dependent on first two instructions. This is a fundamental problem of in-order issue machines.

R6←R1+R2

R3←R1+R6

IF ID OF xx EX WB

IF ID OF xx xx EX WB

R5←R2+R6 IF ID OF xx xx EX WB

Page 49: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Problem with in-order execution (2)

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

49

t

Problem becomes even worse in case of long latency blocks, such as multiplication instructions or cache misses.

R4←ld[MEM]

R3←ld[R1]

IF ID OF MEM1 MEM2 MEM3 WB

IF ID OF MEM1 MEM2 MEM3 xx xx WB

R5←R3+R2 IF ID OF xx xx xx xx xx EX WB

R6←R1+R2

R3←R1+R6

IF ID OF xx xx xx xx xx EX WB

IF ID OF xx xx xx xx xx xx EX

R5←R2+R6 IF ID OF xx xx xx xx xx xx EX

cache miss

Page 50: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Problem with in-order execution (3)

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

50

R4←ld[MEM]

R3←R1+R2

IF ID OF MEM1 MEM2 MEM3 WB

IF ID OF EX WB

t

R4←R3+R2 IF ID OF xx EX WB

R4←ld[MEM]

R3←R1+R2

IF ID OF MEM1 MEM2 MEM3 xx xx WB

IF ID OF EX WB

t

R4←R3+R2 IF ID OF xx xx xx EX WB

cache miss

Suppose there is only one register write port ... The cost increases very quickly with increasing instruction latencies

Page 51: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Problem with in-order execution (4)

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

51

Scalar pipeline: advance in lockstep fashion (in order)

Problems: A blocking instruction blocks all following

instructions even if those are independent is particularly problematic for long-latency instructions

WAW dependencies limit parallelism even though

they are not real dependencies

Why are WAR dependencies no limitation?

Page 52: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Solution 3: Out-of-order Execution (1)

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

52

Idea:

True dependencies (RAW) determine execution

False (output and anti, WAR and WAW) dependences do not block instructions

Except when storing values becomes a resource constraint, see later

In other words, the data flow limit is reached, i.e. instructions are executed as soon as there operands are available.

Out-of-order execution implies a dynamic pipeline

Page 53: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Solution 3: Out-of-order Execution (2)

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

53

R4←ld[MEM]

R3←R1+R2

R5←R3+R4

R6←R1+R5

R3←R1+R4

R5←R4+R5

R4←ld[MEM] R3←R1+R2

R5←R3+R4

R6←R1+R5

R3←R1+R4

R5←R4+R5

Only RAW dependencies

Data flow graph determines order of execution

Data flow limit

Approximates upper limit on ILP

Page 54: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Solution 3: Out-of-order Execution (3)

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

54

Page 55: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Solution 3: Out-of-order Execution (4) 55

fetch

dispatch

FP (1)

FP (2)

FP (3)

FP (4)

INT MEM (1)

MEM (2)

MEM (3)

BR

issue buffer

complete

retire store buffer

issue

execute

finish

in-o

rder

in-o

rder

ou

t-of-o

rder

decode

store queue

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

Page 56: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Solution 3: Out-of-order Execution (5)

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

56

Issue buffer = reservation station Instructions are inserted in issue buffer and reorder

buffer in program order Instructions are executed on FUs and might leave them in

another order

Reorder buffer = completion buffer Out-of-order finish because of out-of-order issue en non-

uniform execution latencies In-order retirement = writing back the results in the

registers Enables precise exceptions

Store queue en store buffer (see later)

Page 57: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Solution 3: Out-of-order Execution (6)

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

57

"Sequentiality is an illusion" Programmer sees instructions executing in program order, i.e.,

sequentially

In hardware many things happen in parallel

Temporally by pipelining

Spatially by exploiting ILP

In out-of-order processors this even involves instructions that are not executed in program order

In-order retirement guarantees sequential appearance

Parallelism at the instruction level: instruction-level parallelism (ILP)

Page 58: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

58

Limitations to Scalar Pipelines

Instruction type unification is a problem

e.g.: floating-point addition vs. integer addition

solved with pipeline diversification

Fundamental limit: IPC ≤ 1

solved with superscalar pipeline

In-order execution: IPC < 1

stalls caused by dependencies

solved with out-of-order execution

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

Page 59: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Summary

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

59

Improve performance with

Pipelining: temporal parallelism

Superscalar design: spatial parallelism

Optimal pipeline depth

Limitations on deeper pipelines

Technology aspects, data dependencies, control dependencies in code

Superscalar processor

Execute more than one operation per cycle

In-order versus out-of-order

Page 60: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Important Streams

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

60

Instruction stream Front-end of the pipeline Fetch as many as possible instructions into the pipeline

per cycle Lecture 3

Data stream through registers Detect and deal with dependencies between instructions OoO execution of instructions Lecture 3

Data stream through memory Get as much as possible data in and out of memory Lecture 2

Page 61: Advanced Computer Architecture - Vrije Universiteit Brusselparallel.vub.ac.be/education/computerarchitectuur/ACA - 1 - Pipelining.pdf · Historic Perspective Advanced Computer Architecture

Acknowledgement

10/12/2012 Advanced Computer Architecture – Jan Lemeire – VUB - 2012-2013 - Lecture 1

61

Thanks for (parts of) slides

Bjorn De Sutter

Lieven Eeckhout

Mikko H. Lipasti

James C. Hoe

John P. Shen

Advanced Computer Architecture – Jan Lemeire– VUB - 2012-2013 - Lecture 0