44
CPE 408441 Computer Architecture Fall2009 Appendix A: Pipelining: Basic and Intermediate Concepts Sa’ed R. Abed [Computer Engineering Department, Hashemite University]

CPE 408441 Computer Architecture Fall2009 Appendix A ...elcom-hu.com/Computer/Computer Architecture/Computer Architecture... · Appendix A: Pipelining: Basic and Intermediate Concepts

  • Upload
    others

  • View
    28

  • Download
    1

Embed Size (px)

Citation preview

CPE 408441Computer Architecture

Fall2009

Appendix A: Pipelining: Basic and Intermediate Concepts

Sa’ed R. Abed[Computer Engineering Department,

Hashemite University]

2

Outline

Basic concept of Pipelining

The Basic Pipeline for MIPS

The Major Hurdles of Pipelining – Pipeline Hazards

CPE 0408441 @2009 S. Abed - HU, Jordan

3

What Is Pipelining?

Laundry Example

• Ann, Betty, Cathy, Dave each has one load of clothes to wash, dry, and fold

• Washer takes 30 minutes

• Dryer takes 40 minutes

• “Folder” takes 20 minutes

A B C D

CPE 0408441 @2009 S. Abed - HU, Jordan

4

What Is Pipelining?

Sequential laundry takes 6 hours for 4 loads

A

B

C

D

30 40 20 30 40 20 30 40 20 30 40 20

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

Want to reduce the time? - Pipelining!!!

CPE 0408441 @2009 S. Abed - HU, Jordan

5

What Is Pipelining?

• Start work ASAP• Pipelined laundry takes

3.5 hours for 4 loads

A

B

C

D

Task

Order

6 PM 7 8 9Time

30 40 40 40 40 20

CPE 0408441 @2009 S. Abed - HU, Jordan

6

• Pipelining doesn’t help latency of single task; it helps throughput of entire workload

• Pipeline rate is limited by the slowest pipeline stage• Multiple tasks operating simultaneously• Potential speedup = Number of pipe stages

– Unbalanced lengths of pipe stages reduces speedup

What Is Pipelining?

Pipelining is an implementation technique whereby multiple instructions are overlapped in execution

It takes advantage of parallelism that exists among instructions => instruction-level parallelism

It is the key implementation technique used to make fast CPUs

CPE 0408441 @2009 S. Abed - HU, Jordan

7

MIPS Without Pipelining The execution of instructions is controlled by CPU clock. One

specific function in one clock cycle. Every MIPS instruction takes 5 clock cycles in terms of five different

stages. Several temporary registers are introduced to implement the 5-stage

structure.

CPE 0408441 @2009 S. Abed - HU, Jordan

8

MIPS Functions

Instruction Fetch (IF):• Send out the PC and fetch the instruction from memory into the

instruction register (IR); increment the PC by 4 to address the next sequential instruction and store it in next program count register (NPC).

• IR holds the instruction that will be used in the next stage.• NPC holds the value of the next PC.

Passed To Next Stage

IR <- Mem[PC]NPC <- PC + 4

Only consider load-store, BEQZ, and

integer ALU

CPE 0408441 @2009 S. Abed - HU, Jordan

9

Instruction Decode/Register Fetch (ID):• Decode the instruction and access the register file to read the registers.• The outputs of the general purpose registers are read into two

temporary registers (A & B) for use in later clock cycles.• We sign extend the lower 16 bits of the Instruction Register into another

temporal register Imm.

Passed To Next Stage

A <- Regs[rs];B <- Regs[rt];Imm <- ((IR16)48 ##IR16..31

MIPS Functions

CPE 0408441 @2009 S. Abed - HU, Jordan

10

Passed To Next Stage

ALUOutput <- A + Imm;

ALUOutput <- A func. B;

ALUOutput <- A op Imm;

ALUOutput <- NPC+ Imm<<2,Cond = (A==0);

Execution/Effective Address Calculation (EX):• We perform an operation (for an ALU) or an address calculation (if the

instruction is about load/store or Branch).• If an ALU, actually do the operation. If an address calculation, figure out

the address and store it for the next cycle.

MIPS Functions

CPE 0408441 @2009 S. Abed - HU, Jordan

11

Passed To Next Stage

LMD = Mem[ALUOutput]orMem[ALUOutput] = B;

If (cond) PC <- ALUOutput

Memory Access/Branch Completion (MEM):• If it is an ALU instruction, do nothing.• If it is a load/store instruction, then access memory.• If it is a branch instruction, update PC if necessary in terms of condition.

MIPS Functions

CPE 0408441 @2009 S. Abed - HU, Jordan

12

Passed To Next Stage

Regs[rd] <- ALUOutput;

Regs[rs] <- ALUOutput;

Regs[rt] <- LMD;

Write-back (WB):• Update the registers from either the ALU or from the data loaded.

MIPS Functions

CPE 0408441 @2009 S. Abed - HU, Jordan

13

The classic five-stages pipeline for MIPS We can pipeline the execution with almost no changes by simply starting a

new instruction on each clock cycle.

Each clock cycle becomes a pipe stage – a cycle in the pipe line which results in the execution pattern as a typical way of pipeline structure.

Although each instruction takes 5 clock cycles to complete, the hardware will initiate a new instruction during each clock cycle and will be executing some parts of the five different instruction already existing in the pipeline.

It may be hard to believe that pipelining is as simple as this.

Clock numberInstruction number 1 2 3 4 5 6 7 8 9Instruction i IF ID EX MEM WBInstruction i+1 IF ID EX MEM WBInstruction i+2 IF ID EX MEM WBInstruction i+3 IF ID EX MEM WBInstruction i+4 IF ID EX MEM WB

CPE 0408441 @2009 S. Abed - HU, Jordan

14

Figure A.2 The pipeline can be thought of as a series of data paths shifted in time

CPE 0408441 @2009 S. Abed - HU, Jordan

15

Simple MIPS Pipeline

MIPS pipeline data path to deal with problems that pipelining introduces in real implementation.

It is critical to ensure that instructions at different stage in the pipeline do not attempt to use the hardware resources at the same time (in the same clock cycle) – perform different operations with the same functional unit such as ALU on the same clock cycle.

Instructions and data memories are separated in different caches (IM/DM). Register file is used in two stages: one for reading in ID and one for writing

in WB. To handle a read and a write to the same register, we perform the register write in the first half of the clock and the read in the second.

CPE 0408441 @2009 S. Abed - HU, Jordan

16

Pipeline implementation for MIPS

In order to ensure that instructions in different stages of the pipeline do not interfere with each other, the data path is pipelined by adding a set of registers, one between each pair of pipe stages.

The registers serve to convey values and control information from one stage to the next.

Most of the data paths flow from left to right, which is from earlier in time to later. The paths flowing from right to left (which carry the register write-back information

and PC information on a branch) introduce complications into the pipeline.CPE 0408441 @2009 S. Abed - HU, Jordan

17

Events on Pipe Stages of the MIPS PipelineStage Any instruction

IF IF/ID.IR <- Mem[PC];IF/ID.NPC, PC <- (If ((EX/MEM.opcode==branch) & EX/MEM.cond){ EX/MEM.ALUOutput} else {PC+4});

ID ID/EX.A <- Regs[IF/ID.IR[rs]]; ID/EX.B <- Regs[IF/ID.IR[rt]];ID/EX.NPC <- IF/ID.NPC; ID/EX.IR <- IF/ID.IR;ID/EX.Imm <- sign-extend(IF/ID.IR[immediate field]);

ALU Instruction Load or store Branch

EX EX/MEM.IR <- ID/EX.IR;EX/MEM.ALUOutput <- ID/EX.A func ID/EX.B; orEX/MEM.ALUOutput <- ID/EX.A op ID/EX.Imm;

EX/MEM.IR <- ID/EX.IREX/MEM.ALUOutput <- ID/EX.A + ID/EX.Imm;

EX/MEM.B <- ID/EX.B

EX/MEM.ALUOutput <-ID/EX.NPC + (ID/EX.Imm << 2);

EX/MEM.cond <- (ID/EX.A ==0);

MEM MEM/WB.IR <- EX/MEM.IR;MEM/WB.ALUOutput <-EX/MEM.ALUOutput;

MEM/WB.IR <- EX/MEM.IR;MEM/WB.LMD <-Mem[EX/MEM.ALUOutput]; orMem[EX/MEM.ALUOutput] <-EX/MEM.B;

WB Regs[MEM/WB.IR[rd]] <-MEM/WB.ALUOutput; orRegs[MEM/WB.IR[rt]] <-MEM/WB.ALUOutput

For load only:Regs[MEM/WB.IR[rt]] <-MEM/WB.LMD

Figure A.19

CPE 0408441 @2009 S. Abed - HU, Jordan

18

Basic Performance Issues for PipeliningExample: Assume that an unpipelined processor has a 1ns clock cycle and that it uses 4 cycles for ALU operations and branches and 5 cycles for memory operations. Assume that the relative frequencies of these operations are 40%, 20%, and 40%, respectively. Suppose that due to clock skew and setup, pipelining the processor adds 0.2 ns overhead to the clock. Ignoring any latency impact, how much speedup in the instruction execution time will we gain from the pipeline implementation?

Solution:Avg. instr. exec timeunpipelined = Clock cycle time x Avg. CPI

= 1ns x (40%x4+20%x4+40%x5) = 4.4ns

Ideal situation without any latency, avg. CPI is just only 1 cycle for all kind of instructions and the clock cycle time is equal to 1.0ns + 0.2ns (1.2ns), then Avg. instr. exec timepipelined = 1.2ns x1 = 1.2ns

Then, speed up from pipelining is 4.4ns/1.2ns or 3.7 times.

What is the result if there is no overhead when implement pipelining?CPE 0408441 @2009 S. Abed - HU, Jordan

19

A.2 The Major Hurdle of Pipelining –Pipeline Hazard

Limits to pipelining: there are situations, called Hazards, prevent nextinstruction from executing during its designated clock cycle, thusreduce the performance from the ideal speedup. Three classes ofhazards are:

– Structural hazards: arise from resource conflicts when the hardwarecannot support all possible combinations of instructions simultaneouslyin overlapped execution- two different instructions use same h/w in thesame cycle .

– Data hazards: arise when an instruction depends on result of priorinstruction still in the pipeline, RAW, WAR and WAW.

– Control hazards: Pipelining of branches & other instructions thatchange the PC.

Common solution is to stall the pipeline until the hazard is cleared, i.e.,inserting one or more “bubbles” in the pipeline.

CPE 0408441 @2009 S. Abed - HU, Jordan

20

Performance of Pipelining with Stalls• The Pipelined CPI:

instr.per cycles stall Pipeline1depth Pipeline

instr.per cycles stall Pipeline1CPI

CPI CPI

Speedup dunpipeline

pipelined

dunpipeline

+=

+==

instr.per cycles stall Pipeline1

instr.per cycles stall Pipeline CPI IdealCPIpipelined

+=

+=

• Ignoring cycle time overhead of pipelining, and assuming the stages are perfectly balanced (all occupy one clock cycle) and all instructions take the same num of cycles, we have speedup from pipelining:

CPE 0408441 @2009 S. Abed - HU, Jordan

21

Structural HazardsWhen two ormore differentinstructions wantto use same h/wresource in samecycle

e.g., MEM usesthe same memoryport as IF asshown in thisslide.

Solution: stall

Instr.

Order

Time (clock cycles)

Load

Instr 1

Instr 2

Instr 3

Instr 4

Reg ALU DMemIfetch Reg

Reg ALU DMemIfetch Reg

Reg ALU DMemIfetch Reg

Reg ALU DMemIfetch Reg

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5

Reg ALU DMemIfetch Reg

CPE 0408441 @2009 S. Abed - HU, Jordan

22

Structural Hazards

This is another way of looking at the effect of a stall.I

nstr.

Order

Time (clock cycles)

Load

Instr 1

Instr 2

Stall

Instr 3

Reg ALU DMemIfetch Reg

Reg ALU DMemIfetch Reg

Reg ALU DMemIfetch Reg

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5

Reg ALU DMemIfetch Reg

Bubble Bubble Bubble BubbleBubble

CPE 0408441 @2009 S. Abed - HU, Jordan

23

Structural Hazards

This is another way to represent the stall.

CPE 0408441 @2009 S. Abed - HU, Jordan

24

• Stall– low cost, simple – Increases CPI – use for rare case since stalling has performance

effect• Replicate resource

– good performance – increases cost (+ maybe interconnect delay) – useful for cheap or divisible resources

Dealing With Structural Hazards

E.g., we use separate instruction and data memories in MIPS pipeline

CPE 0408441 @2009 S. Abed - HU, Jordan

25

Data Hazards• Data hazards occur when the pipeline changes the order of

read/write accesses to operands (registers) so that the orderdiffers from the order seen by sequentially executinginstructions on an unpipelined processor.

• Where there’s real trouble is when we have:

instruction Ainstruction B,

and B manipulates (reads or writes) data before A does. Thisviolates the order of the instructions, since the architectureimplies that A completes entirely before B is executed.

CPE 0408441 @2009 S. Abed - HU, Jordan

26

Read After Write (RAW)

InstrJ tries to read operand before InstrI writes it

• Caused by a “dependence” (in compiler nomenclature). This hazard results from an actual need for communication.

Execution Order is:InstrIInstrJ

I: dadd r1,r2,r3J: dsub r4,r1,r3

Data Hazards

CPE 0408441 @2009 S. Abed - HU, Jordan

27

Write After Read (WAR)

InstrJ tries to write operand before InstrI reads it– Gets wrong operand

– Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”.

• Can’t happen in MIPS 5 stage pipeline because:– All instructions take 5 stages, and– Reads are always in stage 2, and – Writes are always in stage 5

Execution Order is:InstrIInstrJ

I: dsub r4,r1,r3 J: dadd r1,r2,r3K: mul r6,r1,r7

Data Hazards

CPE 0408441 @2009 S. Abed - HU, Jordan

28

Write After Write (WAW)

InstrJ tries to write operand before InstrI writes it– Leaves wrong result ( InstrI not InstrJ )

– Called an “output dependence” by compiler writersThis also results from the reuse of name “r1”.

• Can’t happen in MIPS 5 stage pipeline because: – All instructions take 5 stages, and – Writes are always in stage 5

• Will see WAR and WAW in later more complicated pipeline implementations

Execution Order is:InstrIInstrJ

I: dsub r1,r4,r3 J: dadd r1,r2,r3K: mul r6,r1,r7

Data Hazards

CPE 0408441 @2009 S. Abed - HU, Jordan

29

• Simple Solution to RAW • Hardware detects RAW and stalls until the result is written into

the register + low cost to implement, simple -- reduces # instruction executed per cycle

• Minimizing RAW stalls: Forwarding (also called bypassing)• Key insight: the result is not really needed by the current

instruction until after the previous instruction actually produces it.• The ALU result from both the EX/MEM and MEM/WB pipeline

registers is always fed back to the ALU inputs.• If the forwarding hardware detects that the previous ALU

operation has written the register corresponding to a source for the current ALU operation, control logic selects the forwarded result as the ALU input rather than the value read from the register file.

Solutions to Data Hazards

CPE 0408441 @2009 S. Abed - HU, Jordan

30

The use of the result of the ADD instruction in the next two instructions causes a hazard, since the register is not written until after those instructions read it.

Instr.

Order

dadd r1,r2,r3

dsub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

Reg ALU DMemIfetch Reg

Reg ALU DMemIfetch Reg

Reg ALU DMemIfetch Reg

Reg ALU DMemIfetch Reg

Reg ALU DMemIfetch Reg

IF ID EX MEM WB

Data HazardsTime (clock cycles) CC1 CC2 CC3 CC4 CC6 CC7CC5 CC8 CC9

CPE 0408441 @2009 S. Abed - HU, Jordan

31

Forwarding is the concept of making data available to the input of the ALU for subsequent instructions, even though the generating instruction hasn’t gotten to WB in order to write the memory or registers.

Reg ALU DMemIfetch Reg

Reg ALU DMemIfetch Reg

Reg ALU DMemIfetch Reg

Reg ALU DMemIfetch Reg

Reg ALU DMemIfetch Reg

Forwarding to Avoid Data Hazards

Instr.

Order

dadd r1,r2,r3

dsub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

Time (clock cycles) CC1 CC2 CC3 CC4 CC6 CC7CC5 CC8 CC9

CPE 0408441 @2009 S. Abed - HU, Jordan

32

There are some instances where hazards occur, even with forwarding, e.g., the data isn’t loaded until after the MEM stage.

Instr.

Order

LD R1,0(R2)

DSUB R4,R1,R6

AND R6,R1,R7

OR R8,R1,R9

Reg ALU DMemIfetch Reg

Reg ALU DMemIfetch Reg

Reg ALU DMemIfetch Reg

Reg ALU DMemIfetch Reg

Data Hazards Requiring StallsTime (clock cycles) CC1 CC2 CC3 CC4 CC6 CC7CC5 CC8

CPE 0408441 @2009 S. Abed - HU, Jordan

33

OR R8,R1,R9

Instr.

Order

LD R1,0(R2)

DSUB R4,R1,R6

AND R6,R1,R7

Reg ALU DMemIfetch Reg

RegIfetch ALU DMem RegBubble

Ifetch ALU DMem RegBubble Reg

Ifetch

ALU DMemBubble Reg

Data Hazards Requiring Stalls

The stall is necessary for the case.

Time (clock cycles) CC1 CC2 CC3 CC4 CC6 CC7CC5 CC8

CPE 0408441 @2009 S. Abed - HU, Jordan

34

LD R1, 0(R2) IF ID EX MEM WB

DSUB R4, R1, R5 IF ID EX MEM WB

AND R6, R1, R7 IF ID EX MEM WB

OR R8, R1, R9 IF ID EX MEM WB

LD R1, 0(R2) IF ID EX MEM WB

DSUB R4, R1, R5 IF ID stall EX MEM WB

AND R6, R1, R7 IF stall ID EX MEM WB

OR R8, R1, R9 stall IF ID EX MEM WB

Another Representation of the Stall

In the top table, we can see why a stall is needed: The MEM cycleof the load produces a value that is needed in the EX cycle of the DSUB, which occurs at the same time. This problem is solved byinserting a stall, as shown in the bottom table.

CPE 0408441 @2009 S. Abed - HU, Jordan

35

Control Hazards

• A control hazard happens when we need to find thedestination of a branch, and can’t fetch any newinstructions until we know that destination.

– If instruction i is a taken branch, then the PC is normally notchanged until the end of ID

• Control hazards can cause a greater performanceloss than do data hazards.

CPE 0408441 @2009 S. Abed - HU, Jordan

36

Control Hazard on Branches Three-Cycle Stall

12: beq r1,r3,36

16: and r2,r3,r5

20: or r6,r1,r7

24: add r8,r1,r9

36: xor r10,r1,r11

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Time (clock cycles) CC1 CC2 CC3 CC4 CC6 CC7CC5 CC8 CC9

CPE 0408441 @2009 S. Abed - HU, Jordan

37

Branch Stall Impact• If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9!

• Two solutions to this dramatic increase:– Determine branch taken or not sooner, AND– Compute target address earlier

• MIPS branch tests if register = 0 or ^ 0

• MIPS Solution:– Move Zero test to ID stage– Adder to calculate target address in ID stage– 1 clock cycle penalty for branch versus 3

CPE 0408441 @2009 S. Abed - HU, Jordan

38

The Pipeline of 1-Cycle Stall for Branch

CPE 0408441 @2009 S. Abed - HU, Jordan

39

Four Solutions to Branch Hazards#1: Stall until branch direction is clear

– Simple both for software and hardware– Branch penalty is fixed (1-cycle penalty for revised MIPS)

Branch instr. IF ID EX MEM WB

Branch successor IF IF ID EX MEM WB

Branch successor+1 IF ID EX MEM WB

Branch successor+2 IF ID EX MEM WB

CPE 0408441 @2009 S. Abed - HU, Jordan

40

Four Solutions to Branch Hazards

Untaken branch instr. IF ID EX MEM WBBranch successor IF ID EX MEM WBBranch successor+1 IF ID EX MEM WBBranch successor+2 IF ID EX MEM WBBranch successor+3 IF ID EX MEM WB

Taken branch instr. IF ID EX MEM WBBranch successor IF idle idle idle idleBranch target IF ID EX MEM WBBranch successor+1 IF ID EX MEM WBBranch successor+2 IF ID EX MEM WB

#2: Predict Branch Not Taken– Continue to fetch instructions as if the branch were a normal

instruction.– If the branch is taken, turn the fetched instruction into a no-op

and restart the fetch at the target address.

CPE 0408441 @2009 S. Abed - HU, Jordan

41

Four Solutions to Branch Hazards#3: Predict Branch Taken

– As soon as the branch is decoded and the target address is computed, we assume the branch to be taken and begin fetching and executing at the target.

– But haven’t calculated the target address before we know the branch outcome in MIPS

• MIPS still incurs 1-cycle branch penalty

• Useful for other machines on which the target address is known before the branch outcome

CPE 0408441 @2009 S. Abed - HU, Jordan

42

#4: Delayed Branch– The execution cycle with a branch delay of one is branch instructionsequential successor1branch target if taken

– The sequential successor is in the branch delay slot.– The instruction in the branch delay slot is executed whether

or not the branch is taken (for zero cycle penalty)

Four Solutions to Branch Hazards

•Where to get instructions to fill branch delay slot?– From before branch instruction– From target address: only valuable when branch taken– From fall through: only valuable when branch not taken– Canceling or nullifying branches allow more slots to be filled (non-

zero cycle penalty, its value depends on the rate of correct predication)

– the delay-slot instruction is turned into a no-op if incorrectly predicted

CPE 0408441 @2009 S. Abed - HU, Jordan

43

Four Solutions to Branch Hazards

CPE 0408441 @2009 S. Abed - HU, Jordan

44

Pipelining Introduction Summary

• Just overlap tasks, and easy if tasks are independent• Speed Up vs. Pipeline Depth; if ideal CPI is 1, then:

• Hazards limit performance on computers:– Structural: need more hardware resources

– Data (RAW,WAR,WAW): need forwarding, compiler scheduling

– Control: delayed branch, prediction

Speedup =Pipeline Depth

1 + Pipeline stall CPIX

Clock Cycle Unpipelined

Clock Cycle Pipelined

CPE 0408441 @2009 S. Abed - HU, Jordan