23
CMPT 334 Computer Organization Chapter 4 The Processor (Pipelining) [Adapted from Computer Organization and Design 5 th Edition, Patterson & Hennessy, © 2014, MK]

CMPT 334 Computer Organization Chapter 4 The Processor (Pipelining) [Adapted from Computer Organization and Design 5 th Edition, Patterson & Hennessy,

Embed Size (px)

Citation preview

Page 1: CMPT 334 Computer Organization Chapter 4 The Processor (Pipelining) [Adapted from Computer Organization and Design 5 th Edition, Patterson & Hennessy,

CMPT 334 Computer OrganizationChapter 4 The Processor (Pipelining)

[Adapted from Computer Organization and Design 5th Edition,

Patterson & Hennessy, © 2014, MK]

Page 2: CMPT 334 Computer Organization Chapter 4 The Processor (Pipelining) [Adapted from Computer Organization and Design 5 th Edition, Patterson & Hennessy,

Improving Performance

•Ultimate goal: improve system performance

•One idea: pipeline the CPU•Pipelining is a technique in which multiple

instructions are overlapped in execution.• It relies on the fact that the various parts

of the CPU aren’t all used at the same time•Let’s look at an analogy

Page 3: CMPT 334 Computer Organization Chapter 4 The Processor (Pipelining) [Adapted from Computer Organization and Design 5 th Edition, Patterson & Hennessy,

Sequential Laundry• Four roommates need to do laundry• How long to do laundry sequentially?

▫Washer, dryer, “folder”, “storer” each take 30 minutes

▫Total time: 8 hours for four loads

Page 4: CMPT 334 Computer Organization Chapter 4 The Processor (Pipelining) [Adapted from Computer Organization and Design 5 th Edition, Patterson & Hennessy,

Pipelined Laundry

•How long to do if can overlap tasks?▫Only 3.5 hours!

Page 5: CMPT 334 Computer Organization Chapter 4 The Processor (Pipelining) [Adapted from Computer Organization and Design 5 th Edition, Patterson & Hennessy,

Pipelining Notes

• Pipelining doesn’t help latency of single task, it helps throughput of entire workload▫How many

instructions can we execute per second?

• Potential speedup = number of stages

Page 6: CMPT 334 Computer Organization Chapter 4 The Processor (Pipelining) [Adapted from Computer Organization and Design 5 th Edition, Patterson & Hennessy,

MIPS Pipeline

• Five stages, one step per stage1. IF: Instruction fetch from memory2. ID: Instruction decode & register read3. EX: Execute operation or calculate

address4. MEM: Access memory operand5. WB: Write result back to register

Page 7: CMPT 334 Computer Organization Chapter 4 The Processor (Pipelining) [Adapted from Computer Organization and Design 5 th Edition, Patterson & Hennessy,

Stages of the Datapath

•Stage 1: Instruction Fetch▫No matter what the instruction, the 32-bit

instruction word must first be fetched from memory

▫Every time we fetch an instruction, we also increment the PC to prepare it for the next instruction fetch PC = PC + 4, to point to the next instruction

Page 8: CMPT 334 Computer Organization Chapter 4 The Processor (Pipelining) [Adapted from Computer Organization and Design 5 th Edition, Patterson & Hennessy,

Stages of the Datapath

•Stage 2: Instruction Decode▫First, read the opcode to determine

instruction type and field lengths▫Second, read in data from all necessary

registers For add, read two registers For addi, read one register For jal, no register read necessary

Page 9: CMPT 334 Computer Organization Chapter 4 The Processor (Pipelining) [Adapted from Computer Organization and Design 5 th Edition, Patterson & Hennessy,

Stages of the Datapath

•Stage 3: Execution▫Uses the ALU▫The real work of most instructions is done

here: arithmetic, logic, etc.▫What about loads and stores – e.g., lw $t0,

40($t1) Address we are accessing in memory is 40 +

contents of $t1 We can use the ALU to do this addition in this

stage

Page 10: CMPT 334 Computer Organization Chapter 4 The Processor (Pipelining) [Adapted from Computer Organization and Design 5 th Edition, Patterson & Hennessy,

Stages of the Datapath

•Stage 4: Memory Access▫ Only the load and store instructions do anything

during this stage; the others remain idle

•Stage 5: Register Write▫ Most instructions write the result of some

computation into a register▫ Examples: arithmetic, logical, shifts, loads, slt▫ What about stores, branches, jumps?

Don’t write anything into a register at the end These remain idle during this fifth stage

Page 11: CMPT 334 Computer Organization Chapter 4 The Processor (Pipelining) [Adapted from Computer Organization and Design 5 th Edition, Patterson & Hennessy,

MIPS Pipeline

• Five stages, one step per stage1. IF: Instruction fetch from memory2. ID: Instruction decode & register read3. EX: Execute operation or calculate

address4. MEM: Access memory operand5. WB: Write result back to register

Page 12: CMPT 334 Computer Organization Chapter 4 The Processor (Pipelining) [Adapted from Computer Organization and Design 5 th Edition, Patterson & Hennessy,

Datapath Walkthrough: LW, SW

• lw $s3, 17($s1)▫ Stage 1: fetch this instruction, increment PC▫ Stage 2: decode to find it’s a lw, then read register $s1▫ Stage 3: add 17 to value in register $s1 (retrieved in Stage 2)▫ Stage 4: read value from memory address compute in Stage 3▫ Stage 5: write value read in Stage 4 into register $s3

• sw $s3, 17($s1)▫ Stage 1: fetch this instruction, increment PC▫ Stage 2: decode to find it’s a sw, then read registers $s1 and

$s3▫ Stage 3: add 17 to value in register $1 (retrieved in Stage 2)▫ Stage 4: write value in register $3 (retrieved in Stage 2) into

memory address computed in Stage 3▫ Stage 5: go idle (nothing to write into a register)

Page 13: CMPT 334 Computer Organization Chapter 4 The Processor (Pipelining) [Adapted from Computer Organization and Design 5 th Edition, Patterson & Hennessy,

Datapath Walkthrough: SLTI, ADD• slti $s3,$s1,17

▫Stage 1: fetch this instruction, increment PC▫Stage 2: decode to find it’s an slti, then read register $s1▫Stage 3: compare value retrieved in Stage 2 with the

integer 17▫Stage 4: go idle▫Stage 5: write the result of Stage s3 in register $s3

• add $s3,$s1,$s2▫Stage 1: fetch this instruction, increment PC▫Stage 2: decode to find it’s an add, then read registers

$s1 and $s2▫Stage 3: add the two values retrieved in Stage 2▫Stage 4: idle (nothing to write to memory)▫Stage 5: write result of Stage 3 into register $s3

Page 14: CMPT 334 Computer Organization Chapter 4 The Processor (Pipelining) [Adapted from Computer Organization and Design 5 th Edition, Patterson & Hennessy,

Pipeline Performance•Assume time for stages is

▫100ps for register read or write▫200ps for other stages

•Compare pipelined datapath with single-cycle datapath

Instr Instr fetch Register read

ALU op Memory access

Register write

Total time

lw 200ps 100 ps 200ps 200ps 100 ps 800ps

sw 200ps 100 ps 200ps 200ps 700ps

R-format 200ps 100 ps 200ps 100 ps 600ps

beq 200ps 100 ps 200ps 500ps

Page 15: CMPT 334 Computer Organization Chapter 4 The Processor (Pipelining) [Adapted from Computer Organization and Design 5 th Edition, Patterson & Hennessy,

Pipeline PerformanceSingle-cycle (Tc= 800ps)

Pipelined (Tc= 200ps)

Page 16: CMPT 334 Computer Organization Chapter 4 The Processor (Pipelining) [Adapted from Computer Organization and Design 5 th Edition, Patterson & Hennessy,

Pipeline Speedup

•If all stages are balanced▫i.e., all take the same time

▫Time between instructionspipelined

= Time between instructionsnonpipelined

Number of stages•If not balanced, speedup is less

Page 17: CMPT 334 Computer Organization Chapter 4 The Processor (Pipelining) [Adapted from Computer Organization and Design 5 th Edition, Patterson & Hennessy,

Limits to Pipelining: Hazards•Situations that prevent starting the next

instruction in the next cycle•Structure hazards

▫A required resource is busy•Data hazard

▫Need to wait for previous instruction to complete its data read/write

•Control hazard▫Deciding on control action depends on

previous instruction

Page 18: CMPT 334 Computer Organization Chapter 4 The Processor (Pipelining) [Adapted from Computer Organization and Design 5 th Edition, Patterson & Hennessy,

Data Hazards•An instruction depends on completion of

data access by a previous instruction▫add $s0, $t0, $t1sub $t2, $s0, $t3

stall the pipeline

Page 19: CMPT 334 Computer Organization Chapter 4 The Processor (Pipelining) [Adapted from Computer Organization and Design 5 th Edition, Patterson & Hennessy,

Exercise 4.8IF ID EX MEM WB

250ps 350ps 150ps 300ps 200ps

R-type beq lw sw

45% 20% 20% 15%

•What is the clock cycle time in a pipelined and non-pipelined processor?

Pipelined Single-cycle350 ps 1250 ps

Page 20: CMPT 334 Computer Organization Chapter 4 The Processor (Pipelining) [Adapted from Computer Organization and Design 5 th Edition, Patterson & Hennessy,

Exercise 4.8IF ID EX MEM WB

250ps 350ps 150ps 300ps 200ps

R-type beq lw sw

45% 20% 20% 15%

•What is the total latency of an lw instruction in a pipelined and non-pipelined processor?

Pipelined Single-cycle1250 ps 1250 ps

Page 21: CMPT 334 Computer Organization Chapter 4 The Processor (Pipelining) [Adapted from Computer Organization and Design 5 th Edition, Patterson & Hennessy,

Exercise 4.8IF ID EX MEM WB

250ps 350ps 150ps 300ps 200ps

R-type beq lw sw

45% 20% 20% 15%

•What is the total latency of an lw instruction in a pipelined and non-pipelined processor?

Pipelined Single-cycle1250 ps 1250 ps

Page 22: CMPT 334 Computer Organization Chapter 4 The Processor (Pipelining) [Adapted from Computer Organization and Design 5 th Edition, Patterson & Hennessy,

Exercise 4.8IF ID EX MEM WB

250ps 350ps 150ps 300ps 200ps

R-type beq lw sw

45% 20% 20% 15%

•What is the utilization of the data memory?

35%

Page 23: CMPT 334 Computer Organization Chapter 4 The Processor (Pipelining) [Adapted from Computer Organization and Design 5 th Edition, Patterson & Hennessy,

Exercise 4.8IF ID EX MEM WB

250ps 350ps 150ps 300ps 200ps

R-type beq lw sw

45% 20% 20% 15%

•What is the utilization of the write-register port of the “Registers” unit?

65%