28
Lecture-5 (Pipelining) CS422-Spring 2018 Biswa@CSE-IITK

Lecture-5 (Pipelining) CS422-Spring 2018

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture-5 (Pipelining) CS422-Spring 2018

Lecture-5 (Pipelining)CS422-Spring 2018

Biswa@CSE-IITK

Page 2: Lecture-5 (Pipelining) CS422-Spring 2018

CS422: Spring 2018 Biswabandan Panda, CSE@IITK 2

Before That: Single Cycle Design

Page 3: Lecture-5 (Pipelining) CS422-Spring 2018

CS422: Spring 2018 Biswabandan Panda, CSE@IITK 3

An R-format single Cycle CPU

opcode rs rt rd functshamt

Syntax: ADD $8 $9 $10 Semantics: $8 = $9 + $10

Sample program:ADD $8 $9 $10SUB $4 $8 $3AND $9 $8 $4...

How registers get their initial values are not of concern to us right now.

No loads or stores: machine has no use for data memory, only instruction memory.

No branches or jumps: machine only runs straight line code.

Page 4: Lecture-5 (Pipelining) CS422-Spring 2018

CS422: Spring 2018 Biswabandan Panda, CSE@IITK 4

Instruction Memory

32

Addr

Data

32

InstrMem

Reads are combinational: Put a stable address on input, a short time later data appears on output.

Not concerned about how programs are loaded into this memory.

Related to separate instruction and data caches in “real” designs.

Page 5: Lecture-5 (Pipelining) CS422-Spring 2018

CS422: Spring 2018 Biswabandan Panda, CSE@IITK 5

Let’s Fetch It

32

Addr

Data

32

InstrMem

Fetching straight-line MIPS instructions requires a machine that generates this timing diagram:

Why +4 and not +1?

Why increment every cycle?

CLK

Addr

Data IMem[PC + 8]IMem[PC + 4]IMem[PC]

PC + 8PC + 4PC

PC == Program Counter, points to next instruction.

32-bit instructions.

Straight-line code.

Page 6: Lecture-5 (Pipelining) CS422-Spring 2018

CS422: Spring 2018 Biswabandan Panda, CSE@IITK 6

Decode & Execute

32rd1

RegFile

32rd2

WE32wd

5rs1

5rs2

5ws

32ALU

32

32

op

opcode rs rt rd functshamt

Decode fields to get : ADD $8 $9 $10

Logic

Page 7: Lecture-5 (Pipelining) CS422-Spring 2018

CS422: Spring 2018 Biswabandan Panda, CSE@IITK 7

Let’s Put It Together

32

Addr Data

Instr

Mem

32D

PC

Q

32

32

+

32

32

0x4

To rs1,

rs2, ws,

op decode

logic ...

32

rd1

RegFile

32rd2

WE32

wd

5rs1

5rs2

5ws

32A

L

U

32

32

op

Page 8: Lecture-5 (Pipelining) CS422-Spring 2018

CS422: Spring 2018 Biswabandan Panda, CSE@IITK 8

R TO I Format

Page 9: Lecture-5 (Pipelining) CS422-Spring 2018

CS422: Spring 2018 Biswabandan Panda, CSE@IITK 9

Adding Data Memory

32

rd1

RegFile

32rd2

WE32

wd

5rs1

5rs2

5ws

ExtRegDest

ALUsrcExtOp

ALUctr

MemToReg

MemWr

Syntax: LW $1, 32($2)

Action: $1 = M[$2 + 32]

RegWr

Page 10: Lecture-5 (Pipelining) CS422-Spring 2018

CS422: Spring 2018 Biswabandan Panda, CSE@IITK 10

What is Single Cycle Here?

32

Addr Data

Instr

MemEqual

RegDest

RegWr

ExtOp

ALUsrc MemWr

MemToReg

PCSrc

Combinational Logic

(Only Gates, No Flip Flops)

Just specify logic functions!

rs,rt,rd,imm

Page 11: Lecture-5 (Pipelining) CS422-Spring 2018

CS422: Spring 2018 Biswabandan Panda, CSE@IITK 11

Let’s Understand This

Page 12: Lecture-5 (Pipelining) CS422-Spring 2018

CS422: Spring 2018 Biswabandan Panda, CSE@IITK 12

Implications?

Page 13: Lecture-5 (Pipelining) CS422-Spring 2018

CS422: Spring 2018 Biswabandan Panda, CSE@IITK 13

Implications?

Page 14: Lecture-5 (Pipelining) CS422-Spring 2018

CS422: Spring 2018 Biswabandan Panda, CSE@IITK 14

So 8ns enough?

Page 15: Lecture-5 (Pipelining) CS422-Spring 2018

CS422: Spring 2018 Biswabandan Panda, CSE@IITK 15

It’s the Memory Stupid ( ~ 50 ns) – Oh NO!!

So, frequency of ~ 10MHzBut Confucius says “Make the common case first” ☺

Page 16: Lecture-5 (Pipelining) CS422-Spring 2018

CS422: Spring 2018 Biswabandan Panda, CSE@IITK 16

World of Latency, Throughput, Parallelism

• Latency• time it takes to complete one instance

• Throughput• number of computations done per unit time

Page 17: Lecture-5 (Pipelining) CS422-Spring 2018

CS422: Spring 2018 Biswabandan Panda, CSE@IITK 17

Pipelining: Goal

• Goal: Increase machine throughput by making better use of available hardware resources

• Pipelining increases throughput at the cost of latency

• Hopefully not too high of a cost

• Assumptions

• Idle hardware resources exist

• parallelism!

• Work available

• parallelism!

Page 18: Lecture-5 (Pipelining) CS422-Spring 2018

CS422: Spring 2018 Biswabandan Panda, CSE@IITK 18

How?

• Partition hardware function into sub-functions so overlap can occur

• Ideally each sub-function same time

f

t

t + t

f1

f2

f3

t

t + t/3

t + 2t/3

t + tIssue op every t time units

Issue op every t/3 time units (may be slightly more than t/3-- why?)

Page 19: Lecture-5 (Pipelining) CS422-Spring 2018

CS422: Spring 2018 Biswabandan Panda, CSE@IITK 19

Malformed Pipeline

A2ns

B1ns

C1ns

X

F(X)

How many stage pipeline is this?What is the problem?

Page 20: Lecture-5 (Pipelining) CS422-Spring 2018

CS422: Spring 2018 Biswabandan Panda, CSE@IITK 20

Ideal Pipeline

• Goal: Increase throughput with little increase in cost (hardware cost, in case of instruction processing)

• Repetition of identical operations• The same operation is repeated on a large number of different

inputs

• Repetition of independent operations• No dependencies between repeated operations

• Uniformly partitionable suboperations• Processing can be evenly divided into uniform-latency

suboperations (that do not share resources)

• Fitting examples: automobile assembly line, doing laundry

Page 21: Lecture-5 (Pipelining) CS422-Spring 2018

CS422: Spring 2018 Biswabandan Panda, CSE@IITK 21

Ideal Pipeline

• All objects go through the same stages

• No sharing of resources between any two stages

• Propagation delay through all pipeline stages is equal

• The scheduling of an object entering the pipeline is not affected by the objects in other stages

stage1

stage2

stage3

stage4

Page 22: Lecture-5 (Pipelining) CS422-Spring 2018

CS422: Spring 2018 Biswabandan Panda, CSE@IITK 22

Ideal Pipeline

combinational logic (F,D,E,M,W)T psec

BW=~(1/T)

BW=~(2/T)T/2 ps (F,D,E) T/2 ps (M,W)

BW=~(3/T)T/3ps (F,D)

T/3ps (E,M)

T/3ps (M,W)

Page 23: Lecture-5 (Pipelining) CS422-Spring 2018

CS422: Spring 2018 Biswabandan Panda, CSE@IITK 23

More Realistic One

Nonpipelined version with delay T

BW = 1/(T+S) where S = latch delay

k-stage pipelined version

BWk-stage = 1 / (T/k +S )

BWmax = 1 / (1 gate delay + S )

T ps

T/kps

T/kps

Page 24: Lecture-5 (Pipelining) CS422-Spring 2018

CS422: Spring 2018 Biswabandan Panda, CSE@IITK 24

More Realistic One

Nonpipelined version with combinational cost G

Cost = G+L where L = latch cost

k-stage pipelined version

Costk-stage = G + Lk

G gates

G/k G/k

Page 25: Lecture-5 (Pipelining) CS422-Spring 2018

CS422: Spring 2018 Biswabandan Panda, CSE@IITK 25

World Is Not The Ideal One

Identical operations ... NOT!

different instructions do not need all stages- Forcing different instructions to go through the same multi-function pipe

external fragmentation (some pipe stages idle for some instructions)

Uniform suboperations ... NOT!

difficult to balance the different pipeline stages- Not all pipeline stages do the same amount of work

internal fragmentation (some pipe stages are too-fast but take the same clock cycle time)

Independent operations ... NOT!

instructions are not independent of each other- Need to detect and resolve inter-instruction dependencies to ensure the pipeline operates correctly Pipeline is not always moving (it stalls)

Page 26: Lecture-5 (Pipelining) CS422-Spring 2018

CS422: Spring 2018 Biswabandan Panda, CSE@IITK 26

Let’s Digress a Bit: Latch and Flip-flop (Piazza: +1)

Page 27: Lecture-5 (Pipelining) CS422-Spring 2018

CS422: Spring 2018 Biswabandan Panda, CSE@IITK 27

Simple 5-stage Pipeline

MemoryAccess

WriteBack

InstructionFetch

Instr. DecodeReg. Fetch

ExecuteAddr. Calc

ALU

Mem

ory

Reg F

ile

MU

XM

UX

Data

Mem

ory

MU

X

SignExtend

Zero?

IF/ID

ID/E

X

ME

M/W

B

EX

/ME

M

4

Adder

Next SEQ PC Next SEQ PC

RD RD RD WB

Dat

a

• Data stationary control– local decode for each instruction phase / pipeline stage

Next PC

Addre

ss

RS1

RS2

Imm

MU

X

IR <= mem[PC];

PC <= PC + 4

A <= Reg[IRrs];

B <= Reg[IRrt]

rslt <= A opIRop B

Reg[IRrd] <= WB

WB <= rslt

Page 28: Lecture-5 (Pipelining) CS422-Spring 2018

CS422: Spring 2018 Biswabandan Panda, CSE@IITK 28

Visualise It

Instr.

Order

Time (clock cycles)

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5