Single-Cycle Processors: Datapath & Control

1

Single-Cycle Processors:Datapath & Control

ArvindComputer Science & Artificial Intelligence Lab

M.I.T.

Based on the material prepared byArvind and Krste Asanovic

6.823 L5- 2

Instruction Set Architecture (ISA) Arvind

versus Implementation

• ISA is the hardware/software interface – Defines set of programmer visible state – Defines instruction format (bit encoding) and instruction

semantics

– Examples: MIPS, x86, IBM 360, JVM

• Many possible implementations of one ISA– 360 implementations: model 30 (c. 1964), z900 (c. 2001) – x86 implementations: 8086 (c. 1978), 80186, 286, 386, 486,

Pentium, Pentium Pro, Pentium-4 (c. 2000), AMD Athlon, Transmeta Crusoe, SoftPC

– MIPS implementations: R2000, R4000, R10000, ...– JVM: HotSpot, PicoJava, ARM Jazelle, ...

September 26, 2005

6.823 L5- 3 Arvind

Processor Performance

Time = Instructions Cycles Time Program Program * Instruction * Cycle

– Instructions per program depends on source code, compiler technology, and ISA

– Cycles per instructions (CPI) depends upon the ISA and the microarchitecture

– Time per cycle depends upon the microarchitecture and the base technology

this lecture

Microarchitecture CPI cycle time

Microcoded >1 short

Single-cycle unpipelined 1 long

Pipelined 1 short

September 26, 2005

6.823 L5- 4 Arvind

Microarchitecture: Implementation of an ISA

Controller

Data path

control pointsstatus

lines

Structure: How components are connected. Static

Behavior: How data moves between components Dynamic

September 26, 2005

Hardware Elements• Combinational circuits OpSelect

–

Result

Comp?

A

B

ALU

Sel

O A0 A1

An-1

Mux... A

Dem

ux ...

O0 O1

On1

Sel

A

Dec

oder ...

O0 O1

On-1

Mux, Demux, Decoder, ALU, ... - Add, Sub, ... - And, Or, Xor, Not, ... - GT, LT, EQ, Zero, ...

lg(n) lg(n)

lg(n)

• Synchronous state elements – Flipflop, Register, Register file, SRAM, DRAM

Clk

D

Q

Enff

Q

D

Clk En

ff

Q0

D0

Clk En

ff

Q1

D1

ff

Q2

D2

ff

Qn-1

Dn-1

...

...

...

register

Edge-triggered: Data is sampled at the rising edge September 26, 2005

6.823 L5- 6 Arvind

Register Files Clock WE

ReadData1ReadSel1 ReadSel2

WriteSel

Register file

2R+1W

ReadData2

WriteData

rd1rs1

rs2

ws wd

rd2

we

ws clk

register 1

wd

we rs2

rs1

rd1

rd2

register 0

…

…

…

…

32

32

32

32

32

325 5

5

register 31

• No timing issues in reading a selected register • Register files with a large number of ports are difficult

to design – Intel’s Itanium, GPR File has 128 registers with 8 read ports and

4 write ports!!! September 26, 2005

6.823 L5- 7 Arvind

A Simple Memory Model

WriteEnable Clock

Address ReadData

WriteData

Reads and writes are always completed in one cycle

MAGIC RAM

• a Read can be done any time (i.e. combinational) • a Write is performed at the rising clock edge

if it is enabled ⇒ the write address and data

must be stable at the clock edge

Later in the course we will present a more realistic model of memory

September 26, 2005

6.823 L5- 8 Arvind

Implementing MIPS:

Single-cycle per instructiondatapath & control logic

September 26, 2005

6.823 L5- 9 Arvind

The MIPS ISAProcessor State

32 32-bit GPRs, R0 always contains a 032 single precision FPRs, may also be viewed as

16 double precision FPRs FP status register, used for FP compares & exceptions PC, the program counter some other special registers

Data types8-bit byte, 16-bit half word 32-bit word for integers32-bit word for single precision floating point64-bit word for double precision floating point

Load/Store style instruction setdata addressing modes- immediate & indexed branch addressing modes- PC relative & register indirect Byte addressable memory- big endian mode

All instructions are 32 bits September 26, 2005

6.823 L5- 10 Arvind

Instruction Execution

Execution of an instruction involves

1. instruction fetch2. decode and register fetch3. ALU operation4. memory operation (optional)5. write back

and the computation of the address of the next instruction

September 26, 2005

6.823 L5- 11 Arvind

Datapath: Reg-Reg ALU Instructions

0x4 Add

clk

addr inst

Inst. Memory

PC

inst<5:0>

z ALU

Control

RegWrite

clk

rd1

rs1 rs2

ws wd rd2

weinst<25:21> inst<20:16>

inst<15:11>

OpCode

ALU

GPRs

RegWrite Timing?6 5 5 5 5 6 0 rd 0 funcrs rt rd ← (rs) func (rt)

31 26 25 21 20 16 15 11 5 0 September 26, 2005

6.823 L5- 12 Arvind

Datapath: Reg-Imm ALU Instructions

Imm Ext

inst<15:0>

0x4 Add

clk

addr inst

Inst. Memory

PC

z ALU

RegWrite

clk

rd1

rs1 rs2

ws wd rd2

we

ALU Control

GPRs

inst<25:21>

inst<20:16>

inst<31:26>

OpCode ExtSel

6 5 5 16 opcode rs rt immediate rt ← (rs) op immediate

31 26 25 2120 16 15 0 September 26, 2005

6.823 L5- 13 Arvind

Conflicts in Merging Datapath

0x4 Add

addr inst

Inst. Memory

PC

clk

RegWrite Introduce

Imm Ext

z ALU

clk

rd1

rs1 rs2

ws wd rd2

we

inst<15:0>

ALU Controlinst<5:0>

muxes

GPRs

inst<25:21>

inst<20:16>

inst<31:26>

inst<15:11>

OpCode ExtSel

6 5 5 5 5 6 0 rs rt rd 0 func

opcode rs rt immediate

rd ← (rs) func (rt)

rt ← (rs) op immediate

September 26, 2005

6.823 L5- 14 Arvind

Datapath for ALU Instructions

0x4 Add

addr inst

Inst. Memory

PC

RegWrite

clk

<31:26>, <5:0>

BSrc

Imm Ext

z ALU

clk

rd1

rs1 rs2

ws wd rd2

we<25:21> <20:16>

<15:0>

ALU Control

<15:11> GPRs

OpCode RegDst ExtSel OpSel rt / rd Reg / Imm

6 5 5 5 5 60 rs rt rd 0 func

opcode rs rt immediate

rd ← (rs) func (rt)

rt ← (rs) op immediate

September 26, 2005

6.823 L5- 15 Arvind

Datapath for Memory Instructions

Should program and data memory be separate?

Harvard style: separate (Aiken and Mark 1 influence) - read-only program memory - read/write data memory

at some level the two memories have to be the same

Princeton style: the same (von Neumann’s influence) - A Load or Store instruction requires

accessing the memory more than once during its execution

September 26, 2005

6.823 L5- 16 Arvind

Load/Store Instructions:Harvard Datapath

0x4 Add

addr inst

Inst. Memory

PC

RegWrite MemWrite

clk

WBSrc ALU / Mem

“base”

disp

ALU Control

z ALU

clk

rd1

rs1 rs2

ws wd rd2

we

Imm Ext

clk

addr

wdata

rdataData Memory

we

GPRs

OpCode RegDst ExtSel OpSel BSrc

opcode rs rt displacement6 5 5 16 addressing mode

(rs) + displacement 31 26 25 21 20 16 15 0 rs is the base register rt is the destination of a Load or the source for a Store

September 26, 2005

6.823 L5- 17 Arvind

MIPS Control Instructions Conditional (on GPR) PC-relative branch

6 5 5 16 opcode rs offset BEQZ, BNEZ

Unconditional register-indirect jumps6 5 5 16

opcode rs JR, JALR

Unconditional absolute jumps 6 26

opcode target J, JAL

• PC-relative branches add offset×4 to PC+4 to calculate the target address (offset is in words): ±128 KB range

• Absolute jumps append target×4 to PC<31:28> to calculate the target address: 256 MB range

• jump-&-link stores PC+4 into the link register (R31) • All Control Transfers are delayed by 1 instruction

we will worry about the branch delay slot later September 26, 2005

6.823 L5- 18 Arvind

Conditional Branches (BEQZ, BNEZ)

Add

PCSrc

clk

WBSrcMemWrite

addr

wdata

rdataData Memory

we

z

clk

clk

addr inst

Inst. Memory

PC rd1

rs1 rs2

ws wd rd2

we

Imm Ext

ALU

ALU Control

Add

br

pc+4

RegWrite

0x4

zero?

GPRs


September 26, 2005

6.823 L5- 19 Arvind

Register-Indirect Jumps (JR) RegWrite

Add Add

clk

WBSrcMemWrite

addr

wdata

rdataData Memory

we

z

clk

clk

addr inst

Inst. Memory

PC rd1

rs1 rs2

ws wd rd2

we

Imm Ext

ALU

ALU Control

PCSrc br

pc+4

rind

0x4

zero?

GPRs


September 26, 2005

6.823 L5- 20 Arvind

Register-Indirect Jump-&-Link (JALR)

RegWrite

Add Add

clk

WBSrcMemWrite

addr

wdata

rdataData Memory

we

z

clk

clk

addr inst

Inst. Memory

PC rd1

rs1 rs2

ws wd rd2

we

Imm Ext

ALU

ALU Control

31

PCSrc br

pc+4

rind

0x4

zero?

GPRs


September 26, 2005

6.823 L5- 21 Arvind

Absolute Jumps (J, JAL) RegWrite

Add Add

clk

WBSrcMemWrite

addr

wdata

rdataData Memory

we

z

clk

clk

addr inst

Inst. Memory

PC rd1

rs1 rs2

ws wd rd2

we

Imm Ext

ALU

ALU Control

31

PCSrc br

pc+4

rind jabs

0x4

zero?

GPRs


September 26, 2005

6.823 L5- 22 Arvind

Harvard-Style Datapath for MIPS RegWrite

Add Add

clk

WBSrc

addr

wdata

rdataData Memory

we

z

clk

zero?

clk

addr inst

Inst. Memory

PC rd1

rs1 rs2

ws wd rd2

we

Imm Ext

ALU

ALU Control

31

PCSrc br rind jabs pc+4

0x4

MemWrite

GPRs


September 26, 2005

23

Five-minute break to stretch your legs

6.823 L5- 24 Arvind

Single-Cycle Hardwired Control:Harvard architecture

We will assume • clock period is sufficiently long for all of

the following steps to be “completed”:

1. instruction fetch2. decode and register fetch3. ALU operation4. data fetch if required5. register write-back setup time

⇒ tC > tIFetch + tRFetch + tALU+ tDMem+ tRWB

• At the rising edge of the following clock, the PC,the register file and the memory are updated

September 26, 2005

6.823 L5- 25

Hardwired Control is pure Arvind

Combinational Logic

combinational logic

ExtSel

BSrc

OpSelop code

MemWrite

WBSrczero?

RegDst

RegWrite

PCSrc

September 26, 2005

6.823 L5- 26 Arvind

ALU Control & Immediate Extension

Inst<31:26> (Opcode)

Decode Map

Inst<5:0> (Func)

ALUop

0?

+

OpSel ( Func, Op, +, 0? )

ExtSel ( sExt16, uExt16, High16)

September 26, 2005

6.823 L5- 27 Arvind

Hardwired Control TableOpcode ExtSel BSrc OpSel MemW RegW WBSrc RegDst PCSrc

ALU * Reg Func no yes ALU rd pc+4 ALUi sExt16 Imm Op no yes ALU rt pc+4 ALUiu uExt16 Imm Op no yes ALU rt pc+4 LW sExt16 Imm + no yes Mem rt pc+4 SW sExt16 Imm + yes no * * pc+4

BEQZz=0 sExt16 * 0? no no * * br BEQZz=1 sExt16 * 0? no no * * pc+4 J * * * no no * * jabs JAL * * * no yes PC R31 jabs JR * * * no no * * rind JALR * * * no yes PC R31 rind

BSrc = Reg / Imm WBSrc = ALU / Mem / PC RegDst = rt / rd / R31 PCSrc = pc+4 / br / rind / jabs

September 26, 2005

6.823 L5- 28 Arvind

Pipelined MIPS To pipeline MIPS:

• First build MIPS without pipelining with CPI=1

• Next, add pipeline registers to reduce cycle time while maintaining CPI=1

September 26, 2005

6.823 L5- 29 Arvind

Pipelined Datapath

0x4

Add

addrPC

we rs1 rs2

rd1 we ws addrrdata IR ALUwd rd2

rdata GPRs Data Inst. MemoryImmMemory wdataExt

writefetch decode & Reg-fetch execute memory -backphase phase phase phase phase

Clock period can be reduced by dividing the execution of an instruction into multiple cycles

tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably)

However, CPI will increase unless instructions are pipelined September 26, 2005

6.823 L5- 30 Arvind

An Ideal Pipeline

stage 1

stage 2

stage 3

stage 4

• All objects go through the same stages

• No sharing of resources between any two stages

• Propagation delay through all pipeline stages is equal

• The scheduling of an object entering the pipeline is not affected by the objects in other stages

These conditions generally hold for industrial assembly lines. But can an instruction pipeline satisfy the last condition?

September 26, 2005

6.823 L5- 31

How to divide the datapath Arvind

into stages

Suppose memory is significantly slower than other stages. In particular, suppose

= 10 unitstIM

= 10 unitstDM

= 5 unitstALU = 1 unittRF

= 1 unittRW

Since the slowest stage determines the clock, it may be possible to combine some stages without any loss of performance

September 26, 2005

tC > max {tIM, tRF, tALU, tDM, tRW} = tDMtC > max {tIM, tRF+tALU, tDM, tRW} = tDM

6.823 L5- 32 Arvind

Alternative Pipelining

write -backfetch

phase execute phase

decode & Reg-fetch phase

memory phase

addr

wdata Memory

we ALU

Imm Ext

0x4

Add

addr

Inst. Memory

rd1

GPRs

rs1 rs2

ws wd rd2

we

IRPC

rdata Data

rdata

phase

tC > max {tIM, tRF+tALU, tDM+tRW} = tDM+ tRW

⇒ increase the critical path by 10%

Write-back stage takes much less time than other stages. Suppose we combined it with the memory phase

September 26, 2005

6.823 L5- 33 Arvind

Maximum Speedup by Pipelining

Assumptions Unpipelined Pipelined Speedup tC tC

tt

1. tIM = tDM = 10, ALU = 5, RF = tRW= 1

4-stage pipeline 27 10 2.7

2. tIM = tDM = tALU = tRF = tRW = 5 4-stage pipeline 25 10 2.5

3. tIM = tDM = tALU = tRF = tRW = 5 5-stage pipeline 25 5 5.0

It is possible to achieve higher speedup with more stages in the pipeline.

September 26, 2005

34

Thank you !

Documents

Single-Cycle Processors: Datapath & Control