CS141-L4-1Tarun Soni, Summer ‘03 Multi Cycle CPU Previously: built a Single Cycle CPU. Today: Exceptions Multi-cycle CPU; Microprogramming

CS141-L4-1 Tarun Soni, Summer ‘03

Multi Cycle CPU

Previously: built a Single Cycle CPU. Today:

ExceptionsMulti-cycle CPU; Microprogramming


Mid-term Review Discussion Session

Peterson Hall 104Tue: 2-3 pm Tue: 3-4 pm

0

5

10

15

20

25

30

35

40

45

50

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55

Series1


Instruction Set Architectures Performance issues 2s complement, Addition, Subtraction Multiplication, Division, Floating Point numbers ALUs Single Cycle CPU

ExceptionsMulticycle CPU: datapath; controlMicroprogramming

The Story so far:


• Design alternative:

– provide more powerful operations

– goal is to reduce number of instructions executed

– danger is a slower cycle time and/or a higher CPI

• Sometimes referred to as “RISC vs. CISC”

– virtually all new instruction sets since 1982 have been RISC

– VAX: minimize code size, make assembly language easy

instructions from 1 to 54 bytes long!

• We’ll look at Pentium, UltraSparc and JVM

Alternative Architectures


Pentium


Java VM

• Most instr one byte– ADD– POP

• One byte arg– ILOAD IND8– BIPUSH CON8

• Two byte arg– SIPUSH CON16– IF_ICMPEQ OFFSET16

• Type = int, signed int etc.


UltraSparc


Exceptions

or

Oops!


Exceptions

• There are two sources of non-sequential control flow in a processor– explicit branch and jump instructions– exceptions

• Branches are synchronous and deterministic• Exceptions are typically asynchronous and non-deterministic• Guess which is more difficult to handle?

• exceptions as any unexpected change in control flow

• interrupts as any externally-caused exception

• Literature is not consistent

arithmetic overflow

divide by zero

I/O device signals completion to CPU

user program invokes the OS

memory parity error

illegal instruction

timer signal


Exceptions

• The machine we’ve been designing in class can generate two types of exceptions.

– arithmetic overflow

– illegal instruction

• On an exception, we need to

– save the PC (invisible to user code)

– record the nature of the exception/interrupt

– transfer control to OS

user programSystemExceptionHandlerException:

return fromexception


Exceptions

• MIPS architecture defines the instruction as having no effect if the instruction causes an exception.

• When we get to virtual memory we will see that certain classes of exceptions must prevent the instruction from changing the machine state.

• This aspect of handling exceptions becomes complex and potentially limits performance => why it is hard

• Interrupts– caused by external events– asynchronous to program

execution– may be handled between

instructions– simply suspend and resume user

program• Traps/Exceptions

– caused by internal events• exceptional conditions

(overflow)• errors (parity)• faults (non-resident page)

– synchronous to program execution– condition must be remedied by the

handler– instruction may be retried or

simulated and program continued or program may be aborted


Exceptions

Addressing the Exception Handler

• Traditional Approach: Interupt Vector

– PC <- MEM[ IV_base + cause || 00]

– 370, 68000, Vax, 80x86, . . .

• RISC Handler Table

– PC <– IT_base + cause || 0000

– saves state and jumps

– Sparc, PA, M88K, . . .

• MIPS Approach: fixed entry

– PC <– EXC_addr

– Actually very small table

• RESET entry

• TLB

• other

iv_basecause

handlercode

iv_basecause

handler entry code


Exceptions

Saving State

• Push it onto the stack

– Vax, 68k, 80x86

• Save it in special registers

– MIPS EPC, BadVaddr, Status, Cause

• Shadow Registers

– M88k

– Save state in a shadow of the internal pipeline registers

Significant component of “interrupt response time”


Exceptions

• For our MIPS-subset architecture, we will add two registers:– EPC: a 32-bit register to hold the user’s PC– Cause: A register to record the cause of the exception

• we’ll assume undefined inst = 0, overflow = 1• We will also add three control signals:

– EPCWrite (will need to be able to subtract 4 from PC)– CauseWrite– IntCause

• We will extend PCSource multiplexor to be able to latch the interrupt handler address into the PC.


Cau

se

CauseWrite

IntCause

EP

C

PC

PCWrite EPCWrite

PCSource

InterruptHandlerAddress

sub4

imm

16

32

ALUctr

Clk

busW

RegWr

3232

busA

32busB

55 5

Rw Ra Rb32 32-bitRegisters

Rs

Rt

Rt

RdRegDst

Exten

der

Mu

x

3216imm16

ALUSrcExtOp

Mu

x

MemtoReg

Clk

Data InWrEn32 Adr

DataMemory

MemWrA

LU

Equal

Instruction<31:0>

0

1

0

1

01

<21:25>

<16:20>

<11:15>

<0:15>

Imm16RdRtRs

=

Ad

der

Ad

der

PC

Clk

00

Mu

x

4

nPC_sel

PC

Ext

Adr

InstMemory

Exceptions


ALUctrRegDst ALUSrcExtOp MemtoRegMemWr Equal

Instruction<31:0>

<21:25>

<16:20>

<11:15>

<0:15>

Imm16RdRsRt

nPC_sel

Adr

InstMemory

DATA PATH

Control

Op

<21:25>

Fun

RegWrExceptionSignals

Exceptions: Creating a “Control line”

Regs: – EPC: – Cause:

control signals:– EPCWrite (subtract 4 from PC)– CauseWrite– IntCause


Clk

5

Rw Ra Rb

32 32-bitRegisters

RdA

LU

Clk

Data In

DataAddress

IdealData

Memory

Instruction

InstructionAddress

IdealInstruction

Memory

Clk

PC

5Rs

5Rt

16Imm

32

323232

A

B

Nex

t A

dd

ress

Regs: – EPC: – Cause:

control signals:– EPCWrite (subtract 4 from PC)– CauseWrite– IntCause

Extend PCSource MUX to include jump address from int-table

Exceptions: Creating the data path


CPU

Multi Cycle CPU


CPU

The Big Picture: Where are We Now?

• The Five Classic Components of a Computer

• Datapath Design, then Control Design

Control

Datapath

Memory

Processor

Input

Output


Recap: Processor Design is a Process

• Bottom-up– assemble components in target technology to establish critical timing

• Top-down– specify component behavior from high-level requirements

• Iterative refinement– establish partial solution, expand and improve

datapath control

processorInstruction SetArchitecture

=>

Reg. File Mux ALU Reg Mem Decoder Sequencer

Cells Gates


CPU: The single cycle

Instruction

Fetch

Instruction

Decode

Operand

Fetch

Execute

Result

Store

Next

Instruction° Design hardware for each of these steps!!!

Execute anentire instruction

Fetc

h

Dec

ode

Fetc

h

Exe

cute

Stor

e

Nex

t


CPU: Clocking

Clk

Don’t Care

Setup Hold

.

.

.

.

.

.

.

.

.

.

.

.

Setup Hold

• All storage elements are clocked by the same clock edge


CPU: Main Control PLA Implementation of the Main Control

op<0>

op<5>. .op<5>. .

<0>

op<5>. .

<0>

op<5>. .

<0>

op<5>. .

<0>

op<5>. .

<0>

R-type ori lw sw beq jumpRegWrite

ALUSrc

MemtoReg

MemWrite

Branch

Jump

RegDst

ExtOp

ALUop<2>

ALUop<1>

ALUop<0>


CPU: Main Control

• In our single-cycle processor, each instruction is realized by exactly one control command or “microinstruction”

– in general, the controller is a finite state machine– microinstruction can also control sequencing (see later)

Control Logic / Store(PLA, ROM)

OPcode

Datapath

Inst

ruct

ion

Decode

Con

ditio

ns

ControlPoints

microinstruction


CPU: Abstract View of a single cycle processor

• looks like a FSM with PC as state

PC

Nex

t PC

Reg

iste

rF

etch ALU Reg

. W

rt

Mem

Acc

ess

Dat

aM

em

Inst

ruct

ion

Fet

ch

Res

ult S

tore

ALU

ctr

Reg

Dst

ALU

Src

Ext

Op

Mem

Wr

Eq

ual

nPC

_sel

Reg

Wr

Mem

Wr

Mem

Rd

MainControl

ALUcontrol

op

fun

Ext


CPU: Why is a CPI=1 processor bad?

• Long Cycle Time• All instructions take as much time as the slowest• Real memory is not so nice as our idealized memory

– cannot always get the job done in one (short) cycle

PC Inst Memory mux ALU Data Mem mux

PC Reg FileInst Memory mux ALU mux

PC Inst Memory mux ALU Data Mem

PC Inst Memory cmp mux

Reg File

Reg File

Reg File

Arithmetic & Logical

Load

Store

Branch

Critical Path

setup

setup


I cache Decode,R-Read

ALU PC update

D cache R-Write Total

R-type 1 1 .9 - - .8 3.7

Load 1 1 .9 - 1 .8 4.7

Store 1 1 .9 - 1 - 3.9

beq 1 1 .9 .1 - - 3.0

•Load needs 5 cycles•Store and R-type need 4•beq needs 3

Goal: balance amount of work done each cycle.

CPU: Why is a CPI=1 processor bad?


CPU: Reducing Cycle Time

• Cut combinational dependency graph and insert register / latch• Do same work in two fast cycles, rather than one slow one

storage element

Acyclic CombinationalLogic

storage element

storage element

Acyclic CombinationalLogic (A)

storage element

storage element

Acyclic CombinationalLogic (B)

=>


CPU: Building blocks

• Adder

• MUX

• ALU

32

32

A

B

32Sum

Carry

32

32

A

B

32Result

OP

32A

B32

Y32

Select

Ad

der

MU

XA

LU

CarryIn


CPU: Building blocks

OP

32A

B32

Y32

Select

MU

X

3232

A[31..0]

B[31..0]32

Sum[31..0]

Carry

Ad

der

CarryIn

32A[63..32]

B[63..32]32

Sum[63..32]

Carry

Ad

der

CarryIn

32

• Building a 64-bit adder from 2x32-bit adders

• Speed of addition?

• For one ADD?

• For consecutive ADDS?


Multicycle CPU: Individual operations

• Next address logic– PC <= branch ? PC + offset : PC + 4

• Instruction Fetch– InstructionReg <= Mem[PC]

• Register Access– A <= R[rs]

• ALU operation– R <= A + B

PC

Nex

t PC

Ope

rand

Fet

ch Exec

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

Inst

ruct

ion

Fet

ch

Res

ult S

tore

ALU

ctr

Reg

Dst

ALU

Src

Ext

Op

Mem

Wr

nPC

_sel

Reg

Wr

Mem

Wr

Mem

Rd

Control


• Five execution steps (some instructions use fewer)– IF: Instruction Fetch

– ID: Instruction Decode (& register fetch & add PC+immed)

– EX: Execute

– Mem: Memory access

– WB: Write-Back into registers

IF ID EX Mem WB

I cache Decode,R-Read

ALU PC update

D cache R-Write Total

R-type 1 1 .9 - - .8 3.7

Load 1 1 .9 - 1 .8 4.7

Store 1 1 .9 - 1 - 3.9

beq 1 1 .9 .1 - - 3.0

Multicycle CPU: Partitioning Time


IF ID Ex Mem WB

Multicycle CPU: StepsNote: Reuse of ALU


Multicycle CPU

Partitioning the CPI=1 Datapath

• Add registers between smallest stepsP

C

Nex

t PC

Ope

rand

Fet

ch Exec

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

Inst

ruct

ion

Fet

ch

Res

ult S

tore

ALU

ctr

Reg

Dst

ALU

Src

Ext

Op

Mem

Wr

nPC

_sel

Reg

Wr

Mem

Wr

Mem

Rd


Multicycle CPU

Clk

Cycle 1

Multiple Cycle Implementation:

Ifetch Reg Exec Mem Wr

Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10

Ifetch Reg Exec Mem

Load Store

Clk

Single Cycle Implementation:

Load Store Waste

Ifetch

R-type

Cycle 1 Cycle 2


Step R-type Memory Branch Instruction Fetch IR = Mem[PC]

PC = PC + 4 Instruction Decode/ register fetch

A = Reg[IR[25-21]] B = Reg[IR[20-16]]

ALUout = PC + (sign-extend(IR[15-0]) << 2) Execution, address computation, branch completion

ALUout = A op B ALUout = A + sign-

extend(IR[15-0])

if (A==B) then PC=ALUout

Memory access or R-type completion

Reg[IR[15-11]] = ALUout

memory-data = Mem[ALUout]

or Mem[ALUout]=

B

Write-back Reg[IR[20-16]] = memory-data

Multicycle CPU: Instruction Types


Multicycle CPU: Sharing Hardware

• Example: memory is used twice, at different times

– Ave mem access per inst = 1 + Flw + Fsw ~ 1.3

– if CPI is 4.8, imem utilization = 1/4.8, dmem =0.3/4.8

• We could reduce HW without hurting performance

– extra control

IR <- Mem[PC]

A <- R[rs]; B<– R[rt]

S <– A + B

R[rd] <– S;PC <– PC+4;

S <– A + SX

M <– Mem[S]

R[rd] <– M;PC <– PC+4;

S <– A or ZX

R[rt] <– S;PC <– PC+4;

S <– A + SX

Mem[S] <- B

PC <– PC+4; PC < PC+4; PC < PC+SX;


Multicycle CPU: Sharing Functional Units

PC

Memory

Address

Instructionor data

Data

Instructionregister

Registers

Register #

Data

Register #

Register #

ALU

Memorydata

register

A

B

ALUOut

Step nameAction for R-type

instructionsAction for memory-reference

instructionsAction for branches

Action for jumps

Instruction fetch IR = Memory[PC]PC = PC + 4

Instruction A = Reg [IR[25-21]]decode/register fetch B = Reg [IR[20-16]]

ALUOut = PC + (sign-extend (IR[15-0]) << 2)

Execution, address ALUOut = A op B ALUOut = A + sign-extend if (A ==B) then PC = PC [31-28] IIcomputation, branch/ (IR[15-0]) PC = ALUOut (IR[25-0]<<2)jump completion

Memory access or R-type Reg [IR[15-11]] = Load: MDR = Memory[ALUOut]completion ALUOut or

Store: Memory [ALUOut] = B

Memory read completion Load: Reg[IR[20-16]] = MDR

Reuse:• ALU• Memory

Need more • Muxing• Control

Single ALU, Common data and instruction memory datapath


Since we reuse logic (e.g. ALU), we need to store results between states

Need extra registers when:– signal is computed in one clock cycle and used in

another, AND– the inputs to the combinational circuit can change

before the signal is written into a state element.

Multicycle CPU: Adding State Elements


IF ID Ex Mem WB

Multicycle CPU: Adding State Elements


Multicycle CPU: The Full Multi-Cycle Implementation


Cycle 1: Instruction Fetch

Datapath: IR = Memory[PC], PC = PC + 4 (may be revised later)Control: IorD=0, MemRead=1, MemWr=0, IRwrite=1, ALUsrcA=0, etc


A = Register[IR[25-21]]

B = Register[IR[20-16]]

ALUout = PC + (sign-extend (IR[15-0]) << 2)

Cycle 1: Instruction Decode


A = Reg[IR[25-21]] B = Reg[IR[20-16]]ALUout = PC + (sign-extend (IR[15-0]) << 2)

We compute target address even though we don’t know if it will be used – Operation may not be branch– Even if it is, branch may not be taken

Why? Everything up to this point must be instruction-independent,

because we haven’t decoded the instruction.The ALU, the (incremented) PC, and the immed field are now all

available

Cycle 2: Instruction Decode & RegFetch


Cycle 3 for beq: EXecute

• In cycle 1, PC was incremented by 4

• In cycle 2, ALUout was set to branch target

•This cycle, we conditionally reset PC: if (A==B) PC=ALUout

A

B

ALUout


• Cycle 3 (EXecute)

ALUout = A op B

• Cycle 4 (WriteBack)

Reg[IR[15-11]] = ALUout

R-type instruction is finished

Cycle 3: R-type Instruction


Cycle 3: ALUout = A op B

Cycle 4: Reg[IR[15-11]] = ALUout

B

A



Cycle 3: ALUout = A op B

Cycle 4: Reg[IR[15-11]] = ALUout

B

AALUout



Multicycle CPU: The datapath

PC

Nex

t P

C

Ope

rand

Fet

ch

Ext

ALU Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

Inst

ruct

ion

Fet

ch

Res

ult

Sto

re

AL

Uct

r

Reg

Dst

AL

US

rc

Ext

Op

nPC

_sel

Reg

Wr

Mem

Wr

Mem

Rd

IRA

B

R

M

RegFile

Mem

ToR

eg

Equ

al

Extra Registers:• IR• A,B• R ( sometimes called S or ALUout)• M



• Logical Register Transfer

• Physical Register Transfers

inst Logical Register Transfers

ADDU R[rd] <– R[rs] + R[rt]; PC <– PC + 4

inst Physical Register Transfers

IR <– MEM[pc]

ADDU A<– R[rs]; B <– R[rt]

S <– A + B

R[rd] <– S; PC <– PC + 4

Exe

c

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

A

B

S

M

Reg

File

Equ

al

PC

Nex

t PC

IR

Inst

. Mem






ORI R[rt] <– R[rs] OR zx(Im16); PC <– PC + 4


IR <– MEM[pc]

ADDU A<– R[rs]; B <– R[rt]

S <– ( A or ZeroExt(Im16) )

R[rt] <– S; PC <– PC + 4

Exe

c

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

A

B

S

M

Reg

File

Equ

al

PC

Nex

t PC

IR

Inst

. Mem






LW R[rt] <– MEM(R[rs] + sx(Im16);

PC <– PC + 4


IR <– MEM[pc]

LW A<– R[rs]; B <– R[rt]

S <– A + SignEx(Im16)

M <– MEM[S]

R[rd] <– M; PC <– PC + 4

Exe

c

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

A

B

S

M

Reg

File

Equ

al

PC

Nex

t PC

IR

Inst

. Mem






SW MEM(R[rs] + sx(Im16) <– R[rt];

PC <– PC + 4


IR <– MEM[pc]

SW A<– R[rs]; B <– R[rt]

S <– A + SignEx(Im16);

MEM[S] <– B; PC <– PC + 4

Exe

c

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

A

B

S

M

Reg

File

Equ

al

PC

Nex

t PC

IR

Inst

. Mem






BEQ if R[rs] == R[rt]

then PC <= PC + sx(Im16) || 00

else PC <= PC + 4


IR <– MEM[pc]

BEQ|Eq PC <– PC + 4


IR <– MEM[pc]

BEQ|Eq PC <– PC + sx(Im16) || 00

Exe

c

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

A

B

S

M

Reg

File

Equ

al

PC

Nex

t PC

IR

Inst

. Mem


Multicycle CPU: Summary

Step nameAction for R-type

instructionsAction for memory-reference

instructionsAction for branches

Action for jumps

Instruction fetch IR = Memory[PC]PC = PC + 4

Instruction A = Reg [IR[25-21]]decode/register fetch B = Reg [IR[20-16]]

ALUOut = PC + (sign-extend (IR[15-0]) << 2)

Execution, address ALUOut = A op B ALUOut = A + sign-extend if (A ==B) then PC = PC [31-28] IIcomputation, branch/ (IR[15-0]) PC = ALUOut (IR[25-0]<<2)jump completion

Memory access or R-type Reg [IR[15-11]] = Load: MDR = Memory[ALUOut]completion ALUOut or

Store: Memory [ALUOut] = B

Memory read completion Load: Reg[IR[20-16]] = MDR


Multicycle CPU: Mid-term alert !!

• How many cycles will it take to execute this code?

lw $t2, 0($t3)lw $t3, 4($t3)beq $t2, $t3, Label #assume notadd $t5, $t2, $t3sw $t5, 8($t3)

Label: ...

• What is going on during the 8th cycle of execution?• In what cycle does the actual addition of $t2 and $t3 takes place?


Multicycle CPU: Sharing Hardware

“Princeton” Organization

• Single memory for instruction and data access – memory utilization -> 1.3/4.8

• In this case our state diagram does not change– several additional control signals– must ensure each bus is only driven by one source on each cycle

RegFile

A

B

A-BusB Bus

IR S

W-Bus

PC

nextPC ZX SX

Mem


Multicycle CPU: Control Line Timing

Shiftleft 2

MemtoReg

IorD MemRead MemWrite

PC

Memory

MemData

Writedata

Mux

0

1

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Instruction[15– 11]

Mux

0

1

Mux

0

1

4

ALUOpALUSrcB

RegDst RegWrite


Instruction [5– 0]

Signextend

3216




Instructionregister

1 Mux

0

3

2

ALUcontrol

Mux

0

1ALU

resultALU

ALUSrcA

ZeroA

B

ALUOut

IRWrite

Address

Memorydata

register

Clk

Cycle 1

Ifetch Reg Exec Mem Wr

Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10

Ifetch Reg Exec Mem

Load Store

Ifetch

R-type

IRWrite


Review: Finite State Machines

• Finite state machines:– a set of states and – next state function (determined by current state and the input)– output function (determined by current state and possibly input)

– We’ll use a Moore machine (output based only on current state)

Next-statefunction

Current state

Clock

Outputfunction

Nextstate

Outputs

Inputs


Multicycle CPU: Control

PCWrite

PCWriteCond

IorD

MemtoReg

PCSource

ALUOp

ALUSrcB

ALUSrcA

RegWrite

RegDst

NS3NS2NS1NS0

Op5

Op4

Op3

Op2

Op1

Op0

S3

S2

S1

S0

State register

IRWrite

MemRead

MemWrite

Instruction registeropcode field

Outputs

Control logic

Inputs

If (State == Instruction Fetch)

{

IRWrite = 1;

// All other signals are 0;

State = Operand Fetch;

}

If (State == Execute && InstructionOpCode == BEQ )

{

// Do your thing..

}

ControlOutput = f(State, OpCode)

NextState = f(State, OpCode)


Multicycle CPU: Our basic FSM

Instruction fetch

Decode and Register Fetch

Memoryinstructions

R-typeinstructions

Branchinstructions

Jumpinstruction



IR <= MEM[PC]

R-type

A <= R[rs]B <= R[rt]

S <= A fun B

R[rd] <= SPC <= PC + 4

S <= A or ZX

R[rt] <= SPC <= PC + 4

ORi

S <= A + SX

R[rt] <= MPC <= PC + 4

M <= MEM[S]

LW

S <= A + SX

MEM[S] <= BPC <= PC + 4

BEQ & Equal

BEQ & ~Equal

PC <= PC + 4 PC <= PC + SX || 00

SW

“instruction fetch”

“decode / operand fetch”

Exe

cute

Mem

ory

Writ

e-ba

ck



PCWritePCSource = 10

ALUSrcA = 1ALUSrcB = 00ALUOp = 01PCWriteCond

PCSource = 01

ALUSrcA =1ALUSrcB = 00ALUOp= 10

RegDst = 1RegWrite

MemtoReg = 0

MemWriteIorD = 1

MemReadIorD = 1

ALUSrcA = 1ALUSrcB = 10ALUOp = 00

RegDst = 0RegWrite

MemtoReg =1


MemReadALUSrcA = 0

IorD = 0IRWrite

ALUSrcB = 01ALUOp = 00


Instruction fetchInstruction decode/

register fetch

Jumpcompletion

BranchcompletionExecution

Memory addresscomputation

Memoryaccess

Memoryaccess R-type completion

Write-back step

(Op = 'LW') or (Op = 'SW') (Op = R-type)

(Op

= 'B

EQ')

(Op

= 'J

')

(Op = 'SW

')

(Op

= 'L

W')

4

01

9862

753

Start

Number of states?

Number of bits for state?


Multicycle CPU: Control: Assigning States

IR <= MEM[PC]

R-type


S <= A fun B

R[rd] <= SPC <= PC + 4

S <= A or ZX

R[rt] <= SPC <= PC + 4

ORi

S <= A + SX

R[rt] <= MPC <= PC + 4

M <= MEM[S]

LW

S <= A + SX

MEM[S] <= BPC <= PC + 4

BEQ & EqualBEQ & ~Equal

PC <= PC + 4 PC <= PC + SX || 00

SW

“instruction fetch”

“decode”

Exe

cute

Mem

ory

Writ

e-ba

ck

0000

0001

0100

0101

0110

0111

1000

1001

1010

0011 00101011

1100


Multicycle CPU: Detailed control spec.

0000 ?????? ? 0001 10001 BEQ 0 0011 1 10001 BEQ 1 0010 1 10001 R-type x 0100 1 10001 orI x 0110 1 10001 LW x 1000 1 10001 SW x 1011 1 10010 xxxxxx x 0000 1 10011 xxxxxx x 0000 1 00100 xxxxxx x 0101 0 1 fun 10101 xxxxxx x 0000 1 0 0 1 10110 xxxxxx x 0111 0 0 or 10111 xxxxxx x 0000 1 0 0 1 01000 xxxxxx x 1001 1 0 add 11001 xxxxxx x 1010 1 0 01010 xxxxxx x 0000 1 0 1 1 01011 xxxxxx x 1100 1 0 add 11100 xxxxxx x 0000 1 0 0 1

State Op field Eq Next IR PC Ops Exec Mem Write-Backen sel A B Ex Sr ALU S R W M M-R Wr Dst

R:

ORi:

LW:

SW:


Multicycle CPU: Implementation styles

• ROM = "Read Only Memory"– values of memory locations are fixed ahead of time

• A ROM can be used to implement a truth table– if the address is m-bits, we can address 2m entries in the ROM.– our outputs are the bits of data that the address points to.

– 2m is the "height", and n is the "width"

m n

0 0 0 0 0 1 10 0 1 1 1 0 00 1 0 1 1 0 00 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 11 1 0 0 1 1 01 1 1 0 1 1 1



• How many inputs are there?6 bits for opcode, 4 bits for state = 10 address lines(i.e., 210 = 1024 different addresses)

• How many outputs are there?16 datapath-control outputs, 4 state bits = 20 outputs

• ROM is 210 x 20 = 20K bits (and a rather unusual size)

• Rather wasteful, since for lots of the entries, the outputs are the same— i.e., opcode is often ignored



• Break up the table into two parts

— 4 state bits tell you the 16 outputs, 24 x 16 bits of ROM

— 10 bits tell you the 4 next state bits, 210 x 4 bits of ROM

— Total: 4.3K bits of ROM

• PLA is much smaller

— can share product terms

— only need entries that produce an active output

— can take into account don't cares

• Size is (#inputs ´ #product-terms) + (#outputs ´ #product-terms)

For this example = (10x17)+(20x17) = 460 PLA cells

• PLA cells usually about the size of a ROM cell (slightly bigger)



PLA ImplementationOp5

Op4

Op3

Op2

Op1

Op0

S3

S2

S1

S0

IorD

IRWrite

MemReadMemWrite

PCWritePCWriteCond

MemtoRegPCSource1

ALUOp1

ALUSrcB0ALUSrcARegWriteRegDstNS3NS2NS1NS0

ALUSrcB1ALUOp0

PCSource0

IRWrite = (!S0 && !S1 && !S2 && !S3)

NS0 = ( S[3..0] == 0000) ||( S[3..0] == 0110 ) ||( S[3..0] == 1001 && OP[5..0]=000010 ) ||(…)(…)


Microprogramming


ALUSrcA = 1ALUSrcB = 00ALUOp = 01PCWriteCond

PCSource = 01

ALUSrcA =1ALUSrcB = 00ALUOp= 10

RegDst = 1RegWrite

MemtoReg = 0

MemWriteIorD = 1

MemReadIorD = 1


RegDst = 0RegWrite

MemtoReg =1


MemReadALUSrcA = 0

IorD = 0IRWrite

ALUSrcB = 01ALUOp = 00


Instruction fetchInstruction decode/

register fetch

Jumpcompletion

BranchcompletionExecution

Memory addresscomputation

Memoryaccess

Memoryaccess R-type completion

Write-back step

(Op

= 'J

')

(Op

= 'L

W')

4

01

9862

753

Start

• Control is the hard part of processor design

° Datapath is fairly regular and well-organized

° Memory is highly regular

° Control is irregular and global

Consider the FSM in case of 100s of instructions !!!

• FSMs get unmanageable quickly as they grow.– hard to specify– hard to manipulate– error prone– hard to visualize

• The state digrams that arise define the controller for an instruction set processor are highly structured

• Use this structure to construct a simple “microsequencer” • Control reduces to programming this very simple device

– microprogramming


Microprogramming

Opcode

State Reg

Inputs

Outputs

Control LogicPLA or ROM

MulticycleDatapath

1

Address Select Logic

Adder

Types of “branching”• Set state to 0• Dispatch (state 1)• Use incremented state number

Common case: State += 1;

Microprogramming:A Particular Strategy for Implementing the Control Unit of a processor by "programming" at the level of register transfer operations

Microarchitecture:Logical structure and functional capabilities of the hardware as seen by the microprogrammer

Historical Note:

IBM 360 Series first to distinguish between architecture & organization Same instruction set across wide range of implementations, each with different cost/performance


Macro-Micro programming?

MainMemory

executionunit

controlmemory

CPU

ADDSUBAND

DATA

.

.

.

User program plus Data

this can change!

AND microsequence

e.g., Fetch Calc Operand Addr Fetch Operand(s) Calculate Save Answer(s)

one of these ismapped into oneof these


Horizontal Microinstructions

° “Horizontal” Microcode

– control field for each control point in the machine

µseq µaddr A-mux B-mux bus enables register enables

Control Logic / Store(PLA, ROM)

OPcode

Datapath

Inst

ruct

ion

Decode

Co

nd

itio

ns

ControlPoints

microinstruction

Depending on bus organization, many potential control combinations simply wrong, i.e., implies transfers that can never happen at the same time.

Idea: encode fields to save ROM space

Example: mem_to_reg and ALU_to_reg should never happen simultenously; => encode in single bit which is decoded rather than two separate bits


Vertical Microinstructions

° “Vertical” Microcode

– encoded control fields with local decode

src dst

DEC

DEC

other control fields next states inputs

MUX

Some of these may havenothing to do with registers!


Design Microinstruction Sets

1) Start with list of control signals

2) Group signals together that make sense (vs. random): called “fields”

3) Places fields in some logical order (e.g., ALU operation & ALU operands first and microinstruction sequencing last)

4) Create a symbolic legend for the microinstruction format, showing name of field values and how they set the control signals

– Use computers to design computers

5) To minimize the width, encode operations that will never be used at the same time


Microinstructions Start with list of control signals, grouped into fields

Signal name Effect when deasserted Effect when assertedALUSelA 1st ALU operand = PC 1st ALU operand = Reg[rs]RegWrite None Reg. is written MemtoReg Reg. write data input = ALU Reg. write data input = memory RegDst

Reg. dest. no. = rt Reg. dest. no. = rdTargetWrite None Target reg. = ALU MemRead None Memory at address is readMemWrite None Memory at address is written IorD Memory address = PC Memory address = ALUIRWrite None IR = MemoryPCWrite None PC = PCSourcePCWriteCond None IF ALUzero then PC = PCSource

Sin

gle

Bit

Con

trol

Signal name Value Effect ALUOp 00 ALU adds 01 ALU subtracts 10 ALU does function code

11 ALU does logical OR ALUSelB 000 2nd ALU input = Reg[rt] 001 2nd ALU input = 4 010 2nd ALU input = sign extended IR[15-0] 011 2nd ALU input = sign extended, shift left 2 IR[15-0]

100 2nd ALU input = zero extended IR[15-0] PCSource 00 PC = ALU 01 PC = Target 10 PC = PC+4[29-26] : IR[25–0] << 2

Mu

ltip

le B

it C

ontr

ol


Microinstructions

Field Name Width Control Signals Set

wide narrow

ALU Control 4 2 ALUOp

SRC1 2 1 ALUSelA

SRC2 5 3 ALUSelB

ALU Destination 6 4 RegWrite, MemtoReg, RegDst, TargetWr.

Memory 4 3 MemRead, MemWrite, IorD

Memory Register 1 1 IRWrite

PCWrite Control 5 4 PCWrite, PCWriteCond, PCSource

Sequencing 3 2 AddrCtl

Total width 30 20 bits


Microinstructions: MIPS field name and values

Field Name Values for Field Function of Field with Specific ValueALU Add ALU adds

Subt. ALU subtractsFunc code ALU does function codeOr ALU does logical OR

SRC1 PC 1st ALU input = PCrs 1st ALU input = Reg[rs]

SRC2 4 2nd ALU input = 4Extend 2nd ALU input = sign ext. IR[15-0]Extend0 2nd ALU input = zero ext. IR[15-0] Extshft 2nd ALU input = sign ex., sl IR[15-0]rt 2nd ALU input = Reg[rt]

ALU destination Target Target = ALUoutrd Reg[rd] = ALUout

Memory Read PC Read memory using PCRead ALU Read memory using ALU outputWrite ALU Write memory using ALU output

Memory register IR IR = MemWrite rt Reg[rt] = MemRead rt Mem = Reg[rt]

PC write ALU PC = ALU outputTarget-cond. IF ALU Zero then PC = Targetjump addr. PC = PCSource

Sequencing Seq Go to sequential µinstructionFetch Go to the first microinstructionDispatch Dispatch using ROM.


Microinstructions: The datapath again

Shiftleft 2

MemtoReg

IorD MemRead MemWrite

PC

Memory

MemData

Writedata

Mux

0

1

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2


Mux

0

1

Mux

0

1

4

ALUOpALUSrcB

RegDst RegWrite


Instruction [5– 0]

Signextend

3216




Instructionregister

1 Mux

0

3

2

ALUcontrol

Mux

0

1ALU

resultALU

ALUSrcA

ZeroA

B

ALUOut

IRWrite

Address

Memorydata

register

Field Name Values for Field Function of Field with Specific ValueSRC1 PC 1st ALU input = PC

rs 1st ALU input = Reg[rs]SRC2 4 2nd ALU input = 4

Extend 2nd ALU input = sign ext. IR[15-0]Extend0 2nd ALU input = zero ext. IR[15-0] Extshft 2nd ALU input = sign ex., sl IR[15-0]rt 2nd ALU input = Reg[rt]

ALU destination Target Target = ALUoutrd Reg[rd] = ALUout


Microinstructions: Pros-Cons

• Specification Advantages:

– Easy to design and write

– Design architecture and microcode in parallel

• Implementation (off-chip ROM) Advantages

– Easy to change since values are in memory

– Can emulate other architectures and instruction sets

– Can make use of internal registers

• Implementation Disadvantages, SLOWER now that:

– Control is implemented on same chip as processor

– ROM is no longer faster than RAM

– No need to go back and make changes


CPU Control: Methodology

Initialrepresentation

Finite statediagram

Microprogram

Sequencingcontrol

Explicit nextstate function

Microprogram counter+ dispatch ROMS

Logicrepresentation

Logicequations

Truthtables

Implementationtechnique

Programmablelogic array

Read onlymemory


Microprogramming: the last word ?

Summary: Microprogramming one inspiration for RISC

• If simple instruction could execute at very high clock rate…• If you could even write compilers to produce microinstructions…• If most programs use simple instructions and addressing modes…• If microcode is kept in RAM instead of ROM so as to fix bugs …• If same memory used for control memory could be used instead as cache for

“macroinstructions”…• Then why not skip instruction interpretation by a microprogram and simply compile

directly into lowest language of machine? (microprogramming is overkill when ISA matches datapath 1-1)


Exceptions

Supporting exceptions in our FSM

MemReadALUSelA = 0

IorD = 0IRWrite

ALUSelB = 01ALUOp = 00


ALUSelA = 0ALUSelB = 11ALUOp = 00TargetWrite

Memory InstFSM

R-type InstFSM

Branch InstFSM

Jump InstFSM

Instruction Fetch, state 0 Instruction Decode/ Register Fetch, state 1

Opcode = LW or SW

Opcode = R

-type

Opc

ode

= B

EQ

Opc

ode

= J

MP

Start

to state 10

Opcode = anything else


Exceptions


ALUSelA = 1ALUSelB = 00ALUOp = 10

from state 1

ALUSelA = 1RegDst = 1RegWrite

MemtoReg = 0ALUSelB = 10ALUOp = 10

To state 0

R-type instructions

overflowTo state 11


Exceptions

IntCause=1CauseWrite

ALUSelA = 0ALUSelB = 01ALUOp = 01

EPCWrite

To state 0 (fetch)

IntCause=0CauseWrite

PCWritePCSource=11

state 11

state 13

state 10 state 12

EP

C

Cau

se

PC

PCWrite EPCWrite

CauseWrite

IntCause

PCSource

InterruptHandlerAddress

sub4

illegalinstruction

arithmeticoverflow


Write Cause into registerWrite PC into EPCLoad Exception Handler address to PC


Exceptions

IR <= MEM[PC]PC <= PC + 4

R-type


S <= A fun B

R[rd] <= S

S <= A op ZX

R[rt] <= S

ORi

S <= A + SX

R[rt] <= M

M <= MEM[S]

LW

S <= A + SX

MEM[S] <= B

SW

other

undefined instruction

EPC <= PC - 4PC <= exp_addrcause <= 10 (RI)

EPC <= PC - 4PC <= exp_addrcause <= 12 (Ovf)

overflow

Additional condition fromDatapath

Equal

BEQ

PC <= PC + SX || 00

0010

0011

S <= A - B ~Equal


Summary

• multicycle CPUs make things faster.• control is harder.• microprogramming can simplify (conceptually) CPU control generation• a microprogram is a small program inside the CPU that executes the individual

instructions of the “real” program.• exception-handling is difficult in the CPU, because the interactions between the

executing instructions and the interrupt are complex and unpredictable.


Mid-Term Review

• Technology trends: Design for the future • Instruction Set Architectures: types of ISAs: Addressing modes, length of instruction etc.• MIPS instruction format-basic classes of instructions• Registers and load store architectures• Data types, operands, memory organization/addressing• Basic MIPS instructions: Arithmetic, logical, data transfer, branching, jumps• Issues in jump/branching distance and immediate addressing modes• Stacks and frames• E.g., swap(), leaf_procedure(), nested_procedure()

• Performance: Relative (Boeing e.g,), Metrics, Benchmarking, SPEC marks• Performance = Instruction Count x Cycles/Instruction x Seconds/Cycle • Amdahl’s law Improvement = Execution Time Unaffected + ( Execution Time Affected /

Amount of Improvement )• Arithmetic: 2s complement• Basic digital logic, 1-bit adder, full adder, 32-bit adder/subtractor• ALU: adder+mux+special conditions• Delays in combinational logic, clocking• Ripple carry vs. Carry look ahead adders


Mid-Term Review

• Multiplication & Division: grade school version• 3 incrementally better algorithms (data paths)

• Basics of booth arithmetic• Floating point representation• Floating point operations (+,-,*,/)• Guard,round and sticky bits

• Single cycle CPU• Building blocks: Register files, memory etc.• Storage units, clocking methodology • PC arithmetic• Instruction fetch• Datapath on various operations: Load, Store, Branch, R-type, I-type• Control: basic control signals for the MIPS subset• Distributed control: Main control + ALU control • PLA implementation• Timing diagrams


Mid-Term Review

• Multi-cycle CPU• Datapath: registers/stages: Ifetch, A,B, Execute, Store etc.• Various instructions through the datapath• Control: Sharing functional units• Finite state machine perspective for control: FSM for MIPS • Implementation styles: ROM, PLA• Microprogramming: Horizontal, vertical, relationship to RISC• Exceptions: change in FSM, internal, external; need to save state.

Documents

CS141-L4-1Tarun Soni, Summer ‘03 Multi Cycle CPU Previously: built a Single Cycle CPU. Today: Exceptions Multi-cycle CPU; Microprogramming