View
223
Download
1
Tags:
Embed Size (px)
Citation preview
CS141-L4-1 Tarun Soni, Summer ‘03
Multi Cycle CPU
Previously: built a Single Cycle CPU. Today:
ExceptionsMulti-cycle CPU; Microprogramming
CS141-L4-2 Tarun Soni, Summer ‘03
Mid-term Review Discussion Session
Peterson Hall 104Tue: 2-3 pm Tue: 3-4 pm
0
5
10
15
20
25
30
35
40
45
50
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55
Series1
CS141-L4-3 Tarun Soni, Summer ‘03
Instruction Set Architectures Performance issues 2s complement, Addition, Subtraction Multiplication, Division, Floating Point numbers ALUs Single Cycle CPU
ExceptionsMulticycle CPU: datapath; controlMicroprogramming
The Story so far:
CS141-L4-4 Tarun Soni, Summer ‘03
• Design alternative:
– provide more powerful operations
– goal is to reduce number of instructions executed
– danger is a slower cycle time and/or a higher CPI
• Sometimes referred to as “RISC vs. CISC”
– virtually all new instruction sets since 1982 have been RISC
– VAX: minimize code size, make assembly language easy
instructions from 1 to 54 bytes long!
• We’ll look at Pentium, UltraSparc and JVM
Alternative Architectures
CS141-L4-5 Tarun Soni, Summer ‘03
Pentium
CS141-L4-6 Tarun Soni, Summer ‘03
Java VM
• Most instr one byte– ADD– POP
• One byte arg– ILOAD IND8– BIPUSH CON8
• Two byte arg– SIPUSH CON16– IF_ICMPEQ OFFSET16
• Type = int, signed int etc.
CS141-L4-7 Tarun Soni, Summer ‘03
UltraSparc
CS141-L4-8 Tarun Soni, Summer ‘03
Exceptions
or
Oops!
CS141-L4-9 Tarun Soni, Summer ‘03
Exceptions
• There are two sources of non-sequential control flow in a processor– explicit branch and jump instructions– exceptions
• Branches are synchronous and deterministic• Exceptions are typically asynchronous and non-deterministic• Guess which is more difficult to handle?
• exceptions as any unexpected change in control flow
• interrupts as any externally-caused exception
• Literature is not consistent
arithmetic overflow
divide by zero
I/O device signals completion to CPU
user program invokes the OS
memory parity error
illegal instruction
timer signal
CS141-L4-10 Tarun Soni, Summer ‘03
Exceptions
• The machine we’ve been designing in class can generate two types of exceptions.
– arithmetic overflow
– illegal instruction
• On an exception, we need to
– save the PC (invisible to user code)
– record the nature of the exception/interrupt
– transfer control to OS
user programSystemExceptionHandlerException:
return fromexception
CS141-L4-11 Tarun Soni, Summer ‘03
Exceptions
• MIPS architecture defines the instruction as having no effect if the instruction causes an exception.
• When we get to virtual memory we will see that certain classes of exceptions must prevent the instruction from changing the machine state.
• This aspect of handling exceptions becomes complex and potentially limits performance => why it is hard
• Interrupts– caused by external events– asynchronous to program
execution– may be handled between
instructions– simply suspend and resume user
program• Traps/Exceptions
– caused by internal events• exceptional conditions
(overflow)• errors (parity)• faults (non-resident page)
– synchronous to program execution– condition must be remedied by the
handler– instruction may be retried or
simulated and program continued or program may be aborted
CS141-L4-12 Tarun Soni, Summer ‘03
Exceptions
Addressing the Exception Handler
• Traditional Approach: Interupt Vector
– PC <- MEM[ IV_base + cause || 00]
– 370, 68000, Vax, 80x86, . . .
• RISC Handler Table
– PC <– IT_base + cause || 0000
– saves state and jumps
– Sparc, PA, M88K, . . .
• MIPS Approach: fixed entry
– PC <– EXC_addr
– Actually very small table
• RESET entry
• TLB
• other
iv_basecause
handlercode
iv_basecause
handler entry code
CS141-L4-13 Tarun Soni, Summer ‘03
Exceptions
Saving State
• Push it onto the stack
– Vax, 68k, 80x86
• Save it in special registers
– MIPS EPC, BadVaddr, Status, Cause
• Shadow Registers
– M88k
– Save state in a shadow of the internal pipeline registers
Significant component of “interrupt response time”
CS141-L4-14 Tarun Soni, Summer ‘03
Exceptions
• For our MIPS-subset architecture, we will add two registers:– EPC: a 32-bit register to hold the user’s PC– Cause: A register to record the cause of the exception
• we’ll assume undefined inst = 0, overflow = 1• We will also add three control signals:
– EPCWrite (will need to be able to subtract 4 from PC)– CauseWrite– IntCause
• We will extend PCSource multiplexor to be able to latch the interrupt handler address into the PC.
CS141-L4-15 Tarun Soni, Summer ‘03
Cau
se
CauseWrite
IntCause
EP
C
PC
PCWrite EPCWrite
PCSource
InterruptHandlerAddress
sub4
imm
16
32
ALUctr
Clk
busW
RegWr
3232
busA
32busB
55 5
Rw Ra Rb32 32-bitRegisters
Rs
Rt
Rt
RdRegDst
Exten
der
Mu
x
3216imm16
ALUSrcExtOp
Mu
x
MemtoReg
Clk
Data InWrEn32 Adr
DataMemory
MemWrA
LU
Equal
Instruction<31:0>
0
1
0
1
01
<21:25>
<16:20>
<11:15>
<0:15>
Imm16RdRtRs
=
Ad
der
Ad
der
PC
Clk
00
Mu
x
4
nPC_sel
PC
Ext
Adr
InstMemory
Exceptions
CS141-L4-16 Tarun Soni, Summer ‘03
ALUctrRegDst ALUSrcExtOp MemtoRegMemWr Equal
Instruction<31:0>
<21:25>
<16:20>
<11:15>
<0:15>
Imm16RdRsRt
nPC_sel
Adr
InstMemory
DATA PATH
Control
Op
<21:25>
Fun
RegWrExceptionSignals
Exceptions: Creating a “Control line”
Regs: – EPC: – Cause:
control signals:– EPCWrite (subtract 4 from PC)– CauseWrite– IntCause
CS141-L4-17 Tarun Soni, Summer ‘03
Clk
5
Rw Ra Rb
32 32-bitRegisters
RdA
LU
Clk
Data In
DataAddress
IdealData
Memory
Instruction
InstructionAddress
IdealInstruction
Memory
Clk
PC
5Rs
5Rt
16Imm
32
323232
A
B
Nex
t A
dd
ress
Regs: – EPC: – Cause:
control signals:– EPCWrite (subtract 4 from PC)– CauseWrite– IntCause
Extend PCSource MUX to include jump address from int-table
Exceptions: Creating the data path
CS141-L4-18 Tarun Soni, Summer ‘03
CPU
Multi Cycle CPU
CS141-L4-19 Tarun Soni, Summer ‘03
CPU
The Big Picture: Where are We Now?
• The Five Classic Components of a Computer
• Datapath Design, then Control Design
Control
Datapath
Memory
Processor
Input
Output
CS141-L4-20 Tarun Soni, Summer ‘03
Recap: Processor Design is a Process
• Bottom-up– assemble components in target technology to establish critical timing
• Top-down– specify component behavior from high-level requirements
• Iterative refinement– establish partial solution, expand and improve
datapath control
processorInstruction SetArchitecture
=>
Reg. File Mux ALU Reg Mem Decoder Sequencer
Cells Gates
CS141-L4-21 Tarun Soni, Summer ‘03
CPU: The single cycle
Instruction
Fetch
Instruction
Decode
Operand
Fetch
Execute
Result
Store
Next
Instruction° Design hardware for each of these steps!!!
Execute anentire instruction
Fetc
h
Dec
ode
Fetc
h
Exe
cute
Stor
e
Nex
t
CS141-L4-22 Tarun Soni, Summer ‘03
CPU: Clocking
Clk
Don’t Care
Setup Hold
.
.
.
.
.
.
.
.
.
.
.
.
Setup Hold
• All storage elements are clocked by the same clock edge
CS141-L4-23 Tarun Soni, Summer ‘03
CPU: Main Control PLA Implementation of the Main Control
op<0>
op<5>. .op<5>. .
<0>
op<5>. .
<0>
op<5>. .
<0>
op<5>. .
<0>
op<5>. .
<0>
R-type ori lw sw beq jumpRegWrite
ALUSrc
MemtoReg
MemWrite
Branch
Jump
RegDst
ExtOp
ALUop<2>
ALUop<1>
ALUop<0>
CS141-L4-24 Tarun Soni, Summer ‘03
CPU: Main Control
• In our single-cycle processor, each instruction is realized by exactly one control command or “microinstruction”
– in general, the controller is a finite state machine– microinstruction can also control sequencing (see later)
Control Logic / Store(PLA, ROM)
OPcode
Datapath
Inst
ruct
ion
Decode
Con
ditio
ns
ControlPoints
microinstruction
CS141-L4-25 Tarun Soni, Summer ‘03
CPU: Abstract View of a single cycle processor
• looks like a FSM with PC as state
PC
Nex
t PC
Reg
iste
rF
etch ALU Reg
. W
rt
Mem
Acc
ess
Dat
aM
em
Inst
ruct
ion
Fet
ch
Res
ult S
tore
ALU
ctr
Reg
Dst
ALU
Src
Ext
Op
Mem
Wr
Eq
ual
nPC
_sel
Reg
Wr
Mem
Wr
Mem
Rd
MainControl
ALUcontrol
op
fun
Ext
CS141-L4-26 Tarun Soni, Summer ‘03
CPU: Why is a CPI=1 processor bad?
• Long Cycle Time• All instructions take as much time as the slowest• Real memory is not so nice as our idealized memory
– cannot always get the job done in one (short) cycle
PC Inst Memory mux ALU Data Mem mux
PC Reg FileInst Memory mux ALU mux
PC Inst Memory mux ALU Data Mem
PC Inst Memory cmp mux
Reg File
Reg File
Reg File
Arithmetic & Logical
Load
Store
Branch
Critical Path
setup
setup
CS141-L4-27 Tarun Soni, Summer ‘03
I cache Decode,R-Read
ALU PC update
D cache R-Write Total
R-type 1 1 .9 - - .8 3.7
Load 1 1 .9 - 1 .8 4.7
Store 1 1 .9 - 1 - 3.9
beq 1 1 .9 .1 - - 3.0
•Load needs 5 cycles•Store and R-type need 4•beq needs 3
Goal: balance amount of work done each cycle.
CPU: Why is a CPI=1 processor bad?
CS141-L4-28 Tarun Soni, Summer ‘03
CPU: Reducing Cycle Time
• Cut combinational dependency graph and insert register / latch• Do same work in two fast cycles, rather than one slow one
storage element
Acyclic CombinationalLogic
storage element
storage element
Acyclic CombinationalLogic (A)
storage element
storage element
Acyclic CombinationalLogic (B)
=>
CS141-L4-29 Tarun Soni, Summer ‘03
CPU: Building blocks
• Adder
• MUX
• ALU
32
32
A
B
32Sum
Carry
32
32
A
B
32Result
OP
32A
B32
Y32
Select
Ad
der
MU
XA
LU
CarryIn
CS141-L4-30 Tarun Soni, Summer ‘03
CPU: Building blocks
OP
32A
B32
Y32
Select
MU
X
3232
A[31..0]
B[31..0]32
Sum[31..0]
Carry
Ad
der
CarryIn
32A[63..32]
B[63..32]32
Sum[63..32]
Carry
Ad
der
CarryIn
32
• Building a 64-bit adder from 2x32-bit adders
• Speed of addition?
• For one ADD?
• For consecutive ADDS?
CS141-L4-31 Tarun Soni, Summer ‘03
Multicycle CPU: Individual operations
• Next address logic– PC <= branch ? PC + offset : PC + 4
• Instruction Fetch– InstructionReg <= Mem[PC]
• Register Access– A <= R[rs]
• ALU operation– R <= A + B
PC
Nex
t PC
Ope
rand
Fet
ch Exec
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
Inst
ruct
ion
Fet
ch
Res
ult S
tore
ALU
ctr
Reg
Dst
ALU
Src
Ext
Op
Mem
Wr
nPC
_sel
Reg
Wr
Mem
Wr
Mem
Rd
Control
CS141-L4-32 Tarun Soni, Summer ‘03
• Five execution steps (some instructions use fewer)– IF: Instruction Fetch
– ID: Instruction Decode (& register fetch & add PC+immed)
– EX: Execute
– Mem: Memory access
– WB: Write-Back into registers
IF ID EX Mem WB
I cache Decode,R-Read
ALU PC update
D cache R-Write Total
R-type 1 1 .9 - - .8 3.7
Load 1 1 .9 - 1 .8 4.7
Store 1 1 .9 - 1 - 3.9
beq 1 1 .9 .1 - - 3.0
Multicycle CPU: Partitioning Time
CS141-L4-33 Tarun Soni, Summer ‘03
IF ID Ex Mem WB
Multicycle CPU: StepsNote: Reuse of ALU
CS141-L4-34 Tarun Soni, Summer ‘03
Multicycle CPU
Partitioning the CPI=1 Datapath
• Add registers between smallest stepsP
C
Nex
t PC
Ope
rand
Fet
ch Exec
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
Inst
ruct
ion
Fet
ch
Res
ult S
tore
ALU
ctr
Reg
Dst
ALU
Src
Ext
Op
Mem
Wr
nPC
_sel
Reg
Wr
Mem
Wr
Mem
Rd
CS141-L4-35 Tarun Soni, Summer ‘03
Multicycle CPU
Clk
Cycle 1
Multiple Cycle Implementation:
Ifetch Reg Exec Mem Wr
Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10
Ifetch Reg Exec Mem
Load Store
Clk
Single Cycle Implementation:
Load Store Waste
Ifetch
R-type
Cycle 1 Cycle 2
CS141-L4-36 Tarun Soni, Summer ‘03
Step R-type Memory Branch Instruction Fetch IR = Mem[PC]
PC = PC + 4 Instruction Decode/ register fetch
A = Reg[IR[25-21]] B = Reg[IR[20-16]]
ALUout = PC + (sign-extend(IR[15-0]) << 2) Execution, address computation, branch completion
ALUout = A op B ALUout = A + sign-
extend(IR[15-0])
if (A==B) then PC=ALUout
Memory access or R-type completion
Reg[IR[15-11]] = ALUout
memory-data = Mem[ALUout]
or Mem[ALUout]=
B
Write-back Reg[IR[20-16]] = memory-data
Multicycle CPU: Instruction Types
CS141-L4-37 Tarun Soni, Summer ‘03
Multicycle CPU: Sharing Hardware
• Example: memory is used twice, at different times
– Ave mem access per inst = 1 + Flw + Fsw ~ 1.3
– if CPI is 4.8, imem utilization = 1/4.8, dmem =0.3/4.8
• We could reduce HW without hurting performance
– extra control
IR <- Mem[PC]
A <- R[rs]; B<– R[rt]
S <– A + B
R[rd] <– S;PC <– PC+4;
S <– A + SX
M <– Mem[S]
R[rd] <– M;PC <– PC+4;
S <– A or ZX
R[rt] <– S;PC <– PC+4;
S <– A + SX
Mem[S] <- B
PC <– PC+4; PC < PC+4; PC < PC+SX;
CS141-L4-38 Tarun Soni, Summer ‘03
Multicycle CPU: Sharing Functional Units
PC
Memory
Address
Instructionor data
Data
Instructionregister
Registers
Register #
Data
Register #
Register #
ALU
Memorydata
register
A
B
ALUOut
Step nameAction for R-type
instructionsAction for memory-reference
instructionsAction for branches
Action for jumps
Instruction fetch IR = Memory[PC]PC = PC + 4
Instruction A = Reg [IR[25-21]]decode/register fetch B = Reg [IR[20-16]]
ALUOut = PC + (sign-extend (IR[15-0]) << 2)
Execution, address ALUOut = A op B ALUOut = A + sign-extend if (A ==B) then PC = PC [31-28] IIcomputation, branch/ (IR[15-0]) PC = ALUOut (IR[25-0]<<2)jump completion
Memory access or R-type Reg [IR[15-11]] = Load: MDR = Memory[ALUOut]completion ALUOut or
Store: Memory [ALUOut] = B
Memory read completion Load: Reg[IR[20-16]] = MDR
Reuse:• ALU• Memory
Need more • Muxing• Control
Single ALU, Common data and instruction memory datapath
CS141-L4-39 Tarun Soni, Summer ‘03
Since we reuse logic (e.g. ALU), we need to store results between states
Need extra registers when:– signal is computed in one clock cycle and used in
another, AND– the inputs to the combinational circuit can change
before the signal is written into a state element.
Multicycle CPU: Adding State Elements
CS141-L4-40 Tarun Soni, Summer ‘03
IF ID Ex Mem WB
Multicycle CPU: Adding State Elements
CS141-L4-41 Tarun Soni, Summer ‘03
Multicycle CPU: The Full Multi-Cycle Implementation
CS141-L4-42 Tarun Soni, Summer ‘03
Cycle 1: Instruction Fetch
Datapath: IR = Memory[PC], PC = PC + 4 (may be revised later)Control: IorD=0, MemRead=1, MemWr=0, IRwrite=1, ALUsrcA=0, etc
CS141-L4-43 Tarun Soni, Summer ‘03
A = Register[IR[25-21]]
B = Register[IR[20-16]]
ALUout = PC + (sign-extend (IR[15-0]) << 2)
Cycle 1: Instruction Decode
CS141-L4-44 Tarun Soni, Summer ‘03
A = Reg[IR[25-21]] B = Reg[IR[20-16]]ALUout = PC + (sign-extend (IR[15-0]) << 2)
We compute target address even though we don’t know if it will be used – Operation may not be branch– Even if it is, branch may not be taken
Why? Everything up to this point must be instruction-independent,
because we haven’t decoded the instruction.The ALU, the (incremented) PC, and the immed field are now all
available
Cycle 2: Instruction Decode & RegFetch
CS141-L4-45 Tarun Soni, Summer ‘03
Cycle 3 for beq: EXecute
• In cycle 1, PC was incremented by 4
• In cycle 2, ALUout was set to branch target
•This cycle, we conditionally reset PC: if (A==B) PC=ALUout
A
B
ALUout
CS141-L4-46 Tarun Soni, Summer ‘03
• Cycle 3 (EXecute)
ALUout = A op B
• Cycle 4 (WriteBack)
Reg[IR[15-11]] = ALUout
R-type instruction is finished
Cycle 3: R-type Instruction
CS141-L4-47 Tarun Soni, Summer ‘03
Cycle 3: ALUout = A op B
Cycle 4: Reg[IR[15-11]] = ALUout
B
A
Cycle 3: R-type Instruction
CS141-L4-48 Tarun Soni, Summer ‘03
Cycle 3: ALUout = A op B
Cycle 4: Reg[IR[15-11]] = ALUout
B
AALUout
Cycle 4: R-type Instruction
CS141-L4-49 Tarun Soni, Summer ‘03
Multicycle CPU: The datapath
PC
Nex
t P
C
Ope
rand
Fet
ch
Ext
ALU Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
Inst
ruct
ion
Fet
ch
Res
ult
Sto
re
AL
Uct
r
Reg
Dst
AL
US
rc
Ext
Op
nPC
_sel
Reg
Wr
Mem
Wr
Mem
Rd
IRA
B
R
M
RegFile
Mem
ToR
eg
Equ
al
Extra Registers:• IR• A,B• R ( sometimes called S or ALUout)• M
CS141-L4-50 Tarun Soni, Summer ‘03
Multicycle CPU: The datapath
• Logical Register Transfer
• Physical Register Transfers
inst Logical Register Transfers
ADDU R[rd] <– R[rs] + R[rt]; PC <– PC + 4
inst Physical Register Transfers
IR <– MEM[pc]
ADDU A<– R[rs]; B <– R[rt]
S <– A + B
R[rd] <– S; PC <– PC + 4
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
A
B
S
M
Reg
File
Equ
al
PC
Nex
t PC
IR
Inst
. Mem
CS141-L4-51 Tarun Soni, Summer ‘03
Multicycle CPU: The datapath
• Logical Register Transfer
• Physical Register Transfers
inst Logical Register Transfers
ORI R[rt] <– R[rs] OR zx(Im16); PC <– PC + 4
inst Physical Register Transfers
IR <– MEM[pc]
ADDU A<– R[rs]; B <– R[rt]
S <– ( A or ZeroExt(Im16) )
R[rt] <– S; PC <– PC + 4
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
A
B
S
M
Reg
File
Equ
al
PC
Nex
t PC
IR
Inst
. Mem
CS141-L4-52 Tarun Soni, Summer ‘03
Multicycle CPU: The datapath
• Logical Register Transfer
• Physical Register Transfers
inst Logical Register Transfers
LW R[rt] <– MEM(R[rs] + sx(Im16);
PC <– PC + 4
inst Physical Register Transfers
IR <– MEM[pc]
LW A<– R[rs]; B <– R[rt]
S <– A + SignEx(Im16)
M <– MEM[S]
R[rd] <– M; PC <– PC + 4
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
A
B
S
M
Reg
File
Equ
al
PC
Nex
t PC
IR
Inst
. Mem
CS141-L4-53 Tarun Soni, Summer ‘03
Multicycle CPU: The datapath
• Logical Register Transfer
• Physical Register Transfers
inst Logical Register Transfers
SW MEM(R[rs] + sx(Im16) <– R[rt];
PC <– PC + 4
inst Physical Register Transfers
IR <– MEM[pc]
SW A<– R[rs]; B <– R[rt]
S <– A + SignEx(Im16);
MEM[S] <– B; PC <– PC + 4
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
A
B
S
M
Reg
File
Equ
al
PC
Nex
t PC
IR
Inst
. Mem
CS141-L4-54 Tarun Soni, Summer ‘03
Multicycle CPU: The datapath
• Logical Register Transfer
• Physical Register Transfers
inst Logical Register Transfers
BEQ if R[rs] == R[rt]
then PC <= PC + sx(Im16) || 00
else PC <= PC + 4
inst Physical Register Transfers
IR <– MEM[pc]
BEQ|Eq PC <– PC + 4
inst Physical Register Transfers
IR <– MEM[pc]
BEQ|Eq PC <– PC + sx(Im16) || 00
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
A
B
S
M
Reg
File
Equ
al
PC
Nex
t PC
IR
Inst
. Mem
CS141-L4-55 Tarun Soni, Summer ‘03
Multicycle CPU: Summary
Step nameAction for R-type
instructionsAction for memory-reference
instructionsAction for branches
Action for jumps
Instruction fetch IR = Memory[PC]PC = PC + 4
Instruction A = Reg [IR[25-21]]decode/register fetch B = Reg [IR[20-16]]
ALUOut = PC + (sign-extend (IR[15-0]) << 2)
Execution, address ALUOut = A op B ALUOut = A + sign-extend if (A ==B) then PC = PC [31-28] IIcomputation, branch/ (IR[15-0]) PC = ALUOut (IR[25-0]<<2)jump completion
Memory access or R-type Reg [IR[15-11]] = Load: MDR = Memory[ALUOut]completion ALUOut or
Store: Memory [ALUOut] = B
Memory read completion Load: Reg[IR[20-16]] = MDR
CS141-L4-56 Tarun Soni, Summer ‘03
Multicycle CPU: Mid-term alert !!
• How many cycles will it take to execute this code?
lw $t2, 0($t3)lw $t3, 4($t3)beq $t2, $t3, Label #assume notadd $t5, $t2, $t3sw $t5, 8($t3)
Label: ...
• What is going on during the 8th cycle of execution?• In what cycle does the actual addition of $t2 and $t3 takes place?
CS141-L4-57 Tarun Soni, Summer ‘03
Multicycle CPU: Sharing Hardware
“Princeton” Organization
• Single memory for instruction and data access – memory utilization -> 1.3/4.8
• In this case our state diagram does not change– several additional control signals– must ensure each bus is only driven by one source on each cycle
RegFile
A
B
A-BusB Bus
IR S
W-Bus
PC
nextPC ZX SX
Mem
CS141-L4-58 Tarun Soni, Summer ‘03
Multicycle CPU: Control Line Timing
Shiftleft 2
MemtoReg
IorD MemRead MemWrite
PC
Memory
MemData
Writedata
Mux
0
1
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Instruction[15– 11]
Mux
0
1
Mux
0
1
4
ALUOpALUSrcB
RegDst RegWrite
Instruction[15– 0]
Instruction [5– 0]
Signextend
3216
Instruction[25– 21]
Instruction[20– 16]
Instruction[15– 0]
Instructionregister
1 Mux
0
3
2
ALUcontrol
Mux
0
1ALU
resultALU
ALUSrcA
ZeroA
B
ALUOut
IRWrite
Address
Memorydata
register
Clk
Cycle 1
Ifetch Reg Exec Mem Wr
Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10
Ifetch Reg Exec Mem
Load Store
Ifetch
R-type
IRWrite
CS141-L4-59 Tarun Soni, Summer ‘03
Review: Finite State Machines
• Finite state machines:– a set of states and – next state function (determined by current state and the input)– output function (determined by current state and possibly input)
– We’ll use a Moore machine (output based only on current state)
Next-statefunction
Current state
Clock
Outputfunction
Nextstate
Outputs
Inputs
CS141-L4-60 Tarun Soni, Summer ‘03
Multicycle CPU: Control
PCWrite
PCWriteCond
IorD
MemtoReg
PCSource
ALUOp
ALUSrcB
ALUSrcA
RegWrite
RegDst
NS3NS2NS1NS0
Op5
Op4
Op3
Op2
Op1
Op0
S3
S2
S1
S0
State register
IRWrite
MemRead
MemWrite
Instruction registeropcode field
Outputs
Control logic
Inputs
If (State == Instruction Fetch)
{
IRWrite = 1;
// All other signals are 0;
State = Operand Fetch;
}
If (State == Execute && InstructionOpCode == BEQ )
{
// Do your thing..
}
ControlOutput = f(State, OpCode)
NextState = f(State, OpCode)
CS141-L4-61 Tarun Soni, Summer ‘03
Multicycle CPU: Our basic FSM
Instruction fetch
Decode and Register Fetch
Memoryinstructions
R-typeinstructions
Branchinstructions
Jumpinstruction
CS141-L4-62 Tarun Soni, Summer ‘03
Multicycle CPU: Control
IR <= MEM[PC]
R-type
A <= R[rs]B <= R[rt]
S <= A fun B
R[rd] <= SPC <= PC + 4
S <= A or ZX
R[rt] <= SPC <= PC + 4
ORi
S <= A + SX
R[rt] <= MPC <= PC + 4
M <= MEM[S]
LW
S <= A + SX
MEM[S] <= BPC <= PC + 4
BEQ & Equal
BEQ & ~Equal
PC <= PC + 4 PC <= PC + SX || 00
SW
“instruction fetch”
“decode / operand fetch”
Exe
cute
Mem
ory
Writ
e-ba
ck
CS141-L4-63 Tarun Soni, Summer ‘03
Multicycle CPU: Control
PCWritePCSource = 10
ALUSrcA = 1ALUSrcB = 00ALUOp = 01PCWriteCond
PCSource = 01
ALUSrcA =1ALUSrcB = 00ALUOp= 10
RegDst = 1RegWrite
MemtoReg = 0
MemWriteIorD = 1
MemReadIorD = 1
ALUSrcA = 1ALUSrcB = 10ALUOp = 00
RegDst = 0RegWrite
MemtoReg =1
ALUSrcA = 0ALUSrcB = 11ALUOp = 00
MemReadALUSrcA = 0
IorD = 0IRWrite
ALUSrcB = 01ALUOp = 00
PCWritePCSource = 00
Instruction fetchInstruction decode/
register fetch
Jumpcompletion
BranchcompletionExecution
Memory addresscomputation
Memoryaccess
Memoryaccess R-type completion
Write-back step
(Op = 'LW') or (Op = 'SW') (Op = R-type)
(Op
= 'B
EQ')
(Op
= 'J
')
(Op = 'SW
')
(Op
= 'L
W')
4
01
9862
753
Start
Number of states?
Number of bits for state?
CS141-L4-64 Tarun Soni, Summer ‘03
Multicycle CPU: Control: Assigning States
IR <= MEM[PC]
R-type
A <= R[rs]B <= R[rt]
S <= A fun B
R[rd] <= SPC <= PC + 4
S <= A or ZX
R[rt] <= SPC <= PC + 4
ORi
S <= A + SX
R[rt] <= MPC <= PC + 4
M <= MEM[S]
LW
S <= A + SX
MEM[S] <= BPC <= PC + 4
BEQ & EqualBEQ & ~Equal
PC <= PC + 4 PC <= PC + SX || 00
SW
“instruction fetch”
“decode”
Exe
cute
Mem
ory
Writ
e-ba
ck
0000
0001
0100
0101
0110
0111
1000
1001
1010
0011 00101011
1100
CS141-L4-65 Tarun Soni, Summer ‘03
Multicycle CPU: Detailed control spec.
0000 ?????? ? 0001 10001 BEQ 0 0011 1 10001 BEQ 1 0010 1 10001 R-type x 0100 1 10001 orI x 0110 1 10001 LW x 1000 1 10001 SW x 1011 1 10010 xxxxxx x 0000 1 10011 xxxxxx x 0000 1 00100 xxxxxx x 0101 0 1 fun 10101 xxxxxx x 0000 1 0 0 1 10110 xxxxxx x 0111 0 0 or 10111 xxxxxx x 0000 1 0 0 1 01000 xxxxxx x 1001 1 0 add 11001 xxxxxx x 1010 1 0 01010 xxxxxx x 0000 1 0 1 1 01011 xxxxxx x 1100 1 0 add 11100 xxxxxx x 0000 1 0 0 1
State Op field Eq Next IR PC Ops Exec Mem Write-Backen sel A B Ex Sr ALU S R W M M-R Wr Dst
R:
ORi:
LW:
SW:
CS141-L4-66 Tarun Soni, Summer ‘03
Multicycle CPU: Implementation styles
• ROM = "Read Only Memory"– values of memory locations are fixed ahead of time
• A ROM can be used to implement a truth table– if the address is m-bits, we can address 2m entries in the ROM.– our outputs are the bits of data that the address points to.
– 2m is the "height", and n is the "width"
m n
0 0 0 0 0 1 10 0 1 1 1 0 00 1 0 1 1 0 00 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 11 1 0 0 1 1 01 1 1 0 1 1 1
CS141-L4-67 Tarun Soni, Summer ‘03
Multicycle CPU: Implementation styles
• How many inputs are there?6 bits for opcode, 4 bits for state = 10 address lines(i.e., 210 = 1024 different addresses)
• How many outputs are there?16 datapath-control outputs, 4 state bits = 20 outputs
• ROM is 210 x 20 = 20K bits (and a rather unusual size)
• Rather wasteful, since for lots of the entries, the outputs are the same— i.e., opcode is often ignored
CS141-L4-68 Tarun Soni, Summer ‘03
Multicycle CPU: Implementation styles
• Break up the table into two parts
— 4 state bits tell you the 16 outputs, 24 x 16 bits of ROM
— 10 bits tell you the 4 next state bits, 210 x 4 bits of ROM
— Total: 4.3K bits of ROM
• PLA is much smaller
— can share product terms
— only need entries that produce an active output
— can take into account don't cares
• Size is (#inputs ´ #product-terms) + (#outputs ´ #product-terms)
For this example = (10x17)+(20x17) = 460 PLA cells
• PLA cells usually about the size of a ROM cell (slightly bigger)
CS141-L4-69 Tarun Soni, Summer ‘03
Multicycle CPU: Implementation styles
PLA ImplementationOp5
Op4
Op3
Op2
Op1
Op0
S3
S2
S1
S0
IorD
IRWrite
MemReadMemWrite
PCWritePCWriteCond
MemtoRegPCSource1
ALUOp1
ALUSrcB0ALUSrcARegWriteRegDstNS3NS2NS1NS0
ALUSrcB1ALUOp0
PCSource0
IRWrite = (!S0 && !S1 && !S2 && !S3)
NS0 = ( S[3..0] == 0000) ||( S[3..0] == 0110 ) ||( S[3..0] == 1001 && OP[5..0]=000010 ) ||(…)(…)
CS141-L4-70 Tarun Soni, Summer ‘03
Microprogramming
PCWritePCSource = 10
ALUSrcA = 1ALUSrcB = 00ALUOp = 01PCWriteCond
PCSource = 01
ALUSrcA =1ALUSrcB = 00ALUOp= 10
RegDst = 1RegWrite
MemtoReg = 0
MemWriteIorD = 1
MemReadIorD = 1
ALUSrcA = 1ALUSrcB = 10ALUOp = 00
RegDst = 0RegWrite
MemtoReg =1
ALUSrcA = 0ALUSrcB = 11ALUOp = 00
MemReadALUSrcA = 0
IorD = 0IRWrite
ALUSrcB = 01ALUOp = 00
PCWritePCSource = 00
Instruction fetchInstruction decode/
register fetch
Jumpcompletion
BranchcompletionExecution
Memory addresscomputation
Memoryaccess
Memoryaccess R-type completion
Write-back step
(Op
= 'J
')
(Op
= 'L
W')
4
01
9862
753
Start
• Control is the hard part of processor design
° Datapath is fairly regular and well-organized
° Memory is highly regular
° Control is irregular and global
Consider the FSM in case of 100s of instructions !!!
• FSMs get unmanageable quickly as they grow.– hard to specify– hard to manipulate– error prone– hard to visualize
• The state digrams that arise define the controller for an instruction set processor are highly structured
• Use this structure to construct a simple “microsequencer” • Control reduces to programming this very simple device
– microprogramming
CS141-L4-71 Tarun Soni, Summer ‘03
Microprogramming
Opcode
State Reg
Inputs
Outputs
Control LogicPLA or ROM
MulticycleDatapath
1
Address Select Logic
Adder
Types of “branching”• Set state to 0• Dispatch (state 1)• Use incremented state number
Common case: State += 1;
Microprogramming:A Particular Strategy for Implementing the Control Unit of a processor by "programming" at the level of register transfer operations
Microarchitecture:Logical structure and functional capabilities of the hardware as seen by the microprogrammer
Historical Note:
IBM 360 Series first to distinguish between architecture & organization Same instruction set across wide range of implementations, each with different cost/performance
CS141-L4-72 Tarun Soni, Summer ‘03
Macro-Micro programming?
MainMemory
executionunit
controlmemory
CPU
ADDSUBAND
DATA
.
.
.
User program plus Data
this can change!
AND microsequence
e.g., Fetch Calc Operand Addr Fetch Operand(s) Calculate Save Answer(s)
one of these ismapped into oneof these
CS141-L4-73 Tarun Soni, Summer ‘03
Horizontal Microinstructions
° “Horizontal” Microcode
– control field for each control point in the machine
µseq µaddr A-mux B-mux bus enables register enables
Control Logic / Store(PLA, ROM)
OPcode
Datapath
Inst
ruct
ion
Decode
Co
nd
itio
ns
ControlPoints
microinstruction
Depending on bus organization, many potential control combinations simply wrong, i.e., implies transfers that can never happen at the same time.
Idea: encode fields to save ROM space
Example: mem_to_reg and ALU_to_reg should never happen simultenously; => encode in single bit which is decoded rather than two separate bits
CS141-L4-74 Tarun Soni, Summer ‘03
Vertical Microinstructions
° “Vertical” Microcode
– encoded control fields with local decode
src dst
DEC
DEC
other control fields next states inputs
MUX
Some of these may havenothing to do with registers!
CS141-L4-75 Tarun Soni, Summer ‘03
Design Microinstruction Sets
1) Start with list of control signals
2) Group signals together that make sense (vs. random): called “fields”
3) Places fields in some logical order (e.g., ALU operation & ALU operands first and microinstruction sequencing last)
4) Create a symbolic legend for the microinstruction format, showing name of field values and how they set the control signals
– Use computers to design computers
5) To minimize the width, encode operations that will never be used at the same time
CS141-L4-76 Tarun Soni, Summer ‘03
Microinstructions Start with list of control signals, grouped into fields
Signal name Effect when deasserted Effect when assertedALUSelA 1st ALU operand = PC 1st ALU operand = Reg[rs]RegWrite None Reg. is written MemtoReg Reg. write data input = ALU Reg. write data input = memory RegDst
Reg. dest. no. = rt Reg. dest. no. = rdTargetWrite None Target reg. = ALU MemRead None Memory at address is readMemWrite None Memory at address is written IorD Memory address = PC Memory address = ALUIRWrite None IR = MemoryPCWrite None PC = PCSourcePCWriteCond None IF ALUzero then PC = PCSource
Sin
gle
Bit
Con
trol
Signal name Value Effect ALUOp 00 ALU adds 01 ALU subtracts 10 ALU does function code
11 ALU does logical OR ALUSelB 000 2nd ALU input = Reg[rt] 001 2nd ALU input = 4 010 2nd ALU input = sign extended IR[15-0] 011 2nd ALU input = sign extended, shift left 2 IR[15-0]
100 2nd ALU input = zero extended IR[15-0] PCSource 00 PC = ALU 01 PC = Target 10 PC = PC+4[29-26] : IR[25–0] << 2
Mu
ltip
le B
it C
ontr
ol
CS141-L4-77 Tarun Soni, Summer ‘03
Microinstructions
Field Name Width Control Signals Set
wide narrow
ALU Control 4 2 ALUOp
SRC1 2 1 ALUSelA
SRC2 5 3 ALUSelB
ALU Destination 6 4 RegWrite, MemtoReg, RegDst, TargetWr.
Memory 4 3 MemRead, MemWrite, IorD
Memory Register 1 1 IRWrite
PCWrite Control 5 4 PCWrite, PCWriteCond, PCSource
Sequencing 3 2 AddrCtl
Total width 30 20 bits
CS141-L4-78 Tarun Soni, Summer ‘03
Microinstructions: MIPS field name and values
Field Name Values for Field Function of Field with Specific ValueALU Add ALU adds
Subt. ALU subtractsFunc code ALU does function codeOr ALU does logical OR
SRC1 PC 1st ALU input = PCrs 1st ALU input = Reg[rs]
SRC2 4 2nd ALU input = 4Extend 2nd ALU input = sign ext. IR[15-0]Extend0 2nd ALU input = zero ext. IR[15-0] Extshft 2nd ALU input = sign ex., sl IR[15-0]rt 2nd ALU input = Reg[rt]
ALU destination Target Target = ALUoutrd Reg[rd] = ALUout
Memory Read PC Read memory using PCRead ALU Read memory using ALU outputWrite ALU Write memory using ALU output
Memory register IR IR = MemWrite rt Reg[rt] = MemRead rt Mem = Reg[rt]
PC write ALU PC = ALU outputTarget-cond. IF ALU Zero then PC = Targetjump addr. PC = PCSource
Sequencing Seq Go to sequential µinstructionFetch Go to the first microinstructionDispatch Dispatch using ROM.
CS141-L4-79 Tarun Soni, Summer ‘03
Microinstructions: The datapath again
Shiftleft 2
MemtoReg
IorD MemRead MemWrite
PC
Memory
MemData
Writedata
Mux
0
1
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Instruction[15– 11]
Mux
0
1
Mux
0
1
4
ALUOpALUSrcB
RegDst RegWrite
Instruction[15– 0]
Instruction [5– 0]
Signextend
3216
Instruction[25– 21]
Instruction[20– 16]
Instruction[15– 0]
Instructionregister
1 Mux
0
3
2
ALUcontrol
Mux
0
1ALU
resultALU
ALUSrcA
ZeroA
B
ALUOut
IRWrite
Address
Memorydata
register
Field Name Values for Field Function of Field with Specific ValueSRC1 PC 1st ALU input = PC
rs 1st ALU input = Reg[rs]SRC2 4 2nd ALU input = 4
Extend 2nd ALU input = sign ext. IR[15-0]Extend0 2nd ALU input = zero ext. IR[15-0] Extshft 2nd ALU input = sign ex., sl IR[15-0]rt 2nd ALU input = Reg[rt]
ALU destination Target Target = ALUoutrd Reg[rd] = ALUout
CS141-L4-80 Tarun Soni, Summer ‘03
Microinstructions: Pros-Cons
• Specification Advantages:
– Easy to design and write
– Design architecture and microcode in parallel
• Implementation (off-chip ROM) Advantages
– Easy to change since values are in memory
– Can emulate other architectures and instruction sets
– Can make use of internal registers
• Implementation Disadvantages, SLOWER now that:
– Control is implemented on same chip as processor
– ROM is no longer faster than RAM
– No need to go back and make changes
CS141-L4-81 Tarun Soni, Summer ‘03
CPU Control: Methodology
Initialrepresentation
Finite statediagram
Microprogram
Sequencingcontrol
Explicit nextstate function
Microprogram counter+ dispatch ROMS
Logicrepresentation
Logicequations
Truthtables
Implementationtechnique
Programmablelogic array
Read onlymemory
CS141-L4-82 Tarun Soni, Summer ‘03
Microprogramming: the last word ?
Summary: Microprogramming one inspiration for RISC
• If simple instruction could execute at very high clock rate…• If you could even write compilers to produce microinstructions…• If most programs use simple instructions and addressing modes…• If microcode is kept in RAM instead of ROM so as to fix bugs …• If same memory used for control memory could be used instead as cache for
“macroinstructions”…• Then why not skip instruction interpretation by a microprogram and simply compile
directly into lowest language of machine? (microprogramming is overkill when ISA matches datapath 1-1)
CS141-L4-83 Tarun Soni, Summer ‘03
Exceptions
Supporting exceptions in our FSM
MemReadALUSelA = 0
IorD = 0IRWrite
ALUSelB = 01ALUOp = 00
PCWritePCSource = 00
ALUSelA = 0ALUSelB = 11ALUOp = 00TargetWrite
Memory InstFSM
R-type InstFSM
Branch InstFSM
Jump InstFSM
Instruction Fetch, state 0 Instruction Decode/ Register Fetch, state 1
Opcode = LW or SW
Opcode = R
-type
Opc
ode
= B
EQ
Opc
ode
= J
MP
Start
to state 10
Opcode = anything else
CS141-L4-84 Tarun Soni, Summer ‘03
Exceptions
Supporting exceptions in our FSM
ALUSelA = 1ALUSelB = 00ALUOp = 10
from state 1
ALUSelA = 1RegDst = 1RegWrite
MemtoReg = 0ALUSelB = 10ALUOp = 10
To state 0
R-type instructions
overflowTo state 11
CS141-L4-85 Tarun Soni, Summer ‘03
Exceptions
IntCause=1CauseWrite
ALUSelA = 0ALUSelB = 01ALUOp = 01
EPCWrite
To state 0 (fetch)
IntCause=0CauseWrite
PCWritePCSource=11
state 11
state 13
state 10 state 12
EP
C
Cau
se
PC
PCWrite EPCWrite
CauseWrite
IntCause
PCSource
InterruptHandlerAddress
sub4
illegalinstruction
arithmeticoverflow
Supporting exceptions in our FSM
Write Cause into registerWrite PC into EPCLoad Exception Handler address to PC
CS141-L4-86 Tarun Soni, Summer ‘03
Exceptions
IR <= MEM[PC]PC <= PC + 4
R-type
A <= R[rs]B <= R[rt]
S <= A fun B
R[rd] <= S
S <= A op ZX
R[rt] <= S
ORi
S <= A + SX
R[rt] <= M
M <= MEM[S]
LW
S <= A + SX
MEM[S] <= B
SW
other
undefined instruction
EPC <= PC - 4PC <= exp_addrcause <= 10 (RI)
EPC <= PC - 4PC <= exp_addrcause <= 12 (Ovf)
overflow
Additional condition fromDatapath
Equal
BEQ
PC <= PC + SX || 00
0010
0011
S <= A - B ~Equal
CS141-L4-87 Tarun Soni, Summer ‘03
Summary
• multicycle CPUs make things faster.• control is harder.• microprogramming can simplify (conceptually) CPU control generation• a microprogram is a small program inside the CPU that executes the individual
instructions of the “real” program.• exception-handling is difficult in the CPU, because the interactions between the
executing instructions and the interrupt are complex and unpredictable.
CS141-L4-88 Tarun Soni, Summer ‘03
Mid-Term Review
• Technology trends: Design for the future • Instruction Set Architectures: types of ISAs: Addressing modes, length of instruction etc.• MIPS instruction format-basic classes of instructions• Registers and load store architectures• Data types, operands, memory organization/addressing• Basic MIPS instructions: Arithmetic, logical, data transfer, branching, jumps• Issues in jump/branching distance and immediate addressing modes• Stacks and frames• E.g., swap(), leaf_procedure(), nested_procedure()
• Performance: Relative (Boeing e.g,), Metrics, Benchmarking, SPEC marks• Performance = Instruction Count x Cycles/Instruction x Seconds/Cycle • Amdahl’s law Improvement = Execution Time Unaffected + ( Execution Time Affected /
Amount of Improvement )• Arithmetic: 2s complement• Basic digital logic, 1-bit adder, full adder, 32-bit adder/subtractor• ALU: adder+mux+special conditions• Delays in combinational logic, clocking• Ripple carry vs. Carry look ahead adders
CS141-L4-89 Tarun Soni, Summer ‘03
Mid-Term Review
• Multiplication & Division: grade school version• 3 incrementally better algorithms (data paths)
• Basics of booth arithmetic• Floating point representation• Floating point operations (+,-,*,/)• Guard,round and sticky bits
• Single cycle CPU• Building blocks: Register files, memory etc.• Storage units, clocking methodology • PC arithmetic• Instruction fetch• Datapath on various operations: Load, Store, Branch, R-type, I-type• Control: basic control signals for the MIPS subset• Distributed control: Main control + ALU control • PLA implementation• Timing diagrams
CS141-L4-90 Tarun Soni, Summer ‘03
Mid-Term Review
• Multi-cycle CPU• Datapath: registers/stages: Ifetch, A,B, Execute, Store etc.• Various instructions through the datapath• Control: Sharing functional units• Finite state machine perspective for control: FSM for MIPS • Implementation styles: ROM, PLA• Microprogramming: Horizontal, vertical, relationship to RISC• Exceptions: change in FSM, internal, external; need to save state.