[email protected] Computing Systems The Processor: Datapath and Control

[email protected] 1

Computing Systems

The Processor: Datapath and Control

2

Introduction

The performance of a machine depends on 3 key factors:

instruction count clock cycle time clock cycles per instructions (CPI)

Implement a basic MIPS simplified to contains only:

memory-reference instructions: lw,sw arithmetic-logical instructions: add,sub,and,or,slt control-flow instructions: beq,j

compiler and ISA

hardware implementation

3

Overview of the implementation

Generic Implementation:

use the program counter (PC) to supply instruction address get the instruction from memory read registers use the instruction to decide exactly what to do

The actions required to complete an instruction depend on the instruction class

Even across instruction classes there are some similarities

e.g., all instructions use the ALU after reading the registersWhy? memory-reference? arithmetic? control flow?

4

Functional units to implement the processor

Two types of functional units: elements that operate on data values (combinational) elements that contain state (sequential)

Combinational units The current outputs depends only on the current

inputs

Sequential units The current outputs depends on the current inputs but

also on the past inputs the element “remember” its history i.e., it has the capability of storing the input

provided

5

State elements: latches and flip-flops Output is equal to the stored value (state) inside the element

(don't need to ask for permission to look at the value)

Change of state (value) is based on the “control” signal

Unclocked: Latches

state update is level triggered i.e., the state can change whenever the “control” change

Clocked: Flip-Flops

State update is edge triggered i.e., state can change only on clock edges

clock

rising edge

clock cycle

falling edge

6

Clocking methodology

clock

state element 1

state element 2

combinational logic

Computer design cannot tolerate unpredictability clocking methodology is designed to prevent this circumstance. specify the timing of reads and writes of the state elements

The easiest solution is to use a synchronous clocking scheme edge triggered methodology

Typical execution: read contents of some state elements, send values through some combinational logic write results to state elements

7

What functional units do we need ?

we need an ALU we need memory to store instructions and data

instruction memory takes address and supply instruction data memory takes address and supply data (lw) data memory address and data and write into memory (sw)

we need to manage a PC and its update we need a register file to include 32 registers

read two operands and write a result back sometime the operand comes from the instruction we need additional support for the immediate class of instructions

(sign extension) we need additional support for the jump instruction

8

Datapath building blocks

9

Register File (read circuits)

Make sure you understand what is the “mux” above?

Built using D flip-flops

The clock signal is not shown

10

Register File (write circuits)

This is just a diagram to illustrate the “principle”.

In practice never gate a clock signal !!!

11

Datapath implementation

Use multiplexors to stitch the variousfunctional units together

12

The datapath operation

1. Fetching instructions and incrementing PC

13

Arithmetic-logical instructions

2. two registers read from the register file3. ALU operates on the data from the two registers4. The result from ALU is written in the register file

14

Memory-reference instructions: sw

2. Two registers are read from the register file3. ALU add the value read from one of the register and the sign-extended, lower 16 bits of instruction (offset)4. The value read from the second register is written in data memory at the address given by the sum computed by the ALU

15

Memory-reference instructions: lw

2. A registers is read from the register file3. ALU computes the sum of the value read from the register file and the sign-extended, lower 16 bits of instruction (offset)4. The sum from the ALU is used as the address for the data memory in data memory 5. The data from the memory is written in the register file at the destination register

16

Control flow instructions: beq

branch target address

2. Two registers are read from the register file3. ALU performs a subtract on the values from the register file the value of PC+4 is added to the sign- extended, lower 16 bits of the instruction (offset) shifted left by 2 4. The Zero result from the ALU is used to decide which adder result to store into the PC

17

Arithmetic-logical: add,sub,and,or,slt (R-type)

Example:

add $t0,$t1,$t2 # $t1 in rs, $t2 in rt, $t0 in rd

Memory-reference: lw,sw (I-type)

Example:

lw $t0,offset($t1) # $t1 in rs, $t0 in rt

sw $t0,offset($t1) # $t1 in rs, $t0 in rt

Format of the instructions

31:26 25:21 20:16 15:11 10:6 5:0

0 rs rt rd shamt funct

31:26 25:21 20:16 15:0

35 or 43 rs rt offset

Oops !! The destination register can be in two possible places. For load is in bit 20:16 (rt), while for R-type instructionit is in bit positions 15:11 (rd). We have a small bug !!!

18

branch: beq (I-type)

Example:

beq $t0,$t1,label # $t0 in rs, $t1 in rt

jump: j (J-type)

Example:

j address

Format of the instructions

31:26 25:21 20:16 15:0

4 rs rt address

31:26 25:0

5 address

We will leave the implementation of j out until the very end

19

A small bug fix !!!

We need to add a mux to select which field of the instruction is used to indicate the register to be written

20

The control unit

Selecting the operations to perform (ALU, read/write of data

memory and register file)

Controlling the flow of data (multiplexor inputs)

Information comes from the 32 bits of the instruction

Example:add $8, $17, $18

000000 10001 10010 01000 00000 100000 op rs rt rd shamt funct

ALU's operation based on instruction type and function code

21

The control unit

ALU control input (ALU operation lines)

0000 and0001 or0010 add0110 subtract0111 set-on-less-than1100 nor

Why is the code for subtract 0110 and not 0011?

22

The control unit The control unit must compute 4-bit ALU control input:

given function code for arithmetic given instruction type

ALUop=00 for lw,sw ALUop=01 for beq, ALUop=10 for arithmetic

Describe it using a truth table (can turn into gates):

ALUop is an intermediate 2-bit code computed from the opcode field of the instruction to simplify the logic needed for computing the 4-bit ALU operation control input

23

ALUop

ALU operation

Instruction RegDst ALUSrcMemto-

RegReg

WriteMem Read

Mem Write

Branch ALUOp1 ALUp0

R-format 1 0 0 1 0 0 0 1 0lw 0 1 1 1 1 0 0 0 0sw X 1 X 0 0 1 0 0 0beq X 0 X 0 0 0 1 0 1

24

The control unit

• Simple combinational logic (truth tables)

25

Implementing jump

26

Single-cycle control structure

Every instruction begins execution on one clock edge

and completes execution on the next clock edge

We use a single long clock cycle for every instruction

All of the control logic is combinational

We wait for everything to settle down, and the right thing

to be done

ALU might not produce “right answer” right away

we use write signals along with clock to determine

when to write

Cycle time determined by length of the longest path

27

Single-cycle implementation

Critical path for different instruction classes

Instruction class Functional units used by the instruction class

R-typeinstr.

fetchreg.

accessALU

reg. access

Loadinstr.

fetchreg.

accessALU

mem. access

reg. access

Storeinstr.

fetchreg.

accessALU

mem. access

Branchinstr.

fetchreg.

accessALU

Jumpinstr.

fetch

28

Performance of single-cycle machines

Assume the major functional units of a machine have the following delays: Memory Units: 200 ps ALU and adders: 100 ps Register File (read or write): 50 ps Muxes, control unit,

PC accesses, sign extension unit: no delay

Instruction mix 25% loads, 10% stores, 45% ALU instructions, 15% branches,

and 5% jump

What is execution time for an implementation in which every instruction operates in 1 clock cycle of a fixed length ? every instruction executes in 1 clock cycle using a variable length clock ?

29

Timing for different instruction classes

Instr.

classInstr.

Mem.

Reg.

ReadALU

Data

Mem.

Reg.

WriteTotal

R-type 200 50 100 0 50 400 ps

Load 200 50 100 200 50 600 ps

Store 200 50 100 200 550 ps

Branch 200 50 100 350 ps

Jump 200 200 ps

30

Single-cycle machine: performance

The clock cycle for a machine with a single clock for all instructions will be determined by the longest instruction

CPU clock cycle (single clock) = 600 ps

The average clock cycle for a machine with a variable clock is:

CPU clock cycle (variable clock) = 600 x 25% + 550 x 10% + 400 x 45% + 350 x 15% + 200 x 3% = 447.5 ps

CPU execution time = IC x CPI x clock cycle time

1.34447.5

600

timeexecutionCPU

timeexecutionCPU

eperformancCPU

eperformancCPU

clockvariable

clocksingle

clocksingle

clockvariable

31

Single-cycle problems

Single cycle Problems: clock cycle is equal to the worst-case delay for all instructions i.e., we violate the key design principle of making the common

case fast if we implement more complicated instructions

(e.g., floating point arithmetic) the performance penalty is unbearable !!!

some functional units must be duplicated (wasteful of area)

One possible solution: a “multicycle” datapath:

use a “smaller” cycle time have different instructions take different numbers of cycles

32

Multi-cycle approach

We will be reusing functional units ALU used to compute address and to increment PC Memory used for instruction and data

Break up the instructions into steps, each step takes a cycle balance the amount of work to be done restrict each cycle to use only one major functional unit

At the end of a cycle store values for use in later cycles (easiest thing to do) introduce additional “internal” registers

Our control signals will not be determined directly by instruction e.g., what should the ALU do for a “subtract” instruction?

We’ll use a finite state machine for control

33

PCLoad

Multi-cycle datapath with control lines(without jump)

IR

Both for instruction and data MDR

34

Five execution steps

Instruction Fetch

Instruction Decode and Register Fetch

Execution, Memory Address Computation, or Branch Completion

Memory Access or R-type instruction completion

Write-back step

Instructions take from 3 to 5 cycles !

35

Step 1: instruction fetch

Use PC to get instruction and put it in the Instruction Register.

Increment the PC by 4 and put the result back in the PC. Can be described succinctly using RTL "Register-

Transfer Language"

IR <= Memory[PC];PC <= PC + 4;

Can we figure out the values of the control signals?

What is the advantage of updating the PC now?

36

Step 2: instruction decode and register fetch

Read registers rs and rt in case we need them Compute the branch address in case the instruction is a branch RTL:

A <= Reg[IR[25:21]];B <= Reg[IR[20:16]];ALUOut <= PC + (sign-extend(IR[15:0]) <<

2);

We aren't setting any control lines based on the instruction type (the instruction is still being decoded in the control unit)

37

Step 3: instruction dependent

ALU is performing one of three functions, based on instruction type

Memory Reference: (address computation)ALUOut <= A + sign-extend(IR[15:0]);

R-type: (execution of the operation)ALUOut <= A op B;

Branch completion: (write PC)if (A==B) PC <= ALUOut;

Jump completion: (write PC)PC <= {PC[31:28], IR[25:0], 2b’00};

38

Step 4: R-type or memory-access

Memory Reference: (loads and stores access memory)

load: MDR <= Memory[ALUOut]; (read access)

orstore completion step: Memory[ALUOut] <= B; (write access)

R-type instruction completion step: (write destination register)

Reg[IR[15:11]] <= ALUOut;

The write actually takes place at the end of the cycle on the edge

39

Step 5: Write back

Memory–Reference:

load completion step: Reg[IR[20:16]] <= MDR;

40

Summary:

41

Simple Questions

How many cycles will it take to execute this code?

lw $t2, 0($t3) lw $t3, 4($t3) beq $t2, $t3, Label #assume not add $t5, $t2, $t3 sw $t5, 8($t3)

Label: ...

What is going on during the 8th cycle of execution? In what cycle does the actual addition of $t2 and $t3

takes place?

42

Complete multi-cycle machine

43

Review: Finite State Machine (FSM)

Finite state machines: a set of states and next state function (determined by current state and the input) output function (determined by current state and possibly input)

We’ll use a Moore machine (output based only on current state)

next-statefunction

inputscurrent state

outputfunction

outputs

Mealy Machine

Moore Machine output register

44

Implementingthe control unit

Note:

don’t care if not mentioned

asserted if name only

otherwise exact value

How many state bits will we need?

45

Implementing the control unit

Value of control signals is dependent upon: what instruction is being executed which step is being performed

In each clock cycle decide all the actions that need to be taken

Control unit is the most complex part of the design Can be “hard-wired”, ROM based, or micro-programmed

Simpler instructions lead to simpler control Sometime simple instructions are more effective than a

single complex instruction Complex instructions may have to be maintained for

compatibility reasons

46

Historical perspective

Historical context of CISC: Too much logic to put on a single chip Use a ROM (or even RAM) to hold the microcode It’s easy to add new instructions

Microprogramming appropriate if hundreds of opcodes, modes, cycles, etc. signals specified symbolically using microinstructions

47

Microprogramming

= IR[31:35]

48

Microprogramming detailDispatch ROM 1 Dispatch ROM 2

Op Opcode name Value Op Opcode name Value000000 R-format 0110 100011 lw 0011000010 jmp 1001 101011 sw 0101000100 beq 1000100011 lw 0010101011 sw 0010

State number Address-control action Value of AddrCtl0 Use incremented state 31 Use dispatch ROM 1 12 Use dispatch ROM 2 23 Use incremented state 34 Replace state number by 0 05 Replace state number by 0 06 Use incremented state 37 Replace state number by 0 08 Replace state number by 0 09 Replace state number by 0 0

State

Adder

1

PLA or ROM

Mux3 2 1 0

Dispatch ROM 1Dispatch ROM 2

0

AddrCtl

Address select logic

Instruction registeropcode field

Documents

[email protected] Computing Systems The Processor: Datapath and Control