Computer Architecture The Processor: Datapath and Control

Computer Architecture

The Processor: Datapath and Control

The Processor

We are going to see how the processor is implemented starting with a very simple processor, and adding

some more complexity This processor implements implements a

subset of the MIPS instruction set: Memory-reference instructions: lw and sw The ALU instructions: add, sub, or, slt Control flow instructions: beq and j

We’ll see how implementation choices affect the performance characteristics of the machine clockrate, CPI

10,000ft View

Three steps Send the PC to the memory to load the next

instruction from memory (32 bits) Read 0, 1 or 2 registers using the corresponding

fields of the instruction to know which one(s) to read, if any

Execute the instruction Luckily there are some commonalities: most instructions

need to use the ALU To calculate a numerical value, to calculate an @

Of course there are differences Only 2 instructions need to access the memory The store does not write into registers Only the branch and jump change the value of the PC

Our Processor, sort of ... To implement these steps, our processor will have 5

main components The PC register The memory from which instructions are loaded The memory from which data is stored/loaded The ALU The Register file

The trick is to interconnect them in a way that’s useful, cheap, and fast

Note that above with distinguish two memories, which conceptually goes against the Von Neumann architecture model

Let’s just go with this for now Note that we have separate Instructions and Data caches

anyway!

Our Processor, sort of...

Our Processor, sort of...

What’s missing How to combine input that are “joined” together How to tell which component what to do?

Multiplexers and Controllers

In the previous figure we have two or more “wires” going into the input of a component

This is because depending on the instruction being executed different input should be provided

So, based on the instruction, we need to decide which input should be selected

This is done with a multiplexer

MUX

input 1

input n. . . selected output

control: ceil(log2(n)) bits

What about the Control?

So great, now we can control multiplexers Besides, there are other things to control Example: the ALU has a bunch of control bits,

that tells it what to do:

2-bit control

00: ADD01: SUB02: MUL03: SHIFT

The Control unit

We need a controller that sends the appropriate control bits to all the multiplexers and the components

The control unit sends these signals based on the nature of the instruction

It uses the bits of the opcode to infer the appropriate control bits

Example: Say that a branch has an opcode of the form XXXXX1, and that

all other opcodes are of the form XXXXX0 Then, the control can decide on whether the PC should come

from (current PC+4) or from the output of the ALU Let’s show this on a figure (that makes _many_ simplifying

assumptions, which we will clear up in what follows)

Control Unit (Simplified) Example

instruction register. . .

PC

Add

. . . offset

MUX

4

input 1

input 0

0 or 1

A more complete picture (5.2)

Logic Design Convention In everything we’ve talked about so far there was no

well defined notion of time Although we all know there is such a thing as a clock

Let’s review some elements of logic design See Section 5.2 and Appendix B for more details if needed

(some 331 things) Logic design uses two kinds of elements Combinational elements: elements whose output

depends only on inputs And always gives the same output for the same inputs There is no notion of internal storage, state, etc. Simply looks at the voltage on input lines and always produce a

given voltage on the output line State elements: elements that have a state

State Elements

State elements are used for things like registers, memories Conceptually the same

A state element has at least two inputs and one output Input:

The value to be written into the element The clock: when the data is written

Output: The value that was written in a previous clock cycle (the state element can be read at any time)

Let’s try to understand how the clock works

The Clock

The clock cycle/period is divided into two portions high clock low clock

We use edge-triggered clocking, meaning that state changes (in state elements) occur at a clock edge

Using either the rising edge or the falling edge

clock cycle

rising edge falling edge

The Clock

In the above, we want to use the value in state element #1 to modify the value in state element #2: It takes one cycle We need all signals to be stabilized

clock cycle

stateelement #1

stateelement #2

stable updated on edge

combinatorialcircuit

stable by edge

The Clock

The nice thing about the previous system is that since we know that all state elements are updated on a clock edge, we can’t (sort of) ignore the clock signal and just know we’re using edge-triggered clocking

Some state elements of course are not always updated at every cycle!

What we do then, is AND the clock signal with some control bit, and pass that as the second input to the state element Assuming a rising edge update:

While the control bit stays at 0, nothing happen If we set the control bit to 1, the state element will be

updated at the next rising edge

Read/Write in a Clock Cycle A great implication of edge-triggered clocking: a state element can

be read and written in the same clock cycle No race condition (i.e., non-deterministic behavior) We will say things like : “reads happen in the first half of the clock

cycle, writes happen in the second half” You can read S’s state at the rising edge, and have it be updated

at the next rising edge

stateelement #1

stateelement #2

stable updated on edge

combinatorialcircuit

stable by edge

read state element #2

Busses and bus width

Many of the state elements and combinational elements take multi-bit inputs (often 32-bit inputs)

The term “bus” refers to a wire that carries more than one bit

multiple 1-bit wires, really We simply indicate the width of the busses as follows:

16

8

control signal

Building a Datapath

A datapath is an element in the processor that is supposed to operate on or hold data instruction memory, data memory, register

file, ALU, adders Let’s re-examine the datapath elements

we only barely introduced earlier

Fetching Instructions

add

InstructionMemory

Instruction

read @PC

4

32

32

32

The PC gets updated in 1 clock cycle because we use edge-triggered clocking

What about R-type instructions?

These instructions take 3 registers as arguments: 1 output register 2 input registers

Example: add t1, t1, t2 Each register has a 5-bit code, that can be extracted

from the 32-bit instruction code We need an input that contains data to be written into

the output register Typically comes from the ALU

We need a Write signal to trigger the register write on the next clock edge

A write anytime during the clock cycle could lead to race conditions if that register is also read

Let’s see how we can start representing the components to build this: Register File and ALU

Register File and ALU

ALU

Register File

Readdata 2

Readregister 1

32

Readdata 1

Readregister 2

Writeregister

Writedata

RegWrite

32

5

5

5

32 Operation4

32

32

32

zero

Add t1, t1, t2 (sketch)

ALU

Register File

Readdata 2

Readregister 1

32

Readdata 1

Readregister 2

Writeregister

Writedata

RegWrite(must be set only at the next edge)

32

5

5

5

32Operation4

t1

t2

t1

instruction

zero

What about the Load/Store

lw t1, offset(t2) The memory @ is computed by adding the 16-

bit signed offset to the input register Both the register file and the ALU are needed The offset of 16-bit, but memory addresses are

32-bit Therefore, the offset must be sign-extended

into a 32-bit value before being added to the input register

The memory has both read and write control Let’s see how we depict the above on a figure

Implementing Load/Store

signextend16 32

Data Memory

Address Readdata

Writedata

MemRead

MemWrite

3232

32

Implementing Lw s1,offset(s2) (sketch)

signextend16 32 Data Memory

Address Readdata

Writedata

MemRead (set)

MemWrite (not set)

3232

32

instruction

s2

offset

s1

add32

Register File

Readdata 2

Readregister 1

32

Readdata 1

Readregister 2

Writeregister

Writedata

RegWrite (set on next edge)

32

5

5

5

32

What about the Branch

beq t1, t2, offset Note that as humans we write a symbolic target (e.g., “next”) But the assembler transforms it into an offset

To do a branch we must compute the branch’s target address based on its offset decide whether the branch is taken or not taken

Let’s see it on a figure

Implementing a Branch (sketch)

. . .

Putting it altogether

We can combine everything we’ve seen in a single datapath

The simplest design is one in which all instructions are executed in a single clock cycle

Will probably be a pretty long clock cycle In this case, every element of the datapath is used only

once per clock cycle No duplication of hardware needed Or only of a few adders perhaps here and there And we need separate Data and Instruction memories

Let’s at first put together the pieces for the R-type (ALU) instructions and the memory instructions as they are quite similar

(not quite) altogether

We “simply” add multiplexer for choosing between the datapath for the ALU instructions and the memory instructions(making sure we have logic to set all the control signals)

(almost) altogether

missing supportfor jumps

What now?

At this point we’ve identified most of the component for an almost full datapath for a very simple implementation of the MIPS ISA

Let us now design the logic that makes it all work i.e., how we set the control signals

The Control Unit

The Control Unit takes in the instruction opcode and sets a bunch of useful signals

Its operation is defined by a truth table

Instruction [31-26]Control

Unit

control1 control2 control3 control4 control5control6

opcode c1 c2 c3 c4 c5 c6

000000 1 X 0 X 0 1

000001 1 1 X 0 0 0

. . . . . .

111110 1 x 0 x x 0

111111 0 0 x 0 1 x

X = don’t care

Control Unit

Let’s go through the type of control signals that need to be generated

An important set of signals if for the ALU Our ALU has four control signals:

ALU controls Function

0 0 0 0 AND

0 0 0 1 OR

0 0 1 0 add

0 1 1 0 subtract

0 1 1 1 set on less than

1 1 0 0 NOR

Controlling the ALU Depending on the instruction, the ALU will have to do different

things For Load/Store: the ALU needs to add For R-type instructions: depends on the 6-bit function field in the low-

order bits of the instructions (Remember Chapter 2) For branch: the ALU needs to subtract

We can generate the 4-bit ALU control using a small control unit that takes:

2 control bits called ALUOp add (00), sub (01), depends (10)

the instruction’s function field We have a simple truth table to obtain ALUOp from the opcode Figure 5.12 and 5.13 show how we obtain a final truth table The truth table can be implemented with a few AND, OR, and NOT

gates See ICS313 for how to build this

All Control Lines

The Control Unit

Datapath in use for R-type

Datapath in use for Load

Datapath in use for a beq

Setting of control line

Inst. RegDst ALUSrc Memto Reg

Reg Write

Mem Read

Mem Write

Branch ALUOp1 ALUOp2

R-format

1 0 0 1 0 0 0 1 0

lw 0 1 1 1 1 0 0 0 0

sw X 1 X 0 0 1 0 0 0

beq X 0 X 0 0 0 1 0 1

Truth table based on opcode

Signal R-format lw sw beq

Input

Op5 0 1 1 0

Op4 0 0 0 0

Op3 0 0 1 0

Op2 0 0 0 1

Op1 0 1 1 0

Op0 0 1 1 0

Output

RegDst 1 0 X X

ALUSrc 0 1 1 0

MemtoReg 0 1 X X

RegWrite 1 1 0 0

MemRead 0 1 0 0

MemWrite 0 0 1 0

Branch 0 0 0 1

ALUOp1 1 0 0 0

ALUOp2 0 0 0 1

Implementing jump

The jump instruction is actually very simple: the target address is the concatenation of The upper 4 bits of the current PC+4 The 26-bit from the instruction’s immediate

field and 00

So we can simply do this in hardware and use an extra multiplexer to pick the desired address

Implementing jump

Single-Cycle Implementation

The design we just developed is very simple, which is good

But it is terribly inefficient Each instruction takes a cycle, so the cycle time

is that needed by the longest instruction The load uses five functional units in series:

instruction memory, register file, ALU, data memory, register file

This violates the “common case fast” principle The single-cycle approach has been

abandoned al long time ago Instead, it is better to use multiple shorter clock

cycles for the instructions

Multi-Cycle Implementation

In the interest of time, we’ll just describe this at a very, very high level, without showing hardware diagrams

The idea is to have the functional units we’ve seen before and a set of additional registers to hold important values in between the cycles of a single instruction

This way a functional unit can be shared between cycles of the same instruction, provided some multiplexers are added to decide where the input should come from

From a functional unit? From one of the additional registers?

Question: How do we split instructions?

Multi-Cycle Instructions

We need to think of instructions as running in multiple cycles

At each cycle we need to identify which functional units an instruction must use

For MIPS, we can think of the instruction running in 5 1-cycle stages: Instruction fetch (IF) Instruction decode (ID) Execution (EX) Memory access (Mem) Write back (WB)

These stages do more than what their names imply

IF: Instruction Fetch

Fetch the instruction from memory into the Instruction Register (IR) and compute PC+4

In this step we don’t know yet what the instruction does

ID: Instruction Decode

Read the register names (perhaps) specified in the instruction code and read their values from the register file into temporary registers It may be that we won’t need them, but this can’t hurt Can all be done at once because MIPS uses fixed

encoding, so we know where the register names are Compute the branch address with the ALU and

save it in a temporary register Just in case the instruction is a branch

Do needed sign extensions The Control unit sets a bunch of controls based

on the opcode of the instruction being decoded

EX: Execution

If the instruction is a load/store ALU adds operands (registers and immediate value

read in the previous state) to obtain an @ If the instruction is a R-type instruction

ALU performs whatever operation is needed on the operands: registers read in the previous sep

If the instruction is a branch ALU does the “equal” comparison between the two

registers read in the previous stage If the instruction is a jump

The PC is replaced by the jump @

The above also set useful control signals

Mem: Memory

If the instruction is a load: Data is retrieved from memory and stored

into a temporary register If the instruction is a store:

Data is written to memory If the instruction is a R-type instruction:

Place the result from the ALU into a temporary register

The above set useful control signals

WB: Write-back

Write back into the register file the obtained results in the previous steps Data from memory on a load ALU result in a R-type instruction

The schematic view

IF ID EX

Mem WB

uses the memory

uses the register file

uses the ALU

uses the memory

uses the register file

Very important to remember the content of this slide

Conclusion

We haven’t dived into the gory details of implementing a multi-cycle processors

This will be saved for a future lecture

Documents

Computer Architecture The Processor: Datapath and Control