Upload
georgiana-watts
View
286
Download
4
Embed Size (px)
Citation preview
Computer Architecture
The Processor: Datapath and Control
The Processor
We are going to see how the processor is implemented starting with a very simple processor, and adding
some more complexity This processor implements implements a
subset of the MIPS instruction set: Memory-reference instructions: lw and sw The ALU instructions: add, sub, or, slt Control flow instructions: beq and j
We’ll see how implementation choices affect the performance characteristics of the machine clockrate, CPI
10,000ft View
Three steps Send the PC to the memory to load the next
instruction from memory (32 bits) Read 0, 1 or 2 registers using the corresponding
fields of the instruction to know which one(s) to read, if any
Execute the instruction Luckily there are some commonalities: most instructions
need to use the ALU To calculate a numerical value, to calculate an @
Of course there are differences Only 2 instructions need to access the memory The store does not write into registers Only the branch and jump change the value of the PC
Our Processor, sort of ... To implement these steps, our processor will have 5
main components The PC register The memory from which instructions are loaded The memory from which data is stored/loaded The ALU The Register file
The trick is to interconnect them in a way that’s useful, cheap, and fast
Note that above with distinguish two memories, which conceptually goes against the Von Neumann architecture model
Let’s just go with this for now Note that we have separate Instructions and Data caches
anyway!
Our Processor, sort of...
Our Processor, sort of...
What’s missing How to combine input that are “joined” together How to tell which component what to do?
Multiplexers and Controllers
In the previous figure we have two or more “wires” going into the input of a component
This is because depending on the instruction being executed different input should be provided
So, based on the instruction, we need to decide which input should be selected
This is done with a multiplexer
MUX
input 1
input n. . . selected output
control: ceil(log2(n)) bits
What about the Control?
So great, now we can control multiplexers Besides, there are other things to control Example: the ALU has a bunch of control bits,
that tells it what to do:
2-bit control
00: ADD01: SUB02: MUL03: SHIFT
The Control unit
We need a controller that sends the appropriate control bits to all the multiplexers and the components
The control unit sends these signals based on the nature of the instruction
It uses the bits of the opcode to infer the appropriate control bits
Example: Say that a branch has an opcode of the form XXXXX1, and that
all other opcodes are of the form XXXXX0 Then, the control can decide on whether the PC should come
from (current PC+4) or from the output of the ALU Let’s show this on a figure (that makes _many_ simplifying
assumptions, which we will clear up in what follows)
Control Unit (Simplified) Example
instruction register. . .
PC
Add
. . . offset
MUX
4
input 1
input 0
0 or 1
A more complete picture (5.2)
Logic Design Convention In everything we’ve talked about so far there was no
well defined notion of time Although we all know there is such a thing as a clock
Let’s review some elements of logic design See Section 5.2 and Appendix B for more details if needed
(some 331 things) Logic design uses two kinds of elements Combinational elements: elements whose output
depends only on inputs And always gives the same output for the same inputs There is no notion of internal storage, state, etc. Simply looks at the voltage on input lines and always produce a
given voltage on the output line State elements: elements that have a state
State Elements
State elements are used for things like registers, memories Conceptually the same
A state element has at least two inputs and one output Input:
The value to be written into the element The clock: when the data is written
Output: The value that was written in a previous clock cycle (the state element can be read at any time)
Let’s try to understand how the clock works
The Clock
The clock cycle/period is divided into two portions high clock low clock
We use edge-triggered clocking, meaning that state changes (in state elements) occur at a clock edge
Using either the rising edge or the falling edge
clock cycle
rising edge falling edge
The Clock
In the above, we want to use the value in state element #1 to modify the value in state element #2: It takes one cycle We need all signals to be stabilized
clock cycle
stateelement #1
stateelement #2
stable updated on edge
combinatorialcircuit
stable by edge
The Clock
The nice thing about the previous system is that since we know that all state elements are updated on a clock edge, we can’t (sort of) ignore the clock signal and just know we’re using edge-triggered clocking
Some state elements of course are not always updated at every cycle!
What we do then, is AND the clock signal with some control bit, and pass that as the second input to the state element Assuming a rising edge update:
While the control bit stays at 0, nothing happen If we set the control bit to 1, the state element will be
updated at the next rising edge
Read/Write in a Clock Cycle A great implication of edge-triggered clocking: a state element can
be read and written in the same clock cycle No race condition (i.e., non-deterministic behavior) We will say things like : “reads happen in the first half of the clock
cycle, writes happen in the second half” You can read S’s state at the rising edge, and have it be updated
at the next rising edge
stateelement #1
stateelement #2
stable updated on edge
combinatorialcircuit
stable by edge
read state element #2
Busses and bus width
Many of the state elements and combinational elements take multi-bit inputs (often 32-bit inputs)
The term “bus” refers to a wire that carries more than one bit
multiple 1-bit wires, really We simply indicate the width of the busses as follows:
16
8
control signal
Building a Datapath
A datapath is an element in the processor that is supposed to operate on or hold data instruction memory, data memory, register
file, ALU, adders Let’s re-examine the datapath elements
we only barely introduced earlier
Fetching Instructions
add
InstructionMemory
Instruction
read @PC
4
32
32
32
The PC gets updated in 1 clock cycle because we use edge-triggered clocking
What about R-type instructions?
These instructions take 3 registers as arguments: 1 output register 2 input registers
Example: add t1, t1, t2 Each register has a 5-bit code, that can be extracted
from the 32-bit instruction code We need an input that contains data to be written into
the output register Typically comes from the ALU
We need a Write signal to trigger the register write on the next clock edge
A write anytime during the clock cycle could lead to race conditions if that register is also read
Let’s see how we can start representing the components to build this: Register File and ALU
Register File and ALU
ALU
Register File
Readdata 2
Readregister 1
32
Readdata 1
Readregister 2
Writeregister
Writedata
RegWrite
32
5
5
5
32 Operation4
32
32
32
zero
Add t1, t1, t2 (sketch)
ALU
Register File
Readdata 2
Readregister 1
32
Readdata 1
Readregister 2
Writeregister
Writedata
RegWrite(must be set only at the next edge)
32
5
5
5
32Operation4
t1
t2
t1
instruction
zero
What about the Load/Store
lw t1, offset(t2) The memory @ is computed by adding the 16-
bit signed offset to the input register Both the register file and the ALU are needed The offset of 16-bit, but memory addresses are
32-bit Therefore, the offset must be sign-extended
into a 32-bit value before being added to the input register
The memory has both read and write control Let’s see how we depict the above on a figure
Implementing Load/Store
signextend16 32
Data Memory
Address Readdata
Writedata
MemRead
MemWrite
3232
32
Implementing Lw s1,offset(s2) (sketch)
signextend16 32 Data Memory
Address Readdata
Writedata
MemRead (set)
MemWrite (not set)
3232
32
instruction
s2
offset
s1
add32
Register File
Readdata 2
Readregister 1
32
Readdata 1
Readregister 2
Writeregister
Writedata
RegWrite (set on next edge)
32
5
5
5
32
What about the Branch
beq t1, t2, offset Note that as humans we write a symbolic target (e.g., “next”) But the assembler transforms it into an offset
To do a branch we must compute the branch’s target address based on its offset decide whether the branch is taken or not taken
Let’s see it on a figure
Implementing a Branch (sketch)
. . .
Putting it altogether
We can combine everything we’ve seen in a single datapath
The simplest design is one in which all instructions are executed in a single clock cycle
Will probably be a pretty long clock cycle In this case, every element of the datapath is used only
once per clock cycle No duplication of hardware needed Or only of a few adders perhaps here and there And we need separate Data and Instruction memories
Let’s at first put together the pieces for the R-type (ALU) instructions and the memory instructions as they are quite similar
(not quite) altogether
We “simply” add multiplexer for choosing between the datapath for the ALU instructions and the memory instructions(making sure we have logic to set all the control signals)
(almost) altogether
missing supportfor jumps
What now?
At this point we’ve identified most of the component for an almost full datapath for a very simple implementation of the MIPS ISA
Let us now design the logic that makes it all work i.e., how we set the control signals
The Control Unit
The Control Unit takes in the instruction opcode and sets a bunch of useful signals
Its operation is defined by a truth table
Instruction [31-26]Control
Unit
control1 control2 control3 control4 control5control6
opcode c1 c2 c3 c4 c5 c6
000000 1 X 0 X 0 1
000001 1 1 X 0 0 0
. . . . . .
111110 1 x 0 x x 0
111111 0 0 x 0 1 x
X = don’t care
Control Unit
Let’s go through the type of control signals that need to be generated
An important set of signals if for the ALU Our ALU has four control signals:
ALU controls Function
0 0 0 0 AND
0 0 0 1 OR
0 0 1 0 add
0 1 1 0 subtract
0 1 1 1 set on less than
1 1 0 0 NOR
Controlling the ALU Depending on the instruction, the ALU will have to do different
things For Load/Store: the ALU needs to add For R-type instructions: depends on the 6-bit function field in the low-
order bits of the instructions (Remember Chapter 2) For branch: the ALU needs to subtract
We can generate the 4-bit ALU control using a small control unit that takes:
2 control bits called ALUOp add (00), sub (01), depends (10)
the instruction’s function field We have a simple truth table to obtain ALUOp from the opcode Figure 5.12 and 5.13 show how we obtain a final truth table The truth table can be implemented with a few AND, OR, and NOT
gates See ICS313 for how to build this
All Control Lines
The Control Unit
Datapath in use for R-type
Datapath in use for Load
Datapath in use for a beq
Setting of control line
Inst. RegDst ALUSrc Memto Reg
Reg Write
Mem Read
Mem Write
Branch ALUOp1 ALUOp2
R-format
1 0 0 1 0 0 0 1 0
lw 0 1 1 1 1 0 0 0 0
sw X 1 X 0 0 1 0 0 0
beq X 0 X 0 0 0 1 0 1
Truth table based on opcode
Signal R-format lw sw beq
Input
Op5 0 1 1 0
Op4 0 0 0 0
Op3 0 0 1 0
Op2 0 0 0 1
Op1 0 1 1 0
Op0 0 1 1 0
Output
RegDst 1 0 X X
ALUSrc 0 1 1 0
MemtoReg 0 1 X X
RegWrite 1 1 0 0
MemRead 0 1 0 0
MemWrite 0 0 1 0
Branch 0 0 0 1
ALUOp1 1 0 0 0
ALUOp2 0 0 0 1
Implementing jump
The jump instruction is actually very simple: the target address is the concatenation of The upper 4 bits of the current PC+4 The 26-bit from the instruction’s immediate
field and 00
So we can simply do this in hardware and use an extra multiplexer to pick the desired address
Implementing jump
Single-Cycle Implementation
The design we just developed is very simple, which is good
But it is terribly inefficient Each instruction takes a cycle, so the cycle time
is that needed by the longest instruction The load uses five functional units in series:
instruction memory, register file, ALU, data memory, register file
This violates the “common case fast” principle The single-cycle approach has been
abandoned al long time ago Instead, it is better to use multiple shorter clock
cycles for the instructions
Multi-Cycle Implementation
In the interest of time, we’ll just describe this at a very, very high level, without showing hardware diagrams
The idea is to have the functional units we’ve seen before and a set of additional registers to hold important values in between the cycles of a single instruction
This way a functional unit can be shared between cycles of the same instruction, provided some multiplexers are added to decide where the input should come from
From a functional unit? From one of the additional registers?
Question: How do we split instructions?
Multi-Cycle Instructions
We need to think of instructions as running in multiple cycles
At each cycle we need to identify which functional units an instruction must use
For MIPS, we can think of the instruction running in 5 1-cycle stages: Instruction fetch (IF) Instruction decode (ID) Execution (EX) Memory access (Mem) Write back (WB)
These stages do more than what their names imply
IF: Instruction Fetch
Fetch the instruction from memory into the Instruction Register (IR) and compute PC+4
In this step we don’t know yet what the instruction does
ID: Instruction Decode
Read the register names (perhaps) specified in the instruction code and read their values from the register file into temporary registers It may be that we won’t need them, but this can’t hurt Can all be done at once because MIPS uses fixed
encoding, so we know where the register names are Compute the branch address with the ALU and
save it in a temporary register Just in case the instruction is a branch
Do needed sign extensions The Control unit sets a bunch of controls based
on the opcode of the instruction being decoded
EX: Execution
If the instruction is a load/store ALU adds operands (registers and immediate value
read in the previous state) to obtain an @ If the instruction is a R-type instruction
ALU performs whatever operation is needed on the operands: registers read in the previous sep
If the instruction is a branch ALU does the “equal” comparison between the two
registers read in the previous stage If the instruction is a jump
The PC is replaced by the jump @
The above also set useful control signals
Mem: Memory
If the instruction is a load: Data is retrieved from memory and stored
into a temporary register If the instruction is a store:
Data is written to memory If the instruction is a R-type instruction:
Place the result from the ALU into a temporary register
The above set useful control signals
WB: Write-back
Write back into the register file the obtained results in the previous steps Data from memory on a load ALU result in a R-type instruction
The schematic view
IF ID EX
Mem WB
uses the memory
uses the register file
uses the ALU
uses the memory
uses the register file
Very important to remember the content of this slide
Conclusion
We haven’t dived into the gory details of implementing a multi-cycle processors
This will be saved for a future lecture