Upload
vannga
View
234
Download
2
Embed Size (px)
Citation preview
1
Single-Cycle Processors:Datapath & Control
ArvindComputer Science & Artificial Intelligence Lab
M.I.T.
Based on the material prepared byArvind and Krste Asanovic
6.823 L5- 2
Instruction Set Architecture (ISA) Arvind
versus Implementation
• ISA is the hardware/software interface – Defines set of programmer visible state – Defines instruction format (bit encoding) and instruction
semantics
– Examples: MIPS, x86, IBM 360, JVM
• Many possible implementations of one ISA– 360 implementations: model 30 (c. 1964), z900 (c. 2001) – x86 implementations: 8086 (c. 1978), 80186, 286, 386, 486,
Pentium, Pentium Pro, Pentium-4 (c. 2000), AMD Athlon, Transmeta Crusoe, SoftPC
– MIPS implementations: R2000, R4000, R10000, ...– JVM: HotSpot, PicoJava, ARM Jazelle, ...
September 26, 2005
6.823 L5- 3 Arvind
Processor Performance
Time = Instructions Cycles Time Program Program * Instruction * Cycle
– Instructions per program depends on source code, compiler technology, and ISA
– Cycles per instructions (CPI) depends upon the ISA and the microarchitecture
– Time per cycle depends upon the microarchitecture and the base technology
this lecture
Microarchitecture CPI cycle time
Microcoded >1 short
Single-cycle unpipelined 1 long
Pipelined 1 short
September 26, 2005
6.823 L5- 4 Arvind
Microarchitecture: Implementation of an ISA
Controller
Data path
control pointsstatus
lines
Structure: How components are connected. Static
Behavior: How data moves between components Dynamic
September 26, 2005
Hardware Elements• Combinational circuits OpSelect
–
Result
Comp?
A
B
ALU
Sel
O A0 A1
An-1
Mux... A
Dem
ux ...
O0 O1
On1
Sel
A
Dec
oder ...
O0 O1
On-1
Mux, Demux, Decoder, ALU, ... - Add, Sub, ... - And, Or, Xor, Not, ... - GT, LT, EQ, Zero, ...
lg(n) lg(n)
lg(n)
• Synchronous state elements – Flipflop, Register, Register file, SRAM, DRAM
Clk
D
Q
Enff
Q
D
Clk En
ff
Q0
D0
Clk En
ff
Q1
D1
ff
Q2
D2
ff
Qn-1
Dn-1
...
...
...
register
Edge-triggered: Data is sampled at the rising edge September 26, 2005
6.823 L5- 6 Arvind
Register Files Clock WE
ReadData1ReadSel1 ReadSel2
WriteSel
Register file
2R+1W
ReadData2
WriteData
rd1rs1
rs2
ws wd
rd2
we
ws clk
register 1
wd
we rs2
rs1
rd1
rd2
register 0
…
…
…
…
32
32
32
32
32
325 5
5
register 31
• No timing issues in reading a selected register • Register files with a large number of ports are difficult
to design – Intel’s Itanium, GPR File has 128 registers with 8 read ports and
4 write ports!!! September 26, 2005
6.823 L5- 7 Arvind
A Simple Memory Model
WriteEnable Clock
Address ReadData
WriteData
Reads and writes are always completed in one cycle
MAGIC RAM
• a Read can be done any time (i.e. combinational) • a Write is performed at the rising clock edge
if it is enabled ⇒ the write address and data
must be stable at the clock edge
Later in the course we will present a more realistic model of memory
September 26, 2005
6.823 L5- 8 Arvind
Implementing MIPS:
Single-cycle per instructiondatapath & control logic
September 26, 2005
6.823 L5- 9 Arvind
The MIPS ISAProcessor State
32 32-bit GPRs, R0 always contains a 032 single precision FPRs, may also be viewed as
16 double precision FPRs FP status register, used for FP compares & exceptions PC, the program counter some other special registers
Data types8-bit byte, 16-bit half word 32-bit word for integers32-bit word for single precision floating point64-bit word for double precision floating point
Load/Store style instruction setdata addressing modes- immediate & indexed branch addressing modes- PC relative & register indirect Byte addressable memory- big endian mode
All instructions are 32 bits September 26, 2005
6.823 L5- 10 Arvind
Instruction Execution
Execution of an instruction involves
1. instruction fetch2. decode and register fetch3. ALU operation4. memory operation (optional)5. write back
and the computation of the address of the next instruction
September 26, 2005
6.823 L5- 11 Arvind
Datapath: Reg-Reg ALU Instructions
0x4 Add
clk
addr inst
Inst. Memory
PC
inst<5:0>
z ALU
Control
RegWrite
clk
rd1
rs1 rs2
ws wd rd2
weinst<25:21> inst<20:16>
inst<15:11>
OpCode
ALU
GPRs
RegWrite Timing?6 5 5 5 5 6 0 rd 0 funcrs rt rd ← (rs) func (rt)
31 26 25 21 20 16 15 11 5 0 September 26, 2005
6.823 L5- 12 Arvind
Datapath: Reg-Imm ALU Instructions
Imm Ext
inst<15:0>
0x4 Add
clk
addr inst
Inst. Memory
PC
z ALU
RegWrite
clk
rd1
rs1 rs2
ws wd rd2
we
ALU Control
GPRs
inst<25:21>
inst<20:16>
inst<31:26>
OpCode ExtSel
6 5 5 16 opcode rs rt immediate rt ← (rs) op immediate
31 26 25 2120 16 15 0 September 26, 2005
6.823 L5- 13 Arvind
Conflicts in Merging Datapath
0x4 Add
addr inst
Inst. Memory
PC
clk
RegWrite Introduce
Imm Ext
z ALU
clk
rd1
rs1 rs2
ws wd rd2
we
inst<15:0>
ALU Controlinst<5:0>
muxes
GPRs
inst<25:21>
inst<20:16>
inst<31:26>
inst<15:11>
OpCode ExtSel
6 5 5 5 5 6 0 rs rt rd 0 func
opcode rs rt immediate
rd ← (rs) func (rt)
rt ← (rs) op immediate
September 26, 2005
6.823 L5- 14 Arvind
Datapath for ALU Instructions
0x4 Add
addr inst
Inst. Memory
PC
RegWrite
clk
<31:26>, <5:0>
BSrc
Imm Ext
z ALU
clk
rd1
rs1 rs2
ws wd rd2
we<25:21> <20:16>
<15:0>
ALU Control
<15:11> GPRs
OpCode RegDst ExtSel OpSel rt / rd Reg / Imm
6 5 5 5 5 60 rs rt rd 0 func
opcode rs rt immediate
rd ← (rs) func (rt)
rt ← (rs) op immediate
September 26, 2005
6.823 L5- 15 Arvind
Datapath for Memory Instructions
Should program and data memory be separate?
Harvard style: separate (Aiken and Mark 1 influence) - read-only program memory - read/write data memory
at some level the two memories have to be the same
Princeton style: the same (von Neumann’s influence) - A Load or Store instruction requires
accessing the memory more than once during its execution
September 26, 2005
6.823 L5- 16 Arvind
Load/Store Instructions:Harvard Datapath
0x4 Add
addr inst
Inst. Memory
PC
RegWrite MemWrite
clk
WBSrc ALU / Mem
“base”
disp
ALU Control
z ALU
clk
rd1
rs1 rs2
ws wd rd2
we
Imm Ext
clk
addr
wdata
rdataData Memory
we
GPRs
OpCode RegDst ExtSel OpSel BSrc
opcode rs rt displacement6 5 5 16 addressing mode
(rs) + displacement 31 26 25 21 20 16 15 0 rs is the base register rt is the destination of a Load or the source for a Store
September 26, 2005
6.823 L5- 17 Arvind
MIPS Control Instructions Conditional (on GPR) PC-relative branch
6 5 5 16 opcode rs offset BEQZ, BNEZ
Unconditional register-indirect jumps6 5 5 16
opcode rs JR, JALR
Unconditional absolute jumps 6 26
opcode target J, JAL
• PC-relative branches add offset×4 to PC+4 to calculate the target address (offset is in words): ±128 KB range
• Absolute jumps append target×4 to PC<31:28> to calculate the target address: 256 MB range
• jump-&-link stores PC+4 into the link register (R31) • All Control Transfers are delayed by 1 instruction
we will worry about the branch delay slot later September 26, 2005
6.823 L5- 18 Arvind
Conditional Branches (BEQZ, BNEZ)
Add
PCSrc
clk
WBSrcMemWrite
addr
wdata
rdataData Memory
we
z
clk
clk
addr inst
Inst. Memory
PC rd1
rs1 rs2
ws wd rd2
we
Imm Ext
ALU
ALU Control
Add
br
pc+4
RegWrite
0x4
zero?
GPRs
OpCode RegDst ExtSel OpSel BSrc
September 26, 2005
6.823 L5- 19 Arvind
Register-Indirect Jumps (JR) RegWrite
Add Add
clk
WBSrcMemWrite
addr
wdata
rdataData Memory
we
z
clk
clk
addr inst
Inst. Memory
PC rd1
rs1 rs2
ws wd rd2
we
Imm Ext
ALU
ALU Control
PCSrc br
pc+4
rind
0x4
zero?
GPRs
OpCode RegDst ExtSel OpSel BSrc
September 26, 2005
6.823 L5- 20 Arvind
Register-Indirect Jump-&-Link (JALR)
RegWrite
Add Add
clk
WBSrcMemWrite
addr
wdata
rdataData Memory
we
z
clk
clk
addr inst
Inst. Memory
PC rd1
rs1 rs2
ws wd rd2
we
Imm Ext
ALU
ALU Control
31
PCSrc br
pc+4
rind
0x4
zero?
GPRs
OpCode RegDst ExtSel OpSel BSrc
September 26, 2005
6.823 L5- 21 Arvind
Absolute Jumps (J, JAL) RegWrite
Add Add
clk
WBSrcMemWrite
addr
wdata
rdataData Memory
we
z
clk
clk
addr inst
Inst. Memory
PC rd1
rs1 rs2
ws wd rd2
we
Imm Ext
ALU
ALU Control
31
PCSrc br
pc+4
rind jabs
0x4
zero?
GPRs
OpCode RegDst ExtSel OpSel BSrc
September 26, 2005
6.823 L5- 22 Arvind
Harvard-Style Datapath for MIPS RegWrite
Add Add
clk
WBSrc
addr
wdata
rdataData Memory
we
z
clk
zero?
clk
addr inst
Inst. Memory
PC rd1
rs1 rs2
ws wd rd2
we
Imm Ext
ALU
ALU Control
31
PCSrc br rind jabs pc+4
0x4
MemWrite
GPRs
OpCode RegDst ExtSel OpSel BSrc
September 26, 2005
23
Five-minute break to stretch your legs
6.823 L5- 24 Arvind
Single-Cycle Hardwired Control:Harvard architecture
We will assume • clock period is sufficiently long for all of
the following steps to be “completed”:
1. instruction fetch2. decode and register fetch3. ALU operation4. data fetch if required5. register write-back setup time
⇒ tC > tIFetch + tRFetch + tALU+ tDMem+ tRWB
• At the rising edge of the following clock, the PC,the register file and the memory are updated
September 26, 2005
6.823 L5- 25
Hardwired Control is pure Arvind
Combinational Logic
combinational logic
ExtSel
BSrc
OpSelop code
MemWrite
WBSrczero?
RegDst
RegWrite
PCSrc
September 26, 2005
6.823 L5- 26 Arvind
ALU Control & Immediate Extension
Inst<31:26> (Opcode)
Decode Map
Inst<5:0> (Func)
ALUop
0?
+
OpSel ( Func, Op, +, 0? )
ExtSel ( sExt16, uExt16, High16)
September 26, 2005
6.823 L5- 27 Arvind
Hardwired Control TableOpcode ExtSel BSrc OpSel MemW RegW WBSrc RegDst PCSrc
ALU * Reg Func no yes ALU rd pc+4 ALUi sExt16 Imm Op no yes ALU rt pc+4 ALUiu uExt16 Imm Op no yes ALU rt pc+4 LW sExt16 Imm + no yes Mem rt pc+4 SW sExt16 Imm + yes no * * pc+4
BEQZz=0 sExt16 * 0? no no * * br BEQZz=1 sExt16 * 0? no no * * pc+4 J * * * no no * * jabs JAL * * * no yes PC R31 jabs JR * * * no no * * rind JALR * * * no yes PC R31 rind
BSrc = Reg / Imm WBSrc = ALU / Mem / PC RegDst = rt / rd / R31 PCSrc = pc+4 / br / rind / jabs
September 26, 2005
6.823 L5- 28 Arvind
Pipelined MIPS To pipeline MIPS:
• First build MIPS without pipelining with CPI=1
• Next, add pipeline registers to reduce cycle time while maintaining CPI=1
September 26, 2005
6.823 L5- 29 Arvind
Pipelined Datapath
0x4
Add
addrPC
we rs1 rs2
rd1 we ws addrrdata IR ALUwd rd2
rdata GPRs Data Inst. MemoryImmMemory wdataExt
writefetch decode & Reg-fetch execute memory -backphase phase phase phase phase
Clock period can be reduced by dividing the execution of an instruction into multiple cycles
tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably)
However, CPI will increase unless instructions are pipelined September 26, 2005
6.823 L5- 30 Arvind
An Ideal Pipeline
stage 1
stage 2
stage 3
stage 4
• All objects go through the same stages
• No sharing of resources between any two stages
• Propagation delay through all pipeline stages is equal
• The scheduling of an object entering the pipeline is not affected by the objects in other stages
These conditions generally hold for industrial assembly lines. But can an instruction pipeline satisfy the last condition?
September 26, 2005
6.823 L5- 31
How to divide the datapath Arvind
into stages
Suppose memory is significantly slower than other stages. In particular, suppose
= 10 unitstIM
= 10 unitstDM
= 5 unitstALU = 1 unittRF
= 1 unittRW
Since the slowest stage determines the clock, it may be possible to combine some stages without any loss of performance
September 26, 2005
tC > max {tIM, tRF, tALU, tDM, tRW} = tDMtC > max {tIM, tRF+tALU, tDM, tRW} = tDM
6.823 L5- 32 Arvind
Alternative Pipelining
write -backfetch
phase execute phase
decode & Reg-fetch phase
memory phase
addr
wdata Memory
we ALU
Imm Ext
0x4
Add
addr
Inst. Memory
rd1
GPRs
rs1 rs2
ws wd rd2
we
IRPC
rdata Data
rdata
phase
tC > max {tIM, tRF+tALU, tDM+tRW} = tDM+ tRW
⇒ increase the critical path by 10%
Write-back stage takes much less time than other stages. Suppose we combined it with the memory phase
September 26, 2005
6.823 L5- 33 Arvind
Maximum Speedup by Pipelining
Assumptions Unpipelined Pipelined Speedup tC tC
tt
1. tIM = tDM = 10, ALU = 5, RF = tRW= 1
4-stage pipeline 27 10 2.7
2. tIM = tDM = tALU = tRF = tRW = 5 4-stage pipeline 25 10 2.5
3. tIM = tDM = tALU = tRF = tRW = 5 5-stage pipeline 25 5 5.0
It is possible to achieve higher speedup with more stages in the pipeline.
September 26, 2005
34
Thank you !