Upload
aggie
View
34
Download
0
Embed Size (px)
DESCRIPTION
EENG 449bG/CPSC 439bG Computer Systems Lecture 3 MIPS Instruction Set & Intro to Pipelining. January 20, 2004 Prof. Andreas Savvides Spring 2004 http://www.eng.yale.edu/courses/eeng449bG. The MIPS Architecture. Features: GPRs with load-store - PowerPoint PPT Presentation
Citation preview
EENG449b/SavvidesLec 3.1
1/20/04
January 20, 2004
Prof. Andreas Savvides
Spring 2004
http://www.eng.yale.edu/courses/eeng449bG
EENG 449bG/CPSC 439bG Computer Systems
Lecture 3
MIPS Instruction Set&
Intro to Pipelining
EENG449b/SavvidesLec 3.2
1/20/04
The MIPS Architecture
Features:• GPRs with load-store• Displacement, Immediate and Register
Indirect Addressing Modes• Data sizes: 8-, 16-, 32-, 64-bit integers and
64-bit floating point numbers• Simple instructions: load, store, add, sub,
move register-register, shift• Compare equal, compare not equal, compare
less, branch, jump call and return• Fixed instruction encoding for performance,
variable instruction encoding for size• Provide at least 16 general purpose registers
EENG449b/SavvidesLec 3.3
1/20/04
MIPS Architecture Features
Registers:• 32 64-bit GPRs (R0, R1…R31)
– Note: R0 is always 0 !!!
• 32 64-bit Floating Point Registers (F0,F1… F31)Data types:• 8-bit bytes, 16-bit half words• 32-bit single precision and 64-bit double precision
floating point instructionsAddressing Modes:• Immediate (Add R4, R3 --- Regs[R4]<-Regs[R4]+3• Displacement (Add R4, 100(R1) – Regs[R4]<-
Mem[100+Regs[R1]]• Register indirect (place 0 in the displacement field)
– E.g Add R4, 0(R1)
• Absolute Addressing (place R0 as the base register)– E.g Add R4, 1000(R0)
EENG449b/SavvidesLec 3.4
1/20/04
MIPS Instruction Format
op – opcode (basic operation of the instruction)
rs – first register operantrt – second register
operantrd – register destination
operantshamnt – shift amountfunct – Function
Example:LW t0, 1200($t1)
35 9 8 1200
100011 0100101000 0000 0100 1011 0000
binary
Note: The numbers for these examples are form “Computer Organization & Design”, Chapter 3
EENG449b/SavvidesLec 3.5
1/20/04
MIPS Instruction Format
op – opcode (basic operation of the instruction)
rs – first register operantrt – second register
operantrd – register destination
operantshamnt – shift amountfunct – Function
Example:Add $t0, $s2,$t0
0 18 8
binary
00000 1001001000
3208
0100000000 100000
Note: The numbers for these examples are form “Computer Organization & Design”, Chapter 3
EENG449b/SavvidesLec 3.6
1/20/04
MIPS Instruction Format
op – opcode (basic operation of the instruction)
rs – first register operantrt – second register
operantrd – register destination
operantshamnt – shift amountfunct – Function
Example:j 10000
2 10000
? ?
binary
You fill it in!
EENG449b/SavvidesLec 3.7
1/20/04
MIPS Operations
Four broad classes supported:
1. Loads and stores (figure 2.28)• Different data sizes: LD, LW, LH, LB, LBU …
2. ALU Operations (figure 2.29)– Add, sub, and, or …– They are all register-register operations
3. Control Flow Instructions (figure 2.30)– Branches (conditional) and Jumps
(unconditional)
4. Floating Point Operations
EENG449b/SavvidesLec 3.8
1/20/04
Levels of Representation
High Level Language Program
Assembly Language Program
Machine Language Program
Control Signal Specification
Compiler
Assembler
Machine Interpretation
temp = v[k];
v[k] = v[k+1];
v[k+1] = temp;
lw $t15,0($t2)
lw $t16,4($t2)
sw $t16,0($t2)sw $t15,4($t2)
0000 1001 1100 0110 1010 1111 0101 10001010 1111 0101 1000 0000 1001 1100 0110 1100 0110 1010 1111 0101 1000 0000 1001 0101 1000 0000 1001 1100 0110 1010 1111
°°
EENG449b/SavvidesLec 3.9
1/20/04
Execution Cycle
Instruction
Fetch
Instruction
Decode
Operand
Fetch
Execute
Result
Store
Next
Instruction
Obtain instruction from program storage
Determine required actions and instruction size
Locate and obtain operand data
Compute result value or status
Deposit results in storage for later use
Determine successor instruction
EENG449b/SavvidesLec 3.10
1/20/04
5 Steps of MIPS Datapath
MemoryAccess
Write
Back
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
LMD
ALU
MU
X
Mem
ory
Reg File
MU
XM
UX
Data
Mem
ory
MU
X
SignExtend
4
Ad
der Zero?
Next SEQ PC
Addre
ss
Next PC
WB Data
Inst
RD
RS1
RS2
Imm
EENG449b/SavvidesLec 3.11
1/20/04
EENG449b/SavvidesLec 3.12
1/20/04
Announcements
• Homework 1 is out– Chapter 1: Problems 1.2, 1.3, 1.17– Chapter 2: Problems 2.5, 2.11, 2.12, 2.19– Appendix A: Problems A.1, A.5, A.6, A.7, A.11– Due Thursday, Feb 5, 2:00pm
• Note the paper on DSP processors on the website
• Reading for this week: Patterson and Hennessy Appendix A
– This lecture we are covering A1 and A2, next lecture will cover the rest of the appendix
• Need to form teams for projects– Select a topic– Signup for group appointments with me
EENG449b/SavvidesLec 3.13
1/20/04
List of Possible Projects
• Power saving schemes in embedded microprocessors
• Embedded operating system enhancements and scheduling schemes for sensor interfaces
– Available operating systems TinyOS, PALOS, uCOS-II
• Time synchronization in sensor networks and its hardware implications
• Efficient microcontroller interfaces and control mechanisms for articulated nodes
• Network protocols and/or data memory management for sensor networks
• I also encourage you to propose your own project
EENG449b/SavvidesLec 3.14
1/20/04
Introduction to Pipelinening
Pipelining – leverage parallelism in hardware by overlapping
instruction execution
EENG449b/SavvidesLec 3.15
1/20/04
Fast, Pipelined Instruction Interpretation
Instruction Register
Operand Registers
Instruction Address
Result Registers
Next Instruction
Instruction Fetch
Decode &Operand Fetch
Execute
Store Results
NIIF
DE
W
NIIF
DE
W
NIIF
DE
W
NIIF
DE
W
NIIF
DE
W
Time
Registers or Mem
EENG449b/SavvidesLec 3.16
1/20/04
Sequential Laundry
• Sequential laundry takes 6 hours for 4 loads• If they learned pipelining, how long would laundry take?
A
B
C
D
30 40 2030 40 2030 40 2030 40 20
6 PM 7 8 9 10 11 Midnight
Task
Order
Time
EENG449b/SavvidesLec 3.17
1/20/04
Pipelined LaundryStart work ASAP
• Pipelined laundry takes 3.5 hours for 4 loads
A
B
C
D
6 PM 7 8 9 10 11 Midnight
Task
Order
Time
30 40 40 40 40 20
EENG449b/SavvidesLec 3.18
1/20/04
Pipelining Lessons
• Pipelining doesn’t help latency of single task, it helps throughput of entire workload
• Pipeline rate limited by slowest pipeline stage
• Multiple tasks operating simultaneously
• Potential speedup = Number pipe stages
• Unbalanced lengths of pipe stages reduces speedup
• Time to “fill” pipeline and time to “drain” it reduces speedup
A
B
C
D
6 PM 7 8 9
Task
Order
Time
30 40 40 40 40 20
EENG449b/SavvidesLec 3.19
1/20/04
Instruction Pipelining
• Execute billions of instructions, so throughput is what matters
– except when?
• What is desirable in instruction sets for pipelining?
– Variable length instructions vs. all instructions same length?
– Memory operands part of any operation vs. memory operands only in loads or stores?
– Register operand many places in instruction format vs. registers located in same place?
EENG449b/SavvidesLec 3.20
1/20/04
Requirements for Pipelining
Goal: Start a new instruction at every cycle
What are the hardware implications?• Two different tasks should not attempt to use the same
datapath resource on the same clock cycle.• Instructions should not interfere with each other• Need to have separate data and instruction memories• Need increased memory bandwidth
– A 5-stage pipeline operating at the same clock rate as pipelined version requires 5 times the bandwidth
• Need to introduce pipeline registers • Register file used in two places in the ID and WB stages
– Perform reads in the first half and writes in the second half.
EENG449b/SavvidesLec 3.21
1/20/04
Pipeline Requirements…
Need separate instruction andData memories: Structural Hazard
Register fileRead in the first half, write in the second half cycle
EENG449b/SavvidesLec 3.22
1/20/04
Add registers between pipeline stages
• Prevent interference between 2 instructions• Carry data from one stage to the next• Edge triggered
EENG449b/SavvidesLec 3.23
1/20/04
Pipelining Hazards
Hazards: circumstances that would cause incorrect execution if next instruction where launched
Structural Hazards:Attempting to use the same hardware to do two different things at the same time
Data Hazards:Instruction depends on result of prior instruction still in the pipeline
Control Hazards:Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps)
Common Solution: “Stall” the pipeline, until the hazard is resolved by inserting one or more “bubbles” in the pipeline
EENG449b/SavvidesLec 3.24
1/20/04
Data Hazards
Occurs when the relative timing of instructions is altered because of pipelining
Consider the following code:DADD R1, R2, R3
DSUB R4, R1, R5AND R6, R1, R7OR R8, R1, R9XOR R10, R1, R11
EENG449b/SavvidesLec 3.25
1/20/04
Data Hazard
EENG449b/SavvidesLec 3.26
1/20/04
Data Hazards: Data Forwarding
EENG449b/SavvidesLec 3.27
1/20/04
Data Hazards Requiring StallsLD R1,0(R2)DSUB R4,R1,R5AND R6,R1,R7OR R8,R1,R9
HAVE to stall for 1 cycle…
EENG449b/SavvidesLec 3.28
1/20/04
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear#2: Predict Branch Not Taken
– Execute successor instructions in sequence– “Squash” instructions in pipeline if branch actually taken– Advantage of late pipeline state update– 47% MIPS branches not taken on average– PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken– 53% MIPS branches taken on average– But haven’t calculated branch target address in MIPS
» MIPS still incurs 1 cycle branch penalty» Other machines: branch target known before outcome
EENG449b/SavvidesLec 3.29
1/20/04
Four Branch Hazard Alternatives
#4: Delayed Branch– Define branch to take place AFTER a following
instruction
branch instructionsequential successor1
sequential successor2
........sequential successorn
........
branch target if taken
– 1 slot delay allows proper decision and branch target address in 5 stage pipeline
– MIPS uses this
Branch delay of length n
EENG449b/SavvidesLec 3.30
1/20/04
Delayed Branch• Where to get instructions to fill branch delay slot?
– Before branch instruction– From the target address: only valuable when branch taken– From fall through: only valuable when branch not taken– Canceling branches allow more slots to be filled
• Compiler effectiveness for single branch delay slot:– Fills about 60% of branch delay slots– About 80% of instructions executed in branch delay slots
useful in computation– About 50% (60% x 80%) of slots usefully filled
• Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)
EENG449b/SavvidesLec 3.31
1/20/04
Pipelining Performance Issues
Consider an unpipelined processor
1ns/instruction Frequency
4 cycles for ALU operations 40%
4 cycles for branches 20%
5 cycles for memory operations 40%
Pipelining overhead 0.2ns
For the unpipelined processor
ns 4.4 5 40% 4 x20%) ((40% ns 1
CPI average cycleClock time execution ninstructio Average
EENG449b/SavvidesLec 3.32
1/20/04
Speedup from Pipelining
Now if we had a pipelined processor, we assume that each instruction takes 1 cycle BUT we also have overhead so instructions take 1ns + 0.2 ns = 1.2ns
pipelined time ninstructio Averagedunpipeline time ninstructio Average
pipelining from Speedup
times3.7 ns 1.2ns 4.4
EENG449b/SavvidesLec 3.33
1/20/04
Considering the stall overhead
pipelined time ninstructio Averagedunpipeline time ninstructio Average
pipelining from Speedup
pipelined cycleClock pipelined CPIunpiplined cycleClock dunpipeline CPI
pipelined cycleClock dunpipeline cycleClock
pipelined CPIdunpipeline CPI
Instper cycles Stall Average CPI Ideal pipelined CPI
ninstructioper cycles stall Pipeline 1dunpipeline CPI
Speedup
pipelined TimeCycledunpipeline TimeCycle
CPI stall Pipeline 1
depth Pipeline Speedup