Upload
programming-passion
View
99
Download
0
Tags:
Embed Size (px)
Citation preview
What is Multiprocessing
Parallelism at the Instruction Level is limited because of data dependency => Speed up is limited!!
Abundant availability of program level parallelism, like Do I = 1000, Loop Level Parallelism. How about employing multiple processors to execute the loops => Parallel processing or Multiprocessing
With billion transistors on a chip, we can put a few CPUs in one chip => Chip multiprocessor
Memory Latency Problem
Even if we increase CPU power, memory is the real bottleneck. Techniques to alleviate memory latency problem:
1. Memory hierarchy – Program locality, cache memory, multilevel, pages and context switching
2. Prefetching – Get the instruction/data before the CPU needs. Good for instns because of sequential locality, so all modern processors use prefetch buffers for instns. What do with data?
3. Multithreading – Can the CPU jump to another program when accessing memory? It’s like multiprogramming!!
Hardware Multithreading
We need to develop a hardware multithreading technique because switching between threads in software is very time-consuming (Why?), so not suitable for main memory (instead of I/O) access, Ex: Multitasking
Develop multiple PCs and register sets on the CPU so that thread switching can occur without having to store the register contents in main memory (stack, like it is done for context switching).
Several threads reside in the CPU simultaneously, and execution switches between the threads on main memory access.
How about both multiprocessors and multithreading on a chip? => Network Processor
Architectural Comparisons (cont.)Tim
e (processor cycle)
Multiprocessing
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Idle slot
SimultaneousMultithreading
Single-cycle Datapath for MIPS
Use datapath figure to represent pipeline
DataMemory(Dmem)
PC Registers ALUInstruction
Memory(Imem)
Stage 1 Stage 2 Stage 3 Stage 4
Stage 5
IFtchDcd Exec Mem WBA
LU IM Reg DM Reg
Stages of Execution in Pipelined MIPS5 stage instruction pipeline 1) I-fetch: Fetch Instruction, Increment PC2) Decode: Instruction, Read Registers3) Execute:
Mem-reference: Calculate Address R-format: Perform ALU Operation
4) Memory: Load: Read Data from Data Memory Store: Write Data to Data Memory
5) Write Back: Write Data to Register
Pipelined Execution Representation
To simplify pipeline, every instruction takes same number of steps, called stages
One clock cycle per stage
IFtchDcd Exec Mem WB
IFtchDcd Exec Mem WB
IFtchDcd Exec Mem WB
IFtchDcd Exec Mem WB
IFtchDcd Exec Mem WBProgram Flow
Datapath Timing: Single-cycle vs. Pipelined Assume the following delays for major functional units:
2 ns for a memory access or ALU operation 1 ns for register file read or write
Total datapath delay for single-cycle:
In pipeline machine, each stage = length of longest delay = 2ns; 5 stages = 10ns
Insn Insn Reg ALU Data Reg TotalType Fetch Read Oper Access Write Time
beq 2ns 1ns 2ns 5nsR-form 2ns 1ns 2ns 1ns 6nssw 2ns 1ns 2ns 2ns 7nslw 2ns 1ns 2ns 2ns 1ns 8ns
Pipelining Lessons Pipelining doesn’t help latency (execution time)
of single task, it helps throughput of entire workload
Multiple tasks operating simultaneously using different resources
Po te ntia l speedup = Number of pipe stages Time to “fill” pipeline and time to “drain” it
reduces speedup
Pipeline rate limited by slowest pipeline stage Unbalanced lengths of pipe stages also reduces
speedup
Single Cycle Datapath (From Ch 5)
Regs
ReadReg1
Readdata1
ALURead
data2
ReadReg2
WriteReg
WriteData
Zero
ALU-con
RegWrite
Address
Readdata
WriteData
SignExtend
Dmem
MemRead
MemWrite
Mux
MemTo-Reg
Mux
Read Addr
Instruc-tion
Imem
4
PC
add
add <<
2
Mux
PCSrc
ALUOp
ALU-src
Mux
25:21
20:16
15:11
RegDst
15:0
31:0
Required Changes to Datapath Introduce registers to separate 5 stages by
putting IF/ID, ID/EX, EX/MEM, and MEM/WB registers in the datapath.
Next PC value is computed in the 3rd step, but we need to bring in next instn in the next cycle – Move PCSrc Mux to 1st stage. The PC is incremented unless there is a new branch address.
Branch address is computed in 3rd stage. With pipeline, the PC value has changed! Must carry the PC value along with instn. Width of IF/ID register = (IR)+(PC) = 64 bits.
Pipelined Datapath (with Pipeline Regs)
Address
4
32
0
Add Addresult
Shiftleft 2
Ins
tru
ctio
n
Mux
0
1
Add
PC
0
Address
Writedata
Mux
1
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALU
Zero
Imem
Dmem
Regs
IF/ID ID/EX EX/MEM MEM/WB
5
Fetch Decode Execute Memory Write Back
Pipelined Control
Start with single-cycle controller Group control lines by pipeline stage needed Extend pipeline registers with control bits
Control
EX
Mem
WB
WB
WB
IF/ID ID/EX EX/MEM MEM/WB
Instruction
Mem
RegDstALUopALUSrc
BranchMemReadMemWrite
MemToRegRegWrite
Pipelined Processor: Datapath+Control Mo re wo rk to c o rre c tly ha ndle p ip e line ha z a rds
PC
Inst
ruct
ion
Add
Instruction[20– 16]
4
16 32Instruction[15– 0]
0
0
Mux
0
1
Add Addresul t
Writeregis ter
Writedata
Readdata 1
Readdata 2
Readregis ter 1
Readregis ter 2
Signextend
Mux1
ALUresult
Zero
Writedata
Readdata
Mux
1
ALUcontrol
Sh iftleft 2
Control
ALU
Instruction[15– 11]
6
EX
M
WB
M
WB
WBIF/ID
ID/EX
EX/MEM
MEM/WB
Mux
0
1
Address
Address
Reg
Writ
e
ALUSrc
ALUOp
RegDst
MemRead
Mem
ToR
eg
Mem
Writ
eBranch
PCSrc
Imem
DmemRegs
In short
if can keep all pipeline stages busy, can retire (complete) up to one instruction per clock cycle (thereby achieving single-cycle thro ug hp ut)
The pipeline paradox (for MIPS): any instruction still takes 5 cycles to execute (even though can retire one instruction per cycle)
Problems for Pipelining Hazards prevent next instruction from
executing during its designated clock cycle, limiting speedup Structural hazards: HW cannot support this
combination of instructions (single memory for instruction and data)
Data hazards: Instruction depends on result of prior instruction still in the pipeline
Control hazards: conditional branches & other instructions may stall the pipeline delaying later instructions
M
Single Memory is a Structural Hazard
Load
Instr 1
Instr 2
Instr 3
Instr 4A
LU M Reg M Reg
AL
U M Reg M Reg
AL
U M Reg M Reg
AL
UReg M Reg
AL
U M Reg M Reg
• Can’t read same memory twice in same clock cycle
Instr.
Order
Time (clock cycles)
EX: MIPS multicycle datapath: Structural Hazard in Memory
Registers
ReadReg1
ALU
ReadReg2
WriteReg
Data
PC
Address
Instructionor Data
MemoryA
B
ALU-Out
InstructionRegister
Data MemoryData
Register
Readdata 1
Readdata 2
Structural Hazards limit performance Example: if 1.3 memory accesses per
instruction (30% of instructions execute loads and stores)and only one memory access per cycle then Average CPI ≥ 1.3 Otherwise datapath resource is more than
100% utilized
Structural Hazard Solution: Add more Hardware
Speed Up Equation for Pipelining
CPIpipelined = Ideal CPI + Pipeline stall clock cycles per instn
Speedup = Ideal CPI x Pipeline depth Clock Cycleunpipelined
---------------------------------- X ------------------------- Ideal CPI + Pipeline stall CPI Clock Cyclepipelined
Speedup = Pipeline depth Clock Cycleunpipelined
------------------------ X --------------------------- 1 + Pipeline stall CPI Clock Cyclepipelined
x
Example: Dual-port vs. Single-port
Machine A: Dual ported memory Machine B: Single ported memory, but its pipelined
implementation has a 1.05 times faster clock rate Ideal CPI = 1 for both Loads are 40% of instructions executed SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe) = Pipeline Depth
SpeedUpB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe / 1.05)
= (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth
SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33 Machine A is 1.33 times faster
Data Hazard on Register $1 (6.4)
add $1 ,$2, $3
sub $4, $1 ,$3
and $6, $1 ,$7
or $8, $1 ,$9
xor $10, $1 ,$11
• “Forward” result from one stage to another
• “or” OK if implement register file properly
Data Hazard Solution:
add $1,$2,$3
sub $4,$1,$3
and $6,$1,$7
or $8,$1,$9
xor $10,$1,$11
IF ID/RF EX MEM WBAL
UIM Reg DM Reg
AL
UIM Reg DM RegA
LUIM Reg DM Reg
IM
AL
UReg DM Reg
AL
UIM Reg DM Reg
Instr.
Order
Time (clock cycles)
Hazard Detection for Forwarding
A hazard must be detected just before execution so that in case of hazard, the data can be forwarded to the input of the ALU.
It can be detected when a source register (Rs or Rt or both) of the instruction at the EX stage is equal to the destination register (Rd) of an instruction in the pipeline (either in MEM or WB stage)
Compare the values of Rs and Rt registers in the ID/EX stage with Rd at EX/MEM and MEM/WB stages => Need to carry Rs, Rt, Rd values to the ID/EX register from the IF/ID register (only Rd was carried before)
If they match, forward the data to the input of the ALU through the multiplexor.
•Dependencies backward in time are hazards
• Can’t solve with forwarding alone• Must stall instruction dependent on load•“Load-Use” hazard
Forwarding: What about Loads?
lw $1,0($2)
sub $4,$1,$3
IF ID/RF EX MEM WBAL
UIM Reg DM Reg
AL
UIM Reg DM Reg
• Must stall pipeline 1 cycle (insert 1 bubble)
lw $1, 0($2)
sub $4,$1,$6
and $6,$1,$7
or $8,$1,$9
IF ID/RF EX MEM WBAL
UIM Reg DM Reg
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
IM
AL
UReg DM
Time (clock cycles)
bubble
bubble
bubble
Data Hazard Even with Forwarding
Compiler Schemes to Improve Load Delay
Compiler will detect data dependency and inserts nop instructions until data is available
sub $2, $1, $3
nop
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2) Compiler will find independent instructions to fill
in the delay slots
Try producing fast code for
a = b + c;
d = e – f;
assuming a, b, c, d ,e, and f in memory. Slow code:
LW Rb,b
LW Rc,c
ADD Ra,Rb,Rc
SW a,Ra
LW Re,e
LW Rf,f
SUB Rd,Re,Rf
SW d,Rd
Software Scheduling to Avoid Load Hazards
Fast code:Fast code:LW Rb,bLW Rc,cLW Re,e ADD Ra,Rb,RcLW Rf,fSW a,Ra SUB Rd,Re,RfSW d,Rd