Upload
denzel-spearman
View
267
Download
4
Embed Size (px)
Citation preview
CPUs
CPU performance
CPU power consumption
Elements of CPU performance
Cycle time CPU pipeline Memory system
Pipelining
Several instructions are executed simultaneously at different stages of completion
Various conditions can cause pipeline bubbles that reduce utilization: branches memory system delays etc.
Pipeline structures ARM7 has 3-stage pipes:
fetch instruction from memory decode opcode and operands execute
ARM9 have 5-stage pipes: Instruction fetch Decode Execute Data memory access Register write
ARM9 core instruction pipeline
ARM7 pipeline execution
add r0,r1,#5
sub r2,r3,r6
cmp r2,#3
fetch
time
decode
fetch
execute
decode
fetch
execute
decode execute
1 2 3
Performance measures
Latency: time it takes for an instruction to get through the pipeline
Throughput: number of instructions executed per time period
Pipelining increases throughput without reducing latency
Pipeline stalls
If every step cannot be completed in the same amount of time, pipeline stalls
Bubbles introduced by stall increase latency, reduce throughput
ARM multi-cycle LDMIA instruction
fetch decodeex ld r2ldmia r0,{r2,r3}
sub r2,r3,r6
cmp r2,#3
ex ld r3
fetch
time
decode ex sub
fetch decodeex cmp
Control stalls
Branches often introduce stalls (branch penalty) Stall time may depend on whether branch is
taken May have to squash instructions that
already started executing Don’t know what to fetch until condition is
evaluated
ARM pipelined branch
time
fetch decode ex bnebne foo
sub r2,r3,r6
fetch decode
foo add r0,r1,r2
ex bne
fetch decode ex add
ex bne
Delayed branch
To increase pipeline efficiency, delayed branch mechanism requires n instructions after branch always executed whether branch is executed or not
Example: ARM7 execution time
Determine execution time of FIR filter:for (i=0; i<N; i++)
f = f + c[i]*x[i]; Only branch in loop test may take more
than one cycle. BLT loop takes 1 cycle best case, 3 worst
case.
ARM10 processor execution time
Impossible to describe briefly the exact behavior of all instructions in all circumstances Branch prediction
Prefetch buffer Branch folding
The independent Load/Store Unit Data alignment How many accesses hit in the cache and TLB
ARM10 integer core
3 instr’s
Integer core Prefetch Unit
Fetches instructions from I-cache or external memory Predicts the outcome of branches whenever it can
Integer Unit Decode Barrel shifter, ALU, Multiplier Main instruction sequencer
Load/store Unit Load or store two registers(64bits) per cycle Decouple from the integer unit after the first access of a LDM or
STM instruction Supports Hit-Under-Miss (HUM) operation
Pipeline Fetch
I-cache access, branch prediction Issue
Initial instruction decode Decode
Final instruction decode, register read for ALU op, forwarding, and initial interlock resolution
Execute Data address calculation, shift, flag setting, CC check, branch mispredict
detection, and store data register read Memory
Data cache access Write
Register writes, instruction retirement
Typical operations
Load or store operation
LDR operation that misses
Interlocks Integer core
forwarding to resolve data dependencies between instructions Pipeline interlocks
Data dependency interlocks: Instructions that have a source register that is loaded from memory by
the previous instruction Hardware dependency
A new load waiting for the LSU to finish an existing LDM or STM A load that misses when the HUM slot is already occupied A new multiply waiting for a previous multiply to free up the first stage
of the multiply
Pipeline forwarding paths
Example of interlocking and forwarding
Execute-to-executemov r0, #1
add r1, r0, #1
Memory-to-executemov r0, #1
sub r1, r2, #2
add r2, r0, #1
Example of interlocking and forwarding, cont’d
Single cycle interlockldr r0, [r1, r2]str r3, [r0, r4]
fetch decode writememoryexecuteissue
ldr r1+r2
fetch decode writememoryexecuteissue
str
r0 read
r0+r4r3 read
r3 write
Superscalar execution
Superscalar processor can execute several instructions per cycle. Uses multiple pipelined data paths.
Programs execute faster, but it is harder to determine how much faster.
Data dependencies
Execution time depends on operands, not just opcode.
Superscalar CPU checks data dependencies dynamically:
add r2,r0,r1add r3,r2,r5
data dependency r0 r1
r2 r5
r3
Memory system performance
Caches introduce indeterminacy in execution time Depends on order of execution
Cache miss penalty: added time due to a cache miss
Several reasons for a miss: compulsory, conflict, capacity