27
CPUs CPU performance CPU power consumption

CPUs CPU performance CPU power consumption Elements of CPU performance Cycle time CPU pipeline Memory system

Embed Size (px)

Citation preview

Page 1: CPUs CPU performance CPU power consumption Elements of CPU performance Cycle time CPU pipeline Memory system

CPUs

CPU performance

CPU power consumption

Page 2: CPUs CPU performance CPU power consumption Elements of CPU performance Cycle time CPU pipeline Memory system

Elements of CPU performance

Cycle time CPU pipeline Memory system

Page 3: CPUs CPU performance CPU power consumption Elements of CPU performance Cycle time CPU pipeline Memory system

Pipelining

Several instructions are executed simultaneously at different stages of completion

Various conditions can cause pipeline bubbles that reduce utilization: branches memory system delays etc.

Page 4: CPUs CPU performance CPU power consumption Elements of CPU performance Cycle time CPU pipeline Memory system

Pipeline structures ARM7 has 3-stage pipes:

fetch instruction from memory decode opcode and operands execute

ARM9 have 5-stage pipes: Instruction fetch Decode Execute Data memory access Register write

Page 5: CPUs CPU performance CPU power consumption Elements of CPU performance Cycle time CPU pipeline Memory system

ARM9 core instruction pipeline

Page 6: CPUs CPU performance CPU power consumption Elements of CPU performance Cycle time CPU pipeline Memory system

ARM7 pipeline execution

add r0,r1,#5

sub r2,r3,r6

cmp r2,#3

fetch

time

decode

fetch

execute

decode

fetch

execute

decode execute

1 2 3

Page 7: CPUs CPU performance CPU power consumption Elements of CPU performance Cycle time CPU pipeline Memory system

Performance measures

Latency: time it takes for an instruction to get through the pipeline

Throughput: number of instructions executed per time period

Pipelining increases throughput without reducing latency

Page 8: CPUs CPU performance CPU power consumption Elements of CPU performance Cycle time CPU pipeline Memory system

Pipeline stalls

If every step cannot be completed in the same amount of time, pipeline stalls

Bubbles introduced by stall increase latency, reduce throughput

Page 9: CPUs CPU performance CPU power consumption Elements of CPU performance Cycle time CPU pipeline Memory system

ARM multi-cycle LDMIA instruction

fetch decodeex ld r2ldmia r0,{r2,r3}

sub r2,r3,r6

cmp r2,#3

ex ld r3

fetch

time

decode ex sub

fetch decodeex cmp

Page 10: CPUs CPU performance CPU power consumption Elements of CPU performance Cycle time CPU pipeline Memory system

Control stalls

Branches often introduce stalls (branch penalty) Stall time may depend on whether branch is

taken May have to squash instructions that

already started executing Don’t know what to fetch until condition is

evaluated

Page 11: CPUs CPU performance CPU power consumption Elements of CPU performance Cycle time CPU pipeline Memory system

ARM pipelined branch

time

fetch decode ex bnebne foo

sub r2,r3,r6

fetch decode

foo add r0,r1,r2

ex bne

fetch decode ex add

ex bne

Page 12: CPUs CPU performance CPU power consumption Elements of CPU performance Cycle time CPU pipeline Memory system

Delayed branch

To increase pipeline efficiency, delayed branch mechanism requires n instructions after branch always executed whether branch is executed or not

Page 13: CPUs CPU performance CPU power consumption Elements of CPU performance Cycle time CPU pipeline Memory system

Example: ARM7 execution time

Determine execution time of FIR filter:for (i=0; i<N; i++)

f = f + c[i]*x[i]; Only branch in loop test may take more

than one cycle. BLT loop takes 1 cycle best case, 3 worst

case.

Page 14: CPUs CPU performance CPU power consumption Elements of CPU performance Cycle time CPU pipeline Memory system

ARM10 processor execution time

Impossible to describe briefly the exact behavior of all instructions in all circumstances Branch prediction

Prefetch buffer Branch folding

The independent Load/Store Unit Data alignment How many accesses hit in the cache and TLB

Page 15: CPUs CPU performance CPU power consumption Elements of CPU performance Cycle time CPU pipeline Memory system

ARM10 integer core

3 instr’s

Page 16: CPUs CPU performance CPU power consumption Elements of CPU performance Cycle time CPU pipeline Memory system

Integer core Prefetch Unit

Fetches instructions from I-cache or external memory Predicts the outcome of branches whenever it can

Integer Unit Decode Barrel shifter, ALU, Multiplier Main instruction sequencer

Load/store Unit Load or store two registers(64bits) per cycle Decouple from the integer unit after the first access of a LDM or

STM instruction Supports Hit-Under-Miss (HUM) operation

Page 17: CPUs CPU performance CPU power consumption Elements of CPU performance Cycle time CPU pipeline Memory system

Pipeline Fetch

I-cache access, branch prediction Issue

Initial instruction decode Decode

Final instruction decode, register read for ALU op, forwarding, and initial interlock resolution

Execute Data address calculation, shift, flag setting, CC check, branch mispredict

detection, and store data register read Memory

Data cache access Write

Register writes, instruction retirement

Page 18: CPUs CPU performance CPU power consumption Elements of CPU performance Cycle time CPU pipeline Memory system

Typical operations

Page 19: CPUs CPU performance CPU power consumption Elements of CPU performance Cycle time CPU pipeline Memory system

Load or store operation

Page 20: CPUs CPU performance CPU power consumption Elements of CPU performance Cycle time CPU pipeline Memory system

LDR operation that misses

Page 21: CPUs CPU performance CPU power consumption Elements of CPU performance Cycle time CPU pipeline Memory system

Interlocks Integer core

forwarding to resolve data dependencies between instructions Pipeline interlocks

Data dependency interlocks: Instructions that have a source register that is loaded from memory by

the previous instruction Hardware dependency

A new load waiting for the LSU to finish an existing LDM or STM A load that misses when the HUM slot is already occupied A new multiply waiting for a previous multiply to free up the first stage

of the multiply

Page 22: CPUs CPU performance CPU power consumption Elements of CPU performance Cycle time CPU pipeline Memory system

Pipeline forwarding paths

Page 23: CPUs CPU performance CPU power consumption Elements of CPU performance Cycle time CPU pipeline Memory system

Example of interlocking and forwarding

Execute-to-executemov r0, #1

add r1, r0, #1

Memory-to-executemov r0, #1

sub r1, r2, #2

add r2, r0, #1

Page 24: CPUs CPU performance CPU power consumption Elements of CPU performance Cycle time CPU pipeline Memory system

Example of interlocking and forwarding, cont’d

Single cycle interlockldr r0, [r1, r2]str r3, [r0, r4]

fetch decode writememoryexecuteissue

ldr r1+r2

fetch decode writememoryexecuteissue

str

r0 read

r0+r4r3 read

r3 write

Page 25: CPUs CPU performance CPU power consumption Elements of CPU performance Cycle time CPU pipeline Memory system

Superscalar execution

Superscalar processor can execute several instructions per cycle. Uses multiple pipelined data paths.

Programs execute faster, but it is harder to determine how much faster.

Page 26: CPUs CPU performance CPU power consumption Elements of CPU performance Cycle time CPU pipeline Memory system

Data dependencies

Execution time depends on operands, not just opcode.

Superscalar CPU checks data dependencies dynamically:

add r2,r0,r1add r3,r2,r5

data dependency r0 r1

r2 r5

r3

Page 27: CPUs CPU performance CPU power consumption Elements of CPU performance Cycle time CPU pipeline Memory system

Memory system performance

Caches introduce indeterminacy in execution time Depends on order of execution

Cache miss penalty: added time due to a cache miss

Several reasons for a miss: compulsory, conflict, capacity