24
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands

COMPSYS 304

  • Upload
    nyoko

  • View
    44

  • Download
    0

Embed Size (px)

DESCRIPTION

COMPSYS 304. Computer Architecture Speculation & Branching. Morning visitors - Paradise Bay, Bay of Islands. Speculation. High Tech Gambling? Data Prefetch Cache instruction dcbt : data cache block touch Attempts to bring data into cache so that it will be “close” when needed - PowerPoint PPT Presentation

Citation preview

Page 1: COMPSYS 304

COMPSYS 304

Computer Architecture

Speculation & Branching

Morning visitors - Paradise Bay, Bay of Islands

Page 2: COMPSYS 304

Speculation

• High Tech Gambling?• Data Prefetch

• Cache instruction dcbt : data cache block touch

• Attempts to bring data into cache• so that it will be “close” when needed

• Allows SIU to use idle bus bandwidth• if there’s no spare bandwidth,

this read can be given low priority• Speculative because

• a branch may occur before it’s used• we speculate that this data may be needed

PowerPC mnemonic -Similar opcodes found in other architectures:SPARC v9, MIPS, …

Page 3: COMPSYS 304

Speculation - General

• Some functional units almost always idle• Make them do some (possibly useful) work

rather than idle• If the speculation was incorrect,

results are simply abandoned• No loss in efficiency; Chance of a gain

• Researchers are actively looking at software prefetch schemes• Fetch data well before it’s needed• Reduce latency when it’s actually needed

• Speculative operations have low priority and use idle resources

Page 4: COMPSYS 304

Branching

• Expensive• 2-3 cycles lost in pipeline

• All instructions following branch ‘flushed’

• Bandwidth wasted fetching unused instructions• Stall while branch target is fetched

• We can speculate about the target of a branch• Terminology

• Branch Target : address to which branch jumps

• Branch Taken : control transfers to non- sequential address (target)

• Branch Not Taken : next instruction is executed

Page 5: COMPSYS 304

Branching - Prediction

• Branches can be• unconditional: branch is always taken

call subroutine return from subroutine

• conditional: branch depends on state of computation, eghas loop terminated yet?

• Unconditional branches are simple• New instructions are fetched as soon as the

branch is recognized • As early in the pipeline as possible

• Branch units often placed with fetch & decode stages

Page 6: COMPSYS 304

Branching - Branch Unit

• PowerPC 603 logical layout

Page 7: COMPSYS 304

Branching - Speculation

• We have the following code: if ( cond ) s1; else s2;

• Superscalar machine • Multiple functional units• Start executing both branches (s1 and s2)• Keep idle functional units busy!

• One is speculative and will be abandoned• Processor will eventually calculate the branch

condition and select which result should be retained (written back)

• MIPS R10000 - up to 4 speculative at once

Page 8: COMPSYS 304

Branching - Speculation

• MIPS R10000 - • Up to 4 speculative at once• Instructions are “tagged” with a 4 bit mask

• Indicates to which branch instruction it belongs

• As soon as condition is determined,mis-predicted instructions are aborted

Page 9: COMPSYS 304

Branching - Prediction• We have a sequence of instructions:

addlw

sub brne L1 or st

? If you were asked to guess which branch should be preferred, which would you choose:

? Next sequential instruction (L2)? Branch target (L1)

L2

L1 Some mixture of arithmetic,load, store, etc, instructions

branch on some condition

Some more arithmetic,load, store, etc, instructions

Page 10: COMPSYS 304

Branching - Prediction

• Studies show that backward branches are taken most of the time!

• Because of loops:

add ;any mix of arith,lw ;load, store, etc,

sub ;instructionsbrne L1 ;branch back to loop start

or ;some more arith,st ;memory, etc instructions

L2

L1

Page 11: COMPSYS 304

Branching - Prediction Rule

• A simple prediction rule:• Take backward branches

works amazingly well!• For a loop with n iterations,

this is wrong in 1/n cases only!• A system working on this rule alone would

• detect the backward branch and • start fetching from the branch target

rather than the next instruction

Page 12: COMPSYS 304

Branching - Improving the prediction

• Static prediction systems• Compiler can mark branches

• Likely to be taken or not• Instruction fetch unit will use the marking as

advice on which instruction to fetch

• Compiler often able to give the right advice • Loops are easily detected• Other patterns in conditions can be recognized

• Checking for EOF when reading a file• Error checking

Page 13: COMPSYS 304

Branching - Improving the prediction

• Dynamic prediction systems• Program history determines most likely branch• Branch Target Buffers - Another cache!

Page 14: COMPSYS 304

Branching - Branch Target Buffer

• Instruction Add[11:3] selects BTB entry• Tag determines “hit”• Stats select taken/not taken

Pentium 4>91% prediction

accuracy -4K entry BHT

(Branch History Table)G4e – 2K entries

Page 15: COMPSYS 304

Branching - Branch Target Buffer

• BTB – just another cache• Works on temporal locality principle

• If this branch is taken (not taken) now, it’s likely to be taken (not taken) next time

• Replace on conflicts (newest is best)• Any cache organization could be used

• Direct mapped, associative, set-associative• No write-back needed• Flushed entries are restored

• Major difference from other caches• Status bits …………

Page 16: COMPSYS 304

Branching - Branch Target Buffer

• Status bits• Provide hysteresis in behaviour• Without hysteresis, behaviour change would

cause the prediction to immediately update• Example:

• If ( cond ) s1else s2

• If the program takes branch s1 a few times,the BTB will predict that s1 is more likely than s2

• If s2 is then taken, usual cache behaviour suggests that the prediction should be updated to s2

but• Program branching behaviour is a little

different ….

Page 17: COMPSYS 304

Branching - Branch Target Buffer

• Status bits• Common branch behaviour is like this

• List of taken branches:s1 s1 s1 s1 s1 s2 s1 s1 s1 s2 s1 …

• Usually s1 is executed,occasionally s2

eg •s2 handles errors•s2 follows a loop

• ‘Standard’ cache update policies (assume the most recent will used next) would update the prediction from s1 to s2 immediately• This would cause many mis-predictions

Page 18: COMPSYS 304

Branching - Branch Target Buffer

• Status bits• However, if the BTB waits until it has seen s2 a number

of times before changing the prediction, the previous stream is predicted well

• So the status bits (say 2 bits) are a count of the number of correct predictions• A correct prediction updates the count

(maybe saturating at 2 – ie counts to max 2)• A mis-prediction decrements the count• A mis-prediction and count=0 updates the prediction• This accommodates an occasional break from a

pattern (eg s1 is usually taken) without disturbing the best prediction (take s1)

• It also handles situations where behaviour changes sometimes

Page 19: COMPSYS 304

Branching - Branch Target Buffer• Status bits - Count correct predictions

• Handles situations where behaviour changes sometimes• Programs which move from one ‘region’ to another ..

eg

• Image processing code - looking for an orange object• Process background (non-orange) pixels,• finds the orange thing,• counts orange pixels for a while, then • reverts back to background

// search for orange object in row of pixelsfor(j=0;j<width;j++) { if ( pixel[j].colour != orange ) // s1 bg_cnt++; else { // s2 o_cnt++; if ( o_cnt > obj_width ) … // found it! } }

Page 20: COMPSYS 304

Branching - Branch Target Buffer

• Status bits• Count correct predictions

• Handles situations where behaviour changes sometimes

• Programs which move from one ‘region’ to another ..

• Example:

• Image processing program looking for an orange object• Process background (non-orange) pixels,

• finds the orange thing,

• counts orange pixels for a while, then

• reverts back to background

• List of taken branches:

Taken branches: s1 s1 s1 s2 s2 s2 … s2 s1 s1 s1 s1

Region: BG BG BG OR OR OR … OR BG BG BG BG

Prediction: s1 s1 s1 s1 s1 s2 … s2 s2 s2 s1 s1

Correct: …

Page 21: COMPSYS 304

Branching - Branch Target Buffer

• Status bits• Count correct predictions• Reasonable compromise behaviour for most situations

• Tolerates an occasional ‘error’ branch well• Changes to a new behaviour with a small delay

• Typically about 90% correct predictions• BTB with 2k – 4k entries

Page 22: COMPSYS 304

Speculation & Branching - Summary

• Data speculation• Try to bring data ‘closer’ to CPU (ie into cache)

before needed• Reduce memory access latency

• Techniques• Special ‘touch’ instructions

• Advice to processor – fetch if resources available

• Software• eg Dummy reference

• Instruction (Branch) speculation ..

Page 23: COMPSYS 304

Speculation & Branching - Summary

• Branches are expensive!!• Instruction (Branch) speculation

• Execute both branches of a conditional branch• ‘Squash’ (abandon) results from wrong branch

when branch condition eventually evaluated• Compiler can also mark most probable branch

• Branch prediction• Simplest rule: take backward branches• Branch Target Buffer

• Cache containing most recent branch target• ‘Standard’ cache, except for• Status bits

• Introduce hysteresis into behaviour• Only update branch target when it’s definitely the right choice

Page 24: COMPSYS 304

Superscalar - summary

• Superscalar machines have multiple functional units (FUs)eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x

load/store

• Requires complex IFU• Able to issue multiple instructions/cycle (typ 4)• Able to detect hazards (unavailability of

operands)• Able to re-order instruction issue

• Aim to keep all the FUs busy

• Typically, 6-way superscalars can achieveinstruction level parallelism of 2-3