23
A Critical Look At IA-64 A Critical Look At IA-64 Massive Resources, Massive Resources, Massive ILP, Massive ILP, But Can It Deliver? But Can It Deliver? Martin Hopkins, IBM Research Martin Hopkins, IBM Research 2/7/00 2/7/00 Sampoorani, Sivakumar and Joshua Sampoorani, Sivakumar and Joshua

Design decisions common to modern processors

Embed Size (px)

DESCRIPTION

A Critical Look At IA-64 Massive Resources, Massive ILP, But Can It Deliver? Martin Hopkins, IBM Research 2/7/00 Sampoorani, Sivakumar and Joshua. Design decisions common to modern processors. Pipelining Micro Ops Large ROB Single path execution Dynamic scheduling. At what cost?. - PowerPoint PPT Presentation

Citation preview

Page 1: Design decisions common to modern processors

A Critical Look At IA-64A Critical Look At IA-64Massive Resources, Massive Resources,

Massive ILP, Massive ILP, But Can It Deliver?But Can It Deliver?

Martin Hopkins, IBM ResearchMartin Hopkins, IBM Research2/7/002/7/00

Sampoorani, Sivakumar and JoshuaSampoorani, Sivakumar and Joshua

Page 2: Design decisions common to modern processors

Design decisions common to Design decisions common to modern processorsmodern processors

PipeliningPipelining Micro OpsMicro Ops Large ROBLarge ROB Single path executionSingle path execution Dynamic schedulingDynamic scheduling

Page 3: Design decisions common to modern processors

At what cost?At what cost?

Accurate Branch PredictionAccurate Branch Prediction Dependency CheckingDependency Checking Register RenamingRegister Renaming Alias Detection HardwareAlias Detection Hardware

Page 4: Design decisions common to modern processors

Performance of IA-64Performance of IA-64

Execution time = Cycle Time *IC* CPIExecution time = Cycle Time *IC* CPINo improvement reported in frequencyNo improvement reported in frequencyPossible Reasons?Possible Reasons? Reducing CPI at the cost of cycle timeReducing CPI at the cost of cycle time

Compares and branches in same cycleCompares and branches in same cycle Predicated ExecutionPredicated Execution

=> more FUs => more FUs => more complexity => more complexity + longer wires + longer wires

limit on frequencylimit on frequency => more power => more power

Page 5: Design decisions common to modern processors

Dynamic Path Length (IC)Dynamic Path Length (IC)

Longer than other architecturesLonger than other architectures

Reasons?Reasons? SpeculationSpeculation Check operations and recovery codeCheck operations and recovery code PredicationPredication No sign extended loadsNo sign extended loads No integer multiply or divideNo integer multiply or divide

Page 6: Design decisions common to modern processors

Dynamic Path Length (IC)Dynamic Path Length (IC)

Loads and Stores – Only post execution Loads and Stores – Only post execution update of base registerupdate of base register

ldldsz.ldtype.ldhint r1 = sz.ldtype.ldhint r1 = [[r3r3] ] no base update formno base update form

ldldsz.ldtype.ldhint r1 = sz.ldtype.ldhint r1 = [[r3r3], ], r2r2 register base updateregister base update

ldldsz.ldtype.ldhint r1 = sz.ldtype.ldhint r1 = [[r3r3], ], imm imm immediate base updateimmediate base update

Page 7: Design decisions common to modern processors

CPICPI

Cache EffectsCache Effects Larger code footprintLarger code footprint

128 bit bundle - 3 instructions128 bit bundle - 3 instructions Restrictions on placing instructionsRestrictions on placing instructions Branch target - beginning of bundleBranch target - beginning of bundle

Recovery codeRecovery code Pollutes I-Cache and/or triggers page faultsPollutes I-Cache and/or triggers page faults

Speculative loads - Pollute D-cacheSpeculative loads - Pollute D-cache

Page 8: Design decisions common to modern processors

Stalls possibleStalls possible

ExampleExampleload ra = load ra = load rb = ;; // end of bundleload rb = ;; // end of bundleadd rx = raadd rx = raload ry = [rb];;load ry = [rb];;If load ra causes a cache miss, stall.If load ra causes a cache miss, stall.Superscalar out-of-order processors – can executeSuperscalar out-of-order processors – can executenon-dependent instructions in parallel with the cachenon-dependent instructions in parallel with the cachemiss.miss.

Page 9: Design decisions common to modern processors

Comparing ComplexitiesComparing Complexities

Support for speculative executionSupport for speculative execution– Superscalar processorsSuperscalar processors

» reorder bufferreorder buffer

» register renaming hardwareregister renaming hardware

– EPIC EPIC » need to expose parallelism, speculationneed to expose parallelism, speculation

» hardware just does what the compiler sayshardware just does what the compiler says

Page 10: Design decisions common to modern processors

IA-64: Exposing Speculative IA-64: Exposing Speculative ExecutionExecution

Control speculationControl speculation

(moving loads above branches)(moving loads above branches) Data speculationData speculation

(moving loads above stores)(moving loads above stores)

Page 11: Design decisions common to modern processors

Control SpeculationControl Speculation

Hardware for deferring exceptions exposed Hardware for deferring exceptions exposed to softwareto software– NaT (Not a Thing or poison bits)NaT (Not a Thing or poison bits)

» set NaT bit associated with a register on exceptionset NaT bit associated with a register on exception

» perform an explicit check before using the registerperform an explicit check before using the register

– Increase in machine stateIncrease in machine state» 2 NaT registers2 NaT registers

» instructions to modify, test, and retrieve NaT valuesinstructions to modify, test, and retrieve NaT values

Page 12: Design decisions common to modern processors

Data SpeculationData Speculation

Explicit memory-alias-detection tableExplicit memory-alias-detection table– ALAT (Advanced Load Address table)ALAT (Advanced Load Address table)

» loads place their entries in ALATloads place their entries in ALAT

» stores remove the entry if addresses matchstores remove the entry if addresses match

– Hardware cost:Hardware cost:» ALAT is 32 entry, 2 way set associativeALAT is 32 entry, 2 way set associative

» recovery code requires that operands be maintainedrecovery code requires that operands be maintained(until the store is seen the operands have to be maintained)(until the store is seen the operands have to be maintained)

» increased register requirements (128 Int + 128 FP)increased register requirements (128 Int + 128 FP)

Page 13: Design decisions common to modern processors

Data Speculation Hardware CostsData Speculation Hardware Costs

Increased register pressure impliesIncreased register pressure implies– more state to be saved across functionsmore state to be saved across functions

– to avoid this:to avoid this:» Register stacking (SPARC register windows)Register stacking (SPARC register windows)

(0-31) global registers, others dynamically (0-31) global registers, others dynamically mapped mapped

» CFM (Current Frame Marker)CFM (Current Frame Marker)» Register Stack engineRegister Stack engine

Should also handle stack overflowsShould also handle stack overflows Additional complexity due to rotating Additional complexity due to rotating registersregisters

Page 14: Design decisions common to modern processors

Hardware CostsHardware Costs

Reorder bufferReorder buffer Register rename Register rename

mechanismmechanism

NaT bits, associated NaT bits, associated instructionsinstructions

ALATALAT Increased number of Increased number of

registersregisters Reg Stack Engine Reg Stack Engine

– Additional Additional complexities due to complexities due to rotating registers, page rotating registers, page faults, …faults, …

Page 15: Design decisions common to modern processors

Runtime InformationRuntime Information

Information about behavior of programsInformation about behavior of programs– Can’t be predicted at compile timeCan’t be predicted at compile time– Profiling helpsProfiling helps

» But costly…But costly…

Superscalar machinesSuperscalar machines– Dynamic selection of instructions to executeDynamic selection of instructions to execute– Rely upon information known at run timeRely upon information known at run time

Page 16: Design decisions common to modern processors

EpicEpic

Depends mostly on compilerDepends mostly on compiler– Run time information is not used so muchRun time information is not used so much

Consider the following code sequenceConsider the following code sequencecmp p1, p2 = ..cmp p1, p2 = .. /* set predicate /* set predicate

registers */registers */(p1) br.cond low_probability_path ;;(p1) br.cond low_probability_path ;; /* if (p1) goto ...*//* if (p1) goto ...*/

ll ra = [rb];;ra = [rb];;addadd rc = ra, rd;;rc = ra, rd;;use of use of (rc)(rc)

4 bundles, load not hoisted over a branch (which is not 4 bundles, load not hoisted over a branch (which is not usually taken)usually taken)

Page 17: Design decisions common to modern processors

As Scheduled by IA64 CompilerAs Scheduled by IA64 Compiler

Optimize for the most probable pathOptimize for the most probable pathl.sl.s ra = [rb];;ra = [rb];;

addadd rc = ra, rdrc = ra, rdcmp p1, p2 = ...cmp p1, p2 = ...(p1)(p1) br.cond low_probability_path ;;br.cond low_probability_path ;;check.scheck.s rc, recovery_coderc, recovery_codeuse of use of (rc)(rc)

3 bundles3 bundles

Page 18: Design decisions common to modern processors

When Low Probability Path Is When Low Probability Path Is TakenTaken

Superscalar processorSuperscalar processor Execute the load as Execute the load as

early as possibleearly as possible Cancel if found to be Cancel if found to be

mis-speculatedmis-speculated

Change assumptions Change assumptions dynamicallydynamically

EPICEPIC load has to complete since load has to complete since

dependant add is in next dependant add is in next bundlebundle

may take 100s of cycles if may take 100s of cycles if the pointer is randomthe pointer is random

Heavy penalty if the Heavy penalty if the compiler gets the compiler gets the probabilities wrongprobabilities wrong

Page 19: Design decisions common to modern processors

Dependence on ProfilingDependence on Profiling

RISC and CISC find profiling useful, but RISC and CISC find profiling useful, but not essentialnot essential

IA-64 is much more dependent on profilingIA-64 is much more dependent on profiling Difficulties involved with profilingDifficulties involved with profiling

– Additional responsibility for programmerAdditional responsibility for programmer– Creating a representative test suiteCreating a representative test suite– Using in demanding, diverse development Using in demanding, diverse development

environmentsenvironments

Page 20: Design decisions common to modern processors

Code BloatCode Bloat

RISC instructionsRISC instructions 5050 3 instructions per 128 bits3 instructions per 128 bits 3333 Avg of 2 instructions per bundleAvg of 2 instructions per bundle 3333 Branch target at beginning of bundleBranch target at beginning of bundle 1010 Check opsCheck ops Recovery codeRecovery code 2020 No base+disp addressingNo base+disp addressing 1515 No sign-extended loadsNo sign-extended loads PredicationPredication OptimizationsOptimizations 3030IA-64 code should be 4.8 times x86 codeIA-64 code should be 4.8 times x86 code

Page 21: Design decisions common to modern processors

Some things that may reduce Some things that may reduce code sizecode size

Post-increment loads can eliminate and add in a Post-increment loads can eliminate and add in a looploop– eg. accessing an array in strideseg. accessing an array in strides

Combining a compare and a logical opCombining a compare and a logical op r1 + r2 +1r1 + r2 +1 Rotating register files for s/w pipeliningRotating register files for s/w pipeliningAll the above amount to <5% difference.All the above amount to <5% difference.So net code bloat is about 4 times. (excludingSo net code bloat is about 4 times. (excludingoptimization overhead)optimization overhead)Code bloat => More memory b/w requirement.Code bloat => More memory b/w requirement.

Page 22: Design decisions common to modern processors

Performance comparison Performance comparison

800MHz Itanium800MHz Itanium SPECintSPECint<68% Alpha 21264 (1GHz) (20% less power)<68% Alpha 21264 (1GHz) (20% less power)<60% P4<60% P4 (2GHz) (2GHz) SPECfpSPECfp>20% Alpha 21264 >20% Alpha 21264 >8% P4>8% P4Power – a major hurdlePower – a major hurdle

Page 23: Design decisions common to modern processors

ConclusionConclusion

The IA-64 gamble – power is not going to The IA-64 gamble – power is not going to be a critical limitation in future.be a critical limitation in future.

This allows use of massive resourcesThis allows use of massive resources