23
www.compaq.com Simultaneous Simultaneous Multithreading: Multithreading: Multiplying Alpha Multiplying Alpha Performance Performance Dr. Joel Emer Dr. Joel Emer Principal Member Technical Staff Principal Member Technical Staff Alpha Development Group Alpha Development Group Compaq Computer Corporation Compaq Computer Corporation

Www.compaq.com Simultaneous Multithreading: Multiplying Alpha Performance Dr. Joel Emer Principal Member Technical Staff Alpha Development Group Compaq

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

www.compaq.com

Simultaneous Multithreading:Simultaneous Multithreading:Multiplying Alpha PerformanceMultiplying Alpha Performance

Dr. Joel EmerDr. Joel EmerPrincipal Member Technical StaffPrincipal Member Technical Staff

Alpha Development GroupAlpha Development GroupCompaq Computer CorporationCompaq Computer Corporation

www.compaq.com

OutlineOutline

Alpha Processor RoadmapAlpha Processor Roadmap Motivation for Introducing SMTMotivation for Introducing SMT Implementation of an SMT CPUImplementation of an SMT CPU Performance EstimatesPerformance Estimates Architectural AbstractionArchitectural Abstraction

www.compaq.com

Higher Performance

Lo

wer

Co

st

2000 2001 2002 20031998 1999

21264 21264 EV6 EV6

2126421264EV68EV68

0.35m

2126421264EV67EV67

0.28m

0.18m

EV7EV70.18m

...

EV8EV80.125m

Alpha Microprocessor OverviewAlpha Microprocessor Overview

EV78EV780.125m

First System Ship

www.compaq.com

EV8 Technology OverviewEV8 Technology Overview

Leading edge process technology – 1.2-2.0GHzLeading edge process technology – 1.2-2.0GHz0.125µm CMOS0.125µm CMOSSOI-compatibleSOI-compatibleCu interconnectCu interconnect low-k dielectricslow-k dielectrics

Chip characteristicsChip characteristics~1.2V Vdd~1.2V Vdd~250 Million transistors~250 Million transistors~1100 signal pins in flip chip packaging~1100 signal pins in flip chip packaging

www.compaq.com

EV8 Architecture OverviewEV8 Architecture Overview

Enhanced out-of-order executionEnhanced out-of-order execution 8-wide superscalar8-wide superscalar Large on-chip L2 cacheLarge on-chip L2 cache Direct RAMBUS interfaceDirect RAMBUS interface On-chip router for system interconnect On-chip router for system interconnect Glueless, directory-based, ccNUMA for up to 512-way SMPGlueless, directory-based, ccNUMA for up to 512-way SMP 4-way simultaneous multithreading (SMT)4-way simultaneous multithreading (SMT)

www.compaq.com

GoalsGoals

Leadership single stream performanceLeadership single stream performance

Extra multistream performance with multithreadingExtra multistream performance with multithreadingWithout major architectural changesWithout major architectural changesWithout significant additional costWithout significant additional cost

www.compaq.com

Instruction IssueInstruction Issue

Reduced function unit utilization due to dependencies

Time

www.compaq.com

Superscalar IssueSuperscalar Issue

Superscalar leads to more performance, but lower utilization

Time

www.compaq.com

Predicated IssuePredicated Issue

Adds to function unit utilization, but results are thrown away

Time

www.compaq.com

Chip MultiprocessorChip Multiprocessor

Limited utilization when only running one thread

Time

www.compaq.com

Fine Grained MultithreadingFine Grained Multithreading

Intra-thread dependencies still limit performance

Time

www.compaq.com

Simultaneous MultithreadingSimultaneous Multithreading

Maximum utilization of function units by independent operations

Time

www.compaq.com

Basic Out-of-order PipelineBasic Out-of-order Pipeline

Fetch Decode/Map

Queue Reg Read

Execute Dcache/Store Buffer

Reg Write

Retire

PC

Icache

RegisterMap

DcacheRegs Regs

Thread-blind

www.compaq.com

SMT PipelineSMT Pipeline

Fetch Decode/Map

Queue Reg Read

Execute Dcache/Store Buffer

Reg Write

Retire

IcacheDcache

PC

RegisterMap

Regs Regs

www.compaq.com

Changes for SMTChanges for SMT

Basic pipeline – unchangedBasic pipeline – unchanged

Replicated resourcesReplicated resources Program countersProgram counters Register mapsRegister maps

Shared resourcesShared resources Register file (size increased)Register file (size increased) Instruction queueInstruction queue First and second level cachesFirst and second level caches Translation buffersTranslation buffers Branch predictorBranch predictor

www.compaq.com

Multiprogrammed workloadMultiprogrammed workload

0%

50%

100%

150%

200%

250%

SpecInt SpecFP Mixed Int/FP

1T

2T

3T

4T

www.compaq.com

Decomposed SPEC95 ApplicationsDecomposed SPEC95 Applications

0%

50%

100%

150%

200%

250%

Turb3d Swm256 Tomcatv

1T

2T

3T

4T

www.compaq.com

Multithreaded ApplicationsMultithreaded Applications

0%

50%

100%

150%

200%

250%

300%

Barnes Chess Sort TP

1T

2T

4T

www.compaq.com

Architectural AbstractionArchitectural Abstraction

1 CPU with 4 Thread Processing Units (TPUs)1 CPU with 4 Thread Processing Units (TPUs) Shared hardware resourcesShared hardware resources

TPU 0 TPU1 TPU2 TPU3

Icache TLB Dcache

Scache

www.compaq.com

System Block DiagramSystem Block Diagram

EV8M

IOEV8

M

IOEV8

M

IO

EV8M

IOEV8

M

IOEV8

M

IO

EV8M

IOEV8

M

IOEV8

M

IO

0 1 2 3

www.compaq.com

Quiescing Idle ThreadsQuiescing Idle Threads

Problem:Problem:Spin looping thread consumes resourcesSpin looping thread consumes resources

Solution:Solution:Provide quiescing operation that allows aProvide quiescing operation that allows aTPU to sleep until a memory location changesTPU to sleep until a memory location changes

www.compaq.com

SummarySummary

Alpha will maintain single stream performance leadership Alpha will maintain single stream performance leadership

SMT will significantly enhance multistream performanceSMT will significantly enhance multistream performanceAcross a wide range of applications,Across a wide range of applications,Without significant hardware cost, andWithout significant hardware cost, andWithout major architectural changesWithout major architectural changes

www.compaq.com

ReferencesReferences

""Simultaneous Multithreading: Maximizing On-Chip ParallelismSimultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, " by Tullsen, Eggers and Levy in ISCA95.Eggers and Levy in ISCA95.

""Exploiting Choice: Instruction Fetch and Issue on an Implementable Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreaded ProcessorSimultaneous Multithreaded Processor" by Tullsen, Eggers, Emer, Levy, Lo " by Tullsen, Eggers, Emer, Levy, Lo and Stamm in ISCA96.and Stamm in ISCA96.

““Converting Thread-Level Parallelism to Instruction-Level Parallelism via Converting Thread-Level Parallelism to Instruction-Level Parallelism via Simultaneous MultithreadingSimultaneous Multithreading” by Lo, Eggers, Emer, Levy, Stamm and Tullsen ” by Lo, Eggers, Emer, Levy, Stamm and Tullsen in ACM Transactions on Computer Systems, August 1997.in ACM Transactions on Computer Systems, August 1997.

““Simultaneous Multithreading: A Platform for Next-Generation Prcoessors” by Simultaneous Multithreading: A Platform for Next-Generation Prcoessors” by Eggers, Emer, Levy, Lo, Stamm and Tullsen in IEEE Micro, October, 1997.Eggers, Emer, Levy, Lo, Stamm and Tullsen in IEEE Micro, October, 1997.