View
19
Download
0
Category
Preview:
Citation preview
www.compaq.com
Simultaneous Multithreading:Simultaneous Multithreading:Multiplying Alpha PerformanceMultiplying Alpha Performance
Dr. Joel EmerDr. Joel EmerPrincipal Member Technical StaffPrincipal Member Technical Staff
Alpha Development GroupAlpha Development GroupCompaq Computer CorporationCompaq Computer Corporation
www.compaq.com
OutlineOutline
uu Alpha Processor RoadmapAlpha Processor Roadmap
uu Motivat ion for Introducing SMTMotivat ion for Introducing SMT
uu Implementat ion of an SMT CPUImplementat ion of an SMT CPU
uu Performance Est imatesPerformance Est imates
uu Architectural Abstract ionArchitectural Abstract ion
www.compaq.com
Higher Performance
Low
er C
ost
2000 2001 2002 20031998 1999
2126421264EV6EV6
2126421264EV68EV68
0.35µ m
2126421264EV67EV67
0.28µ m
0.18µ m
EV7EV70.18µ m
...
EV8EV80.125µ m
Alpha Microprocessor OverviewAlpha Microprocessor Overview
EV78EV780.125µ m
First System Ship
www.compaq.com
EV8 Technology OverviewEV8 Technology Overview
uu Leading edge process technology – 1.2- 2.0GHzLeading edge process technology – 1.2- 2.0GHzll 0.125µm CMOS0.125µm CMOSll SOI- compatibleSOI- compatiblell Cu interconnectCu interconnectll low- k dielectricslow- k dielectrics
uu Chip characterist icsChip characterist icsll ~1.2V ~1.2V VddVdd
ll ~250 Million transistors~250 Million transistors
ll ~1100 signal pins in f lip chip packaging~1100 signal pins in f lip chip packaging
www.compaq.com
EV8 Architecture OverviewEV8 Architecture Overview
uu Enhanced out- of- order execut ionEnhanced out- of- order execut ion
uu 8- wide 8- wide superscalarsuperscalar
uu Large on- chip L2 cacheLarge on- chip L2 cache
uu Direct RAMBUS interfaceDirect RAMBUS interface
uu On- chip router for system interconnectOn- chip router for system interconnect
uu GluelessGlueless, directory- based, , directory- based, ccNUMAccNUMA for up to 512- way for up to 512- waySMPSMP
uu 4- way simultaneous mult ithreading (SMT)4- way simultaneous mult ithreading (SMT)
www.compaq.com
GoalsGoals
uu Leadership single stream performanceLeadership single stream performance
uu Extra mult istream performance withExtra mult istream performance withmult ithreadingmult ithreading
ll Without major architectural changesWithout major architectural changes
ll Without signif icant addit ional costWithout signif icant addit ional cost
www.compaq.com
Instruction IssueInstruction Issue
Reduced function unit utilization due to dependencies
Time
www.compaq.com
Superscalar Superscalar IssueIssue
Superscalar leads to more performance, but lower utilization
Time
www.compaq.com
Predicated IssuePredicated Issue
Adds to function unit utilization, but results are thrown away
Time
www.compaq.com
Chip MultiprocessorChip Multiprocessor
Limited utilization when only running one thread
Time
www.compaq.com
Fine Grained MultithreadingFine Grained Multithreading
Intra-thread dependencies still limit performance
Time
www.compaq.com
Simultaneous MultithreadingSimultaneous Multithreading
Maximum utilization of function units by independent operations
Time
www.compaq.com
Basic Out-of-order PipelineBasic Out-of-order Pipeline
Fetch Decode/ Map
Queue RegRead
Execute
Dcache/ StoreBuffer
RegWrite
Retire
PC
Icache
RegisterMap
DcacheRegs Regs
Thread-blind
www.compaq.com
SMT PipelineSMT Pipeline
Fetch Decode/ Map
Queue RegRead
Execute
Dcache/ StoreBuffer
RegWrite
Retire
Icache
Dcache
PC
RegisterMap
Regs Regs
www.compaq.com
Changes for SMTChanges for SMT
uu Basic pipeline – unchangedBasic pipeline – unchanged
uu Replicated resourcesReplicated resourcesll Program countersProgram countersll Register mapsRegister maps
uu Shared resourcesShared resourcesll Register f ile (size increased)Register f ile (size increased)ll Instruct ion queueInstruct ion queuell First and second level cachesFirst and second level cachesll Translat ion buffersTranslat ion buffersll Branch predictorBranch predictor
www.compaq.com
Multiprogrammed workloadMultiprogrammed workload
0%
50%
100%
150%
200%
250%
SpecInt SpecFP Mixed Int/FP
1T2T3T4T
www.compaq.com
Decomposed SPEC95 ApplicationsDecomposed SPEC95 Applications
0%
50%
100%
150%
200%
250%
Turb3d Swm256 Tomcatv
1T2T3T4T
www.compaq.com
Multithreaded ApplicationsMultithreaded Applications
0%
50%
100%
150%
200%
250%
300%
Barnes Chess Sort TP
1T2T4T
www.compaq.com
Architectural AbstractionArchitectural Abstraction
uu 1 CPU with 4 Thread Processing Units (1 CPU with 4 Thread Processing Units (TPUsTPUs))
uu Shared hardware resourcesShared hardware resources
TPU 0 TPU1 TPU2 TPU3
I cache TLB Dcache
Scache
www.compaq.com
System Block DiagramSystem Block Diagram
EV8M
IOEV8
M
IOEV8
M
IO
EV8M
IOEV8
M
IOEV8
M
IO
EV8M
IOEV8
M
IOEV8
M
IO
0 1 2 3
www.compaq.com
QuiescingQuiescing Idle Threads Idle Threads
uu Problem:Problem:Spin looping thread consumes resourcesSpin looping thread consumes resources
uu Solut ion:Solut ion:Provide Provide quiescingquiescing operat ion that allows a operat ion that allows aTPU to sleep unt il a memory locat ion changesTPU to sleep unt il a memory locat ion changes
www.compaq.com
SummarySummary
uu Alpha will maintain single stream performanceAlpha will maintain single stream performanceleadershipleadership
uu SMT will signif icant ly enhance mult istreamSMT will signif icant ly enhance mult istreamperformanceperformance
ll Across a wide range of applicat ions,Across a wide range of applicat ions,
ll Without signif icant hardware cost, andWithout signif icant hardware cost, and
ll Without major architectural changesWithout major architectural changes
www.compaq.com
ReferencesReferences
uu ""Simultaneous Multithreading: Maximizing On-Chip ParallelismSimultaneous Multithreading: Maximizing On-Chip Parallelism" by" by Tullsen Tullsen, Eggers, Eggersand Levy in ISCA95.and Levy in ISCA95.
uu ""Exploiting Choice: Instruction Fetch and Issue on anExploiting Choice: Instruction Fetch and Issue on an Implementable Implementable Simultaneous SimultaneousMultithreaded ProcessorMultithreaded Processor" by" by Tullsen Tullsen, Eggers, Emer, Levy, Lo and, Eggers, Emer, Levy, Lo and Stamm Stamm in inISCA96.ISCA96.
uu ““Converting Thread-Level Parallelism to Instruction-Level Parallelism via SimultaneousConverting Thread-Level Parallelism to Instruction-Level Parallelism via SimultaneousMultithreadingMultithreading” by Lo, Eggers, Emer, Levy, ” by Lo, Eggers, Emer, Levy, StammStamm and and Tullsen Tullsen in ACMin ACMTransact ions on Computer Systems, August 1997.Transact ions on Computer Systems, August 1997.
uu “Simultaneous Mult ithreading: A Plat form for Next- Generat ion“Simultaneous Mult ithreading: A Plat form for Next- Generat ionPrcoessorsPrcoessors” by Eggers, Emer, Levy, Lo, ” by Eggers, Emer, Levy, Lo, Stamm Stamm and and Tullsen Tullsen in IEEEin IEEEMicro, October, 1997.Micro, October, 1997.
Recommended