View
215
Download
0
Embed Size (px)
Citation preview
www.compaq.com
Simultaneous Multithreading:Simultaneous Multithreading:Multiplying Alpha PerformanceMultiplying Alpha Performance
Dr. Joel EmerDr. Joel EmerPrincipal Member Technical StaffPrincipal Member Technical Staff
Alpha Development GroupAlpha Development GroupCompaq Computer CorporationCompaq Computer Corporation
www.compaq.com
OutlineOutline
Alpha Processor RoadmapAlpha Processor Roadmap Motivation for Introducing SMTMotivation for Introducing SMT Implementation of an SMT CPUImplementation of an SMT CPU Performance EstimatesPerformance Estimates Architectural AbstractionArchitectural Abstraction
www.compaq.com
Higher Performance
Lo
wer
Co
st
2000 2001 2002 20031998 1999
21264 21264 EV6 EV6
2126421264EV68EV68
0.35m
2126421264EV67EV67
0.28m
0.18m
EV7EV70.18m
...
EV8EV80.125m
Alpha Microprocessor OverviewAlpha Microprocessor Overview
EV78EV780.125m
First System Ship
www.compaq.com
EV8 Technology OverviewEV8 Technology Overview
Leading edge process technology – 1.2-2.0GHzLeading edge process technology – 1.2-2.0GHz0.125µm CMOS0.125µm CMOSSOI-compatibleSOI-compatibleCu interconnectCu interconnect low-k dielectricslow-k dielectrics
Chip characteristicsChip characteristics~1.2V Vdd~1.2V Vdd~250 Million transistors~250 Million transistors~1100 signal pins in flip chip packaging~1100 signal pins in flip chip packaging
www.compaq.com
EV8 Architecture OverviewEV8 Architecture Overview
Enhanced out-of-order executionEnhanced out-of-order execution 8-wide superscalar8-wide superscalar Large on-chip L2 cacheLarge on-chip L2 cache Direct RAMBUS interfaceDirect RAMBUS interface On-chip router for system interconnect On-chip router for system interconnect Glueless, directory-based, ccNUMA for up to 512-way SMPGlueless, directory-based, ccNUMA for up to 512-way SMP 4-way simultaneous multithreading (SMT)4-way simultaneous multithreading (SMT)
www.compaq.com
GoalsGoals
Leadership single stream performanceLeadership single stream performance
Extra multistream performance with multithreadingExtra multistream performance with multithreadingWithout major architectural changesWithout major architectural changesWithout significant additional costWithout significant additional cost
www.compaq.com
Instruction IssueInstruction Issue
Reduced function unit utilization due to dependencies
Time
www.compaq.com
Superscalar IssueSuperscalar Issue
Superscalar leads to more performance, but lower utilization
Time
www.compaq.com
Predicated IssuePredicated Issue
Adds to function unit utilization, but results are thrown away
Time
www.compaq.com
Chip MultiprocessorChip Multiprocessor
Limited utilization when only running one thread
Time
www.compaq.com
Fine Grained MultithreadingFine Grained Multithreading
Intra-thread dependencies still limit performance
Time
www.compaq.com
Simultaneous MultithreadingSimultaneous Multithreading
Maximum utilization of function units by independent operations
Time
www.compaq.com
Basic Out-of-order PipelineBasic Out-of-order Pipeline
Fetch Decode/Map
Queue Reg Read
Execute Dcache/Store Buffer
Reg Write
Retire
PC
Icache
RegisterMap
DcacheRegs Regs
Thread-blind
www.compaq.com
SMT PipelineSMT Pipeline
Fetch Decode/Map
Queue Reg Read
Execute Dcache/Store Buffer
Reg Write
Retire
IcacheDcache
PC
RegisterMap
Regs Regs
www.compaq.com
Changes for SMTChanges for SMT
Basic pipeline – unchangedBasic pipeline – unchanged
Replicated resourcesReplicated resources Program countersProgram counters Register mapsRegister maps
Shared resourcesShared resources Register file (size increased)Register file (size increased) Instruction queueInstruction queue First and second level cachesFirst and second level caches Translation buffersTranslation buffers Branch predictorBranch predictor
www.compaq.com
Multiprogrammed workloadMultiprogrammed workload
0%
50%
100%
150%
200%
250%
SpecInt SpecFP Mixed Int/FP
1T
2T
3T
4T
www.compaq.com
Decomposed SPEC95 ApplicationsDecomposed SPEC95 Applications
0%
50%
100%
150%
200%
250%
Turb3d Swm256 Tomcatv
1T
2T
3T
4T
www.compaq.com
Multithreaded ApplicationsMultithreaded Applications
0%
50%
100%
150%
200%
250%
300%
Barnes Chess Sort TP
1T
2T
4T
www.compaq.com
Architectural AbstractionArchitectural Abstraction
1 CPU with 4 Thread Processing Units (TPUs)1 CPU with 4 Thread Processing Units (TPUs) Shared hardware resourcesShared hardware resources
TPU 0 TPU1 TPU2 TPU3
Icache TLB Dcache
Scache
www.compaq.com
System Block DiagramSystem Block Diagram
EV8M
IOEV8
M
IOEV8
M
IO
EV8M
IOEV8
M
IOEV8
M
IO
EV8M
IOEV8
M
IOEV8
M
IO
0 1 2 3
www.compaq.com
Quiescing Idle ThreadsQuiescing Idle Threads
Problem:Problem:Spin looping thread consumes resourcesSpin looping thread consumes resources
Solution:Solution:Provide quiescing operation that allows aProvide quiescing operation that allows aTPU to sleep until a memory location changesTPU to sleep until a memory location changes
www.compaq.com
SummarySummary
Alpha will maintain single stream performance leadership Alpha will maintain single stream performance leadership
SMT will significantly enhance multistream performanceSMT will significantly enhance multistream performanceAcross a wide range of applications,Across a wide range of applications,Without significant hardware cost, andWithout significant hardware cost, andWithout major architectural changesWithout major architectural changes
www.compaq.com
ReferencesReferences
""Simultaneous Multithreading: Maximizing On-Chip ParallelismSimultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, " by Tullsen, Eggers and Levy in ISCA95.Eggers and Levy in ISCA95.
""Exploiting Choice: Instruction Fetch and Issue on an Implementable Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreaded ProcessorSimultaneous Multithreaded Processor" by Tullsen, Eggers, Emer, Levy, Lo " by Tullsen, Eggers, Emer, Levy, Lo and Stamm in ISCA96.and Stamm in ISCA96.
““Converting Thread-Level Parallelism to Instruction-Level Parallelism via Converting Thread-Level Parallelism to Instruction-Level Parallelism via Simultaneous MultithreadingSimultaneous Multithreading” by Lo, Eggers, Emer, Levy, Stamm and Tullsen ” by Lo, Eggers, Emer, Levy, Stamm and Tullsen in ACM Transactions on Computer Systems, August 1997.in ACM Transactions on Computer Systems, August 1997.
““Simultaneous Multithreading: A Platform for Next-Generation Prcoessors” by Simultaneous Multithreading: A Platform for Next-Generation Prcoessors” by Eggers, Emer, Levy, Lo, Stamm and Tullsen in IEEE Micro, October, 1997.Eggers, Emer, Levy, Lo, Stamm and Tullsen in IEEE Micro, October, 1997.