Upload
umika
View
32
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Pushing Performance, Efficiency and Scalability of Microprocessors. CERCS IAB Meeting, Fall 2006 Gabriel Loh. Research Overview. Funding from state of GA, Intel, MARCO Currently 2 PhD students, 2 MS Active undergrad research as well Collaborations Universities: PSU, UO, Rutgers - PowerPoint PPT Presentation
Citation preview
Pushing Performance, Efficiency and Scalability of Microprocessors CERCS IAB Meeting, Fall 2006Gabriel Loh
Research Overview
• Funding from state of GA, Intel, MARCO
• Currently 2 PhD students, 2 MS– Active undergrad research as well
• Collaborations– Universities: PSU, UO, Rutgers– Industry: Intel, IBM
Research Focus
• “Near-term” microprocessor design issues– ~ 5-year time scale– Power/performance/complexity– Traditional uniprocessor performance– Multi-core performance
• “Longer-term”– Keeping Moore’s Law alive for the longer
term– Primarily, 3D integration for now
Scaling Performance and Efficiency• Multi-cores are here, but single-
thread perf still matters– Intel Core 2 Duo is multi-core, but…– Single core is more OOO than ever
• Larger instruction window, improved branch prediction, speculative load-store ordering, wider pipe and decoders
– But power also really matters• Lower clock speeds, different channel length
transistors, more uop fusion, …
Research Focus
• Maximum performance within bounds– Bounds = power, area, TDP, …
• Single-core performance helps multi-core performance, too– For future multi-core systems, need to strike a
good balance between 1T and MT
• Most of our research is at the uarch level– Caches, branch predictors, instruction
schedulers, memory queue design, memory dependence prediction, etc.
Highlight: Traditional Caching [MICRO’06]
• Well known that different apps respond differently to different replacement policies
• Previous work in the OS domain has described adaptive replacement with provable bounds on performance
• Adapted techniques for on-chip caches
Idea…
Adaptive Cache Implementation
• Theoretical Guarantees– Miss rate provably bounded to be within
a factor of two of the better algorithm
In practice,it’s much better
Current Research
• Working on multi-core generalizations of adaptive caching and other ways to manage shared resources
• Uniprocessor microarchitecture– Scalable memory scheduling [MICRO’06]– Memory dependence prediction
[HPCA’06]– Branch prediction […]– And more…
Longer-Term Processor Scaling
• Limitations/Obstacles– Wire scaling
• Latency/performance• Power
– Feature size• Lithography, parametric variations
– Off-chip communication
3D Integration
• Wire– Power/perf.
• Off-chip• Feature size
– Limitations, variations
ActiveLayer 1
ActiveLayer 2
MetalLayers 1
Die-to-DieVias
Die/Wafer Stacking
MetalLayers 2
Less RC faster, lower-power
Example: Caches
Simplified 2D SRAM Array 3D Bitline Stacking
Wordline length halved
• in our studies, WL was critical for latency
3D Wordline Stacking
Bitline length halved
• BL reduction has greater impact on power savings• Split decoder no activity stacking
We’ve studieda wide varietyof other CPU
building blocks
Uarch-level 3D design
Example: 4-die significance-partitioned datapathUse uarch prediction mechanism for early determination of width
Smaller footprint faster and lower-power
Width-based gating even lower power,
close to original power density
Overall: 47% performance gain atonly 2 degree temperature increase
3D Research Summary
• Circuit-level [ICCD’05,ISVLSI’06,ISCAS’06,GLSVLSI’06]
• Uarch-level [MICRO’06 (w/ ),HPCA’07]
• Tutorial papers [JETC’06]
• Tutorial [MICRO’06]
• Tools [DATE’06,TCAD’07] w/ GTCAD &
• Parametric Variations w/ Jim Meindl
• Funding, equip from ,
Summary
• loh@cc• http://www.cc.gatech.edu/~loh
• Lots of exciting work going on here