Pushing Performance, Efficiency and Scalability of Microprocessors

Pushing Performance, Efficiency and Scalability of Microprocessors CERCS IAB Meeting, Fall 2006Gabriel Loh

Research Overview

• Funding from state of GA, Intel, MARCO

• Currently 2 PhD students, 2 MS– Active undergrad research as well

• Collaborations– Universities: PSU, UO, Rutgers– Industry: Intel, IBM

Research Focus

• “Near-term” microprocessor design issues– ~ 5-year time scale– Power/performance/complexity– Traditional uniprocessor performance– Multi-core performance

• “Longer-term”– Keeping Moore’s Law alive for the longer

term– Primarily, 3D integration for now

Scaling Performance and Efficiency• Multi-cores are here, but single-

thread perf still matters– Intel Core 2 Duo is multi-core, but…– Single core is more OOO than ever

• Larger instruction window, improved branch prediction, speculative load-store ordering, wider pipe and decoders

– But power also really matters• Lower clock speeds, different channel length

transistors, more uop fusion, …

Research Focus

• Maximum performance within bounds– Bounds = power, area, TDP, …

• Single-core performance helps multi-core performance, too– For future multi-core systems, need to strike a

good balance between 1T and MT

• Most of our research is at the uarch level– Caches, branch predictors, instruction

schedulers, memory queue design, memory dependence prediction, etc.

Highlight: Traditional Caching [MICRO’06]

• Well known that different apps respond differently to different replacement policies

• Previous work in the OS domain has described adaptive replacement with provable bounds on performance

• Adapted techniques for on-chip caches

Idea…

Adaptive Cache Implementation

• Theoretical Guarantees– Miss rate provably bounded to be within

a factor of two of the better algorithm

In practice,it’s much better

Current Research

• Working on multi-core generalizations of adaptive caching and other ways to manage shared resources

• Uniprocessor microarchitecture– Scalable memory scheduling [MICRO’06]– Memory dependence prediction

[HPCA’06]– Branch prediction […]– And more…

Longer-Term Processor Scaling

• Limitations/Obstacles– Wire scaling

• Latency/performance• Power

– Feature size• Lithography, parametric variations

– Off-chip communication

3D Integration

• Wire– Power/perf.

• Off-chip• Feature size

– Limitations, variations

ActiveLayer 1

ActiveLayer 2

MetalLayers 1

Die-to-DieVias

Die/Wafer Stacking

MetalLayers 2

Less RC faster, lower-power

Example: Caches

Simplified 2D SRAM Array 3D Bitline Stacking

Wordline length halved

• in our studies, WL was critical for latency

3D Wordline Stacking

Bitline length halved

• BL reduction has greater impact on power savings• Split decoder no activity stacking

We’ve studieda wide varietyof other CPU

building blocks

Uarch-level 3D design

Example: 4-die significance-partitioned datapathUse uarch prediction mechanism for early determination of width

Smaller footprint faster and lower-power

Width-based gating even lower power,

close to original power density

Overall: 47% performance gain atonly 2 degree temperature increase

3D Research Summary

• Circuit-level [ICCD’05,ISVLSI’06,ISCAS’06,GLSVLSI’06]

• Uarch-level [MICRO’06 (w/ ),HPCA’07]

• Tutorial papers [JETC’06]

• Tutorial [MICRO’06]

• Tools [DATE’06,TCAD’07] w/ GTCAD &

• Parametric Variations w/ Jim Meindl

• Funding, equip from ,

Summary

• loh@cc• http://www.cc.gatech.edu/~loh

• Lots of exciting work going on here

Documents

Pushing Performance, Efficiency and Scalability of Microprocessors