15
Pushing Performance, Efficiency and Scalability of Microprocessors CERCS IAB Meeting, Fall 2006 Gabriel Loh

Pushing Performance, Efficiency and Scalability of Microprocessors

  • Upload
    umika

  • View
    32

  • Download
    1

Embed Size (px)

DESCRIPTION

Pushing Performance, Efficiency and Scalability of Microprocessors. CERCS IAB Meeting, Fall 2006 Gabriel Loh. Research Overview. Funding from state of GA, Intel, MARCO Currently 2 PhD students, 2 MS Active undergrad research as well Collaborations Universities: PSU, UO, Rutgers - PowerPoint PPT Presentation

Citation preview

Page 1: Pushing Performance, Efficiency and Scalability of Microprocessors

Pushing Performance, Efficiency and Scalability of Microprocessors CERCS IAB Meeting, Fall 2006Gabriel Loh

Page 2: Pushing Performance, Efficiency and Scalability of Microprocessors

Research Overview

• Funding from state of GA, Intel, MARCO

• Currently 2 PhD students, 2 MS– Active undergrad research as well

• Collaborations– Universities: PSU, UO, Rutgers– Industry: Intel, IBM

Page 3: Pushing Performance, Efficiency and Scalability of Microprocessors

Research Focus

• “Near-term” microprocessor design issues– ~ 5-year time scale– Power/performance/complexity– Traditional uniprocessor performance– Multi-core performance

• “Longer-term”– Keeping Moore’s Law alive for the longer

term– Primarily, 3D integration for now

Page 4: Pushing Performance, Efficiency and Scalability of Microprocessors

Scaling Performance and Efficiency• Multi-cores are here, but single-

thread perf still matters– Intel Core 2 Duo is multi-core, but…– Single core is more OOO than ever

• Larger instruction window, improved branch prediction, speculative load-store ordering, wider pipe and decoders

– But power also really matters• Lower clock speeds, different channel length

transistors, more uop fusion, …

Page 5: Pushing Performance, Efficiency and Scalability of Microprocessors

Research Focus

• Maximum performance within bounds– Bounds = power, area, TDP, …

• Single-core performance helps multi-core performance, too– For future multi-core systems, need to strike a

good balance between 1T and MT

• Most of our research is at the uarch level– Caches, branch predictors, instruction

schedulers, memory queue design, memory dependence prediction, etc.

Page 6: Pushing Performance, Efficiency and Scalability of Microprocessors

Highlight: Traditional Caching [MICRO’06]

• Well known that different apps respond differently to different replacement policies

• Previous work in the OS domain has described adaptive replacement with provable bounds on performance

• Adapted techniques for on-chip caches

Page 7: Pushing Performance, Efficiency and Scalability of Microprocessors

Idea…

Page 8: Pushing Performance, Efficiency and Scalability of Microprocessors

Adaptive Cache Implementation

• Theoretical Guarantees– Miss rate provably bounded to be within

a factor of two of the better algorithm

In practice,it’s much better

Page 9: Pushing Performance, Efficiency and Scalability of Microprocessors

Current Research

• Working on multi-core generalizations of adaptive caching and other ways to manage shared resources

• Uniprocessor microarchitecture– Scalable memory scheduling [MICRO’06]– Memory dependence prediction

[HPCA’06]– Branch prediction […]– And more…

Page 10: Pushing Performance, Efficiency and Scalability of Microprocessors

Longer-Term Processor Scaling

• Limitations/Obstacles– Wire scaling

• Latency/performance• Power

– Feature size• Lithography, parametric variations

– Off-chip communication

Page 11: Pushing Performance, Efficiency and Scalability of Microprocessors

3D Integration

• Wire– Power/perf.

• Off-chip• Feature size

– Limitations, variations

ActiveLayer 1

ActiveLayer 2

MetalLayers 1

Die-to-DieVias

Die/Wafer Stacking

MetalLayers 2

Less RC faster, lower-power

Page 12: Pushing Performance, Efficiency and Scalability of Microprocessors

Example: Caches

Simplified 2D SRAM Array 3D Bitline Stacking

Wordline length halved

• in our studies, WL was critical for latency

3D Wordline Stacking

Bitline length halved

• BL reduction has greater impact on power savings• Split decoder no activity stacking

We’ve studieda wide varietyof other CPU

building blocks

Page 13: Pushing Performance, Efficiency and Scalability of Microprocessors

Uarch-level 3D design

Example: 4-die significance-partitioned datapathUse uarch prediction mechanism for early determination of width

Smaller footprint faster and lower-power

Width-based gating even lower power,

close to original power density

Overall: 47% performance gain atonly 2 degree temperature increase

Page 14: Pushing Performance, Efficiency and Scalability of Microprocessors

3D Research Summary

• Circuit-level [ICCD’05,ISVLSI’06,ISCAS’06,GLSVLSI’06]

• Uarch-level [MICRO’06 (w/ ),HPCA’07]

• Tutorial papers [JETC’06]

• Tutorial [MICRO’06]

• Tools [DATE’06,TCAD’07] w/ GTCAD &

• Parametric Variations w/ Jim Meindl

• Funding, equip from ,

Page 15: Pushing Performance, Efficiency and Scalability of Microprocessors

Summary

• loh@cc• http://www.cc.gatech.edu/~loh

• Lots of exciting work going on here