31
Ralph K. Cavin, III March 18, 2009 Brussels

Ralph K. Cavin, III March 18, 2009 Brussels. Is there a Carnot-like theorem for computation? ◦ e.g., a limit on rate of information throughput/power

Embed Size (px)

Citation preview

Page 1: Ralph K. Cavin, III March 18, 2009 Brussels.  Is there a Carnot-like theorem for computation? ◦ e.g., a limit on rate of information throughput/power

Ralph K. Cavin, IIIMarch 18, 2009Brussels

Page 2: Ralph K. Cavin, III March 18, 2009 Brussels.  Is there a Carnot-like theorem for computation? ◦ e.g., a limit on rate of information throughput/power

Is there a Carnot-like theorem for computation?◦ e.g., a limit on rate of information

throughput/power consumed? The MIND architecture benchmarking

activity for novel devices Memory Architectures Inference Architectures

Page 3: Ralph K. Cavin, III March 18, 2009 Brussels.  Is there a Carnot-like theorem for computation? ◦ e.g., a limit on rate of information throughput/power

Chose a simple one-bit, four instruction processor

All transistors operate at ~kT switching energy

Interconnects dissipate energy at ~kT/gate length

Transistor average fan-out is three

Page 4: Ralph K. Cavin, III March 18, 2009 Brussels.  Is there a Carnot-like theorem for computation? ◦ e.g., a limit on rate of information throughput/power

4

Memory

Program Counter

2-4 DEC 12

+

144

2-bit Counter

CPU

1

1

1C1

I1 I2

S1

S2

S3

S4

S5

S6

X

ALUY

C0

Z

6

1

1

1

6

6

6

6

98

24

Total: 314devices

Red numbers =# transistors

Page 5: Ralph K. Cavin, III March 18, 2009 Brussels.  Is there a Carnot-like theorem for computation? ◦ e.g., a limit on rate of information throughput/power

5

Operational energy of the Minimal Turing Machine

aaaaanArea 5050250083148 222min Joyner

tiling:

cycle

JcycleTkTnkE BBop

18104/9802ln2

9

amin= 1.5 nm

nmnmArea 7575min

Per full CPU operation: operation

J

cycle

JEop

1718 101043

n=314Von Neumann threshold:

Page 6: Ralph K. Cavin, III March 18, 2009 Brussels.  Is there a Carnot-like theorem for computation? ◦ e.g., a limit on rate of information throughput/power

6

Energy per cyclecycle

J18104

nmnmArea 7575 Devices: 314 Device density: 5.61012 cm-2

Time per cycle ~2 ps

Power: 2WPower density : ~30 kW/cm2

BITS=density x freq. = 1014 bit/s MIPS:

2105

Page 7: Ralph K. Cavin, III March 18, 2009 Brussels.  Is there a Carnot-like theorem for computation? ◦ e.g., a limit on rate of information throughput/power

7

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1.E+08

1.E+09

1.E+09 1.E+11 1.E+13 1.E+15 1.E+17 1.E+19 1.E+21 1.E+23 1.E+25

Max. binary throughput, bit/s

MIP

S

Brain1019 bit/s108 MIPS30 W

106 W/cm2

Sources: The Intel Microprocessor Quick Reference Guide and TSCP Benchmark Scores

Instr

ucti

on

s p

er

sec

10

6

BIT:

Page 8: Ralph K. Cavin, III March 18, 2009 Brussels.  Is there a Carnot-like theorem for computation? ◦ e.g., a limit on rate of information throughput/power

8

The Minimal Turing Machine lies on the different performance trajectorie from conventional computers◦ It has slope to meet brain performance

More detailed physics based analysis is needed◦ System thermodynamics of computation

Carnot’s equivalent for Computational Engine?

Lessons from Biological Computation?

Candidates for beyond-CMOS nano-electronics should be evaluated in the context of system scaling◦ e.g. spintronic minimal Turing Machine?

Page 9: Ralph K. Cavin, III March 18, 2009 Brussels.  Is there a Carnot-like theorem for computation? ◦ e.g., a limit on rate of information throughput/power

NRI Focus CentersKerry Bernstein, IBM

February 2009 Update

Page 10: Ralph K. Cavin, III March 18, 2009 Brussels.  Is there a Carnot-like theorem for computation? ◦ e.g., a limit on rate of information throughput/power

1. Short Term – Switches that supplement CMOS and are CMOS-compatible, supporting performance via hardware acceleration

2. Long Term – Switches that replace CMOS for general purpose high performance compute applications

Page 11: Ralph K. Cavin, III March 18, 2009 Brussels.  Is there a Carnot-like theorem for computation? ◦ e.g., a limit on rate of information throughput/power

1) CMOS is not going away anytime soon. Charge (state variable), and the MOSFET (fundamental switch) will remain the preferred HPC solution until new switches appear as the long term replacement solution in 10-20 years.

2) Hdwre Accelerators execute selected functions faster than software performing it on the CPU.Accelerators are responsible for substantial improvements in thruput.

3) Alternative switches often exhibit emergent, idiosyncratic behavior. We should exploit them.Certain physical behaviors may emulate selected HPC instruction sequences. Some operations may be superior to digital solutions.

4) New switches may improve high-utility acceleratorsThe shorter term supplemental solution (5-15 years) improves or replaces accelerators “built in CMOS and designed for CMOS”, either on-chip, or on-planar, or on-3D-stack

Page 12: Ralph K. Cavin, III March 18, 2009 Brussels.  Is there a Carnot-like theorem for computation? ◦ e.g., a limit on rate of information throughput/power

Level Metric The Good The Bad

Device CV/I Charge-Based

FSpecific to Clked

Ion/Ioff Current-Based

Circuit NAND2 Dly, Pwr,

Energy, Area

Incomplete

E-D-A Product Optimization

P-D-A Product Optimization

Logical Effort Constrained

Architecture MIPS, IPC Equivalent T/P Synchronous

Ops/Joule Energy Efficiency Discrete Ops

SpecInt Industry Stds New Capability

Hierarchical Benchmarking

Page 13: Ralph K. Cavin, III March 18, 2009 Brussels.  Is there a Carnot-like theorem for computation? ◦ e.g., a limit on rate of information throughput/power

1) Derive values for the conventional quantitative ITRS benchmarks shown in Benchmark 1.

2) Derive quantitative values and qualitative entries for architecture benchmarks shown in Benchmark 2a and 2b

3) Identify specific logic operations performed elegantly by your switch: where physical device behavior complements desired logic operation. Determine the equivalent IPC, power of that function performed in the new switch, as shown in Benchmark 3 example.Determine the actual IPC, and Operations/Watt had the function been performed via software in the CPU.

Page 14: Ralph K. Cavin, III March 18, 2009 Brussels.  Is there a Carnot-like theorem for computation? ◦ e.g., a limit on rate of information throughput/power

Benchmark 1: Device Metrics

Defined byITRS ERDWorking Group

CapturesFundamentalDevice Properties

Page 15: Ralph K. Cavin, III March 18, 2009 Brussels.  Is there a Carnot-like theorem for computation? ◦ e.g., a limit on rate of information throughput/power

Benchmark 2: Architectural Metrics2a. Quantitative 2b. QualitativeCommunication Metrics           Values  

AREA of die/host accessible within 1 switch delay    

NO. OF SWITCHES accessible within 1 switch delay  

Sq BW/ unit area (Channels x Freq)X x Channels x Freq)Y

Sq Comm Channels (NX x NY) per unit area  

(Accessible Area within one switch delay) / (Area of 1 switch)

Mem. delay / Logic Delay      

Logic Metrics     Metric Delay Power Energy Area No. of Sw.

32 Bit Adder              

Inverter with F04              

NAND2 FO1              

Generic Noise Immunity (dB)            

Generic Logical Effort              

Comp.Density (MIPS/no of devs)            

PETE1 (EDDA)              

PETE2 (PDA)              

CMOS Compatibility

Clocking infrastructure and Locality

Memory Reqmts and compatibility

Scalability

Reconfigurability or Library Dimensions

Logic Execution / Architecture

Specific Logic Function performed well

Useful SpecificPhysical Behaviors

Page 16: Ralph K. Cavin, III March 18, 2009 Brussels.  Is there a Carnot-like theorem for computation? ◦ e.g., a limit on rate of information throughput/power

Transport Mode

Information Token

Diffusion Direct ExcitonsIndirect ExcitonsSpinPseudoSpin

Drift Indirect ExcitonsSpinPseudospin

Ballistic Transport (Fermi Velocity)

SpinPseudospin

Spin Wave Spin

EM Waves Photons Plasmons

Delay versus Length for Various Transport Mechanisms

Azad Naeemi, Georgia Tech

New State Variables will impactcommunication and fan-out

Page 17: Ralph K. Cavin, III March 18, 2009 Brussels.  Is there a Carnot-like theorem for computation? ◦ e.g., a limit on rate of information throughput/power

Eq

uiv

ale

nt

IPC

- M

IPS

/Watt

-

O

ps/J

ou

leof

sw

itch

in

ap

plicati

on

Cry

pto

Com

p H.2

64

Quantu

m

MQ

CA

BTB

TFE

T

Compare apples-to-apples, independent of particular strength

……………..

Page 18: Ralph K. Cavin, III March 18, 2009 Brussels.  Is there a Carnot-like theorem for computation? ◦ e.g., a limit on rate of information throughput/power

Matching Logic Functions & New Switch Behaviors

Single Spin

Spin Domain

Tunnel-FETs

NEMS

MQCA

Molecular

Bio-inspired

CMOL

Excitonics

?

Popular Accelerators New Switch IdeasEncrypt / Decrypt

Compr / Decompr

Reg. Expression Scan

Discrete COS Trnsfrm

Bit Serial Operations

H.264 Std Filtering

DSP, A/D, D/A

Viterbi Algorithms

Image, Graphics Example: Cryptography Hardware AccelerationOperations required: Rotate, Byte Alignmt, EXORs, Multiply, Table LookupCircuits used in Accel: Transmission Gates (“T-Gates”)New Switch Opportunity: A number of new switches (i.e. T-FETs) don’t have thermionic barriers: won’t suffer from CMOS Pass-gate VT drop, Body Effect, or Source-Follower delay.Potential Opportunity: Replace 4 T-Gate MOSFETs with 1 low power switch.

Page 19: Ralph K. Cavin, III March 18, 2009 Brussels.  Is there a Carnot-like theorem for computation? ◦ e.g., a limit on rate of information throughput/power

2.8

E-

4

Bernstein, 1/25/09

• Example of HPC Hdwre Accelerator contribution to power, area, instruction retirement rate, energy efficiency improvement. • Purdue Emerging Technology Evaluator (PETE) metric is convolution of power/energy, delay, and area.• IPC and Ops/nJprovides apples-to-apples comparison of new switches.

Page 20: Ralph K. Cavin, III March 18, 2009 Brussels.  Is there a Carnot-like theorem for computation? ◦ e.g., a limit on rate of information throughput/power

Paul Franzon

Department of Electrical and Computer [email protected]

Page 21: Ralph K. Cavin, III March 18, 2009 Brussels.  Is there a Carnot-like theorem for computation? ◦ e.g., a limit on rate of information throughput/power

Goal:◦ Determine research needs for ~2015

1000 Petaflop computer, and smaller equivalents

Major Conclusions:◦ Major challenge #1: Power efficiency

Communication Overhead in computation

◦ Major Challenge #2: Resiliency Completing computation in presence of

permanent and transient faults◦ Major Challenge #3: Performance Scaling

Performance scaling limited by software, communications bisection bandwidth, and memory speed

Page 22: Ralph K. Cavin, III March 18, 2009 Brussels.  Is there a Carnot-like theorem for computation? ◦ e.g., a limit on rate of information throughput/power

Critical Needs:◦ Reduce power SRAM replacements

45 nm L1 Cache: 3.6 pJ/bit Note: re-architecting in 3D can save ~50% What is the potential for an ERD to reduce to 0.3 pJ/bit?

Note: Would require low-swing on bit lines, while retaining speed and low SET rate

◦ Reduced power switched interconnect Esp. packet routed interconnect (NOC) What is the potential for a memory-style ERD to be used

for fast switchable interconnect? Flash devices can do this for static reconfiguration BUT

faster switching devices will be needed for dynamic reconfiguration

Page 23: Ralph K. Cavin, III March 18, 2009 Brussels.  Is there a Carnot-like theorem for computation? ◦ e.g., a limit on rate of information throughput/power

Blue Gene system reliability:

◦ Most of the DRAM failures are due to DIMM socket failures, not device failures

◦ Critical need: Sub-system level checkpointing and roll-back

Page 24: Ralph K. Cavin, III March 18, 2009 Brussels.  Is there a Carnot-like theorem for computation? ◦ e.g., a limit on rate of information throughput/power

ERD requirement:◦ Tightly embedded Flash-like state “capture”

memory for checkpointing◦ Requirements:

Tightly embedded, e.g. Shadow registers, with minimum process change

Slow read/write OK ~10 M writes minimum extrinsic reliability

requirement

Page 25: Ralph K. Cavin, III March 18, 2009 Brussels.  Is there a Carnot-like theorem for computation? ◦ e.g., a limit on rate of information throughput/power

1. Metrics for cache replacement

Read & Write Speed for 64 kit array

Energy/bit for

64kbit array

Area for

64kbit array

SEU rate Added process complexity

Page 26: Ralph K. Cavin, III March 18, 2009 Brussels.  Is there a Carnot-like theorem for computation? ◦ e.g., a limit on rate of information throughput/power

2. Metrics for programmed routability

Stage Delay for 2x2 switch-box

Energy/bit for routing through 2x2 swtich box

Area for

2x2 switch box

Configuration change speed for 2x2 switchbox

Added process and design Complexity

Page 27: Ralph K. Cavin, III March 18, 2009 Brussels.  Is there a Carnot-like theorem for computation? ◦ e.g., a limit on rate of information throughput/power

3. Metrics for Local Check-pointing memory

Read/write delay per bit

Energy/bit for routing for write

Area per bit Write lifetime Added process and design complexity

Page 28: Ralph K. Cavin, III March 18, 2009 Brussels.  Is there a Carnot-like theorem for computation? ◦ e.g., a limit on rate of information throughput/power

In future computing, both General Purpose and Application Specific, the bottleneck is not in logic operations but in memory, communications, and reliability

Opportunities arise for memory style devices to solve these bottlenecks:◦ Low power SRAM replacement◦ Ultra-low swing, routable interconnect

replacement◦ Local non-volatile memory as an aid to resiliency

Page 29: Ralph K. Cavin, III March 18, 2009 Brussels.  Is there a Carnot-like theorem for computation? ◦ e.g., a limit on rate of information throughput/power

The Memory Wall for multi-core In general purpose multi-core processors,

the tradeoff for L1-L3 between memory bandwidth and memory size is dramatic.◦ At constant BW, two cores may require as much

as 8x memory of one core◦ At 2x BW, two cores require only about 2x

memory of a single core system

◦ Kerry Bernstein “New Dimensions in Performance, Feb. 2009

Page 30: Ralph K. Cavin, III March 18, 2009 Brussels.  Is there a Carnot-like theorem for computation? ◦ e.g., a limit on rate of information throughput/power

Workshop: “Technology Maturity for Adaptive, Massively Parallel Computing” – March 2009, Portland Oregon http://www.technologydashboard.com/adaptivecomputing/.◦ General theme: Inference Architectures and

Technology Karlheinz Meier, U. Heidleberg, “VLSI Implementation of

Very Large Scale Neuromorphic Circuits – Achievements, Challenges, Hopes”

Progress in architectures is being made but many technology challenges remain. (Complexity)

Can Emerging Research Devices accelerate realization of Inference Architectures?

Page 31: Ralph K. Cavin, III March 18, 2009 Brussels.  Is there a Carnot-like theorem for computation? ◦ e.g., a limit on rate of information throughput/power

Continue work on ERD Architectural Benchmarking◦ Work with NRI MIND benchmarking effort

Develop section on memory architectures for Emerging Research Memories

Look at role of ERD/ERM in novel architectures where unique properties can provide substantial leverage; e.g. inference architectures