Ralph K. Cavin, III March 18, 2009 Brussels. Is there a Carnot-like theorem for computation? ◦ e.g., a limit on rate of information throughput/power

Ralph K. Cavin, IIIMarch 18, 2009Brussels

Is there a Carnot-like theorem for computation?◦ e.g., a limit on rate of information

throughput/power consumed? The MIND architecture benchmarking

activity for novel devices Memory Architectures Inference Architectures

Chose a simple one-bit, four instruction processor

All transistors operate at ~kT switching energy

Interconnects dissipate energy at ~kT/gate length

Transistor average fan-out is three

4

Memory

Program Counter

2-4 DEC 12

+

144

2-bit Counter

CPU

1

1

1C1

I1 I2

S1

S2

S3

S4

S5

S6

X

ALUY

C0

Z

6

1

1

1

6

6

6

6

98

24

Total: 314devices

Red numbers =# transistors

5

Operational energy of the Minimal Turing Machine

aaaaanArea 5050250083148 222min Joyner

tiling:

cycle

JcycleTkTnkE BBop

18104/9802ln2

9

amin= 1.5 nm

nmnmArea 7575min

Per full CPU operation: operation

J

cycle

JEop

1718 101043

n=314Von Neumann threshold:

6

Energy per cyclecycle

J18104

nmnmArea 7575 Devices: 314 Device density: 5.61012 cm-2

Time per cycle ~2 ps

Power: 2WPower density : ~30 kW/cm2

BITS=density x freq. = 1014 bit/s MIPS:

2105

7

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1.E+08

1.E+09

1.E+09 1.E+11 1.E+13 1.E+15 1.E+17 1.E+19 1.E+21 1.E+23 1.E+25

Max. binary throughput, bit/s

MIP

S

Brain1019 bit/s108 MIPS30 W

106 W/cm2

Sources: The Intel Microprocessor Quick Reference Guide and TSCP Benchmark Scores

Instr

ucti

on

s p

er

sec

10

6

BIT:

8

The Minimal Turing Machine lies on the different performance trajectorie from conventional computers◦ It has slope to meet brain performance

More detailed physics based analysis is needed◦ System thermodynamics of computation

Carnot’s equivalent for Computational Engine?

Lessons from Biological Computation?

Candidates for beyond-CMOS nano-electronics should be evaluated in the context of system scaling◦ e.g. spintronic minimal Turing Machine?

NRI Focus CentersKerry Bernstein, IBM

February 2009 Update

1. Short Term – Switches that supplement CMOS and are CMOS-compatible, supporting performance via hardware acceleration

2. Long Term – Switches that replace CMOS for general purpose high performance compute applications

1) CMOS is not going away anytime soon. Charge (state variable), and the MOSFET (fundamental switch) will remain the preferred HPC solution until new switches appear as the long term replacement solution in 10-20 years.

2) Hdwre Accelerators execute selected functions faster than software performing it on the CPU.Accelerators are responsible for substantial improvements in thruput.

3) Alternative switches often exhibit emergent, idiosyncratic behavior. We should exploit them.Certain physical behaviors may emulate selected HPC instruction sequences. Some operations may be superior to digital solutions.

4) New switches may improve high-utility acceleratorsThe shorter term supplemental solution (5-15 years) improves or replaces accelerators “built in CMOS and designed for CMOS”, either on-chip, or on-planar, or on-3D-stack

Level Metric The Good The Bad

Device CV/I Charge-Based

FSpecific to Clked

Ion/Ioff Current-Based

Circuit NAND2 Dly, Pwr,

Energy, Area

Incomplete

E-D-A Product Optimization

P-D-A Product Optimization

Logical Effort Constrained

Architecture MIPS, IPC Equivalent T/P Synchronous

Ops/Joule Energy Efficiency Discrete Ops

SpecInt Industry Stds New Capability

Hierarchical Benchmarking

1) Derive values for the conventional quantitative ITRS benchmarks shown in Benchmark 1.

2) Derive quantitative values and qualitative entries for architecture benchmarks shown in Benchmark 2a and 2b

3) Identify specific logic operations performed elegantly by your switch: where physical device behavior complements desired logic operation. Determine the equivalent IPC, power of that function performed in the new switch, as shown in Benchmark 3 example.Determine the actual IPC, and Operations/Watt had the function been performed via software in the CPU.

Benchmark 1: Device Metrics

Defined byITRS ERDWorking Group

CapturesFundamentalDevice Properties

Benchmark 2: Architectural Metrics2a. Quantitative 2b. QualitativeCommunication Metrics Values

AREA of die/host accessible within 1 switch delay

NO. OF SWITCHES accessible within 1 switch delay

Sq BW/ unit area (Channels x Freq)X x Channels x Freq)Y

Sq Comm Channels (NX x NY) per unit area

(Accessible Area within one switch delay) / (Area of 1 switch)

Mem. delay / Logic Delay

Logic Metrics Metric Delay Power Energy Area No. of Sw.

32 Bit Adder

Inverter with F04

NAND2 FO1

Generic Noise Immunity (dB)

Generic Logical Effort

Comp.Density (MIPS/no of devs)

PETE1 (EDDA)

PETE2 (PDA)

CMOS Compatibility

Clocking infrastructure and Locality

Memory Reqmts and compatibility

Scalability

Reconfigurability or Library Dimensions

Logic Execution / Architecture

Specific Logic Function performed well

Useful SpecificPhysical Behaviors

Transport Mode

Information Token

Diffusion Direct ExcitonsIndirect ExcitonsSpinPseudoSpin

Drift Indirect ExcitonsSpinPseudospin

Ballistic Transport (Fermi Velocity)

SpinPseudospin

Spin Wave Spin

EM Waves Photons Plasmons

Delay versus Length for Various Transport Mechanisms

Azad Naeemi, Georgia Tech

New State Variables will impactcommunication and fan-out

Eq

uiv

ale

nt

IPC

- M

IPS

/Watt

-

O

ps/J

ou

leof

sw

itch

in

ap

plicati

on

Cry

pto

Com

p H.2

64

Quantu

m

MQ

CA

BTB

TFE

T

Compare apples-to-apples, independent of particular strength

……………..

Matching Logic Functions & New Switch Behaviors

Single Spin

Spin Domain

Tunnel-FETs

NEMS

MQCA

Molecular

Bio-inspired

CMOL

Excitonics

?

Popular Accelerators New Switch IdeasEncrypt / Decrypt

Compr / Decompr

Reg. Expression Scan

Discrete COS Trnsfrm

Bit Serial Operations

H.264 Std Filtering

DSP, A/D, D/A

Viterbi Algorithms

Image, Graphics Example: Cryptography Hardware AccelerationOperations required: Rotate, Byte Alignmt, EXORs, Multiply, Table LookupCircuits used in Accel: Transmission Gates (“T-Gates”)New Switch Opportunity: A number of new switches (i.e. T-FETs) don’t have thermionic barriers: won’t suffer from CMOS Pass-gate VT drop, Body Effect, or Source-Follower delay.Potential Opportunity: Replace 4 T-Gate MOSFETs with 1 low power switch.

2.8

E-

4

Bernstein, 1/25/09

• Example of HPC Hdwre Accelerator contribution to power, area, instruction retirement rate, energy efficiency improvement. • Purdue Emerging Technology Evaluator (PETE) metric is convolution of power/energy, delay, and area.• IPC and Ops/nJprovides apples-to-apples comparison of new switches.

Paul Franzon

Department of Electrical and Computer [email protected]

Goal:◦ Determine research needs for ~2015

1000 Petaflop computer, and smaller equivalents

Major Conclusions:◦ Major challenge #1: Power efficiency

Communication Overhead in computation

◦ Major Challenge #2: Resiliency Completing computation in presence of

permanent and transient faults◦ Major Challenge #3: Performance Scaling

Performance scaling limited by software, communications bisection bandwidth, and memory speed

Critical Needs:◦ Reduce power SRAM replacements

45 nm L1 Cache: 3.6 pJ/bit Note: re-architecting in 3D can save ~50% What is the potential for an ERD to reduce to 0.3 pJ/bit?

Note: Would require low-swing on bit lines, while retaining speed and low SET rate

◦ Reduced power switched interconnect Esp. packet routed interconnect (NOC) What is the potential for a memory-style ERD to be used

for fast switchable interconnect? Flash devices can do this for static reconfiguration BUT

faster switching devices will be needed for dynamic reconfiguration

Blue Gene system reliability:

◦ Most of the DRAM failures are due to DIMM socket failures, not device failures

◦ Critical need: Sub-system level checkpointing and roll-back

ERD requirement:◦ Tightly embedded Flash-like state “capture”

memory for checkpointing◦ Requirements:

Tightly embedded, e.g. Shadow registers, with minimum process change

Slow read/write OK ~10 M writes minimum extrinsic reliability

requirement

1. Metrics for cache replacement

Read & Write Speed for 64 kit array

Energy/bit for

64kbit array

Area for

64kbit array

SEU rate Added process complexity

2. Metrics for programmed routability

Stage Delay for 2x2 switch-box

Energy/bit for routing through 2x2 swtich box

Area for

2x2 switch box

Configuration change speed for 2x2 switchbox

Added process and design Complexity

3. Metrics for Local Check-pointing memory

Read/write delay per bit

Energy/bit for routing for write

Area per bit Write lifetime Added process and design complexity

In future computing, both General Purpose and Application Specific, the bottleneck is not in logic operations but in memory, communications, and reliability

Opportunities arise for memory style devices to solve these bottlenecks:◦ Low power SRAM replacement◦ Ultra-low swing, routable interconnect

replacement◦ Local non-volatile memory as an aid to resiliency

The Memory Wall for multi-core In general purpose multi-core processors,

the tradeoff for L1-L3 between memory bandwidth and memory size is dramatic.◦ At constant BW, two cores may require as much

as 8x memory of one core◦ At 2x BW, two cores require only about 2x

memory of a single core system

◦ Kerry Bernstein “New Dimensions in Performance, Feb. 2009

Workshop: “Technology Maturity for Adaptive, Massively Parallel Computing” – March 2009, Portland Oregon http://www.technologydashboard.com/adaptivecomputing/.◦ General theme: Inference Architectures and

Technology Karlheinz Meier, U. Heidleberg, “VLSI Implementation of

Very Large Scale Neuromorphic Circuits – Achievements, Challenges, Hopes”

Progress in architectures is being made but many technology challenges remain. (Complexity)

Can Emerging Research Devices accelerate realization of Inference Architectures?

Continue work on ERD Architectural Benchmarking◦ Work with NRI MIND benchmarking effort

Develop section on memory architectures for Emerging Research Memories

Look at role of ERD/ERM in novel architectures where unique properties can provide substantial leverage; e.g. inference architectures

Documents

Ralph K. Cavin, III March 18, 2009 Brussels. Is there a Carnot-like theorem for computation? ◦ e.g., a limit on rate of information throughput/power