Upload
aileen-conley
View
217
Download
1
Tags:
Embed Size (px)
Citation preview
Ralph K. Cavin, IIIMarch 18, 2009Brussels
Is there a Carnot-like theorem for computation?◦ e.g., a limit on rate of information
throughput/power consumed? The MIND architecture benchmarking
activity for novel devices Memory Architectures Inference Architectures
Chose a simple one-bit, four instruction processor
All transistors operate at ~kT switching energy
Interconnects dissipate energy at ~kT/gate length
Transistor average fan-out is three
4
Memory
Program Counter
2-4 DEC 12
+
144
2-bit Counter
CPU
1
1
1C1
I1 I2
S1
S2
S3
S4
S5
S6
X
ALUY
C0
Z
6
1
1
1
6
6
6
6
98
24
Total: 314devices
Red numbers =# transistors
5
Operational energy of the Minimal Turing Machine
aaaaanArea 5050250083148 222min Joyner
tiling:
cycle
JcycleTkTnkE BBop
18104/9802ln2
9
amin= 1.5 nm
nmnmArea 7575min
Per full CPU operation: operation
J
cycle
JEop
1718 101043
n=314Von Neumann threshold:
6
Energy per cyclecycle
J18104
nmnmArea 7575 Devices: 314 Device density: 5.61012 cm-2
Time per cycle ~2 ps
Power: 2WPower density : ~30 kW/cm2
BITS=density x freq. = 1014 bit/s MIPS:
2105
7
1.E-01
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
1.E+08
1.E+09
1.E+09 1.E+11 1.E+13 1.E+15 1.E+17 1.E+19 1.E+21 1.E+23 1.E+25
Max. binary throughput, bit/s
MIP
S
Brain1019 bit/s108 MIPS30 W
106 W/cm2
Sources: The Intel Microprocessor Quick Reference Guide and TSCP Benchmark Scores
Instr
ucti
on
s p
er
sec
10
6
BIT:
8
The Minimal Turing Machine lies on the different performance trajectorie from conventional computers◦ It has slope to meet brain performance
More detailed physics based analysis is needed◦ System thermodynamics of computation
Carnot’s equivalent for Computational Engine?
Lessons from Biological Computation?
Candidates for beyond-CMOS nano-electronics should be evaluated in the context of system scaling◦ e.g. spintronic minimal Turing Machine?
NRI Focus CentersKerry Bernstein, IBM
February 2009 Update
1. Short Term – Switches that supplement CMOS and are CMOS-compatible, supporting performance via hardware acceleration
2. Long Term – Switches that replace CMOS for general purpose high performance compute applications
1) CMOS is not going away anytime soon. Charge (state variable), and the MOSFET (fundamental switch) will remain the preferred HPC solution until new switches appear as the long term replacement solution in 10-20 years.
2) Hdwre Accelerators execute selected functions faster than software performing it on the CPU.Accelerators are responsible for substantial improvements in thruput.
3) Alternative switches often exhibit emergent, idiosyncratic behavior. We should exploit them.Certain physical behaviors may emulate selected HPC instruction sequences. Some operations may be superior to digital solutions.
4) New switches may improve high-utility acceleratorsThe shorter term supplemental solution (5-15 years) improves or replaces accelerators “built in CMOS and designed for CMOS”, either on-chip, or on-planar, or on-3D-stack
Level Metric The Good The Bad
Device CV/I Charge-Based
FSpecific to Clked
Ion/Ioff Current-Based
Circuit NAND2 Dly, Pwr,
Energy, Area
Incomplete
E-D-A Product Optimization
P-D-A Product Optimization
Logical Effort Constrained
Architecture MIPS, IPC Equivalent T/P Synchronous
Ops/Joule Energy Efficiency Discrete Ops
SpecInt Industry Stds New Capability
Hierarchical Benchmarking
1) Derive values for the conventional quantitative ITRS benchmarks shown in Benchmark 1.
2) Derive quantitative values and qualitative entries for architecture benchmarks shown in Benchmark 2a and 2b
3) Identify specific logic operations performed elegantly by your switch: where physical device behavior complements desired logic operation. Determine the equivalent IPC, power of that function performed in the new switch, as shown in Benchmark 3 example.Determine the actual IPC, and Operations/Watt had the function been performed via software in the CPU.
Benchmark 1: Device Metrics
Defined byITRS ERDWorking Group
CapturesFundamentalDevice Properties
Benchmark 2: Architectural Metrics2a. Quantitative 2b. QualitativeCommunication Metrics Values
AREA of die/host accessible within 1 switch delay
NO. OF SWITCHES accessible within 1 switch delay
Sq BW/ unit area (Channels x Freq)X x Channels x Freq)Y
Sq Comm Channels (NX x NY) per unit area
(Accessible Area within one switch delay) / (Area of 1 switch)
Mem. delay / Logic Delay
Logic Metrics Metric Delay Power Energy Area No. of Sw.
32 Bit Adder
Inverter with F04
NAND2 FO1
Generic Noise Immunity (dB)
Generic Logical Effort
Comp.Density (MIPS/no of devs)
PETE1 (EDDA)
PETE2 (PDA)
CMOS Compatibility
Clocking infrastructure and Locality
Memory Reqmts and compatibility
Scalability
Reconfigurability or Library Dimensions
Logic Execution / Architecture
Specific Logic Function performed well
Useful SpecificPhysical Behaviors
Transport Mode
Information Token
Diffusion Direct ExcitonsIndirect ExcitonsSpinPseudoSpin
Drift Indirect ExcitonsSpinPseudospin
Ballistic Transport (Fermi Velocity)
SpinPseudospin
Spin Wave Spin
EM Waves Photons Plasmons
Delay versus Length for Various Transport Mechanisms
Azad Naeemi, Georgia Tech
New State Variables will impactcommunication and fan-out
Eq
uiv
ale
nt
IPC
- M
IPS
/Watt
-
O
ps/J
ou
leof
sw
itch
in
ap
plicati
on
Cry
pto
Com
p H.2
64
Quantu
m
MQ
CA
BTB
TFE
T
Compare apples-to-apples, independent of particular strength
……………..
Matching Logic Functions & New Switch Behaviors
Single Spin
Spin Domain
Tunnel-FETs
NEMS
MQCA
Molecular
Bio-inspired
CMOL
Excitonics
?
Popular Accelerators New Switch IdeasEncrypt / Decrypt
Compr / Decompr
Reg. Expression Scan
Discrete COS Trnsfrm
Bit Serial Operations
H.264 Std Filtering
DSP, A/D, D/A
Viterbi Algorithms
Image, Graphics Example: Cryptography Hardware AccelerationOperations required: Rotate, Byte Alignmt, EXORs, Multiply, Table LookupCircuits used in Accel: Transmission Gates (“T-Gates”)New Switch Opportunity: A number of new switches (i.e. T-FETs) don’t have thermionic barriers: won’t suffer from CMOS Pass-gate VT drop, Body Effect, or Source-Follower delay.Potential Opportunity: Replace 4 T-Gate MOSFETs with 1 low power switch.
2.8
E-
4
Bernstein, 1/25/09
• Example of HPC Hdwre Accelerator contribution to power, area, instruction retirement rate, energy efficiency improvement. • Purdue Emerging Technology Evaluator (PETE) metric is convolution of power/energy, delay, and area.• IPC and Ops/nJprovides apples-to-apples comparison of new switches.
Paul Franzon
Department of Electrical and Computer [email protected]
Goal:◦ Determine research needs for ~2015
1000 Petaflop computer, and smaller equivalents
Major Conclusions:◦ Major challenge #1: Power efficiency
Communication Overhead in computation
◦ Major Challenge #2: Resiliency Completing computation in presence of
permanent and transient faults◦ Major Challenge #3: Performance Scaling
Performance scaling limited by software, communications bisection bandwidth, and memory speed
Critical Needs:◦ Reduce power SRAM replacements
45 nm L1 Cache: 3.6 pJ/bit Note: re-architecting in 3D can save ~50% What is the potential for an ERD to reduce to 0.3 pJ/bit?
Note: Would require low-swing on bit lines, while retaining speed and low SET rate
◦ Reduced power switched interconnect Esp. packet routed interconnect (NOC) What is the potential for a memory-style ERD to be used
for fast switchable interconnect? Flash devices can do this for static reconfiguration BUT
faster switching devices will be needed for dynamic reconfiguration
Blue Gene system reliability:
◦ Most of the DRAM failures are due to DIMM socket failures, not device failures
◦ Critical need: Sub-system level checkpointing and roll-back
ERD requirement:◦ Tightly embedded Flash-like state “capture”
memory for checkpointing◦ Requirements:
Tightly embedded, e.g. Shadow registers, with minimum process change
Slow read/write OK ~10 M writes minimum extrinsic reliability
requirement
1. Metrics for cache replacement
Read & Write Speed for 64 kit array
Energy/bit for
64kbit array
Area for
64kbit array
SEU rate Added process complexity
2. Metrics for programmed routability
Stage Delay for 2x2 switch-box
Energy/bit for routing through 2x2 swtich box
Area for
2x2 switch box
Configuration change speed for 2x2 switchbox
Added process and design Complexity
3. Metrics for Local Check-pointing memory
Read/write delay per bit
Energy/bit for routing for write
Area per bit Write lifetime Added process and design complexity
In future computing, both General Purpose and Application Specific, the bottleneck is not in logic operations but in memory, communications, and reliability
Opportunities arise for memory style devices to solve these bottlenecks:◦ Low power SRAM replacement◦ Ultra-low swing, routable interconnect
replacement◦ Local non-volatile memory as an aid to resiliency
The Memory Wall for multi-core In general purpose multi-core processors,
the tradeoff for L1-L3 between memory bandwidth and memory size is dramatic.◦ At constant BW, two cores may require as much
as 8x memory of one core◦ At 2x BW, two cores require only about 2x
memory of a single core system
◦ Kerry Bernstein “New Dimensions in Performance, Feb. 2009
Workshop: “Technology Maturity for Adaptive, Massively Parallel Computing” – March 2009, Portland Oregon http://www.technologydashboard.com/adaptivecomputing/.◦ General theme: Inference Architectures and
Technology Karlheinz Meier, U. Heidleberg, “VLSI Implementation of
Very Large Scale Neuromorphic Circuits – Achievements, Challenges, Hopes”
Progress in architectures is being made but many technology challenges remain. (Complexity)
Can Emerging Research Devices accelerate realization of Inference Architectures?
Continue work on ERD Architectural Benchmarking◦ Work with NRI MIND benchmarking effort
Develop section on memory architectures for Emerging Research Memories
Look at role of ERD/ERM in novel architectures where unique properties can provide substantial leverage; e.g. inference architectures