32
NanoSystems Department of EE & Department of CS Stanford University The N3XT 1,000 × Subhasish Mitra

NanoSystems The N3XT 1,000

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: NanoSystems The N3XT 1,000

NanoSystems

Department of EE & Department of CS Stanford University

The N3XT 1,000×

Subhasish Mitra

Page 2: NanoSystems The N3XT 1,000

NanoSystemsNew nanotech

Devices

Fabrication

Sensors

imperfections?

large-scale fabrication?

variability?

New systemsNew applications

Newarchitectures

a

2

Page 3: NanoSystems The N3XT 1,000

Application Challenges

Compute Memory

5%

3

95%

•Dense connectivity

•Energy efficiency

•Footprint

Abundant-data apps.

Memory wall

Brain-inspired ⊃ Neural Nets

Chip realization ?

•Compute + memory

Page 4: NanoSystems The N3XT 1,000

N3XT NanoSystems

4

Computation immersed in memory

Page 5: NanoSystems The N3XT 1,000

Computation immersed in memoryIncreased

functionalityFine-grained,ultra-dense

3D

MemoryComputing logic

Impossible with business as usual5

N3XT NanoSystems

Page 6: NanoSystems The N3XT 1,000

Nano-Engineered Computing Systems Technology

6[Aly IEEE Computer 15, Proc. IEEE 19] Stanford + CMU + MIT + NTU Singapore + UC Berkeley + U. Michigan

Page 7: NanoSystems The N3XT 1,000

MRAM (quick access)

3D Resistive RAM (massive) No TSV

thermal

thermal

thermal

1D CNFET, 2D FET (logic)

Ultra-dense, fine-grained

vias

Silicon compatible

N3XT Computation Immersed in Memory

1D CNFET, 2D FET (logic)

1D CNFET, 2D FET (logic)

7

Page 8: NanoSystems The N3XT 1,000

DARPA 3DSoC Program

Max Shulaker Anantha Chandrakasan

Subhasish Mitra H.-S. Philip Wong, Simon S. Wong

Brad Ferguson Mark Nelson

Jefford Humes

8

Page 9: NanoSystems The N3XT 1,000

Carbon Nanotube FET (CNFET)CNT: d = 1.2nm

Gate

2 µm

Gate

2 µm

d

CNFET

Sub-litho

SRAMarrays

Monolithic 3D imager

Scaled devices

Major progressSkyWater Microprocessor

MIT (Shulaker), Peking U. (Peng), Stanford (Mitra, Wong) & others 9

Page 10: NanoSystems The N3XT 1,000

3D Integration

Nano-scale inter-layer vias (ILVs)Through silicon via (TSV)

l Massive ILV density >> TSV density

TSV (chip stacking) Dense, e.g., monolithic

10

Page 11: NanoSystems The N3XT 1,000

Realizing Monolithic 3D

Top Electrode

Metal Oxide

Btm Electrode

Emerging logic (e.g., CNFET)

Emerging RAM (e.g., RRAM)

Monolithic 3D

Naturally enabled: < 400 °C fabrication11

Device + Architecture benefits

+ +

Page 12: NanoSystems The N3XT 1,000

Foundry CNFET + RRAM + Monolithic 3D

12

Page 13: NanoSystems The N3XT 1,000

RRAM Advances

95

97

99

Accurate ML inference

1st RRAM System Multi-bits/cell

1bit/cell

2.6bits/cell

Accu

racy

(%)

New RRAMEndurance

10-year continuous ML inference

Accu

racy

(%)

ENDURER

No resilience

100

01s 10yr

Application lifetime

28× Lower vs. Flash

Low-energy Operation

Activ

e en

ergy

(uJ)

0

40

28×2.3×

FlashRRAM

[Wu ISSCC 19] Stanford + CEA LETI + NTU Singapore 13

Page 14: NanoSystems The N3XT 1,000

First Multiple bits-per-cell RRAM System

Bits per cell

Cells measured

Our work new algorithms

3 Full arrays

Prior workad hoc 2-6.5

Single cell, fewhand-picked cells

Neural netsOptimized weight encoding

Cross-layer

RRAM arraysmultiple bits-per-cell

2.3× accurate inferenceSame hardware, bigger model

14

Page 15: NanoSystems The N3XT 1,000

ENDURER

1 mo 10 yrs0

50

10-year lifetime (measured)100

No resilience

1 s 1 min 1 hr

With ENDURER

Mea

sure

din

fere

nce

ac

cura

cy(%

)

Continuous inference

System test setup

15

Non-volatile IoT microcontroller

Page 16: NanoSystems The N3XT 1,000

N3XT Cross-Layer ModSim

Many chips system

Few chips system

Architecture

Circuit

Device16

Page 17: NanoSystems The N3XT 1,000

OpenSPARC T2 Processor Core

0.05

0.5

0.1 10

tota

l ene

rgy

perc

ycle

(nJ)

1clock frequency (GHz)

FinFETNanowire FET

CNFET

preferred

[Hills IEEE TNANO 18] Stanford + IMEC + TSMC 17

Page 18: NanoSystems The N3XT 1,000

Many FAQs Answered

18

1. Where do benefits come from ?

2. Wires limit performance ? Why research FETs ?

3. CNFET contact resistance ?

4. Aren’t FETs good enough already ?

5. CNT variations ?

Page 19: NanoSystems The N3XT 1,000

NanoSystem Design Kit

Available: nanohub.org 19

Page 20: NanoSystems The N3XT 1,000

N3XT Cross-Layer ModSim

Many chips system

Few chips system

Architecture

Circuit

Device20

Page 21: NanoSystems The N3XT 1,000

Explore architecturesEnergy,

exec. time

Physical design,

Heterogeneous technologies Abundant-

data apps

N3XT Simulation FrameworkJoint technology, design & app. exploration

yield, reliability21

Page 22: NanoSystems The N3XT 1,000

Massive BenefitsDeep Learning, Graph Analytics, …

10×

100×

PageRank ConnectedComponents

Breadth-First

Search

Linear Regression

Language model (LSTMNeural

AlexNet(NeuralNetwork)

Bene

fits 851× 400× 510× 970× 1,950× 210×

~1,000× benefits, existing software

Energy Execution Time Network)22

Page 23: NanoSystems The N3XT 1,000

Compute + Memory Integration

Off-chip DRAM

PCB

Limited I/O connectivity

Si interposermicron scale

Energy-efficient

logic (thin

layers)

Traditional 2D system

23

Logic die 3D TSV system

HBM-type DRAMTSV +µBump(micronscale)

Logic die

2.5D system HBM-type DRAM

Logicdie

N3XT system High-density on-chip non-volatile memory

Dense ILV(nm

scale)

Page 24: NanoSystems The N3XT 1,000

Compute + Memory Integration

1000x

100x

10x

1x

LSTM (Language Model, batch 1)

8.7×

24

102×

CNN (AlexNet, batch 16)1,971×

5.7×23× 30×

Syst

em-le

vel E

DP

bene

fits

(v

s. 2

Dba

selin

e)

Number of vertical connections to 3D RRAM in N3XT (normalized to number of connections to off-chip DRAM in 2D baseline)

16xN3XT with 5µm

via pitch

512xN3XT with 1µm

via pitch

16,384xN3XT with 100 nm

via pitch

Page 25: NanoSystems The N3XT 1,000

Compute + Memory Integration

10x

1x

1000x

100x

Logic EDP benefits in upper tiers vs. SiFETs (with CNFET-based compute tier for all configurations)

LSTM (Language Model, batch 1)

25

202×

CNN (AlexNet, batch 16)1,971×

3.7×15× 30×

Sys

tem

-leve

l ED

P be

nefit

s

(vs.

2D

base

line)

0.01xLTFET upper tiers

1xXFET upper tiers

10xCNFET upper tiers

Page 26: NanoSystems The N3XT 1,000

Resilience

128MBytes

32KBytes 1 year 10 years

Preferred

Lifetime

Extra

SRAM

Flash-inspired wear leveling

ENDURER

Continuous inference

[Grossi IEEE TED 19, Stanford + CEA LETI] Machine learning accelerator + 4 GBytes RRAM 26

Page 27: NanoSystems The N3XT 1,000

l Software mapping

l New micro-architectures & memory hierarchy

l Dense compute + thermal

27

Many More (ModSim) Opportunities

Page 28: NanoSystems The N3XT 1,000

N3XT Cross-Layer ModSim

Few chips system

Architecture

Circuit

Device

Many chips system

28

Collaborators: LBL, ORNL

Page 29: NanoSystems The N3XT 1,000

Distributed Simulation ChallengesServer 0 Server N

MPI Rank 0 MPI Rank N

Slowdown!

29

SDP/TCP timeouts, etc.

instrumentation instrumentation

l Instrumentation slows execution

l Remote nodes can’t reply before timeout

Slowdown!

Page 30: NanoSystems The N3XT 1,000

Ongoing: Decoupled Tracing + ModSim

35

30

70Graph500 BFS, 1M-entry traces

Trac

e da

ta c

aptu

red

(GBy

tes)

0512 1,024 2,048 4,096 8,192

Worker processes (ranks)

ORNL TITAN cluster 512 nodes,8,192 ranks

Instruction-level traces

Page 31: NanoSystems The N3XT 1,000

Thanks: Students, Sponsors, Collaborators

31

Page 32: NanoSystems The N3XT 1,000

Conclusion

32

l NanoSystems today

l RRAM, CNFET, dense (monolithic) 3D

▪ In fabs now (from labs)

l Many cross-layer opportunities