56
CHIMERA: Efficient DNN Inference and Training at the Edge with On-Chip Resistive RAM CHIMERA: A 0.92 TOPS, 2.2 TOPS/W Edge AI Accelerator with 2 MByte On-Chip Foundry Resistive RAM for Efficient Training and Inference, M. Giordano, K. Prabhu, K. Koul, R. M. Radway, A. Gural, R. Doshi, Z. F. Khan, J. W. Kustin, T. Liu, G. B. Lopes, V. Turbiner, W.-S. Khwa, Y.-D. Chih, M.-F. Chang, G. Lallement, B. Murmann, S. Mitra, P. Raina, VLSI 2021 K ARTIK P RABHU

CHIMERA: Efficient DNN Inference and Training at the Edge

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CHIMERA: Efficient DNN Inference and Training at the Edge

CHIMERA: Efficient DNN Inference and Training at the Edge with On-Chip

Resistive RAM

CHIMERA: A 0.92 TOPS, 2.2 TOPS/W Edge AI Accelerator with 2 MByte On-Chip Foundry Resistive RAM for Efficient Training

and Inference, M. Giordano, K. Prabhu, K. Koul, R. M. Radway, A. Gural, R. Doshi, Z. F. Khan, J. W. Kustin, T. Liu, G. B. Lopes,

V. Turbiner, W.-S. Khwa, Y.-D. Chih, M.-F. Chang, G. Lallement, B. Murmann, S. Mitra, P. Raina, VLSI 2021

K A RT I K P R A B H U

Page 2: CHIMERA: Efficient DNN Inference and Training at the Edge

Machine Learning at the Edge

Edge

(SRAM, DRAM, Flash)

Challenging with today’s memory technologies

Fog

Benefits Challenges

private

user-specific

energy-efficient

on-device training

large DNN model sizes

responsive

low-power

high throughput

Page 3: CHIMERA: Efficient DNN Inference and Training at the Edge

On-Chip Non-Volatile Resistive RAM (RRAM)

Voltage controlled resistance

Conductive filament

Top electrode

Bottom electrode

Bit cell and resistance state

Low 0

High 1

[Wong and Salahuddin, Nat. Nanotech ‘15]

Page 4: CHIMERA: Efficient DNN Inference and Training at the Edge

On-Chip Non-Volatile Resistive RAM (RRAM)

Benefits

❑ Non-volatile

❑ High density (53F2)

❑ Low read energy (8 pJ/Byte)

❑ High bandwidth (6 Gbits/s)

Challenges

❑ High write energy (40.3 nJ/Byte)

❑ High write latency (0.4 μs/Byte)

❑ Low endurance (10k-100k writes)

[Chou et al., ISSCC ‘18]

Page 5: CHIMERA: Efficient DNN Inference and Training at the Edge

Key Contributions

0

RRAM-Aware TrainingAlgorithm (LRT)

First On-Device RRAM Training

0 10k 20k 30k 40k

Samples

Acc

ura

cy (

%)

On-deviceincremental

training

100Prev. learned classes

2 MB On-Chip RRAMwith DNN Accelerator

RRAM-Optimized ML SoC

Illusion System for Scale Out

≤5% Energy and Latency Overheads

Page 6: CHIMERA: Efficient DNN Inference and Training at the Edge

Key Contributions

0

RRAM-Aware TrainingAlgorithm (LRT)

First On-Device RRAM Training

0 10k 20k 30k 40k

Samples

Acc

ura

cy (

%)

On-deviceincremental

training

100Prev. learned classes

2 MB On-Chip RRAMwith DNN Accelerator

RRAM-Optimized ML SoC

Illusion System for Scale Out

≤5% Energy and Latency Overheads

Page 7: CHIMERA: Efficient DNN Inference and Training at the Edge

CHIMERA SoC

• 2 MBytes foundry RRAM

• DNN weights

• CPU instructions

• 512 KBytes SRAM

• DNN accelerator

• 0.92 TOPS

• 64b RISC-V CPU

• 2 chip-to-chip (C2C) links

• Inter-chip communication

Page 8: CHIMERA: Efficient DNN Inference and Training at the Edge

Design Space Exploration Methodology

Neural Network Specification

Design Space Specification

Component Energy Models

(MAC, RF, Memories)

Architectural Design Space Exploration (Interstellar)

Optimal Architectural Parameters

(# MACs, Memory sizes, etc.)

Parameterized DNN Accelerator HLS Code

(Catapult HLS)

DNN Accelerator

RTL

Integration into RISC-V

SoC

QuantizationOptimal Loop

Tiling(Interstellar)

VLSI Flow

Power & Performance

Analysis

Design Space

Exploration

Hardware

Low Level Accelerator Driver Code

Software

Page 9: CHIMERA: Efficient DNN Inference and Training at the Edge

DNN Accelerator

Page 10: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

Conventional DRAM-based DNN Accelerator

16x16 Systolic Array

PE

… … … …

……

……

Accumulation Buffer

WeightBuffer

DRAM

Inputs

Input DoubleBuffer

~ ~

Off-chip

En

erg

y (μ

J)C

apa

city

(M

B)

0

0.5

1

1 2 3 4

0

50

100

150

1 2 3 4

2.8x

20.5x

WeightAcc.Input

[Yang et al., ASPLOS ‘20]

Buffers sized for min. energy

High Energy Access

Page 11: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

1. On-chip RRAM-based DNN Accelerator

16x16 Systolic Array

PE

… … … …

……

……

Accumulation Buffer

WeightBuffer

DRAM

Inputs

Input DoubleBuffer

~ ~

Off-chip

En

erg

y (μ

J)C

apa

city

(M

B)

0

0.5

1

1 2 3 4

0

50

100

150

1 2 3 4

2.8x

20.5x

WeightAcc.Input

[Yang et al., ASPLOS ‘20]

Buffers sized for min. energy

DNN weights loaded directly from RRAM

SRAM weight buffer removed

Page 12: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

1. On-chip RRAM-based DNN Accelerator

Inputs

16x16 Systolic Array

PE

… … … …

……

……

DNN weights loaded directly from RRAM

Input DoubleBuffer

Input DoubleBuffer

Accumulation Buffer

Accumulation Buffer

2x smaller input and accumulation buffers

SRAM weight buffer removed

0

0.5

1

1 2 3 4

0

50

100

150

1 2 3 4

WeightAcc.Input

En

erg

y (μ

J)C

ap

aci

ty (

MB

)

Buffers sized for min. energy

Page 13: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

2. PE Filter Unrolling to Reduce Input Buffer

Conventional accelerators only unroll

input and output channels

16x16 Systolic Array

PE

… … … …

……

……

Processing Element

Weights

9 Multiply

Accumulate

Inputs

WeightAcc.Input

0

0.5

1

1 2 3 4

0

50

100

150

1 2 3 4

En

erg

y (μ

J)C

apa

city

(M

B)

X ⋅

Y ⋅C

0 ⋅

C1

Buffers sized for min. energy

Page 14: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

2. PE Filter Unrolling to Reduce Input Buffer

Filter dimensions (Fx =Fy=3) are unrolled in the PE

to reduce input buffer by 9x, while achieving same

data reuse

16x16 Systolic Array

PE

… … … …

……

……

Processing Element

+

9 Multiply

Accumulate

WeightAcc.Input

0

0.5

1

1 2 3 4

0

50

100

150

1 2 3 4

En

erg

y (μ

J)C

apa

city

(M

B)

WeightsInputs3x3

FilterUnrolling

Buffers sized for min. energy

X ⋅

Y ⋅C

0 ⋅C

1

Page 15: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

3. Stride Registers to Reduce Input Accesses

In-PE filter unrolling allows for immediate reuse

of overlapped pixels during striding

WeightAcc.Input

0

0.5

1

1 2 3 4

0

50

100

150

1 2 3 4

En

erg

y (μ

J)C

apa

city

(M

B)

Buffers sized for min. energy Stride

16x16 Systolic Array

PE

… … … …

……

……

Input DoubleBuffer

Page 16: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

3. Stride Registers to Reduce Input Accesses

Exploit horizontal input reuse

Reduce input buffer accesses and energy

by 3x

16x16 Systolic Array

PE

… … … …

……

……

Input DoubleBuffer

…3x3 Input

3 9

Stride Register

En

erg

y (μ

J)C

apa

city

(M

B)

0

0.5

1

1 2 3 4

0

50

100

150

1 2 3 4

WeightAcc.Input

2.8x

20.5x

Buffers sized for min. energy Stride

Page 17: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

ResNet-18 Results

Measured on ImageNet:

Energy: 8.1 mJ/image

Latency: 60 ms/image

Average power: 136 mW

Efficiency: 2.2 TOPS/W

With on-chip RRAM, total power dominated by DNN accelerator, NOT by memory accesses

Systolic ArrayInterconnect, 19%

DA

Main SRAM, 3%

Input and Acc. Buffers, 15%

RRAM + Controller,16%

RISC-V, 2%

IO, 6%Others, 4%

MACs,16%

Address gen.control, 12%

DA others, 7%

69%

Simulated Power Breakdown

Page 18: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

CHIMERAv2 and CHIMERAvS

• CHIMERAv2 implements several optimizations:

• Increased RRAM to 4MBytes

• Data prefetching

• Loop replication

• CHIMERAvS replaces CHIMERAv2’s 4MB of RRAM with SRAM

v1 v2 vS

Weights Memory Capacity (MB) 2 4 4

Weights Memory Area (mm2) 10.8 20.2 30.5

Total Die Area (mm2) 29.2 29.2 39.5

Page 19: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

ResNet-18 Inference Results

v1 v2 vS

Energy (mJ) 8.1 1.8 1.5

Latency (ms) 60.0 15.2 15.1

Average Power (mW) 135 118 99

Page 20: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

On-Chip RRAM vs Traditional SRAM

Page 21: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

On-Chip RRAM vs Traditional SRAM

Page 22: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

RRAM-Optimized DNN Accelerator

1. On-chip RRAM overcomes memory access bottleneck

+

2. Low read-energy RRAM weights reduces SRAM buffer sizes

and energy

+

3. New dataflow: filter unrolling for larger filters improves input

reuse

=

Efficient on-device inference

Page 23: CHIMERA: Efficient DNN Inference and Training at the Edge

Key Contributions

0

RRAM-Aware TrainingAlgorithm (LRT)

First On-Device RRAM Training

0 10k 20k 30k 40k

Samples

Acc

ura

cy (

%)

On-deviceincremental

training

100Prev. learned classes

2 MB On-Chip RRAMwith DNN Accelerator

RRAM-Optimized ML SoC

Illusion System for Scale Out

≤5% Energy and Latency Overheads

Page 24: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

CHIMERA: Scale Out to Larger DNNs

2MBytes of RRAM

DNN ApproachBenefits from on-chip RRAM

≤2MBytes CHIMERA ✔

[Radway et al., Nat. Electronics ‘21]

Page 25: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

CHIMERA: Scale Out to Larger DNNs

DNN ApproachBenefits from on-chip RRAM

≤2MBytes CHIMERA ✔

≥2MBytes

Single CHIMERA, more RRAM ✔

[Radway et al., Nat. Electronics ‘21]

Bigger Chip Needed≥2MBytes of

RRAM

“Dream Chip”: Moving target as DNNs continue growing

Page 26: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

CHIMERA: Scale Out to Larger DNNs

Off-chip DRAM

High energy accesses

DNN ApproachBenefits from on-chip RRAM

≤2MBytes CHIMERA ✔

≥2MBytes

Single CHIMERA, more RRAM ✔

CHIMERA + Off-chip Memory (DRAM)

[Radway et al., Nat. Electronics ‘21]

2MBytes of RRAM

Page 27: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

CHIMERA: Scale Out to Larger DNNs

DNN ApproachBenefits from on-chip RRAM

≤2MBytes CHIMERA ✔

≥2MBytes

Single CHIMERA, more RRAM ✔

CHIMERA + Off-chip Memory (DRAM)

Multiple CHIMERAs in an Illusion System

[Radway et al., Nat. Electronics ‘21]

Chip 0 Chip 1 Chip 2 Chip 3 Chip 4 Chip 5

2MBytes of RRAM

Creates “Illusion” of Dream, single chip

Page 28: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

CHIMERA Illusion System for 12MByte DNNs

Power gating devices

C0

C1

C2

C4

C3

C5

Chip to Chip (C2C) links:(77 pJ/bit;1.9 Gbits/s)

Chip 0 Chip 1 Chip 2 Chip 3 Chip 4 Chip 5

Page 29: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

RRAM Leads to Quick Chip Wakeup/Shutdown

SoC ready from wakeup in 33 μs

Full SoC shutdown in 10 ns

SoC instructions & DNN weights

preserved through shutdown

CHIMERAs on only when actively computing

Ready

CLK

Reset

Power

33 μs

Wake-Up Time (μs)

0 10 20 30 40

Chip 0 Chip 1 Chip 2 Chip 3 Chip 4 Chip 5

ShutdownActive

Page 30: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

Minimizing Inter-Chip Messaging Overheads

Traditional: High message traffic to keep all chips

occupied

CHIMERA: Enough weight RRAM for infrequent data

transfers

Chip 0 Chip 1 Chip 2 Chip 3 Chip 4 Chip 5

ShutdownActive

Illusion: Map DNN to minimize inter-chip traffic

Page 31: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

Mapping 1: Minimizes Energy/Inter-Chip Messages

Energy-efficient ResNet-18 mapping for sporadic

inference

Minimal inter-chip traffic (373 KBytes vs. 12MByte DNN)

Chip 0 Chip 1 Chip 2 Chip 3 Chip 4 Chip 5

InputClassification

Chip 0 Chip 1 Chip 2 Chip 3 Chip 4 Chip 5

ShutdownActive

Page 32: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

Mapping 2: Minimizes Latency

Low-latency ResNet-18 mapping for high-throughput inference

Higher inter-chip traffic (1.5 MBytes)

Chip 0 Chip 1 Chip 2 Chip 3 Chip 4 Chip 5

Input

Chip 0 Chip 1 Chip 2 Chip 3 Chip 4 Chip 5

Input split, weights duplicated for parallel compute

Input Input

Chip 0 Chip 1 Chip 2 Chip 3 Chip 4 Chip 5

ShutdownActive

Page 33: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

0.00x

0.25x

0.50x

0.75x

1.00x

1.25x

1.50x

Per-Layer Total Mapping 1 Mapping 2

No

rmal

ized

Lat

ency

, En

ergy

*

Exec. Time

Energy

CHIMERA Illusion System Measurements

Normalized to the ResNet-18 per-layer total measured on CHIMERA

Minimal energy mapping: ≤ 4% energy and ≤ 5% execution time

Resnet-18 Performance

Mapping 1 Mapping 2

+4% +5%

-20%

+35%

(e.g., single “Dream” Chip)

Page 34: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

Compared to Traditional MCMs

ResNet-50 Inference CHIMERAv2 Simba

Technology 40 nm 16 nm

Core Energy (mJ) 5.18 16.3

Single Chip Latency (ms) 5.29* 4.64

Multi Chip Throughput (inferences/ms)

1.74** 2.06

[Shao et al., MICRO ‘19][Zimmer et al., VLSI ‘19][Zimmer et al., JSSC ‘20][Venkatesan et al., Hot Chips ‘19]

*Scaled for iso-frequency**Scaled for iso-MACs and frequency

Page 35: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

CHIMERA Illusion System

1. A fast, energy-efficient RRAM-optimized DNN accelerator

+

2. Quick chip wakeup/shutdown: < 58 μJ / 33 μs

+

3. Mappings with minimal messages: < 1/30th of DNN weights

=

Scalable inference system enabled by RRAM

Page 36: CHIMERA: Efficient DNN Inference and Training at the Edge

Key Contributions

0

RRAM-Aware TrainingAlgorithm (LRT)

First On-Device RRAM Training

0 10k 20k 30k 40k

Samples

Acc

ura

cy (

%)

On-deviceincremental

training

100Prev. learned classes

2 MB On-Chip RRAMwith DNN Accelerator

RRAM-Optimized ML SoC

Illusion System for Scale Out

≤5% Energy and Latency Overheads

Page 37: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

Training DNN Off-Line

Subset of Flowers102

+

1 image/class

Full ImageNet1000dataset

1M images

Generic classes(abundant data)

User-specific classes(limited data)

Classifier

Fullyconnected

layer

DNN model

Feature extractor

Page 38: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

On-Device Incremental Training

Full Flowers102dataset

+

Random ImageNet images

6 images/class 60 image/class

Generic classes(limited data)

User-specific classes(abundant data)

DNN model deployed at the edge

Feature extractor Classifier

Fullyconnected

layer

Only fully connected layer is updated

Page 39: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

SGD challenging on RRAM-based edge-devices

Immediate classification

Frequent weight updates

Fixed-point training difficult

Large additional SRAM required

Our RRAM-aware Low-Rank Training (LRT) algorithm

Immediate classification

Infrequent RRAM updates

Fixed-point training feasible

Little additional SRAM required

On-device Training Challenges with RRAM

SGD: Stochastic Gradient Descent

Batch = 1

Large Batches

[Gural et al., https://arxiv.org/abs/2009.03887]

Batch = 1

Page 40: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

On-Device Training

Features

Images

12

3

i+1i

Feature extraction

Fullyconnected

layer

Classesprobabilities

Correct class

- =

Gradients

Page 41: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

Low-Rank Training (LRT): InitializationFe

atu

re 1

Feat

ure

R

FeaturesFe

atu

re 2

Gradients

Rank

F

Gra

die

nts

1

Gra

die

nts

R

Gra

die

nts

2

…G

Page 42: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

Low-Rank Training (LRT): New Sample

Feat

ure

iG

rad

ien

t i

Append new sample’s features and gradients

Features

Gradients

Rank + 1

F

G

Page 43: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

Low-Rank Training (LRT): QR Decomposition

.

Decomposition reduces matrix dimensions

much smaller than

number of features/gradients

Feat

ure

iG

rad

ien

t i

Features

Gradients

Rank + 1

F

G

QFRF

.QG

RG

Page 44: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

Low-Rank Training (LRT): QR Decomposition

.Fe

atu

re i

Gra

die

nt

i

Features

Gradients

Rank + 1

F

G

QFRF

.QG

RG

RFRGT

Page 45: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

LRT: Singular Value Decomposition

.

Singular Value Decomposition (SVD)finds most significant components

Feat

ure

iG

rad

ien

t i

Features

Gradients

Rank + 1

F

G

QFRF

.QG

RG

VU.

𝜎1𝜎R…

𝜎

.RFRGT .≈

Rank + 1

Page 46: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

LRT: Dimensionality Reduction

.Fe

atu

re i

Gra

die

nt

i

Features

Gradients

Rank + 1

F

G

QFRF

.QG

RG

VU.

𝜎1𝜎R…

𝜎

.RFRGT .≈

Rank + 1

Dimensionality reduction

Minimal loss of information

Singular Value Decomposition (SVD)finds most significant components

Least significant

Page 47: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

LRT: SRAM Update

LRT compressed gradients stored in small SRAM(6-22 Kbytes depending on Rank)

Fea

ture

iG

rad

ien

t i

Features

Gradients

F

G

.QF

RF

.QG

RG

VU.

𝜎1𝜎R…

𝜎

.RFRGT .≈

Rank + 1

F’

G’

Page 48: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

LRT: Repeat for Multiple ImagesFeatures

Gradients

F

G

Feat

ure

i+1

Gra

die

nt

i+1

LRT

F

G

Feat

ure

i+2

Gra

die

nt

i+2

LRT …

SRAM for compressed gradient does not increase with set size

Up to 1024 images in update set

Page 49: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

LRT: RRAM Weight UpdateFeatures

Gradients

F

G

At end of sample set:expand features & gradients

to update RRAM weights

W =W − 𝛼(F⋅GT)

𝛼 = learning rate

Larger set → fewer RRAM updates

Page 50: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

Set Size (S)1000100101

RR

AM

Wri

tes

109

108

107En

erg

y-D

ela

y P

rod

uct

(J⋅

s)

R=14

Set Size (S)1000100101

R=1

R=16

R=2

RR

AM

Wri

tes

105

104

103RR

AM

Wei

gh

tU

pd

ate

s S

tep

s

R=4

R=14

R=12

Optimal Rank and Set-Size SearchLarger rank -> more computation, but larger set size

SGDSGD101x 340x

LRT: 340x lower EDP and 101x fewer RRAM weight updates

All lines iso-SGD accuracy

Page 51: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

Optimizing Training at the Edge

m=1000n= 1000B=100R=4

Weight updateevery sample

Batch weight update

Store Input and output FC

Low Rank Training

Capacity m ⋅ n ~ 1E6 m ⋅ n ~ 1E6 B ⋅ m + n ~ 2E5 R ⋅ m + n ~ 1E3

Weight updates per sample

m ⋅ n ~ 1E6 m ⋅ n /B~ 1E4 m ⋅ n/B ~ 1E4 m ⋅ n/B ~ 1E4

SRAM writes/sample

Not needed Not needed m+ n ~2E3 R ⋅ m + n2E3

Ops/sample 1E6 1E6 1E6 (R+1)2(m+n + R) ~5E4

B

m

n

m

n

n

mB

n

m R

R ≪ B

Reduction in memory usage and mathematical complexity!

Page 52: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

LRT Accuracy Measured On-Device

Maximum achievable accuracy with limited

field data

Pre-Trained Off-line

Acc

ura

cy (

%)

0 1 2 3 4

ImageNet (1000 classes)

Flowers102 (102 classes)

100

75

50

25

Epochs

Page 53: CHIMERA: Efficient DNN Inference and Training at the Edge

2021 Symposia on VLSI Technology and Circuits

LRT Accuracy Measured On-Device

Pre-Trained Off-line

Acc

ura

cy (

%)

0 1 2 3 4

ImageNet (1000 classes)

Flowers102 (102 classes)

100

75

50

25

Epochs

On-Device LRT

Negligible ImageNet1000 degradation (< 2 %) as Flower102 are learned

0 1k 2k 3k 4kImages

5k

LRT accuracyequivalent to

conventional SGD

Page 54: CHIMERA: Efficient DNN Inference and Training at the Edge

LRT Measured Energy and Latency

LRT rank=14 and set size = 64

Training energy per image

Training latency per image

RRAM

DNN Inference 60 ms 38%

LRT 44 ms 27%

RRAM Update 55 ms 35%

Total 159 ms

DNN Inference 8.1 mJ 46%

LRT 4.5 mJ 25%

RRAM Update 5.5 mJ 29%

Total 18.1 mJ

RRAM

LRT

LRTFE

FESet Size (S)

1000100101

RR

AM

Wri

tes

109

108

107

En

erg

y-D

ela

y P

rod

uct

(J⋅

s) Minimal training EDP

Training overhead

Page 55: CHIMERA: Efficient DNN Inference and Training at the Edge

Low-Rank Training Algorithm1. Gradients are compressed with minimal information loss

+

2. High write-energy RRAM updates amortized over large update sets

+

3. Achieves conventional SGD equivalent accuracy

=

Efficient RRAM-based on-device incremental training

Page 56: CHIMERA: Efficient DNN Inference and Training at the Edge

Conclusion• RRAM-optimized ML SoC with DNN accelerator for efficient on-

device inference

- CHIMERAv2 achieves ResNet-18 inference at 15.2 ms and 1.8 mJ

• Multi-chip Illusion System for scalable DNN inference

- CHIMERAv2 consumes 3x less energy than traditional MCMs for

ResNet-50 inference

• Low-rank training for efficient RRAM-based on-device incremental

training

- LRT achieves 340x lower EDP and 101x fewer RRAM weight updates

compared to SGD at iso-accuracy