Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
CHIMERA: Efficient DNN Inference and Training at the Edge with On-Chip
Resistive RAM
CHIMERA: A 0.92 TOPS, 2.2 TOPS/W Edge AI Accelerator with 2 MByte On-Chip Foundry Resistive RAM for Efficient Training
and Inference, M. Giordano, K. Prabhu, K. Koul, R. M. Radway, A. Gural, R. Doshi, Z. F. Khan, J. W. Kustin, T. Liu, G. B. Lopes,
V. Turbiner, W.-S. Khwa, Y.-D. Chih, M.-F. Chang, G. Lallement, B. Murmann, S. Mitra, P. Raina, VLSI 2021
K A RT I K P R A B H U
Machine Learning at the Edge
Edge
(SRAM, DRAM, Flash)
Challenging with today’s memory technologies
Fog
Benefits Challenges
private
user-specific
energy-efficient
on-device training
large DNN model sizes
responsive
low-power
high throughput
On-Chip Non-Volatile Resistive RAM (RRAM)
Voltage controlled resistance
Conductive filament
Top electrode
Bottom electrode
Bit cell and resistance state
Low 0
High 1
[Wong and Salahuddin, Nat. Nanotech ‘15]
On-Chip Non-Volatile Resistive RAM (RRAM)
Benefits
❑ Non-volatile
❑ High density (53F2)
❑ Low read energy (8 pJ/Byte)
❑ High bandwidth (6 Gbits/s)
Challenges
❑ High write energy (40.3 nJ/Byte)
❑ High write latency (0.4 μs/Byte)
❑ Low endurance (10k-100k writes)
[Chou et al., ISSCC ‘18]
Key Contributions
0
RRAM-Aware TrainingAlgorithm (LRT)
First On-Device RRAM Training
0 10k 20k 30k 40k
Samples
Acc
ura
cy (
%)
On-deviceincremental
training
100Prev. learned classes
2 MB On-Chip RRAMwith DNN Accelerator
RRAM-Optimized ML SoC
Illusion System for Scale Out
≤5% Energy and Latency Overheads
Key Contributions
0
RRAM-Aware TrainingAlgorithm (LRT)
First On-Device RRAM Training
0 10k 20k 30k 40k
Samples
Acc
ura
cy (
%)
On-deviceincremental
training
100Prev. learned classes
2 MB On-Chip RRAMwith DNN Accelerator
RRAM-Optimized ML SoC
Illusion System for Scale Out
≤5% Energy and Latency Overheads
CHIMERA SoC
• 2 MBytes foundry RRAM
• DNN weights
• CPU instructions
• 512 KBytes SRAM
• DNN accelerator
• 0.92 TOPS
• 64b RISC-V CPU
• 2 chip-to-chip (C2C) links
• Inter-chip communication
Design Space Exploration Methodology
Neural Network Specification
Design Space Specification
Component Energy Models
(MAC, RF, Memories)
Architectural Design Space Exploration (Interstellar)
Optimal Architectural Parameters
(# MACs, Memory sizes, etc.)
Parameterized DNN Accelerator HLS Code
(Catapult HLS)
DNN Accelerator
RTL
Integration into RISC-V
SoC
QuantizationOptimal Loop
Tiling(Interstellar)
VLSI Flow
Power & Performance
Analysis
Design Space
Exploration
Hardware
Low Level Accelerator Driver Code
Software
DNN Accelerator
2021 Symposia on VLSI Technology and Circuits
Conventional DRAM-based DNN Accelerator
…
16x16 Systolic Array
PE
… … … …
……
……
Accumulation Buffer
WeightBuffer
DRAM
Inputs
Input DoubleBuffer
~ ~
Off-chip
En
erg
y (μ
J)C
apa
city
(M
B)
0
0.5
1
1 2 3 4
0
50
100
150
1 2 3 4
2.8x
20.5x
WeightAcc.Input
[Yang et al., ASPLOS ‘20]
Buffers sized for min. energy
High Energy Access
2021 Symposia on VLSI Technology and Circuits
1. On-chip RRAM-based DNN Accelerator
…
16x16 Systolic Array
PE
… … … …
……
……
Accumulation Buffer
WeightBuffer
DRAM
Inputs
Input DoubleBuffer
~ ~
Off-chip
En
erg
y (μ
J)C
apa
city
(M
B)
0
0.5
1
1 2 3 4
0
50
100
150
1 2 3 4
2.8x
20.5x
WeightAcc.Input
[Yang et al., ASPLOS ‘20]
Buffers sized for min. energy
DNN weights loaded directly from RRAM
SRAM weight buffer removed
2021 Symposia on VLSI Technology and Circuits
1. On-chip RRAM-based DNN Accelerator
Inputs
…
16x16 Systolic Array
PE
… … … …
……
……
DNN weights loaded directly from RRAM
Input DoubleBuffer
Input DoubleBuffer
Accumulation Buffer
Accumulation Buffer
2x smaller input and accumulation buffers
SRAM weight buffer removed
0
0.5
1
1 2 3 4
0
50
100
150
1 2 3 4
WeightAcc.Input
En
erg
y (μ
J)C
ap
aci
ty (
MB
)
Buffers sized for min. energy
2021 Symposia on VLSI Technology and Circuits
2. PE Filter Unrolling to Reduce Input Buffer
Conventional accelerators only unroll
input and output channels
…
16x16 Systolic Array
PE
… … … …
……
……
Processing Element
Weights
9 Multiply
Accumulate
Inputs
WeightAcc.Input
0
0.5
1
1 2 3 4
0
50
100
150
1 2 3 4
En
erg
y (μ
J)C
apa
city
(M
B)
X ⋅
Y ⋅C
0 ⋅
C1
Buffers sized for min. energy
⊛
2021 Symposia on VLSI Technology and Circuits
2. PE Filter Unrolling to Reduce Input Buffer
Filter dimensions (Fx =Fy=3) are unrolled in the PE
to reduce input buffer by 9x, while achieving same
data reuse
…
16x16 Systolic Array
PE
… … … …
……
……
Processing Element
+
9 Multiply
Accumulate
WeightAcc.Input
0
0.5
1
1 2 3 4
0
50
100
150
1 2 3 4
En
erg
y (μ
J)C
apa
city
(M
B)
WeightsInputs3x3
FilterUnrolling
Buffers sized for min. energy
X ⋅
Y ⋅C
0 ⋅C
1
⊛
2021 Symposia on VLSI Technology and Circuits
3. Stride Registers to Reduce Input Accesses
In-PE filter unrolling allows for immediate reuse
of overlapped pixels during striding
WeightAcc.Input
0
0.5
1
1 2 3 4
0
50
100
150
1 2 3 4
En
erg
y (μ
J)C
apa
city
(M
B)
Buffers sized for min. energy Stride
…
16x16 Systolic Array
PE
… … … …
……
……
Input DoubleBuffer
2021 Symposia on VLSI Technology and Circuits
3. Stride Registers to Reduce Input Accesses
Exploit horizontal input reuse
Reduce input buffer accesses and energy
by 3x
…
16x16 Systolic Array
PE
… … … …
……
……
Input DoubleBuffer
…3x3 Input
3 9
Stride Register
En
erg
y (μ
J)C
apa
city
(M
B)
0
0.5
1
1 2 3 4
0
50
100
150
1 2 3 4
WeightAcc.Input
2.8x
20.5x
Buffers sized for min. energy Stride
2021 Symposia on VLSI Technology and Circuits
ResNet-18 Results
Measured on ImageNet:
Energy: 8.1 mJ/image
Latency: 60 ms/image
Average power: 136 mW
Efficiency: 2.2 TOPS/W
With on-chip RRAM, total power dominated by DNN accelerator, NOT by memory accesses
Systolic ArrayInterconnect, 19%
DA
Main SRAM, 3%
Input and Acc. Buffers, 15%
RRAM + Controller,16%
RISC-V, 2%
IO, 6%Others, 4%
MACs,16%
Address gen.control, 12%
DA others, 7%
69%
Simulated Power Breakdown
2021 Symposia on VLSI Technology and Circuits
CHIMERAv2 and CHIMERAvS
• CHIMERAv2 implements several optimizations:
• Increased RRAM to 4MBytes
• Data prefetching
• Loop replication
• CHIMERAvS replaces CHIMERAv2’s 4MB of RRAM with SRAM
v1 v2 vS
Weights Memory Capacity (MB) 2 4 4
Weights Memory Area (mm2) 10.8 20.2 30.5
Total Die Area (mm2) 29.2 29.2 39.5
2021 Symposia on VLSI Technology and Circuits
ResNet-18 Inference Results
v1 v2 vS
Energy (mJ) 8.1 1.8 1.5
Latency (ms) 60.0 15.2 15.1
Average Power (mW) 135 118 99
2021 Symposia on VLSI Technology and Circuits
On-Chip RRAM vs Traditional SRAM
2021 Symposia on VLSI Technology and Circuits
On-Chip RRAM vs Traditional SRAM
2021 Symposia on VLSI Technology and Circuits
RRAM-Optimized DNN Accelerator
1. On-chip RRAM overcomes memory access bottleneck
+
2. Low read-energy RRAM weights reduces SRAM buffer sizes
and energy
+
3. New dataflow: filter unrolling for larger filters improves input
reuse
=
Efficient on-device inference
Key Contributions
0
RRAM-Aware TrainingAlgorithm (LRT)
First On-Device RRAM Training
0 10k 20k 30k 40k
Samples
Acc
ura
cy (
%)
On-deviceincremental
training
100Prev. learned classes
2 MB On-Chip RRAMwith DNN Accelerator
RRAM-Optimized ML SoC
Illusion System for Scale Out
≤5% Energy and Latency Overheads
2021 Symposia on VLSI Technology and Circuits
CHIMERA: Scale Out to Larger DNNs
2MBytes of RRAM
DNN ApproachBenefits from on-chip RRAM
≤2MBytes CHIMERA ✔
[Radway et al., Nat. Electronics ‘21]
2021 Symposia on VLSI Technology and Circuits
CHIMERA: Scale Out to Larger DNNs
DNN ApproachBenefits from on-chip RRAM
≤2MBytes CHIMERA ✔
≥2MBytes
Single CHIMERA, more RRAM ✔
[Radway et al., Nat. Electronics ‘21]
Bigger Chip Needed≥2MBytes of
RRAM
“Dream Chip”: Moving target as DNNs continue growing
2021 Symposia on VLSI Technology and Circuits
CHIMERA: Scale Out to Larger DNNs
Off-chip DRAM
High energy accesses
DNN ApproachBenefits from on-chip RRAM
≤2MBytes CHIMERA ✔
≥2MBytes
Single CHIMERA, more RRAM ✔
CHIMERA + Off-chip Memory (DRAM)
✘
[Radway et al., Nat. Electronics ‘21]
2MBytes of RRAM
2021 Symposia on VLSI Technology and Circuits
CHIMERA: Scale Out to Larger DNNs
DNN ApproachBenefits from on-chip RRAM
≤2MBytes CHIMERA ✔
≥2MBytes
Single CHIMERA, more RRAM ✔
CHIMERA + Off-chip Memory (DRAM)
✘
Multiple CHIMERAs in an Illusion System
✔
[Radway et al., Nat. Electronics ‘21]
Chip 0 Chip 1 Chip 2 Chip 3 Chip 4 Chip 5
2MBytes of RRAM
Creates “Illusion” of Dream, single chip
2021 Symposia on VLSI Technology and Circuits
CHIMERA Illusion System for 12MByte DNNs
Power gating devices
C0
C1
C2
C4
C3
C5
Chip to Chip (C2C) links:(77 pJ/bit;1.9 Gbits/s)
Chip 0 Chip 1 Chip 2 Chip 3 Chip 4 Chip 5
2021 Symposia on VLSI Technology and Circuits
RRAM Leads to Quick Chip Wakeup/Shutdown
SoC ready from wakeup in 33 μs
Full SoC shutdown in 10 ns
SoC instructions & DNN weights
preserved through shutdown
CHIMERAs on only when actively computing
Ready
CLK
Reset
Power
33 μs
Wake-Up Time (μs)
0 10 20 30 40
Chip 0 Chip 1 Chip 2 Chip 3 Chip 4 Chip 5
ShutdownActive
2021 Symposia on VLSI Technology and Circuits
Minimizing Inter-Chip Messaging Overheads
Traditional: High message traffic to keep all chips
occupied
CHIMERA: Enough weight RRAM for infrequent data
transfers
Chip 0 Chip 1 Chip 2 Chip 3 Chip 4 Chip 5
ShutdownActive
Illusion: Map DNN to minimize inter-chip traffic
2021 Symposia on VLSI Technology and Circuits
Mapping 1: Minimizes Energy/Inter-Chip Messages
Energy-efficient ResNet-18 mapping for sporadic
inference
Minimal inter-chip traffic (373 KBytes vs. 12MByte DNN)
Chip 0 Chip 1 Chip 2 Chip 3 Chip 4 Chip 5
InputClassification
Chip 0 Chip 1 Chip 2 Chip 3 Chip 4 Chip 5
ShutdownActive
2021 Symposia on VLSI Technology and Circuits
Mapping 2: Minimizes Latency
Low-latency ResNet-18 mapping for high-throughput inference
Higher inter-chip traffic (1.5 MBytes)
Chip 0 Chip 1 Chip 2 Chip 3 Chip 4 Chip 5
Input
Chip 0 Chip 1 Chip 2 Chip 3 Chip 4 Chip 5
Input split, weights duplicated for parallel compute
Input Input
Chip 0 Chip 1 Chip 2 Chip 3 Chip 4 Chip 5
ShutdownActive
2021 Symposia on VLSI Technology and Circuits
0.00x
0.25x
0.50x
0.75x
1.00x
1.25x
1.50x
Per-Layer Total Mapping 1 Mapping 2
No
rmal
ized
Lat
ency
, En
ergy
*
Exec. Time
Energy
CHIMERA Illusion System Measurements
Normalized to the ResNet-18 per-layer total measured on CHIMERA
Minimal energy mapping: ≤ 4% energy and ≤ 5% execution time
Resnet-18 Performance
Mapping 1 Mapping 2
+4% +5%
-20%
+35%
(e.g., single “Dream” Chip)
2021 Symposia on VLSI Technology and Circuits
Compared to Traditional MCMs
ResNet-50 Inference CHIMERAv2 Simba
Technology 40 nm 16 nm
Core Energy (mJ) 5.18 16.3
Single Chip Latency (ms) 5.29* 4.64
Multi Chip Throughput (inferences/ms)
1.74** 2.06
[Shao et al., MICRO ‘19][Zimmer et al., VLSI ‘19][Zimmer et al., JSSC ‘20][Venkatesan et al., Hot Chips ‘19]
*Scaled for iso-frequency**Scaled for iso-MACs and frequency
2021 Symposia on VLSI Technology and Circuits
CHIMERA Illusion System
1. A fast, energy-efficient RRAM-optimized DNN accelerator
+
2. Quick chip wakeup/shutdown: < 58 μJ / 33 μs
+
3. Mappings with minimal messages: < 1/30th of DNN weights
=
Scalable inference system enabled by RRAM
Key Contributions
0
RRAM-Aware TrainingAlgorithm (LRT)
First On-Device RRAM Training
0 10k 20k 30k 40k
Samples
Acc
ura
cy (
%)
On-deviceincremental
training
100Prev. learned classes
2 MB On-Chip RRAMwith DNN Accelerator
RRAM-Optimized ML SoC
Illusion System for Scale Out
≤5% Energy and Latency Overheads
2021 Symposia on VLSI Technology and Circuits
Training DNN Off-Line
Subset of Flowers102
+
1 image/class
Full ImageNet1000dataset
1M images
Generic classes(abundant data)
User-specific classes(limited data)
Classifier
Fullyconnected
layer
DNN model
Feature extractor
2021 Symposia on VLSI Technology and Circuits
On-Device Incremental Training
Full Flowers102dataset
+
Random ImageNet images
6 images/class 60 image/class
Generic classes(limited data)
User-specific classes(abundant data)
DNN model deployed at the edge
Feature extractor Classifier
Fullyconnected
layer
Only fully connected layer is updated
2021 Symposia on VLSI Technology and Circuits
SGD challenging on RRAM-based edge-devices
Immediate classification
Frequent weight updates
Fixed-point training difficult
Large additional SRAM required
Our RRAM-aware Low-Rank Training (LRT) algorithm
Immediate classification
Infrequent RRAM updates
Fixed-point training feasible
Little additional SRAM required
On-device Training Challenges with RRAM
SGD: Stochastic Gradient Descent
Batch = 1
Large Batches
[Gural et al., https://arxiv.org/abs/2009.03887]
Batch = 1
2021 Symposia on VLSI Technology and Circuits
On-Device Training
Features
Images
12
3
i+1i
…
Feature extraction
Fullyconnected
layer
Classesprobabilities
Correct class
- =
Gradients
2021 Symposia on VLSI Technology and Circuits
Low-Rank Training (LRT): InitializationFe
atu
re 1
Feat
ure
R
FeaturesFe
atu
re 2
…
Gradients
Rank
F
Gra
die
nts
1
Gra
die
nts
R
Gra
die
nts
2
…G
2021 Symposia on VLSI Technology and Circuits
Low-Rank Training (LRT): New Sample
Feat
ure
iG
rad
ien
t i
Append new sample’s features and gradients
Features
Gradients
Rank + 1
F
G
2021 Symposia on VLSI Technology and Circuits
Low-Rank Training (LRT): QR Decomposition
.
Decomposition reduces matrix dimensions
much smaller than
number of features/gradients
Feat
ure
iG
rad
ien
t i
Features
Gradients
Rank + 1
F
G
QFRF
.QG
RG
2021 Symposia on VLSI Technology and Circuits
Low-Rank Training (LRT): QR Decomposition
.Fe
atu
re i
Gra
die
nt
i
Features
Gradients
Rank + 1
F
G
QFRF
.QG
RG
RFRGT
2021 Symposia on VLSI Technology and Circuits
LRT: Singular Value Decomposition
.
Singular Value Decomposition (SVD)finds most significant components
Feat
ure
iG
rad
ien
t i
Features
Gradients
Rank + 1
F
G
QFRF
.QG
RG
VU.
𝜎1𝜎R…
𝜎
.RFRGT .≈
Rank + 1
2021 Symposia on VLSI Technology and Circuits
LRT: Dimensionality Reduction
.Fe
atu
re i
Gra
die
nt
i
Features
Gradients
Rank + 1
F
G
QFRF
.QG
RG
VU.
𝜎1𝜎R…
𝜎
.RFRGT .≈
Rank + 1
Dimensionality reduction
Minimal loss of information
Singular Value Decomposition (SVD)finds most significant components
Least significant
2021 Symposia on VLSI Technology and Circuits
LRT: SRAM Update
LRT compressed gradients stored in small SRAM(6-22 Kbytes depending on Rank)
Fea
ture
iG
rad
ien
t i
Features
Gradients
F
G
.QF
RF
.QG
RG
VU.
𝜎1𝜎R…
𝜎
.RFRGT .≈
Rank + 1
F’
G’
2021 Symposia on VLSI Technology and Circuits
LRT: Repeat for Multiple ImagesFeatures
Gradients
F
G
Feat
ure
i+1
Gra
die
nt
i+1
LRT
F
G
Feat
ure
i+2
Gra
die
nt
i+2
LRT …
SRAM for compressed gradient does not increase with set size
Up to 1024 images in update set
2021 Symposia on VLSI Technology and Circuits
LRT: RRAM Weight UpdateFeatures
Gradients
F
G
At end of sample set:expand features & gradients
to update RRAM weights
W =W − 𝛼(F⋅GT)
𝛼 = learning rate
Larger set → fewer RRAM updates
2021 Symposia on VLSI Technology and Circuits
Set Size (S)1000100101
RR
AM
Wri
tes
109
108
107En
erg
y-D
ela
y P
rod
uct
(J⋅
s)
R=14
Set Size (S)1000100101
R=1
R=16
R=2
RR
AM
Wri
tes
105
104
103RR
AM
Wei
gh
tU
pd
ate
s S
tep
s
R=4
R=14
R=12
Optimal Rank and Set-Size SearchLarger rank -> more computation, but larger set size
SGDSGD101x 340x
LRT: 340x lower EDP and 101x fewer RRAM weight updates
All lines iso-SGD accuracy
2021 Symposia on VLSI Technology and Circuits
Optimizing Training at the Edge
m=1000n= 1000B=100R=4
Weight updateevery sample
Batch weight update
Store Input and output FC
Low Rank Training
Capacity m ⋅ n ~ 1E6 m ⋅ n ~ 1E6 B ⋅ m + n ~ 2E5 R ⋅ m + n ~ 1E3
Weight updates per sample
m ⋅ n ~ 1E6 m ⋅ n /B~ 1E4 m ⋅ n/B ~ 1E4 m ⋅ n/B ~ 1E4
SRAM writes/sample
Not needed Not needed m+ n ~2E3 R ⋅ m + n2E3
Ops/sample 1E6 1E6 1E6 (R+1)2(m+n + R) ~5E4
B
m
n
m
n
…
n
mB
n
m R
R ≪ B
Reduction in memory usage and mathematical complexity!
2021 Symposia on VLSI Technology and Circuits
LRT Accuracy Measured On-Device
Maximum achievable accuracy with limited
field data
Pre-Trained Off-line
Acc
ura
cy (
%)
0 1 2 3 4
ImageNet (1000 classes)
Flowers102 (102 classes)
100
75
50
25
Epochs
2021 Symposia on VLSI Technology and Circuits
LRT Accuracy Measured On-Device
Pre-Trained Off-line
Acc
ura
cy (
%)
0 1 2 3 4
ImageNet (1000 classes)
Flowers102 (102 classes)
100
75
50
25
Epochs
On-Device LRT
Negligible ImageNet1000 degradation (< 2 %) as Flower102 are learned
0 1k 2k 3k 4kImages
5k
LRT accuracyequivalent to
conventional SGD
LRT Measured Energy and Latency
LRT rank=14 and set size = 64
Training energy per image
Training latency per image
RRAM
DNN Inference 60 ms 38%
LRT 44 ms 27%
RRAM Update 55 ms 35%
Total 159 ms
DNN Inference 8.1 mJ 46%
LRT 4.5 mJ 25%
RRAM Update 5.5 mJ 29%
Total 18.1 mJ
RRAM
LRT
LRTFE
FESet Size (S)
1000100101
RR
AM
Wri
tes
109
108
107
En
erg
y-D
ela
y P
rod
uct
(J⋅
s) Minimal training EDP
Training overhead
Low-Rank Training Algorithm1. Gradients are compressed with minimal information loss
+
2. High write-energy RRAM updates amortized over large update sets
+
3. Achieves conventional SGD equivalent accuracy
=
Efficient RRAM-based on-device incremental training
Conclusion• RRAM-optimized ML SoC with DNN accelerator for efficient on-
device inference
- CHIMERAv2 achieves ResNet-18 inference at 15.2 ms and 1.8 mJ
• Multi-chip Illusion System for scalable DNN inference
- CHIMERAv2 consumes 3x less energy than traditional MCMs for
ResNet-50 inference
• Low-rank training for efficient RRAM-based on-device incremental
training
- LRT achieves 340x lower EDP and 101x fewer RRAM weight updates
compared to SGD at iso-accuracy