Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Dibakar Gope, Jesse Beu, Ganesh Dasika, Urmish Thakker, Matthew Mattina
Arm ML Research Lab
Applying Ternary Quantization to Already
Constrained Networks
2 © 2019 Arm Limited
Executive Summary• ML algorithms are increasingly deployed in IoT devices
Challenge:• Highly constrained memory budget• Aggressive model compression is required to
• Target severely constrained devices• Deploy more applications on them
Our solutions:• Hybrid neural-tree network to improve over an already optimized IoT network
SysML 2019, https://arxiv.org/abs/1903.01531
• Per-layer hybrid filter banks to improve over the already optimized MobileNetsUnder review, arXiv: https://arxiv.org/abs/1911.01028
BBC Micro:Bit (Arm Cortex M0) (16KB RAM)
Apollo Lite (Arm Cortex M4) (32kB SRAM)
3 © 2019 Arm Limited
Hybrid Neural-Tree Network Summary• “DS-CNN” – a highly optimized network for keyword spotting
How do we optimize it further at iso-accuracy? Fit it into 32KB?
• Ternarize weight values using Strassen’s algorithm Memory footprint reduced by 30%
• Selectively use decision trees to reduce compute #Operations reduced by 12%
• < 0.3% loss in accuracy for these savings
DS-CNN
HybridNet
90
91
92
93
94
95
0 5 10 15 20 25 30 35 40 45
Accu
racy
(%)
Memory Footprint (KB)
Accuracy vs Memory Footprint
DS-CNNHybridNet
90
91
92
93
94
95
0 0.5 1 1.5 2 2.5 3Ac
cura
cy (%
)#Operations (M)
Accuracy vs #Operations
4 © 2019 Arm Limited
Known Solutions
• Architectural optimizations, e.g., MobileNet
• Pruning
• Quantization
• Low rank factorizations of weights and activations
5 © 2019 Arm Limited
Our Strategy
• Optimize depthwise separable (DS) convolutional layers• Model complexity dominated by 1 x 1 filters• DS-CNN has 4 DS layers• Optimize DS-CNN ↔ optimize 1x1 filters
• Other techniques to reduce complexity• Binary/ternary quantization – loses accuracy• Model pruning – loses accuracy
• Our approach• Exploit StrassenNets, approximates matrix multiplication using Strassen’s algorithm• > 99% reduction in MULs for 3x3 filters, mostly ternary weights, preserve accuracy
• Apply StrassenNets to 1x1 filters
3 x 3 filter1 x 1 filter
Very compact Over parametrized
6 © 2019 Arm Limited
Standard Matrix Multiplication (8 MULs)
MULs: 2 x 2 x 2 = 8 MULS
7 © 2019 Arm Limited
Strassen’s Matrix Multiplication (7 MULs)• Turns Sum-of-Products into Sum-of-Product-of-Sums
• But using fewer products overall• Sum networks can be represented via ternary weights
7 intermediate terms
8 Confidential © 2019 Arm Limited
Strassen’s Matrix Multiplication (7 MULs)
(Point-wise Multiply)
a b c
p1 p2 p3
d
p4 p5 p6 p7
(a) (a+b) (c+d) (d) (a+d) (b-d) (a-c)
(f-h) (h) (e) (g-e) (e+h) (g+h) (e+f)
C00 C01 C10 C11
activations “weights” Exactoutput
7 terms
A
BP
Output
9 Confidential © 2019 Arm Limited
StrassenNets ( < 7 MULs)
Approximate output
(Point-wise Multiply)
a b c
p1
d
p2 p3
(…) (…) (…)
C00 C01 C10 C11
<< 7 terms(…) (…) (…)
(Point-wise Multiply)
a b c
p1 p2 p3
d
p4 p5 p6 p7
(a) (a+b) (c+d) (d) (a+d) (b-d) (a-c)
(f-h) (h) (e) (g-e) (e+h) (g+h) (e+f)
C00 C01 C10 C11
7 terms
Exact output
10 © 2019 Arm Limited
Convert DS-CNN to Strassenified DS-CNN
Conv1
DS-Conv1
DS-Conv2
Input
DS-Conv3
DS-Conv4
Pooling layer
Output layer
Convolutionallayers
Strassen Convolution
NarrowHidden
layer
WideHidden
layer
• #Intermediate terms ↔ width of hidden layer• Different width of hidden layer ↔ different accuracy vs. costs
Intermediate terms
HiddenLayer
DS-CNN
(…) (…) (…) (…) (…)
(…) (…) (…)(…) (…)Weight
Input
Output
(…) (…) (…)
(…) (…) (…)Weight
Input
Output
11 © 2019 Arm Limited
DS-CNN vs Strassenified DS-CNN
DS-CNN StrassenifiedDS-CNN
90
91
92
93
94
95
0 1 2 3 4 5 6 7 8 9 10 11
Accu
racy
(%)
Operations (M)
Accuracy vs #Operations
DS-CNNStrassenified
DS-CNN
90
91
92
93
94
95
0 5 10 15 20 25 30 35 40
Accu
racy
(%)
Model Size (KB)
Accuracy vs Model Size
• 98% reduction in MULs, modest savings in model size• > 50% increase in ADDS for 1x1 convolutions to achieve iso-accuracy• Over parameterized 3x3 convolutions does not have this problem• See our paper for details
DS-CNN Strassenified DS-CNN
12 © 2019 Arm Limited
StrassenNets Pitfalls and Potential Solutions• How do we address the increase in ADDs?
• Explore tree-based learning
Cons• Bad feature extractor
• Poor prediction accuracy for complex applications
• > 10% drop in accuracy for keyword spotting
Pros• Accuracy on par with neural models
• Computationally efficient compared to neural networkse.g., Bonsai decision trees [ICML 2017]
13 © 2019 Arm Limited
Advantages & Limitations of StrassenNets & Decision Trees
Conv1
DS-Conv1
DS-Conv2
Input
DS-Conv3
DS-Conv4
Pooling layer
Output layer
Yes No
Yes No
𝐱
Decision Tree
StrassenifiedDS-CNN
DS-CNN
Decision Tree
Strassenified DS-CNN
0
1
2
3
4
5
0 20 40 60 80 100 120 140 160
Ope
ratio
ns (M
)
Model Size (KB)
DS-CNN vs. Strassenified DS-CNN vs. Decision Tree
Better
Even this large tree can not
preserve accuracy
14 © 2019 Arm Limited
Hybrid Neural-Tree Network for KWS
𝐲 =
𝐖 𝑫 tanh 𝐕 𝑫
+𝐖 𝑫 tanh 𝐕 𝑫
+𝐖 𝑫 tanh 𝐕 𝑫
𝐖 , 𝐕𝛉 𝑫 > 0
No
𝑫
Conv1
DS-Conv1
DS-Conv2
No
𝐖 , 𝐕𝛉 𝑫 > 0
𝐖 , 𝐕𝐖 , 𝐕
𝐖 , 𝐕𝛉 𝑫 > 0
𝐖 , 𝐕𝐖 , 𝐕
NoNo
Yes
Yes
Input
3 convolution Layersinstead of
5 layers in baseline DS-CNN
Conv1
DS-Conv1
DS-Conv2
Input
DS-Conv3
DS-Conv4
Pooling
Output
𝐱
Rich featuresfed to
Decision Tree
StrassenNets (baseline DS-CNN w/ 5 Convs.)
StrassenNets (HybridNet w/ 3 Convs. + Depth 2 tree)
Reduction in
Computational Cost
15 © 2019 Arm Limited
DS-CNNStrassenified-HybridNet
(8b Activations)
90
91
92
93
94
95
0 0.5 1 1.5 2 2.5 3
Accu
racy
(%)
#Operations (M)
Accuracy vs #Operations
DS-CNN vs Strassenified Hybrid Neural-Tree Network
• Strassenified HybridNet - 11.1% reduction in Ops, 30.6% improvement in memory footprint 0.27% loss in accuracy than DS-CNN98.9% reduction in MULs, 12.2% reduction in ADDs (MACs = ADDs = MULs)
• Ops dominated with area- and energy-efficient ADDs (3-30x [Horowitz 2014]) than MULs
DS-CNN
Strassenified-HybridNet(8b Activations)
90
91
92
93
94
95
0 5 10 15 20 25 30 35 40 45
Accu
racy
(%)
Memory Footprint (KB)
Accuracy vs Overall Memory Footprint
arXiv: https://arxiv.org/abs/1911.01028
Ternary MobileNets via
Per-layer hybrid filter banks
17 © 2019 Arm Limited
Apply Ternary Quantization to MobileNets
MobileNets V1 –13 (1 x 1) convolutional layers
Photo credit: Google AI blog on MobileNets V1
3 x 3 filter1 x 1 filter
Very compact Over parametrized
18 © 2019 Arm Limited
Baseline
TWN
StrassenNets
54
56
58
60
62
64
66
0 100 200 300 400 500 600 700
Accu
racy
(%)
#MACs/ADDs Operations (M)
Accuracy vs #Operations
Performance of MobileNet with Existing Ternary Quantization
• Increase in ADDs using StrassenNets ↔ Wide hidden layers for closely approximating each 1 x 1 filter of MobileNets
Better
• TWN (NeurIPS 2016) –Drops accuracy by 9.6%
• StrasssenNets (ICML 2018) –Increase ADDs by > 300%
19 © 2019 Arm Limited
Different filters respond differently to quantization• Use of wide hidden layers might be required for few 1 x 1 filters
• But NOT for all
• Filters capture different features ↔ respond differently to quantization
20 © 2019 Arm Limited
Different filters respond differently to quantization
L2-loss:
2 hidden units: 0.02, 4 hidden units: 0.0-1 2 -1
-1 2 -1
-1 2 -1
Vertical Lines detector
0 -1 0
-1 5 -1
0 -1 0
Sharpen filter
L2-loss:
2 hidden units: 0.094 hidden units: 0.09, 8 hidden units: 0.01
How do we exploit that?
1) Bank similar value structure filters together
2) Use fewer hidden units to approximate
3) See our paper for details
Filters exists with similar value structure
-1 2 -1 -1 2 -1 -1 2 -1 Vertical Lines detector (flattened)
-1 -1 -1 2 2 2 -1 -1 -1 Horizontal Lines detector (flattened)
Share common values at 5 places, corners and center
21 © 2019 Arm Limited
MobileNet via per-layer hybrid filter banks
• See our paper for details
A layer with hybrid filter bank
Input
Convolutionallayers
MobileNet
Conv1
DS-Conv1
DS-Conv2
DS-Conv3
DS-Conv4
FC layer
Output layer
𝐖
Previous Depthwiseconvolutional layer
1) Traditional convolution2) full-precision weights
1) StrassenNets2) Use narrow hidden layers3) Restrict increase in ADDs
Channel concatenation
𝐖
𝑾 𝑣𝑒𝑐(𝑨)
Precision criticalfilters
Quantization tolerant
filters
22 © 2019 Arm Limited
Baseline
TWN
StrassenNetsHybrid Filter Banks
54
56
58
60
62
64
66
0 500 1000 1500 2000 2500 3000
Accu
racy
(%)
Model Size (KB)
Accuracy vs Model Size
Baseline
TWN
StrassenNetsHybrid Filter Banks
54
56
58
60
62
64
66
0 0.2 0.4 0.6 0.8 1 1.2
Accu
racy
(%)
Energy/Inference (Normalized)
Accuracy vs Energy
Performance of MobileNet with Hybrid Filter Banks
• Hybrid Filter Banks – 47% reduction in MULs, only 48% increase in ADDs (compared to > 300%)28% reduction in energy/inference, 51% reduction in model sizeComparable accuracy No degradation in inference throughput
Better Better
23 © 2019 Arm Limited
Conclusion• Easier to compress over-parameterized networks
• Significantly difficult to compress already highly optimized networks
• Hybrid Neural-Tree Network for highly optimized keyword spotting network
- 30% reduction in memory footprint, 12% reduction in #Ops, comparable accuracy
- SysML 2019 : https://arxiv.org/abs/1903.01531
• Hybrid filter banks for highly optimized MobileNets- 50% reduction in model size, 28% improvement in energy, same throughput,
comparable accuracy
- Under review: https://arxiv.org/abs/1911.01028
Input
Thank YouDankeMerci谢谢
ありがとうGracias
Kiitos감사합니다
ध वाद شكًراתודה
© 2019 Arm Limited