o Ç ] v P d v Ç Y µ v ] Ì ] } v } o Ç } v ] v E Á } l · 2020. 2. 22. · Title: Microsoft PowerPoint - tinyML meetup Nov 2019 - Read-Only Author: jesbeu01 Created Date: 11/22/2019

Dibakar Gope, Jesse Beu, Ganesh Dasika, Urmish Thakker, Matthew Mattina

Arm ML Research Lab

Applying Ternary Quantization to Already

Constrained Networks

2 © 2019 Arm Limited

Executive Summary• ML algorithms are increasingly deployed in IoT devices

Challenge:• Highly constrained memory budget• Aggressive model compression is required to

• Target severely constrained devices• Deploy more applications on them

Our solutions:• Hybrid neural-tree network to improve over an already optimized IoT network

SysML 2019, https://arxiv.org/abs/1903.01531

• Per-layer hybrid filter banks to improve over the already optimized MobileNetsUnder review, arXiv: https://arxiv.org/abs/1911.01028

BBC Micro:Bit (Arm Cortex M0) (16KB RAM)

Apollo Lite (Arm Cortex M4) (32kB SRAM)


Hybrid Neural-Tree Network Summary• “DS-CNN” – a highly optimized network for keyword spotting

How do we optimize it further at iso-accuracy? Fit it into 32KB?

• Ternarize weight values using Strassen’s algorithm Memory footprint reduced by 30%

• Selectively use decision trees to reduce compute #Operations reduced by 12%

• < 0.3% loss in accuracy for these savings

DS-CNN

HybridNet

90

91

92

93

94

95

0 5 10 15 20 25 30 35 40 45

Accu

racy

(%)

Memory Footprint (KB)

Accuracy vs Memory Footprint

DS-CNNHybridNet

90

91

92

93

94

95

0 0.5 1 1.5 2 2.5 3Ac

cura

cy (%

)#Operations (M)

Accuracy vs #Operations


Known Solutions

• Architectural optimizations, e.g., MobileNet

• Pruning

• Quantization

• Low rank factorizations of weights and activations


Our Strategy

• Optimize depthwise separable (DS) convolutional layers• Model complexity dominated by 1 x 1 filters• DS-CNN has 4 DS layers• Optimize DS-CNN ↔ optimize 1x1 filters

• Other techniques to reduce complexity• Binary/ternary quantization – loses accuracy• Model pruning – loses accuracy

• Our approach• Exploit StrassenNets, approximates matrix multiplication using Strassen’s algorithm• > 99% reduction in MULs for 3x3 filters, mostly ternary weights, preserve accuracy

• Apply StrassenNets to 1x1 filters

3 x 3 filter1 x 1 filter

Very compact Over parametrized


Standard Matrix Multiplication (8 MULs)

MULs: 2 x 2 x 2 = 8 MULS


Strassen’s Matrix Multiplication (7 MULs)• Turns Sum-of-Products into Sum-of-Product-of-Sums

• But using fewer products overall• Sum networks can be represented via ternary weights

7 intermediate terms

8 Confidential © 2019 Arm Limited

Strassen’s Matrix Multiplication (7 MULs)

(Point-wise Multiply)

a b c

p1 p2 p3

d

p4 p5 p6 p7

(a) (a+b) (c+d) (d) (a+d) (b-d) (a-c)

(f-h) (h) (e) (g-e) (e+h) (g+h) (e+f)

C00 C01 C10 C11

activations “weights” Exactoutput

7 terms

A

BP

Output

9 Confidential © 2019 Arm Limited

StrassenNets ( < 7 MULs)

Approximate output


a b c

p1

d

p2 p3

(…) (…) (…)

C00 C01 C10 C11

<< 7 terms(…) (…) (…)


a b c

p1 p2 p3

d

p4 p5 p6 p7

(a) (a+b) (c+d) (d) (a+d) (b-d) (a-c)

(f-h) (h) (e) (g-e) (e+h) (g+h) (e+f)

C00 C01 C10 C11

7 terms

Exact output


Convert DS-CNN to Strassenified DS-CNN

Conv1

DS-Conv1

DS-Conv2

Input

DS-Conv3

DS-Conv4

Pooling layer

Output layer

Convolutionallayers

Strassen Convolution

NarrowHidden

layer

WideHidden

layer

• #Intermediate terms ↔ width of hidden layer• Different width of hidden layer ↔ different accuracy vs. costs

Intermediate terms

HiddenLayer

DS-CNN

(…) (…) (…) (…) (…)

(…) (…) (…)(…) (…)Weight

Input

Output

(…) (…) (…)

(…) (…) (…)Weight

Input

Output


DS-CNN vs Strassenified DS-CNN

DS-CNN StrassenifiedDS-CNN

90

91

92

93

94

95

0 1 2 3 4 5 6 7 8 9 10 11

Accu

racy

(%)

Operations (M)


DS-CNNStrassenified

DS-CNN

90

91

92

93

94

95

0 5 10 15 20 25 30 35 40

Accu

racy

(%)

Model Size (KB)

Accuracy vs Model Size

• 98% reduction in MULs, modest savings in model size• > 50% increase in ADDS for 1x1 convolutions to achieve iso-accuracy• Over parameterized 3x3 convolutions does not have this problem• See our paper for details

DS-CNN Strassenified DS-CNN


StrassenNets Pitfalls and Potential Solutions• How do we address the increase in ADDs?

• Explore tree-based learning

Cons• Bad feature extractor

• Poor prediction accuracy for complex applications

• > 10% drop in accuracy for keyword spotting

Pros• Accuracy on par with neural models

• Computationally efficient compared to neural networkse.g., Bonsai decision trees [ICML 2017]


Advantages & Limitations of StrassenNets & Decision Trees

Conv1

DS-Conv1

DS-Conv2

Input

DS-Conv3

DS-Conv4

Pooling layer

Output layer

Yes No

Yes No

𝐱

Decision Tree

StrassenifiedDS-CNN

DS-CNN

Decision Tree

Strassenified DS-CNN

0

1

2

3

4

5

0 20 40 60 80 100 120 140 160

Ope

ratio

ns (M

)

Model Size (KB)

DS-CNN vs. Strassenified DS-CNN vs. Decision Tree

Better

Even this large tree can not

preserve accuracy


Hybrid Neural-Tree Network for KWS

𝐲 =

𝐖 𝑫 tanh 𝐕 𝑫

+𝐖 𝑫 tanh 𝐕 𝑫

+𝐖 𝑫 tanh 𝐕 𝑫

𝐖 , 𝐕𝛉 𝑫 > 0

No

𝑫

Conv1

DS-Conv1

DS-Conv2

No

𝐖 , 𝐕𝛉 𝑫 > 0

𝐖 , 𝐕𝐖 , 𝐕

𝐖 , 𝐕𝛉 𝑫 > 0

𝐖 , 𝐕𝐖 , 𝐕

NoNo

Yes

Yes

Input

3 convolution Layersinstead of

5 layers in baseline DS-CNN

Conv1

DS-Conv1

DS-Conv2

Input

DS-Conv3

DS-Conv4

Pooling

Output

𝐱

Rich featuresfed to

Decision Tree

StrassenNets (baseline DS-CNN w/ 5 Convs.)

StrassenNets (HybridNet w/ 3 Convs. + Depth 2 tree)

Reduction in

Computational Cost


DS-CNNStrassenified-HybridNet

(8b Activations)

90

91

92

93

94

95

0 0.5 1 1.5 2 2.5 3

Accu

racy

(%)

#Operations (M)


DS-CNN vs Strassenified Hybrid Neural-Tree Network

• Strassenified HybridNet - 11.1% reduction in Ops, 30.6% improvement in memory footprint 0.27% loss in accuracy than DS-CNN98.9% reduction in MULs, 12.2% reduction in ADDs (MACs = ADDs = MULs)

• Ops dominated with area- and energy-efficient ADDs (3-30x [Horowitz 2014]) than MULs

DS-CNN

Strassenified-HybridNet(8b Activations)

90

91

92

93

94

95

0 5 10 15 20 25 30 35 40 45

Accu

racy

(%)

Memory Footprint (KB)

Accuracy vs Overall Memory Footprint

arXiv: https://arxiv.org/abs/1911.01028

Ternary MobileNets via

Per-layer hybrid filter banks


Apply Ternary Quantization to MobileNets

MobileNets V1 –13 (1 x 1) convolutional layers

Photo credit: Google AI blog on MobileNets V1

3 x 3 filter1 x 1 filter

Very compact Over parametrized


Baseline

TWN

StrassenNets

54

56

58

60

62

64

66

0 100 200 300 400 500 600 700

Accu

racy

(%)

#MACs/ADDs Operations (M)


Performance of MobileNet with Existing Ternary Quantization

• Increase in ADDs using StrassenNets ↔ Wide hidden layers for closely approximating each 1 x 1 filter of MobileNets

Better

• TWN (NeurIPS 2016) –Drops accuracy by 9.6%

• StrasssenNets (ICML 2018) –Increase ADDs by > 300%


Different filters respond differently to quantization• Use of wide hidden layers might be required for few 1 x 1 filters

• But NOT for all

• Filters capture different features ↔ respond differently to quantization


Different filters respond differently to quantization

L2-loss:

2 hidden units: 0.02, 4 hidden units: 0.0-1 2 -1

-1 2 -1

-1 2 -1

Vertical Lines detector

0 -1 0

-1 5 -1

0 -1 0

Sharpen filter

L2-loss:

2 hidden units: 0.094 hidden units: 0.09, 8 hidden units: 0.01

How do we exploit that?

1) Bank similar value structure filters together

2) Use fewer hidden units to approximate

3) See our paper for details

Filters exists with similar value structure

-1 2 -1 -1 2 -1 -1 2 -1 Vertical Lines detector (flattened)

-1 -1 -1 2 2 2 -1 -1 -1 Horizontal Lines detector (flattened)

Share common values at 5 places, corners and center


MobileNet via per-layer hybrid filter banks

• See our paper for details

A layer with hybrid filter bank

Input

Convolutionallayers

MobileNet

Conv1

DS-Conv1

DS-Conv2

DS-Conv3

DS-Conv4

FC layer

Output layer

𝐖

Previous Depthwiseconvolutional layer

1) Traditional convolution2) full-precision weights

1) StrassenNets2) Use narrow hidden layers3) Restrict increase in ADDs

Channel concatenation

𝐖

𝑾 𝑣𝑒𝑐(𝑨)

Precision criticalfilters

Quantization tolerant

filters


Baseline

TWN

StrassenNetsHybrid Filter Banks

54

56

58

60

62

64

66

0 500 1000 1500 2000 2500 3000

Accu

racy

(%)

Model Size (KB)

Accuracy vs Model Size

Baseline

TWN

StrassenNetsHybrid Filter Banks

54

56

58

60

62

64

66

0 0.2 0.4 0.6 0.8 1 1.2

Accu

racy

(%)

Energy/Inference (Normalized)

Accuracy vs Energy

Performance of MobileNet with Hybrid Filter Banks

• Hybrid Filter Banks – 47% reduction in MULs, only 48% increase in ADDs (compared to > 300%)28% reduction in energy/inference, 51% reduction in model sizeComparable accuracy No degradation in inference throughput

Better Better


Conclusion• Easier to compress over-parameterized networks

• Significantly difficult to compress already highly optimized networks

• Hybrid Neural-Tree Network for highly optimized keyword spotting network

- 30% reduction in memory footprint, 12% reduction in #Ops, comparable accuracy

- SysML 2019 : https://arxiv.org/abs/1903.01531

• Hybrid filter banks for highly optimized MobileNets- 50% reduction in model size, 28% improvement in energy, same throughput,

comparable accuracy

- Under review: https://arxiv.org/abs/1911.01028

Input

Thank YouDankeMerci谢谢

ありがとうGracias

Kiitos감사합니다

ध वाद شكًراתודה

© 2019 Arm Limited

Documents

o Ç ] v P d v Ç Y µ v ] Ì ] } v } o Ç } v ] v E Á } l · 2020. 2. 22. · Title: Microsoft PowerPoint - tinyML meetup Nov 2019 - Read-Only Author: jesbeu01 Created Date: 11/22/2019