Squeezing Deep Learning Into Mobile Phones

Squeezing Deep Learning into mobile phones

- A Practitioners guideAnirudh Koul

i

Anirudh Koul , @anirudhkoul , http://koul.aiProject Lead, Seeing AIApplied Researcher, Microsoft AI & ResearchAkoul at Microsoft dot com

Currently working on applying artificial intelligence for productivity, augmented reality and accessibilityAlong with Eugene Seleznev, Saqib Shaikh, Meher Kasam

http://koul.ai/

Why Deep Learning On Mobile?

i

Latency

Privacy

Mobile Deep Learning Recipe

i

Mobile Inference Engine + Pretrained Model = DL App(Efficient) (Efficient)

Building a DL App in _ time

Building a DL App in 1 hour

Use Cloud APIs

i

Microsoft Cognitive ServicesClarifaiGoogle Cloud VisionIBM Watson ServicesAmazon Rekognition

Microsoft Cognitive Services

i

Models won the 2015 ImageNet Large Scale Visual Recognition ChallengeVision, Face, Emotion, Video and 21 other topics

Building a DL App in 1 day

ihttp://deeplearningkit.org/2015/12/28/deeplearningkit-deep-learning-for-ios-tested-on-iphone-6s-tvos-and-os-x-developed-in-metal-and-swift/

Energy to trainConvolutionalNeural Network

Energy to useConvolutionalNeural Network

http://deeplearningkit.org/2015/12/28/deeplearningkit-deep-learning-for-ios-tested-on-iphone-6s-tvos-and-os-x-developed-in-metal-and-swift/

Base PreTrained Model

i

ImageNet – 1000 Object CategorizerInceptionResnet

Running pre-trained models on mobile

i

MXNet TensorflowCNNDroidDeepLearningKitCaffeTorch

MXNET

i

Amalgamation : Pack all the code in a single source file

Pro:• Cross Platform (iOS, Android), Easy porting• Usable in any programming language

Con:• CPU only, Slow https://github.com/Leliana/WhatsThis

https://github.com/Leliana/WhatsThis

Tensorflow

i

Easy pipeline to bring Tensorflow models to mobileGreat documentationOptimizations to bring model to mobileUpcoming : XLA (Accelerated Linear Algebra) compiler to optimize for hardware

CNNdroid

i

GPU accelerated CNNs for AndroidSupports Caffe, Torch and Theano models~30-40x Speedup using mobile GPU vs CPU (AlexNet)

Internally, CNNdroid expresses data parallelism for different layers, instead of leaving to the GPU’s hardware scheduler

DeepLearningKit

i

Platform : iOS, OS X and tvOS (Apple TV)DNN Type : CNNs models trained in CaffeRuns on mobile GPU, uses Metal

Pro : Fast, directly ingests Caffe modelsCon : Unmaintained

Caffe

i

Caffe for Android https://github.com/sh1r0/caffe-android-libSample app https://github.com/sh1r0/caffe-android-demo

Caffe for iOS : https://github.com/aleph7/caffeSample app https://github.com/noradaiko/caffe-ios-sample

Pro : Usually couple of lines to port a pretrained model to mobile CPUCon : Unmaintained

https://github.com/sh1r0/caffe-android-lib

https://github.com/sh1r0/caffe-android-demo

https://github.com/aleph7/caffe

https://github.com/noradaiko/caffe-ios-sample

Running pre-trained models on mobile

i

Mobile Library

Platform GPU

DNN Architecture Supported

Trained Models Supported

Tensorflow iOS/Android

Yes CNN,RNN,LSTM, etc

Tensorflow

CNNDroid Android Yes CNN Caffe, Torch, Theano

DeepLearningKit

iOS Yes CNN Caffe

MXNet iOS/Android

No CNN,RNN,LSTM, etc

MXNet

Caffe iOS/Android

No CNN Caffe

Torch iOS/Android

No CNN,RNN,LSTM, etc

Torch

Building a DL App in 1 week

i

Learn Playing an Accordion3 months

i

Learn Playing an Accordion3 months

Knows Piano

Fine Tune Skills

1 week

I got a dataset, Now What?

i

Step 1 : Find a pre-trained modelStep 2 : Fine tune a pre-trained modelStep 3 : Run using existing frameworks

“Don’t Be A Hero” - Andrej Karpathy

How to find pretrained models for my task?

i

Search “Model Zoo”

Microsoft Cognitive Toolkit (previously called CNTK) – 50 ModelsCaffe Model ZooKerasTensorflowMXNet

AlexNet, 2012 (simplified)

i[Krizhevsky, Sutskever,Hinton’12]

Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Ng, “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”, 11

n-dimensionFeature

representation

Deciding how to fine tune

i

Size of New Dataset

Similarity to Original Dataset

What to do?

Large High Fine tune.Small High Don’t Fine Tune, it will overfit.

Train linear classifier on CNN Features

Small Low Train a classifier from activations in lower layers.Higher layers are dataset specific to older dataset.

Large Low Train CNN from scratchhttp://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html

Deciding when to fine tune

i

Size of New Dataset


What to do?






i

Size of New Dataset


What to do?






i

Size of New Dataset


What to do?





Building a DL Website in 1 week

Less Data + Smaller Networks = Faster browser training

i

Several JavaScript Libraries

i

Run large CNNs• Keras-JS• MXNetJS• CaffeJS

Train and Run CNNs• ConvNetJS

Train and Run LSTMs• Brain.js• Synaptic.js

Train and Run NNs• Mind.js• DN2A

ConvNetJS

i

Both Train and Test NNs in browserTrain CNNs in browser

Keras.js

i

Run Keras models in browser, with GPU support.

Brain.JS

i

Train and run NNs in browserSupports Feedforward, RNN, LSTM, GRUNo CNNsDemo : http://brainjs.com/

Trained NN to recognize color contrast

http://brainjs.com/

MXNetJS

i

On Firefox and Microsoft Edge, performance is 8x faster than Chrome. Optimization difference because of ASM.js.

Building a DL App in 1 month

(and get featured in Apple App store)

Response Time Limits – Powers of 10

i

0.1 second : Reacting instantly1.0 seconds : User’s flow of thought10 seconds : Keeping the user’s attention

[Miller 1968; Card et al. 1991; Jakob Nielsen 1993]:

Apple frameworks for Deep Learning Inference

i

BNNS – Basic Neural Network SubroutineMPS – Metal Performance Shaders

Metal Performance Shaders (MPS)

i

Fast, Provides GPU acceleration for inference phaseFaster app load times than Tensorflow (Jan 2017)About 1/3rd the run time memory of Tensorflow on Inception-V3 (Jan 2017)~130 ms on iPhone 7S Plus to run Inception-V3

Cons: • Limited documentation. • No easy way to programmatically port models. • No batch normalization. Solution : Join Conv and BatchNorm weights

i

Putting out more frames than an art gallery

Basic Neural Network Subroutines (BNNS)

i

Runs on CPU

BNNS is faster for smaller networks than MPS but slower for bigger networks

BrainCore

i

NN Framework for iOSProvides LSTMs functionalityFast, uses Metal, runs on iPhone GPUhttps://github.com/aleph7/braincore

https://github.com/aleph7/braincore

Building a DL App in 6 months

i

What you want

https://www.flickr.com/photos/kenjonbro/9075514760/ and http://www.newcars.com/land-rover/range-rover-sport/2016

$2000$200,000What you can afford

https://www.flickr.com/photos/kenjonbro/9075514760/

http://www.newcars.com/land-rover/range-rover-sport/2016

11x11 conv, 96, /4, pool/2

5x5 conv, 256, pool/2

3x3 conv, 384

3x3 conv, 384


fc, 4096

fc, 4096

fc, 1000

AlexNet, 8 layers

(ILSVRC 2012)

Revolution of Depth

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015i

11x11 conv, 96, /4, pool/2


3x3 conv, 384

3x3 conv, 384


fc, 4096

fc, 4096

fc, 1000

AlexNet, 8 layers

(ILSVRC 2012)

3x3 conv, 64


3x3 conv, 128


3x3 conv, 256

3x3 conv, 256

3x3 conv, 256


3x3 conv, 512

3x3 conv, 512

3x3 conv, 512


3x3 conv, 512

3x3 conv, 512

3x3 conv, 512


fc, 4096

fc, 4096

fc, 1000

VGG, 19 layers

(ILSVRC 2014)

input

Conv7x7+ 2(S)

MaxPool 3x3+ 2(S)

LocalRespNorm

Conv1x1+ 1(V)

Conv3x3+ 1(S)

LocalRespNorm

MaxPool 3x3+ 2(S)

Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat



Dept hConcat

MaxPool 3x3+ 2(S)



Dept hConcat



Av eragePool 5x5+ 3(V)

Dept hConcat



Dept hConcat



Dept hConcat



AveragePool 5x5+ 3(V)

Dept hConcat

MaxPool 3x3+ 2(S)



Dept hConcat



Dept hConcat

AveragePool 7x7+ 1(V)

FC

Conv1x1+ 1(S)

FC

FC

Soft maxAct iv at ion

soft max0

Conv1x1+ 1(S)

FC

FC

Soft maxActivat ion

soft max1

Soft maxAct ivat ion

soft max2

GoogleNet, 22 layers

(ILSVRC 2014)

Revolution of Depth


AlexNet, 8 layers

(ILSVRC 2012)

ResNet, 152 layers

(ILSVRC 2015)

3x3 conv, 64


3x3 conv, 128


3x3 conv, 256

3x3 conv, 256

3x3 conv, 256


3x3 conv, 512

3x3 conv, 512

3x3 conv, 512


3x3 conv, 512

3x3 conv, 512

3x3 conv, 512


fc, 4096

fc, 4096

fc, 1000

11x11 conv , 96, /4, pool/2


3x3 conv, 384

3x3 conv, 384


fc, 4096

fc, 4096

fc, 1000

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x2 conv, 128, /2

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 256, /2

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 512, /2

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512

3x3 conv, 512

1x1 conv, 2048

ave pool, fc 1000

7x7 conv, 64, /2, pool/2

VGG, 19 layers

(ILSVRC 2014)

Revolution of Depth


Ultra deep

ResNet, 152 layers

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x2 conv, 128, /2

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 256, /2

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 512, /2

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512

3x3 conv, 512

1x1 conv, 2048

ave pool, fc 1000

7x7 conv, 64, /2, pool/2

Revolution of Depth


ILSVRC'15 ResNet

ILSVRC'14 GoogleNet

ILSVRC'14VGG

ILSVRC'13 ILSVRC'12 AlexNet

ILSVRC'11 ILSVRC'10

3.57

6.7 7.3

11.7

16.4

25.828.2

ImageNet Classification top-5 error (%)

shallow8 layers

19 layers22 layers

152 layers

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015

8 layers

Revolution of Depth

i

Your Budget - Smartphone Floating Point Operations Per Second (2015)

i http://pages.experts-exchange.com/processing-power-compared/

http://pages.experts-exchange.com/processing-power-compared/

Accuracy vs Operations Per Image Inference

i

Size is proportional to num parameters

Alfredo Canziani, Adam Paszke, Eugenio Culurciello, “An Analysis of Deep Neural Network Models for Practical Applications” 2016

552 MB

240 MB

What we want

Accuracy Per Parameter

iAlfredo Canziani, Adam Paszke, Eugenio Culurciello, “An Analysis of Deep Neural Network Models for Practical Applications” 2016

Pick your DNN Architecture for your mobile architecture

i

Resnet Family

Under 150 ms on iPhone 7 using Metal GPUKaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, "Deep Residual Learning for Image Recognition”, 2015

Strategies to make DNNs even more efficient

i

Shallow networksCompressing pre-trained networksDesigning compact layersQuantizing parametersNetwork binarization

Pruning

i

Aim : Remove all connections with absolute weights below a threshold

Song Han, Jeff Pool, John Tran, William J. Dally, "Learning both Weights and Connections for Efficient Neural Networks", 2015

Observation : Most parameters in Fully Connected Layers

iAlexNet 240 MB

VGG-16 552 MB

96% of all parameters

90% of all parameters

Pruning gets quickest model compression without accuracy loss

iAlexNet 240 MB

VGG-16 552 MB

First layer which directly interacts with image is sensitive and cannot be pruned too much without hurting accuracy

Weight Sharing

i

Idea : Cluster weights with similar values together, and store in a dictionary.

CodebookHuffman codingHashedNets

Simplest implementation:• Round all weights into 256 levels• Tensorflow export script reduces inception zip file from 87 MB to 26 MB

with 1% drop in precision

Selective training to keep networks shallow

i

Idea : Augment data limited to how your network will be used

Example : If making a selfie app, no benefit in rotating training images beyond +-45 degrees. Your phone will anyway rotate.Followed by WordLens / Google Translate

Example : Add blur if analyzing mobile phone frames

Design consideration for custom architectures – Small Filters

i

Three layers of 3x3 convolutions >> One layer of 7x7 convolution

Replace large 5x5, 7x7 convolutions with stacks of 3x3 convolutionsReplace NxN convolutions with stack of 1xN and Nx1ÞFewer parameters ÞLess compute ÞMore non-linearity

BetterFasterStronger

Andrej Karpathy, CS-231n Notes, Lecture 11

SqueezeNet - AlexNet-level accuracy in 0.5 MB

i

SqueezeNet base 4.8 MBSqueezeNet compressed 0.5 MB

80.3% top-5 Accuracy on ImageNet0.72 GFLOPS/image

Fire Block

Forrest N. Iandola, Song Han et al, "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size"

Reduced precision

i

Reduce precision from 32 bits to <=16 bits or lesserUse stochastic rounding for best results

In Practice:• Ristretto + Caffe

• Automatic Network quantization• Finds balance between compression rate and accuracy

• Apple Metal Performance Shaders automatically quantize to 16 bits

• Tensorflow has 8 bit quantization support• Gemmlowp – Low precision matrix multiplication library

Binary weighted Networks

i

Idea :Reduce the weights to -1,+1Speedup : Convolution operation can be approximated by only summation and subtraction

Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”


i




i



XNOR-Net

i

Idea :Reduce both weights + inputs to -1,+1Speedup : Convolution operation can be approximated by XNOR and Bitcount operations


XNOR-Net

i



XNOR-Net

i



XNOR-Net on Mobile

i

Building a DL App and get $10 million in

funding(or a PhD)

i

Minerva

i

Minerva

i

DeepX Toolkit

iNicholas D. Lane et al, “DXTK : Enabling Resource-efficient Deep Learning on Mobile and Embedded Devices with the DeepX Toolkit",2016

EIE : Efficient Inference Engine on Compressed DNNs

iSong Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark Horowitz, William Dally, "EIE: Efficient Inference Engine on Compressed Deep Neural Network", 2016

189x faster on CPU13x faster on GPU

One Last Question

How to access the slides in 1 second

Link posted here -> @anirudhkoul

Technology

Squeezing Deep Learning Into Mobile Phones