Using ML for ML to span the gamut of · Using ML for ML to span the gamut of TinyML hardware TinyML...

Preview:

Citation preview

Using ML for ML to span the gamut of TinyML hardware

TinyML 2020 Jason Knight - Co-founder and CPO at OctoML

Motivation

CNN GAN RNN MLP DQNNCambrian explosion of models, workloads, and use cases.

Rapidly evolving ML software ecosystem

Silicon scaling limitations (Dennard and Moore):

Cambrian explosion of HW backends. Heterogeneous HW.

Growing set of requirements: Cost Latency Power Security

2

Motivation

CNN GAN RNN MLP DQNNCambrian explosion of models, workloads, and use cases.

Rapidly evolving ML software ecosystem

Silicon scaling limitations (Dennard and Moore):

Cambrian explosion of HW backends. Heterogeneous HW.

Growing set of requirements: Cost Latency Power Security

3

Hard enough on server class hardware…

But now we want to do it in on bare-metal devices?

Bare-Metal Devices

4

FPGA ARM M class,

RISC-V MCUs, etc

Programming bare-metal devices is cumbersome

5

ML Model

?

Running ML on bare-metal hardware is hard

6

ML Model

Running ML on bare-metal hardware is hard

7

ML Model

ML from scratch in C…

ML FrameworkML Model

Running ML on bare-metal hardware is pretty hard

8

ML FrameworkML Model

Running ML on bare-metal hardware is pretty hard

9

Slow kernels,Incomplete coverage

ML FrameworkML Model

Hand- optimize kernels

Running ML on bare-metal hardware is really hard

10

ML FrameworkML Model

Hand- optimize kernels

Running ML on bare-metal hardware is really hard

11

Not much help here…

ML FrameworkML Model

Hand- optimize kernels

Running ML on bare-metal hardware is really hard

12

Bugs!

ML Framework

Run ML models

Debug kernels

Hand- optimize kernels

Running ML on bare-metal hardware is really hard

13

ML Framework

Run ML models

Debug kernels

Hand- optimize kernels

Optimizing ML on bare-metal hardware is a long road

14

μTVM simplifies ML software on bare-metal

15

ML Framework

Run ML models

Debug kernels

Hand- optimize kernels

Automatically

Our Goal

16

Minimize the effort to run and optimize ML on bare-metal.

Our Goal

17

Minimize the effort to run and optimize ML on bare-metal.

So what’s our approach?

The TVM in µTVM

18

The TVM in µTVM

19

FPGA ASIC

The TVM in µTVM

20

FPGA ASIC

How do we get from here…

to here?

TVM

The TVM in µTVM

21

FPGA ASIC

???

The TVM in µTVM

22

FPGA ASIC

???

The TVM in µTVM

23

High-Level Differentiable IR

FPGA ASIC

???

The TVM in µTVM

24

High-Level Differentiable IR

Tensor Expression IR

FPGA ASIC

The TVM in µTVM

25

FPGA ASIC

High-Level Differentiable IR

Tensor Expression IR

VTALLVM, CUDA

The TVM in µTVM

26

FPGA ASIC

High-Level Differentiable IR

Tensor Expression IR

VTALLVM, CUDA

Hand-optimization?

The TVM in µTVM

27

FPGA ASIC

High-Level Differentiable IR

Tensor Expression IR

VTALLVM, CUDA

AutoTVM

Where TVM is being used today

28

Every “Alexa” wake-up today across all devices uses a model optimized with TVM

“[TVM enabled] real-time on mobile CPUs for free...We are excited about the performance TVM achieves.” More than 85x speed-up for speech recognition model.

Bing query understanding: 112ms (Tensorflow) -> 34ms (TVM). QnA bot: 73ms->28ms (CPU), 10.1ms->5.5ms (GPU)

“TVM is key to ML Access on Hexagon”

Where TVM is being used today

29

Every “Alexa” wake-up today across all devices uses a model optimized with TVM

“[TVM enabled] real-time on mobile CPUs for free...We are excited about the performance TVM achieves.” More than 85x speed-up for speech recognition model.

Bing query understanding: 112ms (Tensorflow) -> 34ms (TVM). QnA bot: 73ms->28ms (CPU), 10.1ms->5.5ms (GPU)

“TVM is key to ML Access on Hexagon”

Cf: TVM Conference 2019 (Dec) - videos and slides available online

The TVM in µTVM

30

FPGA ASIC

High-Level Differentiable IR

Tensor Expression IR

VTALLVM, CUDA

AutoTVM

Many hardware targets already enjoy speedups from TVM

Motivation

31

Except for microcontrollers...

High-Level Differentiable IR

Tensor Expression IR AutoTVM

VTA

FPGA ASIC

LLVM, CUDAµTVM

Today’s Talk

32

Today’s Talk

33

High-Level Differentiable IR

Tensor Expression IR AutoTVM

VTA

FPGA ASIC

LLVM, CUDAµTVM

1

Today’s Talk

34

High-Level Differentiable IR

Tensor Expression IR AutoTVM

VTA

FPGA ASIC

LLVM, CUDAµTVM

1

2

Today’s Talk

35

High-Level Differentiable IR

Tensor Expression IR AutoTVM

VTA

FPGA ASIC

LLVM, CUDAµTVM

1

2

3

Today’s Talk

36

High-Level Differentiable IR

Tensor Expression IR AutoTVM

VTA

FPGA ASIC

LLVM, CUDAµTVM

1

2

3

AutoTVM

37

TVM + Program

AutoTVM

38

TVM + Programcompile program

Hardware Backend

AutoTVM

39

TVM + Program

runandmeasure

compile programHardware Backend

AutoTVM

40

AutoTVM

runandmeasure

compile program

return timing feedback

Hardware BackendTVM + Program

AutoTVM

41

AutoTVM

runandmeasure

compile program

improveimplementation

Hardware Backend

return timing feedback

TVM + Program

AutoTVM

42

runandmeasure

compile program

improveimplementation

Hardware Backend

return timing feedback

TVM + Program

AutoTVM

AutoTVM

43

runandmeasure

compile program

improveimplementation

Hardware Backend

return timing feedback

TVM + Program

AutoTVM

AutoTVM

44

runandmeasure

compile program

improveimplementation

Hardware Backend

return timing feedback

TVM + Program

AutoTVM

AutoTVM

45

runandmeasure

compile program

improveimplementation

Hardware Backend

return timing feedback

TVM + Program

AutoTVM

AutoTVM

Automatically adapt to hardware type by learning

46

µTVM

Today’s Talk

47

High-Level Differentiable IR

Tensor Expression IR AutoTVM

VTA

FPGA ASIC

LLVM, CUDA

1

2

3

Today’s Talk

48

µTVM

Using µTVM µTVM Runtime

Graph Runtime

Today’s Talk

49

µTVM

Using µTVM µTVM Runtime

Graph Runtime

How easy is it to use μTVM?

50

How easy is it to use μTVM?

51

• Implement an interface to: - Read and write to device memory - Start code execution

How easy is it to use μTVM?

52

• Implement an interface to: - Read and write to device memory - Start code execution

• Provide a cross-compiler for the device

How easy is it to use μTVM?

53

+

• Implement an interface to: - Read and write to device memory - Start code execution

• Provide a cross-compiler for the device

ResNet Example in Mainline TVM

54

ResNet Example in Mainline TVM

55

image =

ResNet Example in Mainline TVM

56

resnet, params = get_resnet() module = graph_runtime_build(resnet, params)

image =

ResNet Example in Mainline TVM

57

resnet, params = get_resnet() module = graph_runtime_build(resnet, params) module.run(image)

image =

> 'cat'

TVM → µTVM

58

resnet, params = get_resnet()

module = graph_runtime_build(resnet, params)

image =

resnet, params = get_resnet() with micro.Session('openocd') as sess: module = sess.build(resnet, params)

image =

TVM → µTVM

59

resnet, params = get_resnet()

module = graph_runtime_build(resnet, params)

image =

resnet, params = get_resnet() with micro.Session('openocd') as sess: module = sess.build(resnet, params)

image =wrap in a session

TVM → µTVM

60

resnet, params = get_resnet()

module = graph_runtime_build(resnet, params)

image =

resnet, params = get_resnet() with micro.Session('openocd') as sess: module = sess.build(resnet, params)

image = different build function

wrap in a session

Today’s Talk

61

µTVM

Using µTVM µTVM Runtime

Graph Runtime

Today’s Talk

62

µTVM

Using µTVM µTVM Runtime

Graph Runtime

Today’s Talk

63

C CodeGenerator

μDeviceInterface

µTVMRuntime

µTVM

Code Generation

64

High-Level Differentiable IR

Tensor Expression IR AutoTVM

VTA

FPGA ASIC

LLVM, CUDA

µTVM

Code Generation

65

High-Level Differentiable IR

Tensor Expression IR AutoTVM

VTA

FPGA ASIC

LLVM, CUDA

why not LLVM?

µTVM

Code Generation

66

High-Level Differentiable IR

Tensor Expression IR AutoTVM

VTA

FPGA ASIC

LLVM, CUDA

lack of support…

µTVM

Code Generation

67

High-Level Differentiable IR

Tensor Expression IR AutoTVM

VTA

FPGA ASIC

LLVM, CUDAC

µDevice Interface

68

Read Write Execute

µDevice Interface

69

Read Write Execute

Sockets? JTAG?Doesn't matter.

The Runtime

70

μTVM Runtime

C CodeGenerator

μDeviceInterface

The Runtime in Action

71

μTVM Runtime

C CodeGenerator

μDeviceInterface

Set up a RWX interface with the board

72

communication interface

μTVM Runtime

C CodeGenerator

μDeviceInterface

Code generator lowers TVM IR to C code

73

func.c

IR→code

communication interface

μTVM Runtime

C CodeGenerator

μDeviceInterface

Device compiler cross compiles generated code

74

func.ogcc

communication interface

μTVM Runtime

C CodeGenerator

μDeviceInterface

func.c

IR→code

μTVM remaps generated binary to custom device memory locations for multiple library loading

75

remapfunc

ld linker

communication interface

μTVM Runtime

C CodeGenerator

μDeviceInterface

func.c

IR→code

func.ogcc

Remapped binary is loaded onto device

76

customloader

communication interface

μTVM Runtime

C CodeGenerator

μDeviceInterface

func.c

IR→code

func.ogcc remap

func

ld linker

Function invocation on host transfers parameters to the device for execution

77

run

send parameters

'cat'

μTVM Runtime

C CodeGenerator

μDeviceInterface

func.c

IR→code

func.ogcc remap

func

ld linker

customloader

µTVM

Today’s Talk

78

High-Level Differentiable IR

Tensor Expression IR AutoTVM

VTA

FPGA ASIC

LLVM, CUDA

1

2

3

AutoTVM on µTVM

79

● Same pipeline as usual ● Load kernels into RAM

instead of flash

End-to-End CIFAR-10 Evaluation

80

Conv2D

MaxPool2DConv2D AvgPool2D

Conv2DAvgPool2D

Dense

3@32x32

32@32x32 32@16x16 32@16x16 32@8x8

64@8x8 64@4x4

1x10

Replicated an int8-quantized CNN from an ARM Mbed tutorial

Preliminary CIFAR-10 CNN Results

81

● Ran on ARM Cortex-M7

● Compared against CMSIS-NN

● Vanilla template

● ~5 hours of tuning

● No vectorization

Preliminary Int-8 Conv2D Results

82

arm_convolve_HWC_q7_fast in CMSIS-NN arm_convolve_HWC_q7_RGB in CMSIS-NN

Coming Soon - Self-hosted models

83

Coming Soon - Ultra low bit-width quantization

84Riptide: Fast End-to-End Binarized Neural Networks. MLSys 2020 (March 3rd) Joshua Fromm · Meghan Cowan · Matthai Philipose · Luis Ceze · Shwetak Patel

Squeezenet on RaspberryPi 3

• TVM already supports flexible code generation for a variety of data types

• Natural to combine this with µTVM

• See Josh Fromm’s MLSys talk on March 3rd to learn more

Coming soon - µTVM as a service

• OctoML offering hosted version of AutoTVM for cloud, mobile, and embedded • Cloud CPU/GPU • ARM A class CPU/GPU • ARM M class microcontrollers • More on request

• Trained model in, optimized binary out • Currently in private beta

TensorFlow, Pytorch, ONNX serialized models

Optimized deployment artifacts

Octomizer

API and web UI

Auto-tuning using OctoML clusters

85

Status and next steps

86

• Functional on x86, ARM, and RISC-V • In-depth writeup early March on our blog with upstream TVM

documentation/tutorial • Follow us on https://octoml.ai or @octoml

• Interop with CMSIS-NN library and TF Lite Micro (better together!) • First class SIMD vectorization support for ARM schedules, (in addition

to current tensorization approach)

Questions or interest? Please reach out:

Jason Knight jknight@octoml.ai Acknowledgments

Logan Weber - AutoTVM Joshua Fromm - RipTide

Thanks!

88

Copyright NoticeThe presentation(s) in this publication comprise the proceedings of tinyML® Summit 2020. The content reflects the opinion of the authors and their respective companies. This version of the presentation may differ from the version that was presented at the tinyML Summit. The inclusion of presentations in this publication does not constitute an endorsement by tinyML Foundation or the sponsors.

There is no copyright protection claimed by this publication. However, each presentation is the work of the authors and their respective companies and may contain copyrighted material. As such, it is strongly encouraged that any use reflect proper acknowledgement to the appropriate source. Any questions regarding the use of any materials presented should be directed to the author(s) or their companies.

tinyML is a registered trademark of the tinyML Foundation.

www.tinyML.org

Recommended